# Predicting market recession using macroeconomic data

The Data Collection and Metadata notebook can be found as Part_1 : https://www.kaggle.com/devarshraval/1-data-collection-for-predicting-recession

In [None]:
import pandas as pd
df=pd.read_csv("../input/us-macroeconomic-data-19962020-source-fred/macrodata.csv",index_col=0,parse_dates=True)
df

In [None]:
import numpy as np
import matplotlib.pyplot as plt 

from statsmodels.tsa.stattools import adfuller #to check unit root in time series 
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

import seaborn as sns #for correlation heatmap

import warnings
warnings.filterwarnings('ignore')

### Data preprocessing

1) Add lags of the variables as additional features

2) Test stationarity of time series

3) Standardize the dataset

In [None]:
df.index.set_names(names='Date',inplace=True)

In [None]:
# add lags
for col in df.drop(['Regime'], axis=1):
    for n in [3,6,9,12,18]:
        df['{} {}M lag'.format(col, n)] = df[col].shift(n).ffill().values

# 1 month ahead prediction
df["Regime"]=df["Regime"].shift(-1)

In [None]:
df=df.dropna(axis=0)

In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time . It does not mean that the series does not change over time, just that the way it changes does not itself change over time. The mean and variance do not change over time. 
[towardsdatascience.com,Stationarity in time series analysis | by Shay Palachy]

In [None]:
# Check stationarity in time series data
# We will perform adfuller test to check unit roots 3 times. 
# First time for non-stationary series we will take first order difference
# Second time we will take second order difference
# Third time if there are still remaining non-stationary columns we will drop them from feature set

from statsmodels.tsa.stattools import adfuller #to check unit root in time series 
threshold=0.01 #significance level
for column in df.drop(['Regime'], axis=1):
    result=adfuller(df[column])
    if result[1]>threshold:
        df[column]=df[column].diff()
df1=df.dropna(axis=0)

for column in df1.drop(['Regime'], axis=1):
    result=adfuller(df1[column])
    if result[1]>threshold:
        df1[column]=df1[column].diff()
df1=df1.dropna(axis=0)

nonstationary_col=[]
for column in df1.drop(['Regime'], axis=1):
    result=adfuller(df1[column])
    if result[1]>threshold:
        nonstationary_col.append(column)
df1=df1.dropna(axis=0)
df1.drop(nonstationary_col,axis=1,inplace=True)

In [None]:
nonstationary_col

In [None]:
from sklearn.preprocessing import StandardScaler
features=df1.drop(['Regime'],axis=1)
col_names=features.columns

scaler=StandardScaler()
scaler.fit(features)
standardized_features=scaler.transform(features)

df2=pd.DataFrame(data=standardized_features,columns=col_names,index=df1.index)
df2.insert(loc=1,column='Regime', value=df1['Regime'].values)
df2.shape

In [None]:
df2

In [None]:
# import packages for modelling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn import metrics

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.model_selection import TimeSeriesSplit
from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

from matplotlib import pyplot as mp
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
Label = df2["Regime"].apply(lambda regime: 1. if regime == 'Normal' else 0.)
df2.insert(loc=2, column="Label", value=Label.values)

#### Train Test Split
Assigning 75% of data as training set and 30 % as test set

In [None]:
df_targets=df2['Label'].values
df_features=df2.drop(['Regime','Label'], axis=1)

df_training_features = df2[:'2014-09'].drop(['Regime','Label'], axis=1)
df_validation_features = df2['2014-10':].drop(['Regime','Label'], axis=1)

df_training_targets = df2[:'2014-09']['Label'].values


df_validation_targets = df2['2014-10':]['Label'].values

In [None]:
xtr=df_training_features[:'2006-03']
ytr=df2[:'2006-03']['Label'].values
xdev=df_training_features['2006-04':]
ydev=df2['2006-04':'2014-09']['Label'].values

In [None]:
print(len(df_training_features),len(df_training_targets),len(df_targets))
print(len(df_validation_features),len(df_validation_targets),len(df_features))

## Modelling

As the dataset is too small and recessions are in continuous in time series, we cannot use cross validation functions as folds with only one class will be formed.

Thus, loop and iteration of hyperparamets is done in hypertuning Logistic Regression.

In [None]:
df_training_targets

In [None]:
seed=8
scoring='roc_auc' 
kfold = model_selection.TimeSeriesSplit(n_splits=2) 
models = []

models.append(('LR', LogisticRegression(C=1e09)))
models.append(('LR_L1', LogisticRegression(penalty = 'l1',solver='liblinear')))
models.append(('LR_L2', LogisticRegression(penalty = 'l2')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ABC', AdaBoostClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('XGB', xgb.XGBClassifier()))

results = []
names = []
lb = preprocessing.LabelBinarizer()

for name, model in models:
    cv_results = model_selection.cross_val_score(estimator = model, X = df_training_features, 
                                                 y = lb.fit_transform(df_training_targets), cv=kfold, scoring = scoring)
    
    model.fit(df_training_features, df_training_targets) # train the model
    fpr, tpr, thresholds = metrics.roc_curve(df_training_targets, model.predict_proba(df_training_features)[:,1])
    auc = metrics.roc_auc_score(df_training_targets,model.predict(df_training_features))
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (name, auc))
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show() 
warnings.filterwarnings('ignore')

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison based on Cross Validation Scores')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Thus, it is clear Logistic Regression with L2 norm and XGboost perform best in cross validation.

### Hypertuning Logistic Regression 

In [None]:
from sklearn.metrics import roc_auc_score
C = np.reciprocal([0.00000001, 0.00000005, 0.0000001, 0.0000005, 0.000001, 0.000005, 0.00001, 0.00005, 
                         0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000])
penalty=['l1','l2']

penals=pd.DataFrame(index=penalty)
cs=[]
scrs=[]
for p in penalty:
        scores=[]
        params=pd.DataFrame(index=C)
        for c in C:
            model=LogisticRegression(C=c,max_iter=10000,penalty=p,solver='liblinear')
            lr1=model.fit(xtr,ytr)
            ypreds=lr1.predict(xdev)
            score=roc_auc_score(ydev,ypreds)
            scores.append(score)
        params['rocauc']=scores
        maxc=params['rocauc'].idxmax()
        maxsc=params['rocauc'].max()
        scrs.append(maxsc)
        cs.append(maxc)
penals['C']=cs
penals['score']=scrs
penals

In [None]:
model=LogisticRegression(C=1000,penalty='l2',max_iter=10000,solver='liblinear')
lr1=model.fit(df_training_features,df_training_targets)
ypreds=lr1.predict(df_validation_features)
param=lr1.get_params()
score=roc_auc_score(df_validation_targets,ypreds)
score

#### XGBoost model

In [None]:
seed=8
scoring='roc_auc' 
kfold = model_selection.TimeSeriesSplit(n_splits=2) 
lb = preprocessing.LabelBinarizer()
xgboost = model_selection.GridSearchCV(estimator=xgb.XGBClassifier(),
                                       param_grid={'booster': ['gbtree'],
                                                  'max_depth':[2,3,5,10],
                                                  'learning_rate':[0.01,0.1,1]},
                                       scoring=scoring, cv=kfold).fit(df_training_features, 
                                                                      lb.fit_transform(df_training_targets)).best_estimator_
xgboost.fit(df_training_features, df_training_targets)



In [None]:
modelxg=xgb.XGBClassifier(learning_rate=0.001,n_estimators=1000,max_depth=100,booster='gbtree',n_jobs=-1).fit(df_training_features, df_training_targets)
ypredsxgb=modelxg.predict(df_validation_features)

xgbscore=roc_auc_score(df_validation_targets,ypredsxgb)
xgbscore

Thus, it is clear XGboost model overfits on train data and LG(l2) model performs better on test data.

## Results

In [None]:
import datetime
# define periods of recession
rec_spans = []

rec_spans.append([datetime.datetime(2001,3,1), datetime.datetime(2001,10,1)])
rec_spans.append([datetime.datetime(2007,12,1), datetime.datetime(2009,5,1)])
rec_spans.append([datetime.datetime(2020,3,1), datetime.datetime(2020,5,1)])

In [None]:
prob_predictions = lr1.predict_proba(df_training_features)
prob_predictions = np.append(prob_predictions, lr1.predict_proba(df_validation_features), axis=0)
sample_range = pd.date_range(start='10/1/1997', end='5/1/2020', freq='MS')

plt.figure(figsize=(20,5))
plt.plot(sample_range.to_series().values, prob_predictions[:,0])
for i in range(len(rec_spans)):
    plt.axvspan(rec_spans[i][0], rec_spans[i][len(rec_spans[i]) - 1], alpha=0.25, color='grey')
plt.axhline(y=0.5, color='r', ls='dashed', alpha = 0.5)
plt.title('Recession Prediction Probabalities with Logistic Regression')
plt.show()

Thus, the blue line graph past 2014 is the predicted probabilities of our model. One can see it does not predict recession for continuous two months until 2020 when the actual recession started. 
Limitation of this model is the period range of data selected is shorter, as in the online course data was selected for 1960-2018. 
Thus, with more sophistication this model can be used to predict market recession in one month prior.

### Feature Importance
Feature importance is obtained using the shap library. The results are somewhat intuitive with the Nasdaq lag feature, US exchange rate, federal funds rate and AAA bonnd rate being prominent features that influence the model results.

In [None]:
import shap
# load your data here, e.g. X and y
# create and fit your model here
# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.LinearExplainer(lr1,df_training_features)
shap_values = explainer.shap_values(df_training_features.values)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
#shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])

shap.summary_plot(shap_values, df_training_features)

## Limitations and Future Scope
Some limitations for this model are described and can be regarded as future scope/tasks for Kaggle and GitHub users:

#### 1. Range of time periods:

The shorter time period was selected for two main reasons:

To consider more number of features such as exchange rates(EXUSUK,etc), Consumer goods demand(ACOGNO) and business inventories(BUSINV), data for which is not available in past.

In addition, it is discussed in the online course that recent data holds more importance to predict future trends.

However, user can experiment with time period to obtain a optimal range with minimal noise.

#### 2. Modern potential indicators are not used such as 
(a)cryptocurrency rates, 

(b)news sentiment analysis,

(c)Research and innovation indices(especially important going forward owing to public health and climate crisis)

(d)Political and geopolitical stability indices