#### Introduction to Statistical Learning, Exercise 4.1

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Classification on the Weekly Data Set

This data set is very similar to the `Smarket` data set, except that it contains 1,089 observations of *weekly* returns for 21 years, from the beginning of 1990 to the end of 2010.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Visualisation

Produce some numerical and graphical summaries of the `Weekly` data set.

Do there appear to be any patterns?

In [None]:
weekly = datasets.Weekly()
weekly.head()

In [None]:
weekly.describe()

The distributions of the `Lag` variables are all similar, and so is the `Today` distribution. There is nothing obvious to conclude from the numerical summary.

We dare to make a pair plot matrix, this may take take a while...

In [None]:
sns.pairplot(data=weekly)
plt.show()

Comments:

  - Some plots *do* show some structure.
  - Clearly `Volume` increased over time.
  - In general, `Volume` has more interesting correlation structures than the `Lag` variables.

### B. Logistic Regression

Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables (`Lag1` through `Lag5`) and `Volume` as predictors. Do any of the predictors seem to be significant?

In [None]:
logit_fit = smf.glm('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume',
                    weekly, family=sm.families.Binomial()).fit()
logit_fit.summary()

According the (rather generous) $p < 0.05$ criterion, the `Lag1` and `Lag2` variables seem to have a significant influence on the response. We would not bet any money on this, though.

### C. Confusion Matrix

Compute the confusion matrix and overall fraction of correct predictions. What does the confusion matrix tell us about the types of mistakes the classifier makes?

In [None]:
pred = logit_fit.predict() > 0.5
cm = confusion_matrix(pred, logit_fit.model.endog)
cf = (cm[0, 0] + cm[1, 1]) / logit_fit.nobs
print(cm)
print(cf)

Overall the predictions are correct 56% of the time and the *training error rate* is 44%.

In [None]:
cm[1, 1] / (cm[1, 0] + cm[1, 1])

When the market goes down we predict this correctly 53% of the time.

In [None]:
cm[0, 1] / (cm[0, 0] + cm[0, 1])

When the market goes up we predict this correctly 44% of the time.

The overall performance is rather underwhelming.

### D. Regression on one Predictor

Now split the data set in a training data set (1990 - 2008) and a test data set (2009 - 2010) and perform a logistic regression using only the `Lag2` predictor on the training data set.

Compute the confusion matrix and overall rate of correct predictions on the test data set.

In [None]:
train = weekly[weekly.Year < 2009] 
test = weekly[weekly.Year >= 2009]

In [None]:
lm = smf.glm('Direction~Lag2',train, family=sm.families.Binomial()).fit()
lm.summary()

In [None]:
true_test =  test['Direction'] == 'Down'
pred = lm.predict(test) > 0.5

In [None]:
cm = confusion_matrix(pred, true_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)

### E. LDA

Repeat __D__ using linear discriminant analysis.

In [None]:
lda = LinearDiscriminantAnalysis()
lda_fit = lda.fit(train[['Lag2']], train['Direction']) 

In [None]:
pred = lda_fit.predict(test[['Lag2']]) == 'Down'

In [None]:
cm = confusion_matrix(pred, true_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)

### F. QDA

Repeat __D__ using quadratic discriminant analysis.

In [None]:
qda = QuadraticDiscriminantAnalysis()
qda_fit = qda.fit(train[['Lag2']], train['Direction']) 

In [None]:
pred = qda_fit.predict(test[['Lag2']]) == 'Down'

In [None]:
cm = confusion_matrix(pred, true_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)

### G. KNN

Repeat __D__ using a KNN classifier with $k=1$.

In [None]:
knn = KNeighborsClassifier(1)
knn_fit = knn.fit(train[['Lag2']], train['Direction'])

In [None]:
pred = knn_fit.predict(test[['Lag2']]) == 'Down'

In [None]:
cm = confusion_matrix(pred, true_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)

### H. Comparison

Which of all the methods appear to have the best performance?

The logistic regression and the LDA perform best (and exactly the same) with a test error rate of 39.75%.

### I. Experimentation

Experiment with different combinations of predictors, including possible interactions and transformations.

Report the on the performance of the best classifier you found.

This is a bit of an open-ended exercise. We provide two examples. 

In [None]:
x_train = patsy.dmatrix('Lag2+I(np.sqrt(np.abs(Lag1))*np.sign(Lag1))-1',
                        train, return_type='dataframe')
x_train.rename({x_train.columns[1]: 'sqrtLag1'}, axis=1, inplace=True)
y_train = train['Direction']
x_test = patsy.dmatrix('Lag2+I(np.sqrt(np.abs(Lag1))*np.sign(Lag1))-1',
                       test, return_type='dataframe')
x_test.rename({x_test.columns[1]: 'sqrtLag1'}, axis=1, inplace=True)
y_test = test['Direction']

In [None]:
knn = KNeighborsClassifier(4)
knn_fit = knn.fit(x_train, y_train)

In [None]:
ax = sns.scatterplot(x=x_train['Lag2'], y=x_train['sqrtLag1'], hue=y_train)

ax = utils.plot_decision_contour(x_train['Lag2'], x_train['sqrtLag1'],
                                 knn_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x_train['Lag2'], x_train['sqrtLag1'],
                                   knn_fit.predict_proba, ax=ax)

In [None]:
pred = knn_fit.predict(x_test)

In [None]:
cm = confusion_matrix(pred, y_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)

In [None]:
x_train = patsy.dmatrix('Lag2+Lag1-1', train, return_type='dataframe')
y_train = train['Direction']
x_test = patsy.dmatrix('Lag2+Lag1-1', test, return_type='dataframe')
y_test = test['Direction']

In [None]:
qda = QuadraticDiscriminantAnalysis()
qda_fit = qda.fit(x_train, y_train) 

In [None]:
ax = sns.scatterplot(x=x_train['Lag2'], y=x_train['Lag1'], hue=y_train)

ax = utils.plot_decision_contour(x_train['Lag2'], x_train['Lag1'],
                                 qda_fit.predict_proba, levels=None, ax=ax)
ax = utils.plot_decision_boundaries(x_train['Lag2'], x_train['Lag1'],
                                   qda_fit.predict_proba, ax=ax)

In [None]:
pred = qda_fit.predict(x_test)

In [None]:
cm = confusion_matrix(pred, y_test)
cf = (cm[0, 0] + cm[1, 1]) / test.shape[0]
print(cm)
print(cf)