#### Introduction to Statistical Learning, Lab 4.3

# Linear Discriminant Analysis


We will now perform a linear discriminant analysis (LDA) non the `Smarket` data set, trying to predict `Direction` using `Lag1` and `Lag2`. We go beyond the now familiar `statsmodels` library and use the `sklearn` library.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

We first load the data set.

In [None]:
smarket = datasets.Smarket()
smarket.head()

We use the observations before 2005 as the training sample and the later predictions as the test sample.

In [None]:
X_train = smarket[smarket.Year < 2005][['Lag1', 'Lag2']]
Y_train = smarket[smarket.Year < 2005]['Direction']
X_test = smarket[smarket.Year == 2005][['Lag1', 'Lag2']]
Y_test = smarket[smarket.Year == 2005]['Direction']
X_test.head()

We perform an LDA fit on the training data set.

In [None]:
lda = LinearDiscriminantAnalysis()
lda_fit = lda.fit(X_train, Y_train)

In [None]:
print(lda.priors_)

This output indicates that $\hat{\pi}_1 = 0.492$ and $\hat{\pi}_2 = 0.508$. That is, 49.2% of the training observations correspond to days when the marked went down.  

In [None]:
means = pd.DataFrame(lda_fit.means_, index=('Down', 'UP'), columns=('Lag1', 'Lag2'))
means

The reported *means* are the average of each predictor in each class.

We can read off the relation of the predictors to the market direction.

In [None]:
lda_fit.coef_

If $-0.0554\times\text{Lag1} - 0.0443\times\text{Lag2}$ is large the market is predicted to go up.

We now check the performance of the classifier on the test data set.

In [None]:
pred = lda_fit.predict(X_test)
confusion_matrix(pred, Y_test)

Note that this *exactly* replicates the results from the logistic regression.

The following summary report also confirms the agreement with the logistic regression. In particular the 58% precision when predicting `Up`.

In [None]:
print(classification_report(Y_test, pred))

The `utils` module from the `isply` library provides some utility functions to plot decision contours and boundaries.

They require a callable (function, method) that properly determines the probabilities for the two dimensional range specified by two predictor arrays. They only work out of the box when the model has exactly the predictors specified. Otherwise users have to supply a callable that properly marginalises the other predictors.

Let's first overlay the $P=0.5$ decision contour on a scatter plot. This is the default behaviour of `utils.plot_decision_contour()`.

In [None]:
ax = sns.scatterplot(x=X_train['Lag1'], y=X_train['Lag2'], hue=Y_train)
ax = utils.plot_decision_contour(X_train['Lag1'], X_train['Lag2'],
                                 lda_fit.predict_proba, ax=ax)

We can also overlay more contours by specifying the `levels` keyword argument. A value of `None` uses the automatic configuration from `matplotlib`'s `contour()` function.

In [None]:
ax = sns.scatterplot(x=X_train['Lag1'], y=X_train['Lag2'], hue=Y_train)
ax = utils.plot_decision_contour(X_train['Lag1'], X_train['Lag2'],
                                 lda_fit.predict_proba, ax=ax, 
                                 levels=None)

We can choose the category (class) the probabilities refer to (default: $0$). Note the change in the contour annotations, they are now reversed.

In [None]:
ax = sns.scatterplot(x=X_train['Lag1'], y=X_train['Lag2'], hue=Y_train)
ax = utils.plot_decision_contour(X_train['Lag1'], X_train['Lag2'],
                                 lda_fit.predict_proba, ax=ax, 
                                 levels=None, category=1)

The `plot_decision_boundaries()` function identifies the areas where a particular class probability is highest and overlays a coloured shade accordingly. This function works out of the box for multiple *response* classes. But the marginalisation restriction applies if there are more than two *predictors*.  

In [None]:
ax = sns.scatterplot(x=X_train['Lag1'], y=X_train['Lag2'], hue=Y_train)
ax = utils.plot_decision_boundaries(X_train['Lag1'], X_train['Lag2'],
                                   lda_fit.predict_proba, ax=ax)

We can also combine the two.

In [None]:
ax = sns.scatterplot(x=X_train['Lag1'], y=X_train['Lag2'], hue=Y_train)
ax = utils.plot_decision_contour(X_train['Lag1'], X_train['Lag2'],
                                 lda_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(X_train['Lag1'], X_train['Lag2'],
                                   lda_fit.predict_proba, ax=ax)