#### Introduction to Statistical Learning, Lab 4.4

# Quadratic Discriminant Analysis


We will now perform a quadratic discriminant analysis (QDA) non the `Smarket` data set, trying to predict `Direction` using `Lag1` and `Lag2`. We again use the `sklearn` library.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from islpwf import datasets, utils, lmplots
sns.set()
%matplotlib inline

We first load the data set.

In [None]:
smarket = datasets.Smarket()
smarket.head()

We use the observations before 2005 as the training sample and the later predictions as the test sample.

In [None]:
X_train = smarket[smarket.Year < 2005][['Lag1', 'Lag2']]
Y_train = smarket[smarket.Year < 2005]['Direction']
X_test = smarket[smarket.Year == 2005][['Lag1', 'Lag2']]
Y_test = smarket[smarket.Year == 2005]['Direction']
X_test.head()

We perform an QDA fit on the training data set.

In [None]:
qda = QuadraticDiscriminantAnalysis()
qda_fit = qda.fit(X_train, Y_train)

In [None]:
print(qda.priors_)

This output indicates that $\hat{\pi}_1 = 0.492$ and $\hat{\pi}_2 = 0.508$. That is, 49.2% of the training observations correspond to days when the marked went down.  

In [None]:
means = pd.DataFrame(qda_fit.means_, index=('Down', 'UP'), columns=('Lag1', 'Lag2'))
means

The reported *means* are the average of each predictor in each class.

We can read off the relation of the predictors to the market direction.

We now check the performance of the classifier on the test data set.

In [None]:
pred = qda_fit.predict(X_test)
confusion_matrix(pred, Y_test)

In [None]:
print(classification_report(Y_test, pred))

It is interesting that the QDA predictions are correct 60% of the time. This seems quite good for a notoriously difficult problem. The QDA seems to capture the features in the data quite well, but don't bet you money on it.

We showcase the same type of plots as in the previous lab, this time using the test data set.

The `utils` module from the `isply` library provides some utility functions to plot decision contours and boundaries.

They require a callable (function, method) that properly determines the probabilities for the two dimensional range specified by two predictor arrays. They only work out of the box when the model has exactly the predictors specified. Otherwise users have to supply a callable that properly marginalises the other predictors.

Let's first overlay the $P=0.5$ decision contour on a scatter plot. This is the default behaviour of `utils.plot_decision_contour()`.

In [None]:
ax = sns.scatterplot(x=X_test['Lag1'], y=X_test['Lag2'], hue=Y_test)
ax = utils.plot_decision_contour(X_test['Lag1'], X_test['Lag2'],
                                 qda_fit.predict_proba, ax=ax)

We can also overlay more contours by specifying the `levels` keyword argument. A value of `None` uses the automatic configuration from `matplotlib`'s `contour()` function.

In [None]:
ax = sns.scatterplot(x=X_test['Lag1'], y=X_test['Lag2'], hue=Y_test)
ax = utils.plot_decision_contour(X_test['Lag1'], X_test['Lag2'],
                                 qda_fit.predict_proba, ax=ax, 
                                 levels=None)

We can choose the category (class) the probabilities refer to (default: $0$). Note the change in the contour annotations, they are now reversed.

In [None]:
ax = sns.scatterplot(x=X_test['Lag1'], y=X_test['Lag2'], hue=Y_test)
ax = utils.plot_decision_contour(X_test['Lag1'], X_test['Lag2'],
                                 qda_fit.predict_proba, ax=ax, 
                                 levels=None, category=1)

The `plot_decision_boundaries()` function identifies the areas where a particular class probability is highest and overlays a coloured shade accordingly. This function works out of the box for multiple *response* classes. But the marginalisation restriction applies if there are more than two *predictors*.  

In [None]:
ax = sns.scatterplot(x=X_test['Lag1'], y=X_test['Lag2'], hue=Y_test)
ax = utils.plot_decision_boundaries(X_test['Lag1'], X_test['Lag2'],
                                   qda_fit.predict_proba, ax=ax)

We can also combine the two.

In [None]:
ax = sns.scatterplot(x=X_test['Lag1'], y=X_test['Lag2'], hue=Y_test)
ax = utils.plot_decision_contour(X_test['Lag1'], X_test['Lag2'],
                                 qda_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(X_test['Lag1'], X_test['Lag2'],
                                   qda_fit.predict_proba, ax=ax)