#### Introduction to Statistical Learning, Lab 4.5

# K-Nearest Neighbours


We will now perform a KNN classification analysis on the `Smarket` data set, trying to predict `Direction` using `Lag1` and `Lag2`. We again use the `sklearn` library.

The KNN classifier relies on the distances between the predictors. This raises the question what the proper distances are. If we measured everything in the same units (say, metres) that would not be an issue. 

But we are often faced with data sets containing predictors such as `income` (measured in thousands of dollars) and `age` (measured in years).

Intuitively, we know that a difference of 50 years is more important than an income difference of $1000.  

The computer does not know that, though. We therefore have to *normalise* our data.




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

We first load the data set.

In [None]:
smarket = datasets.Smarket()
smarket.head()

We use the observations before 2005 as the training sample and the later predictions as the test sample.

In [None]:
X_train = smarket[smarket.Year < 2005][['Lag1', 'Lag2']]
Y_train = smarket[smarket.Year < 2005]['Direction']
X_test = smarket[smarket.Year == 2005][['Lag1', 'Lag2']]
Y_test = smarket[smarket.Year == 2005]['Direction']
X_test.head()

We first train the scaling on our training data set. We use the `StandardScaler` from `sklearn`. This will scale our predictors to a distribution with mean 0 and standard deviation of 1. 

It is *very important* to use the same scaling on the test data, just like with any other model fit.

In [None]:
scaler = StandardScaler().fit(X_train)

In [None]:
x_train = scaler.transform(X_train)
x_test = scaler.transform(X_test)

With the training and test data properly scaled, we are now ready to fit a KNN classifier. 

We first look at a KNN classifier with $k=1$.

In [None]:
knn1 = KNeighborsClassifier(1)
knn1_fit = knn1.fit(x_train, Y_train)

In [None]:
pred = knn1_fit.predict(x_test)
confusion_matrix(pred, Y_test)

In [None]:
print(classification_report(Y_test, pred))

The results are rather poor: only about 50% of respones are predicted correctly, not any better than random chance.

Remember that $k=1$ is a model with high flexibility. So we expect low bias and high variance.

Let's visualise the KNN classifier with $k=1$. We use the same utility functions as in the previous labs.

In [None]:
ax = sns.scatterplot(x=x_test[:, 0], y=x_test[:, 1], hue=Y_test)
ax = utils.plot_decision_contour(x_test[:, 0], x_test[:, 1],
                                 knn1_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x_test[:, 0], x_test[:, 1],
                                   knn1_fit.predict_proba, ax=ax)

We now try a KNN classifier with $k=3$.

In [None]:
knn3 = KNeighborsClassifier(3)
knn3_fit = knn3.fit(x_train, Y_train)

In [None]:
pred = knn3_fit.predict(x_test)
confusion_matrix(pred, Y_test)

In [None]:
print(classification_report(Y_test, pred))

The results are still poor. It turns out increasing $k$ further does not improve the situation. Feel free to explore! 

Let's visualise the KNN classifier with $k=3$. We use the same utility functions as before.

In [None]:
ax = sns.scatterplot(x=x_test[:, 0], y=x_test[:, 1], hue=Y_test)
ax = utils.plot_decision_contour(x_test[:, 0], x_test[: ,1],
                                 knn3_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x_test[:, 0], x_test[:, 1],
                                   knn3_fit.predict_proba, ax=ax)

For this particular problem, out of all the methods we tried so far, QDA seems to perform best.