#### Introduction to Statistical Learning, Exercise 4.2

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Classification on the Auto Data Set

In this exercise you will develop a model to predict whether a given car has high or low mileage based on the `Auto` data set.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Creating a Class Variable

Using the `Auto` data set create a category (class) variable `highmpg` that is `True` if `mpg` is above the median of the `mpg` distribution and `False` otherwise.

Then create a new data frame from the `Auto` data set with `mpg` replaced by `highmpg`.

In [None]:
auto = datasets.Auto()
auto.set_index('name', inplace=True)
auto.head()

In [None]:
auto['highmpg'] = auto.mpg > auto.mpg.median()
auto.drop('mpg', axis=1, inplace=True)
auto.head()

### B. Visualisation

Explore the data set graphically. Which of the predictors might be useful for predicting `highmpg`? Scatter plots and box plots are useful to answer this question. 

In [None]:
ax = sns.relplot(x='horsepower', y='displacement', data=auto,
                 hue='highmpg', size='weight')

As one might expect, `horsepower`, `weight` and `displacement` seem to be good predictors.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(12, 5))
ax = sns.boxplot(x='highmpg', y='horsepower', data=auto, ax=ax1)
ax = sns.boxplot(x='highmpg', y='weight', data=auto, ax=ax2)
ax = sns.boxplot(x='highmpg', y='displacement', data=auto, ax=ax3)
ax = sns.boxplot(x='highmpg', y='acceleration', data=auto, ax=ax4)
plt.tight_layout()

The box plots confirm our previous findings. The `acceleration` predictor somewhat surprisingly has a weak *negative* correlation with `highmpg`. 

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax = sns.scatterplot(x='horsepower', y='acceleration', data=auto, hue='highmpg', ax=ax1)
ax = sns.scatterplot(x='displacement', y='acceleration', data=auto, hue='highmpg', ax=ax2)
ax = sns.scatterplot(x='weight', y='acceleration', data=auto, hue='highmpg', ax=ax3)
plt.tight_layout()

Some more scatter plots suggest that `acceleration` is not a good predictor, in particular it has strong correlations with `horsepower` and `displacement`. There is also a strong correlation between `horsepower` and `displacement`.

We conclude that good predictors are `horsepower` and `weight`.

### C. Training and Test Data Set

Split the data set evenly into a training and a test data set. Find a way to ensure there is no bias in the observations selected for training.

We split the data set by odd/even rows (observations). This should ensure there is no bias in case the observations are somehow ordered.

In [None]:
train = auto.iloc[::2]
test = auto.iloc[1::2]

In [None]:
train.head()

In [None]:
test.head()

### D. LDA Classifier

Perform an LDA on the training data set using the variables you deemed most useful in __B__ in order to predict `highmpg`. What is the *test error* of the obtained model?  

In [None]:
x_train = train[['horsepower', 'weight']]
y_train = train['highmpg']
x_test = test[['horsepower', 'weight']]
y_test = test['highmpg']

In [None]:
fit = LinearDiscriminantAnalysis().fit(x_train, y_train)

In [None]:
pred = fit.predict(x_test)
cm = confusion_matrix(pred, y_test)
test_error = 1 - (cm[0, 0] + cm[1, 1]) / x_test.shape[0]
print(f'Test error rate: {test_error:0.2f}%')

In [None]:
x1 = x_train['horsepower']
x2 = x_train['weight']
ax = sns.scatterplot(x=x1, y=x2, hue=y_train)
ax = utils.plot_decision_contour(x1, x2, fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x1, x2, fit.predict_proba, ax=ax)

### E. QDA Classifier

Repeat __D__ with a QDA model.

In [None]:
fit = QuadraticDiscriminantAnalysis().fit(x_train, y_train)

In [None]:
pred = fit.predict(x_test)
cm = confusion_matrix(pred, y_test)
test_error = 1 - (cm[0, 0] + cm[1, 1]) / x_test.shape[0]
print(f'Test error rate: {test_error:0.2f}%')

In [None]:
x1 = x_train['horsepower']
x2 = x_train['weight']
ax = sns.scatterplot(x=x1, y=x2, hue=y_train)
ax = utils.plot_decision_contour(x1, x2, fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x1, x2, fit.predict_proba, ax=ax)

### F. Logistic Regression

Repeat __D__ with a logistic regression model.

In [None]:
x_train_lr = patsy.dmatrix('horsepower+weight', train, return_type='dataframe')
x_test_lr = patsy.dmatrix('horsepower+weight', test, return_type='dataframe')

In [None]:
fit = sm.GLM(y_train, x_train_lr, family=sm.families.Binomial()).fit()
fit.summary()

In [None]:
pred = fit.predict(x_test_lr) > 0.5
cm = confusion_matrix(pred, y_test)
test_error = 1 - (cm[0, 0] + cm[1, 1]) / x_test_lr.shape[0]
print(f'Test error rate: {test_error:0.2f}%')

### G. KNN Classifier

Repeat __D__ with a KNN classifier. Vary the value of $k$. Which value of $k$ gives the best (lowest) test error rate?

In [None]:
best_error = 1.0
best_k = 1
best_fit = None
for k in range(1, 11):
    fit = KNeighborsClassifier(k).fit(x_train, y_train)
    pred = fit.predict(x_test)
    cm = confusion_matrix(pred, y_test)
    test_error = 1 - (cm[0, 0] + cm[1, 1]) / x_test.shape[0]
    if test_error < best_error:
        best_error = test_error
        best_k = k
        best_fit = fit

In [None]:
pred = best_fit.predict(x_test)
cm = confusion_matrix(pred, y_test)
test_error = 1 - (cm[0, 0] + cm[1, 1]) / x_test.shape[0]
print(f'Test error rate (k={best_k}): {test_error:0.2f}%')

In [None]:
x1 = x_train['horsepower']
x2 = x_train['weight']
ax = sns.scatterplot(x=x1, y=x2, hue=y_train)
ax = utils.plot_decision_contour(x1, x2, best_fit.predict_proba, ax=ax)
ax = utils.plot_decision_boundaries(x1, x2, best_fit.predict_proba, ax=ax)