#### Introduction to Statistical Learning, Lab 4.6

# Caravan Insurance Data


We will now perform a classification analysis on the `Caravan` data set, trying to predict `Purchase`, which indicates whether an individual purchases a caravan insurance policy.





In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

We first load the data set.

In [None]:
help(datasets.Caravan)

In [None]:
caravan = datasets.Caravan()
caravan.head()

In this data set only 6% of customers purchased an insurance policy.

In [None]:
caravan['Purchase'].describe()

We split the data set in a training and a test data set, using the first 1000 observation as test data and the rest as training data.

In [None]:
test = caravan[:1000]
x_test = test[caravan.columns.drop('Purchase')]
y_test = test['Purchase']
train = caravan[1000:]
x_train = train[caravan.columns.drop('Purchase')]
y_train = train['Purchase']

The KNN classifier relies on the distances between the predictors. This raises the question what the proper distances are. If we measured everything in the same units (say, metres) that would not be an issue. 

But we are often faced with data sets containing predictors such as `income` (measured in thousands of dollars) and `age` (measured in years).

Intuitively, we know that a difference of 50 years is more important than an income difference of $1000.  

The computer does not know that, though. We therefore have to *normalise* our data.

We use the `StandardScaler` from `sklearn` and train it on the training data set. 

It is *very important* to use the *same* scaling on the training and test data, just like with any other model fit. It does not matter whether the scaling is determined from the training data or the full data set, though. All that matters is that the same scaling is applied to both data sets.

In [None]:
scaler = StandardScaler().fit(x_train[x_train.columns])

In [None]:
x_train = pd.DataFrame(scaler.transform(x_train[x_train.columns]), columns=x_train.columns)
x_test = pd.DataFrame(scaler.transform(x_test[x_test.columns]), columns=x_test.columns)

In [None]:
x_train.head()

With the training and test data properly scaled, we are now ready to fit a KNN classifier. 

We first look at a KNN classifier with $k=1$.

In [None]:
knn1 = KNeighborsClassifier(1).fit(x_train, y_train)

In [None]:
pred = knn1.predict(x_test)

In [None]:
(pred != y_test).mean()

In [None]:
(pred != 'No').mean()

The test error rate is 12%, which at first seems quite good. But keep in mind the low prior probability of `Purchase`. We can get the test error rate down to ~8% by always predicting `No`! 

In [None]:
cm = confusion_matrix(pred, y_test)
print(cm)
yes_pred_rate = cm[1, 1] /(cm[1, 0] + cm[1, 1])
print(f"Correct prediction for 'Yes': {100*yes_pred_rate:.2f}%")

The correct prediction rate for customers who *did* buy insurance is about double the prior probability!

Let's see whether we can further improve this with different values of $k$. We first try a KNN classifier with $k=3$.

In [None]:
knn3 = KNeighborsClassifier(3).fit(x_train, y_train)

pred = knn3.predict(x_test)

cm = confusion_matrix(pred, y_test)
print(cm)
yes_pred_rate = cm[1, 1] /(cm[1, 0] + cm[1, 1])
print(f"Correct prediction for 'Yes': {100*yes_pred_rate:.2f}%")

The KNN with $k=3$ was an improvement. We now try a KNN classifier with $k=5$.

In [None]:
knn5 = KNeighborsClassifier(5).fit(x_train, y_train)

pred = knn5.predict(x_test)

cm = confusion_matrix(pred, y_test)
print(cm)
yes_pred_rate = cm[1, 1] /(cm[1, 0] + cm[1, 1])
print(f"Correct prediction for 'Yes': {100*yes_pred_rate:.2f}%")

The KNN classifier with $k=5$ performs best so far, certainly better than random guessing.

We now try a logistic regression for comparison.

In [None]:
x_train_lr = patsy.dmatrix('+'.join(x_train.columns), x_train, return_type='dataframe')
y_train_lr = (y_train == 'Yes').values
x_test_lr = patsy.dmatrix('+'.join(x_test.columns), x_test, return_type='dataframe')
y_test_lr = (y_test == 'Yes').values

In [None]:
lm = sm.GLM(y_train_lr, x_train_lr, family=sm.families.Binomial()).fit()
probs = lm.predict(x_test_lr)

In [None]:
pred = probs > 0.5

cm = confusion_matrix(pred, y_test_lr)
print(cm)
yes_pred_rate = cm[1, 1] /(cm[1, 0] + cm[1, 1])
print(f"Correct prediction for 'Yes': {100*yes_pred_rate:.2f}%")

This doesn't seem to work well. We predict 7 purchases and *all* of them are wrong!

But we don't have to choose our *working point* at 0.5:

In [None]:
pred = probs > 0.25

cm = confusion_matrix(pred, y_test_lr)
print(cm)
yes_pred_rate = cm[1, 1] /(cm[1, 0] + cm[1, 1])
print(f"Correct prediction for 'Yes': {100*yes_pred_rate:.2f}%")

This works much better -- we are right for 33% predictions of `Yes`. That is about five times better than random guessing!