#### Introduction to Statistical Learning, Exercise 5.1

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Validation Set Approach on Default Data Set

We'll look at the validation set approach using the `Default` data set. We would like to predict the probability of `default` based on the predictors `income` and `balance`. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A.  Logistic Regression Fit

Fit a logistic regression model that predicts the probability of `default` based on the predictors `income` and `balance`.

In [None]:
default = datasets.Default()
default.head()

In [None]:
lf = smf.glm('default~income+balance', default, family=sm.families.Binomial()).fit()
lf.summary()

### B. Validation Set

Use the validation set approach to estimate the test error of the model. This involves the following steps.

  - Split the sample set into a training set and a validation set.
  
  - Fit the logistic regression model using only the data in the training sample.
  
  - Use the fitted model to predict the probability of `default` of each individual in the test data set. 
  
  - Choose the working point of $p > 0.5$ to
  classify whether an individual defaults or not.
  
  - Compute the validation set error rate. That is, the fraction of individuals in the validation set that were wrongly classified.
  

In [None]:
train = default.sample(frac=0.5)
test = default.drop(train.index)

We first create our own design matrix and encoding to be sure there is no confusion about the interpretation.

In [None]:
Y_train, X_train = patsy.dmatrices('default~income+balance', train, return_type='dataframe')
Y_train.drop('default[No]', axis=1, inplace=True)
Y_test, X_test = patsy.dmatrices('default~income+balance', test, return_type='dataframe')
Y_test.drop('default[No]', axis=1, inplace=True)

In [None]:
lf = sm.GLM(Y_train, X_train, family=sm.families.Binomial()).fit()
lf.summary()

In [None]:
pred = lf.predict(X_test)
defaulters = (pred > 0.5)

In [None]:
cm = confusion_matrix(defaulters, Y_test)
cm

In [None]:
(cm[0, 1] + cm[1, 0]) / lf.nobs

### C. Different Validation Sets

Repeat __B__ three times with different validation samples and comment on the results. This is *not* the time to fix a random seed, you want different samples after all.

You can of course automate this and do it more than three times, if you so wish.

All ingredients are the same as in __B__, so we don't provide Python code for the solution.

You should observe some variation in the estimated classification error rate. 

### D. Including a Qualitative Variable

Now include the qualitative `student` variable and estimate the test error using the validation set approach like above.

Comment on the effect of this variable on the estimated test error.

In [None]:
train = default.sample(frac=0.5)
test = default.drop(train.index)

In [None]:
formula = 'default~income+balance+student'
Y_train, X_train = patsy.dmatrices(formula, train, return_type='dataframe')
Y_train.drop('default[No]', axis=1, inplace=True)
Y_test, X_test = patsy.dmatrices(formula, test, return_type='dataframe')
Y_test.drop('default[No]', axis=1, inplace=True)

In [None]:
X_train.head()

In [None]:
lf = sm.GLM(Y_train, X_train, family=sm.families.Binomial()).fit()
lf.summary()

In [None]:
pred = lf.predict(X_test)
defaulters = (pred > 0.5)

In [None]:
cm = confusion_matrix(defaulters, Y_test)
cm

In [None]:
(cm[0, 1] + cm[1, 0]) / lf.nobs

The test error rate is slightly higher, although one might expect that including the `student` information should help.

But we note that the $p$-value of the `income` predictor has gone through the roof after including `student`. There is probably a correlation worth investigating.