#### Introduction to Statistical Learning, Exercise 5.2

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Bootstrap versus GLM Standard Error Report

We again look at the `Default` data set. We would like to predict the probability of `default` based on the predictors `income` and `balance`. The `GLM` model fit standard errors are compared to the results of a bootstrap.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils import resample
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A.  Standard Errors from Logistic Regression Fit

Fit a logistic regression model that predicts the probability of `default` based on the predictors `income` and `balance`. Report the standard errors by the `GLM` model fits of the parameters.

In [None]:
default = datasets.Default()
default.head()

In [None]:
Y_train, X_train = patsy.dmatrices('default~income+balance', default, return_type='dataframe')
Y_train.drop('default[No]', axis=1, inplace=True)

In [None]:
fit = sm.GLM(Y_train, X_train, family=sm.families.Binomial()).fit()
fit.summary().tables[1]

### B. Bootstrap

Write a `bootstrap()` function like demonstrated in the lab to estimate the standard errors of the model parameters,

This is a rather technical exercise. the concepts are the same as in the lab, but you won't be able to just copy the function. You'll have to adapt to the `GLM` interface in one way or another. This is very common in everyday work.

In [None]:
def bootstrap(y, x, model, r):
    n_coeff = x.shape[1]
    params = np.zeros((r, n_coeff))
    for i in range(r):
        xs, ys = resample(x, y, n_samples=x.shape[0])
        fit = sm.GLM(ys, xs, family=sm.families.Binomial()).fit()
        params[i, 0] = fit.params[0]
        params[i, 1:] = fit.params[1:]
    
    betas = np.zeros(n_coeff)
    errors = np.zeros(n_coeff)
    for i in range(n_coeff):
        betas[i] = np.mean(params[:, i])
        errors[i] = np.sqrt((r * np.var(params[:, i])) / (r - 1))
    
    return betas, errors

In [None]:
betas, errors = bootstrap(Y_train, X_train, fit.model, 1000)

In [None]:
betas, errors

### C. Comment on the Results

Comment on the parameter estimates from and standard error estimates from __A__ and __B__. Does the bootstrap work on this model and data set? How many ($B$ from the lectures) samples do you need to obtain good results from the bootstrap?

With $B=1000$ the bootstrap performs very well on this data set.