# Part 1: Cross validation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from sklearn.model_selection import train_test_split
from functools import partial
from sklearn.model_selection import \
     (cross_validate,
      KFold,
      ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 1a) Validation set approach
Objective: Use validation set approach to evaluate performance of model predicting `mpg` in `Auto` dataset based on predictor `horsepower`.

In [None]:
Auto =  load_data('Auto')

In [None]:
train, validation = train_test_split(Auto, test_size=0.3)

In [None]:
train

In [None]:
validation

In [None]:
design = MS(['horsepower']).fit(train)
X_train = design.transform(train)
X_train

In [None]:
y_train = train.mpg

In [None]:
model = sm.OLS(y_train,X_train)
results = model.fit()
summarize(results)

In [None]:
y_valid_actual = validation.mpg
X_validation = design.transform(validation)
y_valid_predicted = results.predict(X_validation)
MSE = np.mean((y_valid_actual - y_valid_predicted)**2)
MSE

In [None]:
X_validation

In [None]:
def evalMSE(predictors,
           train,
           validation):
    # build design matrix and response vector
    design = MS(predictors).fit(train)
    X_train = design.transform(train)
    y_train = train.mpg 

    # train model
    model = sm.OLS(y_train,X_train)
    results = model.fit()

    # compute MSE
    y_valid_actual = validation.mpg
    X_validation = design.transform(validation)
    y_valid_predicted = results.predict(X_validation)
    MSE = np.mean((y_valid_actual - y_valid_predicted)**2)
    return MSE

In [None]:
evalMSE(['horsepower'], train, validation)

Let's compute the MSE for linear regression models including successively higher polynomial terms of `horsepower`.

In [None]:
predictors = [poly('horsepower', 2)] # choose powers of horsepower as predictors up to degree i+1
MS(predictors).fit_transform(Auto)

In [None]:
MSE

In [None]:
from ISLP.models import poly
MSE = []
for i in range(5):
    predictors = [poly('horsepower', i+1)] # choose powers of horsepower as predictors up to degree i+1
    err = evalMSE(predictors, train, validation)
    MSE.append(err)

In [None]:
MSE

## 1b) Cross validation

Cross validation is implemented most comfortably in `scikit-learn`. In order to use the `scikit-learn` implementation of cross validation with our `statsmodels` linear model, we use the wrapper `sklearn_sm` provided by the `ISLP` library. From the lab in Chapter 5:

"The class `sklearn_sm()` has as its first argument a model from `statsmodels`. It can take two additional optional arguments: `model_str` which can be used to specify a formula, and `model_args` which should be a dictionary of additional arguments used when fitting the model. For example, to fit a logistic regression model we have to specify a family argument. This is passed as `model_args={'family':sm.families.Binomial()}`."

After specifying our design matrix `X` and the vector `y` we call the `scikit-learn` function `cross_validate` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)).

The result is a dictionary which among others contains the `test_score` which we are interested when we use cross validation to estimate the test error.

In [None]:
...

The function `cross_validate` is the original `scikit-learn` function which carries out cross validation. We provide the following arguments:
- a model which needs `fit()` and `predict()` methods
- a design matrix `X` and a vector of training labels `y`
- the parameter `cv` specifying the number of folds for cross validation.

See the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) for more details.

We can repeat this procedure to compare different models. In the following we do this with various polynomial fits:

In [None]:
...

Instead of using $K = n$ folds such as above (resulting in Leave-One-Out-Cross-Validation (LOOCV)) we can also specify a smaller integer $K$ of folds. There are two possibilities for this:
- set `cv = K`,
- specify a cross validation generator such as `KFold` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html))

Below we use the second approach, which is generally preferred as we do better control the kind of split when using `KFold`.

In [None]:
...

# Part 2: Case study cross validation

(see Exercise 5.4.5)

In this case study we use the credit card dataset to predict the probability of default. We will build a logistic regression model and estimate its test error using the validation set approach and the cross-validation approach.

In [None]:
# run this cell to load the data
Default = load_data('Default')
Default

Background information on the dataset can be found [in the documentation](https://islp.readthedocs.io/en/latest/datasets/Default.html).

## Task 2.1
Fit a logistic regression model that uses `income` and `balance` to predict `default`.

In [None]:
# your code here
...

## Task 2.2
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

ii. Fit a multiple logistic regression model using only the training observations.

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

In [None]:
# your code here
...

## Task 2.3
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

In [None]:
# your code here
...

## Task 2.4
Now predict the test error of the model using 10-fold cross-validation. To do so, follow the steps we developed in Part 1b) of this notebook.

In [None]:
# your code here
...

## Task 2.5
Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using 10-fold cross-validation. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

In [None]:
# your code here
...

# Part 3: Implementation of bootstrap

## Estimating the accuracy of a statistic
### Introducing the dataset
We closely follow an example presented in [Computational and Inferential Thinking](https://inferentialthinking.com/chapters/13/3/Confidence_Intervals.html).

In [None]:
url = 'https://drive.google.com/uc?id='
file_id = "15xUDQPqkzKJBoxrafC9iNz4EgFlwbmM_"
births = pd.read_csv(url + file_id)
births

**Task 1**: Create a new column `Birth Weight (g)` which contains the birth weight in kg. Use the fact 1oz = 28.3495g for your computations.

In [None]:
...

Birth weight is an important factor in the health of a newborn infant. Smaller babies tend to need more medical care in their first days than larger newborns. It is therefore helpful to have an estimate of birth weight before the baby is born. One way to do this is to examine the relationship between birth weight and the number of gestational days.

A simple measure of this relationship is the ratio of birth weight to the number of gestational days. The table ratios contains the first two columns of baby, as well as a column of the ratios. The first entry in that column was calculated as follows:
$$ \frac{3401.94 \text{oz}}{284 \text{ days}} \approx 11.98 \text{g} \text{ per day}.$$

In [None]:
ratios = pd.DataFrame({
    'Birth Weight' : births['Birth Weight (g)'],
    'Ratio BW:GD' : births['Birth Weight (g)'] / births['Gestational Days']
})
ratios

**Task 2:** Plot a histogram of the ratios.

In [None]:
fig,ax  = plt.subplots(figsize =(8,8))
sns.histplot(ratios, x = "Ratio BW:GD");

**Task 3:** Compute the median ratio and the maximum ratio in the sample.

In [None]:
...

In [None]:
...

### Bootstrapping for estimating the variability of the population median
We now want to estimate the population median. For this we are going to use the bootstrapping method.
We start by reviewing the idea in a graphical manner:
![bootstrap.png](bootstrap.png)

**Task 1**: Define a function `one_bootstrap_median` which will bootstrap the sample and return the median ratio in the bootstrapped sample.

- To bootstrap the sample use the Pandas.DataFrame method [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html). *Important*: Make sure to draw a sample of the same length as our original sample and make sure to sample with replacement.
- To compute the appropriate quantile, use the Numpy method [`quantile`](https://numpy.org/doc/stable/reference/generated/numpy.quantile.html)

In [None]:
...

**Task 2**: Initialize a Numpy vector `bootstrap_medians` with zeros of length 5000 (use the Numpy method [`zeros()`](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros). Then fill this vector with 5000 bootstrapped medians.

In [None]:
...

In [None]:
fig,ax  = plt.subplots(figsize =(8,8))
sns.histplot(x = bootstrap_medians, bins=25);

# Bootstrapping for estimating the accuracy of a Linear Regression Model
Now, we discuss how to use bootstrapping in order to assess the variability of the coefficient estimates and predictions from a statistical learning method. As an example, we look at a simple linear regression model based on the `Auto` dataset which predicts the `mpg` variable based on `horsepower`.

With the bootstrap method we are going to estimate the distribution of the coefficient for `mpg` in this model and we compare the standard error of this coefficient as estimated by `statsmodels` with our bootstrap estimate.

**Task 1**: Define a function `one_bootstrap_model_coefficient` which creates a single bootstrap sample from the Auto dataframe, computes a regression model based on the single predictor `horsepower` and returns the model coefficient for `horsepower`.

In [None]:
...

**Task 2**: Initialize a Numpy vector `bootstrap_model_coefficients` with zeros of length 5000 (use the Numpy method [`zeros()`](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros). Then fill this vector with 5000 bootstrapped model coefficients.

In [None]:
...

In [None]:
fig,ax  = plt.subplots(figsize =(8,8))
sns.histplot(x = bootstrap_model_coefficients, bins=25);

**Task 3**: Estimate the standard error of the model coefficient for `horsepower` and assign it to the variable `standard_error_bootstrap`. Compare to the `sm.OLS()` estimate which should be assigned to the variable `standard_error_bootstrap`.

In [None]:
...

print('Bootstrapped standard error for model coefficient:', "{:10.4f}".format(standard_error_bootstrap))
print('Statsmodels OLS standard error estimate for model coefficient:', "{:10.4f}".format(standard_error_statsmodels))

# Part 4: Case study bootstrap

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: 
1. using the bootstrap, and 
2. using the standard formula for computing the standard errors in the sm.GLM() function.

## Task 4.1
Using the `summarize()` and `sm.GLM()` functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

In [None]:
# your answer here

## Task 4.2
Following the bootstrap example in Part 3 above, estimate the standard errors of the logistic regression coefficients for income and balance with the bootstrap.

In [None]:
# your answer here

## Task 4.3
Comment on the estimated standard errors obtained using the `sm.GLM()` function and using the bootstrap.

*your comment here*