## Preparing the environment

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from sklearn.model_selection import train_test_split
from functools import partial
from sklearn.model_selection import \
     (cross_validate,
      KFold,
      ShuffleSplit)
from sklearn.base import clone
from sklearn.metrics import make_scorer
from ISLP.models import sklearn_sm

import plotly.express as px

In [2]:
import warnings
warnings.filterwarnings('ignore')

# Part 1: Case study bootstrap

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the `Default` data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: 
1. using the bootstrap, and 
2. using the standard formula for computing the standard errors in the sm.GLM() function.

In [3]:
# run this cell to load the data
Default = load_data('Default')
Default

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.134700
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879
...,...,...,...,...
9995,No,No,711.555020,52992.378914
9996,No,No,757.962918,19660.721768
9997,No,No,845.411989,58636.156984
9998,No,No,1569.009053,36669.112365


## Task 1.1
Using the `summarize()` and `sm.GLM()` functions, determine the estimated standard errors for the coefficients associated with `income` and `balance` in a multiple logistic regression model that uses both predictors.

In [4]:
predictors = Default.columns.drop(['student', 'default'])
design = MS(predictors).fit(Default)
X = design.transform(Default)

y = Default.default.map({
    'No' : 0,
    'Yes' : 1
})

model = sm.Logit(y,X)
results = model.fit()
summarize(results)

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10


Unnamed: 0,coef,std err,z,P>|z|
intercept,-11.5405,0.435,-26.544,0.0
balance,0.0056,0.0,24.835,0.0
income,2.1e-05,5e-06,4.174,0.0


## Task 1.2
Following the bootstrap example in Part 2 above, estimate the standard errors of the logistic regression coefficients for income and balance with the bootstrap.

In [5]:
# Step 1: Define function to compute one bootstrap sample of the model coefficients
def one_bootstrap_model_coefficients():
    resample = Default.sample(frac = 1, 
                           replace = True)
    X_bs = design.transform(resample)
    y_bs = y[X_bs.index]
    results_bs = sm.Logit(y_bs,X_bs).fit(disp=0) # disp=0 option mutes output of fit()
    return results_bs.params['balance'], results_bs.params['income']
one_bootstrap_model_coefficients()

(0.005971566415850255, 2.27073226747463e-05)

In [6]:
# Step 2: 
n_bs_samples = 5000
balance_coefficients = np.zeros(n_bs_samples)
income_coefficients = np.zeros(n_bs_samples)
for i in range(n_bs_samples):
    balance_coefficients[i], income_coefficients[i] = one_bootstrap_model_coefficients()

In [7]:
# Step 3a: comparison of bootstrap standard errors and standard errors as per statsmodels.Logit() - balance
print('Bootstrap estimation of standard error for balance parameter:   ', 
      '{:6e}'.format(np.std(balance_coefficients))
)
print('Statsmodels estimation of standard error for balance parameter: ', 
      '{:6e}'.format(summarize(results).loc['balance','std err'])
)

Bootstrap estimation of standard error for balance parameter:    2.298945e-04
Statsmodels estimation of standard error for balance parameter:  0.000000e+00


In [8]:
# Step 3b: comparison of bootstrap standard errors and standard errors as per statsmodels.Logit() - income
print('Bootstrap estimation of standard error for income parameter:   ', 
      '{:6e}'.format(np.std(income_coefficients))
)
print('Statsmodels estimation of standard error for income parameter: ', 
      '{:6e}'.format(summarize(results).loc['income','std err'])
)

Bootstrap estimation of standard error for income parameter:    4.854690e-06
Statsmodels estimation of standard error for income parameter:  4.990000e-06


## Task 1.3

Comment on the estimated standard errors obtained using the `sm.Logit()`/`sm.GLM()` function and using the bootstrap.

**Comment**:
- For `balance` the standard error estimated by `statsmodels.Logit()` is 0.0, which is quite different from the bootstrap standard error of 2.26e-04.
- For `income` the bootstrap estimate of the standard error is similar to the one estimated by `statsmodels`.

# Part 2: Regularization

In this section we learn how to implement regularization for linear regression models using Ridge and the Lasso formalisms.

We look at a [market research project by a pharmaceutical company](https://www.tandfonline.com/doi/abs/10.1080/02664763.2014.994480) (example taken from the textbook [Learning Data Science](https://learningds.org/ch/16/ms_regularization.html#lipovetsky)) by S. Lau, J. Gonzalez and D. Nolan).

The objective of the study is to model consumer interest in purchasing a cold sore health-care product. The study authors gather data from 1,023 consumers. Each consumer is asked to rate on a 10-point scale 35 factors according to whether the factor matters to them when considering purchasing a cold sore treatment. They also rate their interest in purchasing the product.

We begin by reading in the data:

In [9]:
ma_df = pd.read_csv('market-analysis.csv')

The table below lists the 35 factors and provides their correlation to the outcome, their interest in purchasing the product:



|  | Corr | Description |  | Corr | Description |
| --- | --- | --------- | --- | --- | --------- |
| x1  | 0.70 | provides soothing relief | x19 | 0.54 | has a non-messy application |
| x2  | 0.58 | moisturizes cold sore blister | x20 | 0.70 | good for any stage of a cold |
| x3  | 0.69 | provides long-lasting relief | x21 | 0.49 | easy to apply/take |
| x4  | 0.70 | provides fast-acting relief | x22 | 0.52 | package keeps from contamination |
| x5 | 0.72 | shortens duration of a cold | x23 | 0.57 | easy to dispense a right amount |
| x6  | 0.68 | stops the virus from spreading | x24 | 0.63 | worth the price it costs |
| x7 | 0.67| dries up cold sore | x25 | 0.57 | recommended most by pharamacists |
| x8 | 0.72 | heals fast | x26 | 0.54 | recommended by doctors |
| x9 | 0.72 | penetrates deep | x27 | 0.54 | FDA approved |
| x10 | 0.65 | relieves pain | x28 | 0.64 | a brand I trust |
| x11 |0.61 | prevents cold | x29 | 0.60 | clinically proven |
| x12 | 0.73 | prevents from getting worse | x30 | 0.68 | a brand I would recommend |
| x13 | 0.57 | medicated | x31 | 0.74 | an effective treatment |
| x14 | 0.61 | prescription strength | x32  |0.37 | portable |
| x15 | 0.63 | repairs damaged skin | x33 | 0.37 | discreet packaging |
| x16 | 0.67 | blocks virus from spreading | x34 | 0.55 | helps conceal cold sores |
| x17 | 0.42 | contains SPF | x35 | 0.63 | absorbs quickly |
| x18 | 0.57 | non-irritating | | | |


Based on their labels alone, some of these 35 features appear to measure similar aspects of desirability. We can compute the correlations between the explanatory variables to confirm this:

In [10]:
ma_df.corr()

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35
y,1.0,0.698082,0.58447,0.689198,0.698104,0.715737,0.675171,0.674324,0.720245,0.715239,...,0.543472,0.5353,0.643565,0.599936,0.681561,0.744036,0.370732,0.365545,0.553436,0.627739
x1,0.698082,1.0,0.693949,0.741297,0.806206,0.745964,0.733237,0.715636,0.764331,0.781493,...,0.545061,0.555448,0.669401,0.632966,0.675034,0.786363,0.420447,0.327813,0.573375,0.714993
x2,0.58447,0.693949,1.0,0.632579,0.652231,0.62931,0.626027,0.60647,0.628247,0.667748,...,0.520121,0.535033,0.608308,0.600972,0.618307,0.658305,0.42604,0.355965,0.54633,0.631601
x3,0.689198,0.741297,0.632579,1.0,0.775006,0.764568,0.781304,0.716138,0.795831,0.766581,...,0.559115,0.571977,0.634122,0.665531,0.635772,0.795528,0.366715,0.343737,0.66746,0.697526
x4,0.698104,0.806206,0.652231,0.775006,1.0,0.791758,0.763113,0.742654,0.812947,0.78278,...,0.566515,0.550485,0.649257,0.628049,0.674396,0.816444,0.386155,0.316505,0.595784,0.715873
x5,0.715737,0.745964,0.62931,0.764568,0.791758,1.0,0.784662,0.728638,0.848203,0.767637,...,0.585059,0.541089,0.634308,0.631482,0.683331,0.82055,0.341707,0.315925,0.587228,0.692399
x6,0.675171,0.733237,0.626027,0.781304,0.763113,0.784662,1.0,0.723374,0.787879,0.776645,...,0.601352,0.610401,0.619591,0.69206,0.657196,0.781734,0.342936,0.366826,0.644575,0.683923
x7,0.674324,0.715636,0.60647,0.716138,0.742654,0.728638,0.723374,1.0,0.735611,0.762535,...,0.572395,0.576887,0.665014,0.677444,0.645748,0.751695,0.367301,0.38985,0.58125,0.707176
x8,0.720245,0.764331,0.628247,0.795831,0.812947,0.848203,0.787879,0.735611,1.0,0.780745,...,0.562062,0.535833,0.631658,0.645374,0.656173,0.837631,0.320365,0.31538,0.623694,0.689579
x9,0.715239,0.781493,0.667748,0.766581,0.78278,0.767637,0.776645,0.762535,0.780745,1.0,...,0.592798,0.603364,0.681121,0.699057,0.711266,0.815086,0.420671,0.354331,0.608952,0.743931


We observe for example that the last feature `x35` ("arsorbs quickly") is highly correlated to `x1` ("provides soothing relief"), `x4` ("provides fast-acting relief") or `x9` ("penetrates deep").

## Task 2.1

Split the data into train and test sets. Use a test set size of 200 observations.

In [None]:
y = ma_df["y"]
X = ma_df.drop(columns=["y"])

X_train, X_test, y_train, y_test = ...

## Task 2.2

Standardize the features using the `sklearn.preprocessing.StandardScaler()` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) method. Note that only the predictors need to be scaled.

In [None]:
from sklearn.preprocessing import StandardScaler

X_train_scaled = ...
X_test_scaled = ...

Run the cell below to check that scaled training data has mean 0 and SD 1 (approximately):

In [None]:
X_train_scaled.mean(axis=0)

In [None]:
X_train_scaled.std(axis=0)

Note that this is **not** the case for the test data (**why?**):

In [None]:
X_test_scaled.mean(axis=0)

In [None]:
X_test_scaled.std(axis=0)

## Task 2.3

We start by computing an ordinary multiple linear regression model. For consistency with the subsequent tasks we use `sklearn.linear_models.LinearRegression` this time (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In the following, train a Multiple Linear Regression model on the scaled training data.

Compute the model coefficients (using the `coef_` attribute of the trained model) and the mean squared error on the test data (using the function `sklearn.metrics.mean_squared_error()` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html))).

*Note*: The design matrix and the results vector are passed only as arguments to the `fit()` method for `sklearn` models. This is different than for `statsmodels` where we passed the data already at the stage of initializing the model. Additionally, the order in which the design matrix and the results vector are passed to a `sklearn`-model is swapped compared to `statsmodels`!

Also note: for linear models in `sklearn` we do not need to manually create an `intercept` column as we can specify if we want an intercept to be included using the `fit_intercept` argument when initializing the model. This parameter is set to `True` by default.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

coefficients = ...
mse = ...

print('Multiple Linear Regression model coefficients: ', coefficients)
print('Mutiple Linear Regression test MSE: ', mse)

## Task 2.4

Repeat Task 4.3, but this time train your model on the unscaled data. What do you observe?

In [None]:
coefficients = ...
mse = ...

print('Multiple Linear Regression model coefficients: ', coefficients)
print('Mutiple Linear Regression test MSE: ', mse)

**Observation**: The model trained on the unscaled data is equivalent to the model trained on the scaled data as can be seen by comparing the two model's test MSE which are identical.

## Task 2.5

Next, we implement Lasso regression using `sklearn.linear_model.Lasso` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)). 

In the following, train a Lasso model on the scaled training data using the regularization parameter $\lambda = 1$. Note that $\lambda$ is set by specifying the argument `alpha` in `sklearn.linear_model.Lasso`.

Compute the model coefficients (using the `coef_` attribute of the trained model) and the mean squared error on the test data (using the function `sklearn.metrics.mean_squared_error()` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html))).

*Note*: The design matrix and the results vector are passed only as arguments to the `fit()` method for `sklearn` models. This is different than for `statsmodels` where we passed the data already at the stage of initializing the model. Additionally, the order in which the design matrix and the results vector are passed to a `sklearn`-model is swapped compared to `statsmodels`!

Also note: for linear models in `sklearn` we do not need to manually create an `intercept` column as we can specify if we want an intercept to be included using the `fit_intercept` argument when initializing the model.

In [None]:
from sklearn.linear_model import Lasso

coefficients = ...
mse = ...

print('Model coefficients: ', coefficients)
print('Lasso test MSE for alpha = 1: ', mse)

## Task 2.6

For values of $\lambda$ varying from 0.01 to 2 in steps of 0.01 train Lasso models and compute the model coefficients and the model test MSEs. For each new value of $\lambda$, append the new model coefficients and test MSEs to lists called `coefficients_Lasso` and `mses`.

In [None]:
from sklearn.linear_model import Lasso

coefficients_Lasso = []
...

Run the two cells below to visualize your coefficients and your MSEs for the different $\lambda$ values.

In [None]:
col_names = ["x" + str(v) for v in np.arange(1, 36, 1)]

coefs_df = pd.DataFrame(coefficients_Lasso, columns=col_names)

coefs_df["lambda"] = alphas
coefs_long = pd.melt(coefs_df, id_vars=["lambda"], value_vars=col_names)

fig = px.line(coefs_long, x="lambda", y="value", color="variable", log_x=True)
fig.update_layout(
    showlegend=False, width=1000, height=500, yaxis_title="Coefficient",
    xaxis_title="Lambda"
)

In [None]:
px.line(x=alphas, y=mses,
        labels={"x": "Lambda", "y": "MSE"},
        width=700, height=500)

## Task 2.7

Repeat the steps from Task 2.6, this time using Ridge regression [`sklearn.linear_model.Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge) using a parameter $\lambda$ which varies from $1$ to $3000$ in steps of $25$.

In [None]:
from sklearn.linear_model import Ridge

coefficients_Ridge = []
...

Run the two cells below to visualize the coefficients and the test score for the different $\lambda$ parameters.

In [None]:
coefficients_Ridge = np.squeeze(coefficients_Ridge)

col_names = ["x" + str(v) for v in np.arange(1, 36, 1)]

coefsR_df = pd.DataFrame(coefficients_Ridge, columns=col_names)
coefsR_df["lambda"] = alphasR

coefsR_long = pd.melt(coefsR_df, id_vars=["lambda"], value_vars=col_names)

fig = px.line(coefsR_long, x="lambda", y="value", color="variable", log_x=True)
fig.update_layout(
    showlegend=False, width=1000, height=500, 
    yaxis_title="Coefficient", xaxis_title="Lambda"
)
fig.show()

In [None]:
px.line(x=alphasR, y=mses,
        labels={"x": "Lambda", "y": "MSE"},
        width=700, height=500)

## Task 2.8

Now we use $10$-fold cross validation to compare the estimated test MSE of OLS multiple linear regression, Lasso regression and Ridge regression.
To do so, follow the steps outlined below:
- Initialize a `KFold` cross-validator (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)). Make sure to set a random state so that the same folds are used for all models. Also make sure that the data is shuffled.
- With this cross-validator, compute the cross validation scores for the regular OLS model. Since in this part we stay completely within `sklearn` and do not use `statsmodels`, there is no need for using `sklearn_sm`. Make sure to specify the appropriate scorer using the `socring` parameter.
- For `Lasso` and `Ridge` we need to define pass the model in the form of a pipeline to `cross_validate` to make sure that the standardization is carried out on each of the folds separately. For this, use the function `sklearn.pipeline.make_pipeline` (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) or [here](https://scikit-learn.org/stable/modules/compose.html) for more details).

In [None]:
cross_val = ...

In [None]:
# determine OLS cross validation score
...
cv_err_OLS = -np.mean(cv_results['test_score'])

In [None]:
# determine Lasso cross validation scores
from sklearn.pipeline import make_pipeline

...
cv_err_L = ...

...

In [None]:
# determine Ridge cross validation scores
...
cv_err_R = ...

...

In [None]:
print('Cross validation score OLS: ', cv_err_OLS)
print('Best cross validation score Lasso: ', min(cv_err_L), ' (for parameter alpha = ', 
      alphas_L[np.argmin(cv_err_L)],')')
print('Best cross validation score Ridge: ', min(cv_err_R), ' (for parameter alpha = ', 
      alphas_R[np.argmin(cv_err_R)],')')