#### Introduction to Statistical Learning, Lab 5.3

# Bootstrap

The main advantage of the bootstrap is its wide range of applications. We first demonstrate how to use it with a custom statistic on the `Portfolio` data set and then evaluate linear models on the `Auto` data set.

We will use the linear models and tools from the `sklearn` library in this lab.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

#### Bootstrap on Portfolio

We use $\alpha$ from the lecture as the statistic and define a function to calculate it.

In [2]:
def alpha(x, y):
    cv = np.cov(x, y)
    vx = cv[0, 0]
    vy = cv[1, 1]
    cxy = cv[1, 0]
    return (vy - cxy) / (vx + vy - 2 * cxy)

In [3]:
portfolio = datasets.Portfolio()
x = portfolio.X[:100]
y = portfolio.Y[:100]

In [4]:
print(alpha(x, y))

0.57583207459283


We now sample a bootstrap data set with replacement and compute $\hat{\alpha}$.

In [5]:
from sklearn.utils import resample
xs, ys = resample(x, y, n_samples=100)

In [None]:
print(alpha(xs, ys))

In order to perform a bootstrap analysis we have to re-sample many times and compute the standard error. We define a function to do that.

In [None]:
def bootstrap(x, y, r, fn):
    stats = np.zeros(r)
    for i in range(r):
        xs, ys = resample(x, y, n_samples=x.shape[0])
        stats[i] = fn(xs, ys)
    
    stat = np.mean(stats)
    se = np.sqrt((r * np.var(stats)) / (r - 1))
    
    return stat, se

In [None]:
a, se_a = bootstrap(x, y, 1000, alpha)

In [None]:
print(f'alpha: {a:.2f}, SE(alpha): {se_a:.4f}')

#### Bootstrap on Linear Model

We now perform a bootstrap analysis on the parameters of a linear regression model and compare the results to statistics produced by the regression fit.

In [None]:
auto = datasets.Auto()
x = auto[['horsepower']]
y = auto['mpg']
model = skl_lm.LinearRegression()

We again define a function that re-samples the data set many times and estimates the parameters and their standard errors.

In [None]:
def bootstrap_lm(x, y, model, r):
    n_coeff = x.shape[1]
    params = np.zeros((r, n_coeff + 1))
    for i in range(r):
        xs, ys = resample(x, y, n_samples=x.shape[0])
        lm = model.fit(xs, ys)
        params[i, 0] = lm.intercept_
        params[i, 1:] = lm.coef_
    
    betas = np.zeros(n_coeff + 1)
    errors = np.zeros(n_coeff + 1)
    for i in range(n_coeff + 1):
        betas[i] = np.mean(params[:, i])
        errors[i] = np.sqrt((r * np.var(params[:, i])) / (r - 1))
    
    return betas, errors

In [None]:
bootstrap_lm(x, y, model, 1000)

$\hat{\beta}_0 = 39.94$, $\text{SE}(\hat{\beta}_0) = 0.85$ 

$\hat{\beta}_1 = -0.1580$, $\text{SE}(\hat{\beta}_1) = 0.0074$ 

(The results may vary slightly due to the random sampling)

We now compare this to the results obtained from the `statsmodels` library.

In [None]:
import statsmodels.formula.api as smf

lm = smf.ols('mpg~horsepower', auto).fit()
lm.summary().tables[1]

The error estimates are different. Is this a problem?

We repeat the above with a model including a quadratic term. The `bootstrap_lm()` function we defined above can digest arbitrary linear models, so we can reuse it here.

In [None]:
poly = PolynomialFeatures(degree=2, include_bias=False)
x_train = poly.fit_transform(x)
bootstrap_lm(x_train, y, model, 1000)

$\hat{\beta}_0 = 56.97$, $\text{SE}(\hat{\beta}_0) = 2.00$ 

$\hat{\beta}_1 = -4.6739$, $\text{SE}(\hat{\beta}_1) = 0.0318$ 

$\hat{\beta}_2 = -0.0012$, $\text{SE}(\hat{\beta}_2) = 0.0002$ 

(The results may vary slightly due to the random sampling)

In [None]:
lm = smf.ols('mpg~horsepower+I(horsepower**2)', auto).fit()
lm.summary().tables[1]

Now there is a much better agreement on the errors!