# Bootstrap

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. As a simple example, the bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit. Although standard errors were obtained automatically by `statsmodels` in {doc}`Linear regression <../00-prereq/overview>` and {doc}`Logistic regression <../03-logistic-reg/overview>`, the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods.

Watch the 6-minute video below for a visual explanation of cross-validation:

```{admonition} Video

<iframe width="700" height="394" src="https://www.youtube.com/embed/Xz0x-8-cgaQ?start=8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining Cross Bootstrap, by StatQuest](https://www.youtube.com/embed/Xz0x-8-cgaQ?start=8)

```

## Import libraries and load data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import (
    train_test_split,
    LeaveOneOut,
    KFold,
    cross_val_score,
)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import resample

%matplotlib inline

In [None]:
auto_url = "https://github.com/pykale/transparentML/raw/main/data/Auto.csv"

auto_df = pd.read_csv(auto_url, na_values="?").dropna()
auto_df.info()

## The bootstrap

Suppose that we wish to estimate the uncertainly of a coefficient estimate $\beta_1$ from a linear regression fit, we take 
$n$ repeated samples with replacement from our dataset and train our linear regression model $n$ times and record each value $\hat{\beta}_1^{*1}, \hat{\beta}_1^{*2}, \dots, \hat{\beta}_1^{*n}$. With enough resampling - typically 1000 or 10,000 - we can plot the distribution of these bootstrapped estimates $\hat{\beta}_1^{*i}, i = 1,\dots, n $. Then, we can use the resulting distribution to quantify the variability of this estimate by calculating useful summary statistics, such as standard errors and confidence intervals.

The power of the bootstrap lies in the ability to take repeated samples of the dataset, instead of collecting a new dataset each time. Also, in contrast to standard error estimates typically reported with statistical software that rely on algebraic methods and underlying assumptions, bootstrapped standard error estimates are more accurate as they are calculated computationally.


**Bootstrap example using `scikit-learn`** (adapted from this [blog post](https://ethanwicker.com/2021-02-23-bootstrap-resampling-001/))

In [None]:
# Defining number of iterations for bootstrap resample
n_iterations = 1000

# Initializing estimator
lin_reg = LinearRegression()

# Initializing DataFrame, to hold bootstrapped statistics
bootstrapped_stats = pd.DataFrame()

# Each loop iteration is a single bootstrap resample and model fit
for i in range(n_iterations):

    # Sampling n_samples from data, with replacement, as train
    # Defining test to be all observations not in train
    train = resample(auto_df, replace=True, n_samples=len(auto_df))
    test = auto_df[~auto_df.index.isin(train.index)]

    X_train = train.loc[:, ["horsepower", "weight"]]
    y_train = train.loc[:, ["mpg"]]

    X_test = test.loc[:, ["horsepower", "weight"]]
    y_test = test.loc[:, ["mpg"]]

    # Fitting linear regression model
    lin_reg.fit(X_train, y_train)

    # Storing stats in DataFrame, and concatenating with stats
    intercept = lin_reg.intercept_
    beta_horsepower = lin_reg.coef_.ravel()[0]
    beta_weight = lin_reg.coef_.ravel()[1]
    r_squared = lin_reg.score(X_test, y_test)

    bootstrapped_stats_i = pd.DataFrame(
        data=dict(
            intercept=intercept,
            beta_horsepower=beta_horsepower,
            beta_weight=beta_weight,
            r_squared=r_squared,
        )
    )

    bootstrapped_stats = pd.concat(objs=[bootstrapped_stats, bootstrapped_stats_i])

In [None]:
bootstrapped_stats.head()

Plot the distribution of the bootstrapped estimates of the coefficients and the corresponding test scores from the `Auto` dataset:

In [None]:
# Plotting histograms
fig, axes = plt.subplots(1, 4, figsize=(18, 5))
sns.histplot(bootstrapped_stats["intercept"], color="royalblue", ax=axes[0], kde=True)
sns.histplot(bootstrapped_stats["beta_horsepower"], color="olive", ax=axes[1], kde=True)
sns.histplot(bootstrapped_stats["beta_weight"], color="gold", ax=axes[2], kde=True)
sns.histplot(bootstrapped_stats["r_squared"], color="teal", ax=axes[3], kde=True)
plt.show()

## Exercises

min 3 max 5

