# Assessing coefficients with bootstrapping

[Machine Learning Interpretability course](https://www.trainindata.com/p/machine-learning-interpretability)

In this notebook, we will use bootstrapping to determine the error of the coefficients.

if the coefficients are significant, then the contribution of the feature towards the probability is meaningful.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### Load data

To obtain the data, check the folder `prepare-data` in this repo, or section 2 of the course.

In [2]:
# load titanic dataset

df = pd.read_csv('../titanic.csv')

# split data
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("survived", axis=1), 
    df["survived"],
    test_size=0.15,
    random_state=1,
)

# scale the variables
scaler = StandardScaler().set_output(transform="pandas")

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Bootstrapping with Scikit-learn

In the linear regression section, we used cross-validation and different partitions of the train set to estimate the error of the coefficients. Those are valid approaches and you can use them for logistic regression as well.

Here, I will introduce another alternative to estimate the coefficients error: **bootstrapping**.

In [3]:
# Train 50 models on different bootstrapped
# partitions of the train set

s = dict()

for i in np.linspace(1, 50, num=50):

    # bootstrap with replacement
    X_train_b = X_train.sample(frac=0.8, replace=True, random_state=int(i))
    y_train_b = y_train.loc[X_train_b.index]

    # train model
    logit = LogisticRegression(
        penalty=None, random_state=0).fit(X_train_b, y_train_b)

    # store coefficients
    s[str(int(i))] = pd.Series(logit.coef_[0])

In [4]:
# put coefficients in a dataframe

df = pd.concat(s, axis=1)
df.index = logit.feature_names_in_

df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
pclass,-0.534509,-0.624764,-0.351549,-0.438266,-0.512061,-0.440676,-0.578593,-0.472116,-0.615872,-0.542379,...,-0.438867,-0.382899,-0.621534,-0.416289,-0.466371,-0.52871,-0.393777,-0.406558,-0.589398,-0.394367
sibsp,-0.064679,-0.177056,-0.328538,-0.077019,-0.084303,-0.406636,-0.178394,-0.083164,-0.206828,-0.327789,...,-0.171062,-0.243986,-0.297647,-0.402791,-0.33638,-0.346853,-0.140565,-0.400153,-0.18564,-0.285593
parch,0.015369,0.07158,0.062171,-0.032246,-0.0053,0.032669,0.17945,0.000523,-0.147484,0.049313,...,0.096691,0.301907,0.02419,0.140074,0.053208,-0.002931,0.073487,0.099207,0.100552,0.050043
sex_female,1.04756,1.21641,1.220586,1.208595,1.168361,1.304313,1.194548,1.290748,1.190516,1.244512,...,1.166265,1.08179,1.183863,1.295611,1.251984,1.338715,1.173924,1.10249,1.361943,1.141719
embarked_S,0.074015,0.135938,0.14195,0.168767,-0.01513,0.201774,-0.066315,0.039502,0.050337,-0.025995,...,0.009892,-0.134835,0.180129,-0.044966,0.1451,0.075629,-0.089,-0.084067,0.014126,-0.034973


In [5]:
# Summarize variability of coefficients

coeff_summary = df.agg(["mean", "std"], axis=1)

coeff_summary

Unnamed: 0,mean,std
pclass,-0.515202,0.117951
sibsp,-0.240314,0.101813
parch,0.059448,0.097416
sex_female,1.218624,0.0919
embarked_S,0.012063,0.130206
embarked_C,0.255133,0.127116
cabin_B,-0.287461,0.490073
cabin_C,-0.444941,0.593502
cabin_E,-0.067307,0.418947
cabin_D,-0.247478,0.446451


In [6]:
# calculate z and the p-values

coeff_summary["z"] = coeff_summary["mean"] / coeff_summary["std"]

coeff_summary["p_values"] = stats.norm.sf(abs(coeff_summary["z"]))*2  # two sided

coeff_summary

Unnamed: 0,mean,std,z,p_values
pclass,-0.515202,0.117951,-4.367951,1.254175e-05
sibsp,-0.240314,0.101813,-2.360338,0.01825831
parch,0.059448,0.097416,0.61025,0.5416963
sex_female,1.218624,0.0919,13.260308,3.932762e-40
embarked_S,0.012063,0.130206,0.092647,0.9261837
embarked_C,0.255133,0.127116,2.007094,0.04473961
cabin_B,-0.287461,0.490073,-0.586567,0.5574943
cabin_C,-0.444941,0.593502,-0.749686,0.4534437
cabin_E,-0.067307,0.418947,-0.160658,0.8723629
cabin_D,-0.247478,0.446451,-0.554322,0.5793587


Experimentally, we have larger errors, and therefore, the significance is smaller. 

This is probably because the variables are not completely independent, that is, there is some colinearity.