STAT 451: Machine Learning (Fall 2020)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  

Course website: http://pages.stat.wisc.edu/~sraschka/teaching/stat451-fs2020/

# L09: Model Evaluation 2 -- Confidence Intervals and Resampling

<br>
<br>
<br>

# 5. Out-of-Bag Bootstrap

In this section, we are going to look at the OOB bootstrap method, which I recently implemented in mlxtend.

In [1]:
from mlxtend.evaluate import BootstrapOutOfBag
import numpy as np




oob = BootstrapOutOfBag(n_splits=3, random_seed=1)
for train, test in oob.split(np.array([1, 2, 3, 4, 5])):
    print(train, test)

[3 4 0 1 3] [2]
[0 0 1 4 4] [2 3]
[1 2 4 2 4] [0 3]


The reason why I chose a object-oriented implementation is that we can plug it into scikit-learn's `cross_val_score` function, which is super convenient.

In [2]:
from mlxtend.data import iris_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split


X, y = iris_data()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=123, stratify=y)


model = DecisionTreeClassifier(random_state=123)

Below, we are using the standard approach for `cross_val_score` first, which will perform 5-fold cross validation by setting `cv=5`. Note that 

- if the model is a scikit-learn classifier, stratified k-fold cross validation will be performed by default, and the reported evaluation metric is the prediction accuracy;
- if the model is a scikit-learn regressor, standard k-fold cross validation will be performed by default, and the reported evaluation metric is the $R^2$ score on the test folds.

In [3]:
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print('CV scores', cv_scores)
print('Mean CV score', np.mean(cv_scores))
print('CV score Std', np.std(cv_scores))

CV scores [0.94444444 1.         1.         0.88888889 0.94444444]
Mean CV score 0.9555555555555555
CV score Std 0.04157397096415492


Now, let's plug in our OOB object into the `cross_val_score` function:

In [4]:
# 5 splits

bootstrap_scores = \
    cross_val_score(model, X_train, y_train, 
                    cv=BootstrapOutOfBag(n_splits=5, random_seed=123))

print('Bootstrap scores', bootstrap_scores)
print('Mean Bootstrap score', np.mean(bootstrap_scores))
print('Score Std', np.std(bootstrap_scores))

Bootstrap scores [0.93548387 0.96774194 0.96875    0.93023256 0.97058824]
Mean Bootstrap score 0.9545593199770531
Score Std 0.017819915677477555


In [5]:
bootstrap_scores = \
    cross_val_score(model, X_train, y_train, 
                    cv=BootstrapOutOfBag(n_splits=200, random_seed=123))

print('Mean Bootstrap score', np.mean(bootstrap_scores))
print('Score Std', np.std(bootstrap_scores))

Mean Bootstrap score 0.9483980861793887
Score Std 0.039817322453014004


In [6]:
lower = np.percentile(bootstrap_scores, 2.5)
upper = np.percentile(bootstrap_scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

95% Confidence interval: [83.33, 100.00]


In [7]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.95

<br>
<br>
<br>

## MLxtend functional bootstrap API

###  OOB Bootstrap

Below is a more convenient way to compute the OOB Boostrap. Note that it has a tendency to be over-pessimistic.

In [8]:
from mlxtend.evaluate import bootstrap_point632_score

bootstrap_scores = bootstrap_point632_score(model, 
                                            X_train, y_train, 
                                            n_splits=200, 
                                            method='oob',
                                            random_seed=123)

print('Mean Bootstrap score', np.mean(bootstrap_scores))
print('Score Std', np.std(bootstrap_scores))

Mean Bootstrap score 0.9483980861793887
Score Std 0.039817322453014004


In [9]:
lower = np.percentile(bootstrap_scores, 2.5)
upper = np.percentile(bootstrap_scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

95% Confidence interval: [83.33, 100.00]


###  .632 Bootstrap

The .632 Bootstrap is the default setting of `bootstrap_point632_score`; it tends to be overly optimistic.

In [10]:
bootstrap_scores = bootstrap_point632_score(model, 
                                            X_train, y_train, 
                                            n_splits=200,
                                            random_seed=123)
print('Mean Bootstrap score', np.mean(bootstrap_scores))
print('Score Std', np.std(bootstrap_scores))

Mean Bootstrap score 0.9673875904653735
Score Std 0.02516454779030485


In [11]:
lower = np.percentile(bootstrap_scores, 2.5)
upper = np.percentile(bootstrap_scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

95% Confidence interval: [89.47, 100.00]


###  .632+ Bootstrap

The .632+ Boostrap method attempts to address the optimistic bias of the regular .632 Boostrap.

In [12]:
bootstrap_scores = bootstrap_point632_score(model, X_train, y_train, 
                                            n_splits=200, 
                                            method='.632+',
                                            random_seed=123)
print('Mean Bootstrap score', np.mean(bootstrap_scores))
print('Score Std', np.std(bootstrap_scores))

Mean Bootstrap score 0.9658029542600898
Score Std 0.027801366648921747


In [13]:
lower = np.percentile(bootstrap_scores, 2.5)
upper = np.percentile(bootstrap_scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

95% Confidence interval: [88.40, 100.00]
