# L10: Model Evaluation 3 -- Cross-Validation and Model Selection

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,mlxtend,matplotlib,sklearn

Author: Sebastian Raschka

Last updated: 2021-11-08

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 7.29.0

numpy     : 1.21.2
mlxtend   : 0.19.0
matplotlib: 3.4.3
sklearn   : 1.0



In [2]:
import numpy as np
import matplotlib.pyplot as plt

<p style="margin-bottom:5cm;"></p>

## K-fold Cross-Validation in Scikit-Learn

- Simple demonstration of using a cross-validation iterator in scikit-learn

In [3]:
from sklearn.model_selection import KFold


rng = np.random.RandomState(123)

y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
X = rng.random_sample((y.shape[0], 4))


cv = KFold(n_splits=5)

for k in cv.split(X, y):
    print(k)

(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1]))
(array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3]))
(array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5]))
(array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7]))
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))


<p style="margin-bottom:5cm;"></p>

- In practice, we are usually interested in shuffling the dataset, because if the data records are ordered by class label, this would result in cases where the classes are not well represented in the training and test folds

In [4]:
cv = KFold(n_splits=5, random_state=123, shuffle=True)

for k in cv.split(X, y):
    print(k)

(array([1, 2, 3, 5, 6, 7, 8, 9]), array([0, 4]))
(array([0, 1, 2, 3, 4, 6, 8, 9]), array([5, 7]))
(array([0, 1, 2, 4, 5, 6, 7, 9]), array([3, 8]))
(array([0, 2, 3, 4, 5, 7, 8, 9]), array([1, 6]))
(array([0, 1, 3, 4, 5, 6, 7, 8]), array([2, 9]))


<p style="margin-bottom:5cm;"></p>

- Note that the `KFold` iterator only provides us with the array indices; in practice, we are actually interested in the array values (feature values and class labels)

In [5]:
cv = KFold(n_splits=5, random_state=123, shuffle=True)

for train_idx, valid_idx in cv.split(X, y):
    print('train labels with shuffling', y[train_idx])

train labels with shuffling [0 0 0 1 1 1 1 1]
train labels with shuffling [0 0 0 0 0 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 0 0 0 1 1 1 1]


<p style="margin-bottom:5cm;"></p>

- As discussed in the lecture, it's important to stratify the splits (very crucial for small datasets!)

In [6]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)

for train_idx, valid_idx in cv.split(X, y):
    print('train labels', y[train_idx])

train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]
train labels [0 0 0 0 1 1 1 1]


<p style="margin-bottom:5cm;"></p>

- After the illustrations of cross-validation above, the next cell demonstrates how we can actually use the iterators provided through scikit-learn to fit and evaluate a learning algorithm

In [7]:
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.15, 
                                                    shuffle=True, stratify=y)



cv = StratifiedKFold(n_splits=10, random_state=123, shuffle=True)

kfold_acc = 0.
for train_idx, valid_idx in cv.split(X_train, y_train):
    clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train[train_idx], y_train[train_idx])
    y_pred = clf.predict(X_train[valid_idx])
    acc = np.mean(y_pred == y_train[valid_idx])*100
    kfold_acc += acc
kfold_acc /= 10
    
clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train, y_train)
y_pred = clf.predict(X_test)
test_acc = np.mean(y_pred == y_test)*100
    
print('Kfold Accuracy: %.2f%%' % kfold_acc)
print('Test Accuracy: %.2f%%' % test_acc)



Kfold Accuracy: 95.26%
Test Accuracy: 95.65%


<p style="margin-bottom:5cm;"></p>

- Usually, a more convenient way to use cross-validation through scikit-learn is to use the `cross_val_score` function (note that it performs stratifies splitting for classification by default)
- (remember to ask students about whitespaces according to pep8)

In [8]:
from sklearn.model_selection import cross_val_score


cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=-1)

print('Kfold Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

Kfold Accuracy: 96.09%


<p style="margin-bottom:5cm;"></p>

- `cross_val_score` has unfortunately no way to specify a random seed; this is not an issue in regular use cases, but it is not useful if you want to do "repeated cross-validation"
- The next cell illustrates how we can provide our own cross-validation iterator for convenience (note that the results match or "manual" `StratifiedKFold` approach we performed earlier)

In [9]:
from sklearn.model_selection import cross_val_score


cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=StratifiedKFold(n_splits=10, random_state=123, shuffle=True),
                         n_jobs=-1)

print('Kfold Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

Kfold Accuracy: 95.26%


<p style="margin-bottom:5cm;"></p>

##  Bootstrap

- Recall Bootstrapping from 2 lectures ago? Here I is an iterator I implemented analogous to `KFold`

In [11]:
from mlxtend.evaluate import BootstrapOutOfBag

oob = BootstrapOutOfBag(n_splits=5, random_seed=99)
for train, test in oob.split(np.array([1, 2, 3, 4, 5])):
    print(train, test)

[1 3 1 0 1] [2 4]
[0 2 4 4 1] [3]
[3 1 1 0 3] [2 4]
[3 4 2 0 4] [1]
[0 0 4 1 3] [2]


<p style="margin-bottom:5cm;"></p>

- Analagous the `KFold` iterator, we can use it in the `cross_val_score` function for convenience

In [13]:
cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=99, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=BootstrapOutOfBag(n_splits=200, random_seed=99),
                         n_jobs=-1)

print('OOB Bootstrap Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

OOB Bootstrap Accuracy: 94.73%


<p style="margin-bottom:5cm;"></p>

- Analagous to the `cross_val_score` method, you can use the `bootstrap_point632_score`, which implements the .632-Bootstrap method (which is less pesimistically biased than the out-of-bag bootstrap)

In [14]:
from mlxtend.evaluate import bootstrap_point632_score


cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  random_seed=99)

print('OOB Bootstrap Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

OOB Bootstrap Accuracy: 95.55%


- By default, `bootstrap_point632_score` uses the setting `method='.632'`
- By setting `method='.632+'`, we can also perform the .632+ bootstrap, which corrects for optimism bias, which is shown below

In [15]:
cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  method='.632+',
                                  n_splits=200,
                                  random_seed=99)

print('OOB Bootstrap Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

OOB Bootstrap Accuracy: 95.51%


- Finally, for your convenience, you can also set `method='oob'`, to run a regular Out-of-bag boostrap:

In [16]:
cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  method='oob',
                                  n_splits=200,
                                  random_seed=99)

print('OOB Bootstrap Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

OOB Bootstrap Accuracy: 94.77%
