Nested CV starts off with a traditional $k$-fold CV split on the training set. Then, for each of the $k$ folds, there is a training fold and a test fold. The training fold is split further into $k'$ folds which have a training 'subfold' and a validation 'subfold'. After the training and validation on the inner $k'$ folds to find the optimal hyperparameters, testing is done on the test fold of the outer $k$ fold, and iterated over all of the $k$ folds.

This method is supposed to provide an almost unbiased error estimate on the methods being used: (Bias in Error Estimation When Using Cross-Validation for Model Selection by S. Varma and R. Simon, BMC Bioinformatics, 7(1): 91, 2006, https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91).

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
#Algorithms to compare

from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
# Breast cancer data, with diagnosis as target variable
print(df.loc[:, 1].value_counts())

y = df.loc[:, 1].values
X = df.loc[:, 2:].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

1
B    357
M    212
Name: count, dtype: int64


In [3]:
# We will implement a 5 x 2 nested CV scheme here
# Since k = 5, we will use the grid search in the inner k' = 2 folds to optimize the estimator
# And on the outer folds, we use the optimal parameters to test

gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
                  param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7, None]}],
                  scoring='accuracy',
                  cv=2)

scores = cross_val_score(gs_dt, X_train, y_train, scoring='accuracy', cv=5)

print(f"Decision tree CV accuracy = {np.mean(scores):.3f} +/- {np.std(scores):.3f}.")

Decision tree CV accuracy = 0.925 +/- 0.030.


In [4]:
pipe_svc = make_pipeline(StandardScaler(),
                         SVC(random_state=42))

param_range = np.logspace(-4, 3, 8)
param_grid = [{'svc__C': param_range,
               'svc__kernel': ['linear']}, 
# For the linear kernel in SVC, scan over regularization strengths
              {'svc__C': param_range,
               'svc__gamma': param_range,
               'svc__kernel': ['rbf']}] 
# For the RBF (Gaussian) kernel, scan over different regularlization strengths as well as different Gaussian widths

gs_svc = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=2,
                  refit=True, # Refits to the whole training set automatically after finding the best one
                  n_jobs=-1).fit(X_train, y_train)

scores = cross_val_score(gs_svc, X_train, y_train, cv=5, scoring='accuracy')

print(f"SVM CV accuracy = {np.mean(scores):.3f} +/- {np.std(scores):.3f}.")

SVM CV accuracy = 0.972 +/- 0.006.
