$k$-fold cross validation is a robust way to handle cross validation. First, do a train-test split. Then, split the training data into $k$ different parts.  A model is then trained (from scratch), $k$ times, with the $i^{th}$ time training on all but the $i^{th}$ subset. This hold out is used for evaluation. The total evaluation score is the average of all $k$ evaluations of the $k$ trainings of the model. Do this for all models of interest (or for each set of hyperparameters to test). Once the best one is found, the best model (by average eval score) is then retrained from scratch on the entirety of the training set.

A special case of this for small datasets is LOOCV (Leave One Out Cross Validation). If you have $n$ datapoints in your training set, this would be an $n$ fold cross validation.

A good standard value for k in k-fold cross-validation is 10, as empirical evidence shows. For instance, 
experiments by Ron Kohavi on various real-world datasets suggest that 10-fold cross-validation offers 
the best tradeoff between bias and variance (A Study of Cross-Validation and Bootstrap for Accuracy Esti￾mation and Model Selection by Kohavi, Ron, International Joint Conference on Artificial Intelligence (IJCAI), 14 (12): 1137-43, 1995, https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf).

-- Sebastian Raschka

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

In [9]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
# Breast cancer data, with diagnosis as target variable
print(df.loc[:, 1].value_counts())

y = df.loc[:, 1].values
X = df.loc[:, 2:].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

1
B    357
M    212
Name: count, dtype: int64


In [None]:
pipe_lr = make_pipeline(StandardScaler(),
                         PCA(n_components=2),
                         LogisticRegression())

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=10).split(X_train, y_train)

scores = []
# Straightforward - for each of the k folds, use the fold training set and evaluate on the fold test set
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    scores.append(pipe_lr.score(X_train[test], y_train[test]))
    print(f"{k + 1}th fold with class distribution {np.bincount(y_train[train])}.")
    print(f"Accuracy: {scores[-1]:.3f}.")

print(f"Mean accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}.")

1th fold with class distribution [240 143].
Accuracy: 0.977.
2th fold with class distribution [240 143].
Accuracy: 0.977.
3th fold with class distribution [240 143].
Accuracy: 0.907.
4th fold with class distribution [240 143].
Accuracy: 0.953.
5th fold with class distribution [240 143].
Accuracy: 0.977.
6th fold with class distribution [240 143].
Accuracy: 0.907.
7th fold with class distribution [240 144].
Accuracy: 0.929.
8th fold with class distribution [241 143].
Accuracy: 0.929.
9th fold with class distribution [241 143].
Accuracy: 1.000.
10th fold with class distribution [241 143].
Accuracy: 0.929.
Mean accuracy: 0.948 +/- 0.031.


In [None]:
# This can all be done within sklearn
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=-1) # n_jobs=-1 parallelizes over all available CPUs
print(f"CV accuracy scores: {scores}")
print(f"Mean accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}.")

CV accuracy scores: [0.97674419 0.97674419 0.90697674 0.95348837 0.97674419 0.90697674
 0.92857143 0.92857143 1.         0.92857143]
Mean accuracy: 0.948 +/- 0.031.
