# k-fold Cross Validation
Cross-validation works by splitting the dataset into k-parts (e.g. k = 10). Each split of the data is called a fold. The model is trained on k-1 folds with one held back as the test set. This is repeated so that each fold of the dataset has had the chance to be the hold back test set. After running cross-validation, you end up with k different performance scores. These are summarised using the mean and standard deviation.

In [11]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

## Load data

In [2]:
# Pima Indians dataset from https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
# Loads the csv file as a NumPy array using NumPy function loadtext()
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

In [4]:
# See a sample of the dataset (first 5 rows)
dataset[0:5,]

array([[6.000e+00, 1.480e+02, 7.200e+01, 3.500e+01, 0.000e+00, 3.360e+01,
        6.270e-01, 5.000e+01, 1.000e+00],
       [1.000e+00, 8.500e+01, 6.600e+01, 2.900e+01, 0.000e+00, 2.660e+01,
        3.510e-01, 3.100e+01, 0.000e+00],
       [8.000e+00, 1.830e+02, 6.400e+01, 0.000e+00, 0.000e+00, 2.330e+01,
        6.720e-01, 3.200e+01, 1.000e+00],
       [1.000e+00, 8.900e+01, 6.600e+01, 2.300e+01, 9.400e+01, 2.810e+01,
        1.670e-01, 2.100e+01, 0.000e+00],
       [0.000e+00, 1.370e+02, 4.000e+01, 3.500e+01, 1.680e+02, 4.310e+01,
        2.288e+00, 3.300e+01, 1.000e+00]])

## Separate into X (features) and y (label)

In [7]:
X = dataset[:,0:8]
y = dataset[:,8]

## Configure the k-fold cross validation

In [6]:
k = 10
seed = 8

kfold = KFold(n_splits=k, random_state=seed)

## Stratified k-fold cross validation
Enforces the same distribution of classes in each fold as in the whole training dataset. Useful for imbalanced datasets (i.e. when there are a lot more instances for one class than another).

In [12]:
kfold_stratified = StratifiedKFold(n_splits=k, random_state=seed)

## Build the model

In [13]:
model_kfold = XGBClassifier()
model_kfold_stratified = XGBClassifier()

results_kfold = cross_val_score(model_kfold, X, y, cv=kfold)
results_kfold_stratified = cross_val_score(model_kfold_stratified, X, y, cv=kfold_stratified)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


## Evaluate model

In [14]:
# Results for kfold
print("Accuracy: %.2f%% (%.2f%%)" % (results_kfold.mean()*100, results_kfold.std()*100))

Accuracy: 76.69% (7.11%)


In [15]:
# Results for kfold stratified
print("Accuracy: %.2f%% (%.2f%%)" % (results_kfold_stratified.mean()*100, results_kfold_stratified.std()*100))

Accuracy: 76.95% (5.88%)


Rule of thumb: use 10-fold cross-validation for regresion problems and stratified 10-fold cross-validation for classification problems.