The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm or configuration on a dataset.

A single run of the k-fold cross-validation procedure may result in a noisy estimate of model performance. Different splits of the data may result in very different results.

Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure.

Three commonly used variations are as follows:

**Train/Test Split**: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model.

**LOOCV**: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short.

**Stratified**: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

**Repeated**: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.

**Nested**: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.

In [1]:
from numpy import mean, std

In [2]:
from sklearn.datasets import make_classification

In [3]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression

In [4]:
# Create Dataset

In [6]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=1)

In [7]:
print(X.shape, y.shape)

(1000, 20) (1000,)


In [7]:
# Cross Validation

In [8]:
kfold = KFold(n_splits=5, random_state=1, shuffle=True)

In [9]:
kfold

KFold(n_splits=5, random_state=1, shuffle=True)

In [9]:
# Create a Classification Model

In [10]:
clf  = LogisticRegression()

In [11]:
scores = cross_val_score(clf, X, y, scoring='accuracy', cv= kfold, n_jobs=-1)

In [12]:
scores

array([0.865, 0.88 , 0.865, 0.87 , 0.85 ])

In [17]:
print("mean of accuracy",mean(scores),"\nstd of accuracy",std(scores)) 

mean of accuracy 0.866 
std of accuracy 0.009695359714832668


In [18]:
print("mean of accuracy",round(mean(scores),2)*100,"\nstd of accuracy",round(std(scores),2)*100) 

mean of accuracy 87.0 
std of accuracy 1.0


In [15]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [16]:
scores = cross_val_score(rf, X, y, scoring='accuracy', cv= kfold, n_jobs=-1)

In [17]:
scores

array([0.93, 0.92, 0.94, 0.95, 0.91, 0.9 , 0.86, 0.9 , 0.92, 0.9 ])

In [18]:
print("mean of accuracy",round(mean(scores),2)*100,"\nstd of accuracy",round(std(scores),2)*100) 

mean of accuracy 91.0 
std of accuracy 2.0


# Repeated KFOLD

    The estimate of model performance via k-fold cross-validation can be noisy.
    This means that each time the procedure is run, a different split of the dataset into k-folds can be implemented, and in turn, the distribution of performance scores can be different, resulting in a different mean estimate of model performance.
    A noisy estimate of model performance can be frustrating as it may not be clear which result should be used to compare and select a final model to address the problem.
    One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.
    Repeated k-fold cross-validation has the benefit of improving the estimate of the mean model performance at the cost of fitting and evaluating many more models.

In [19]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=1)
from sklearn.model_selection import RepeatedKFold
Rkfold = RepeatedKFold(n_splits=10, n_repeats= 3, random_state=1)
scores = cross_val_score(clf, X, y, scoring='accuracy', cv= Rkfold, n_jobs=-1)
print("mean of accuracy",round(mean(scores),2)*100,"\nstd of accuracy",round(std(scores),2)*100) 

mean of accuracy 87.0 
std of accuracy 4.0


#  stratified k-fold cross-validation

k-Fold Cross-Validation for Imbalanced Classification

In [20]:
# Create Imbalanced Dataset
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=3)

In [21]:
from numpy import unique
classes = unique(y)
total = len(y)

In [22]:
for c in classes:
    n_examples = len(y[y==c])
    percent = n_examples/total *100
    print('>Class=%d : %d/%d (%.1f%%)' % (c, n_examples, total, percent))

>Class=0 : 901/1000 (90.1%)
>Class=1 : 99/1000 (9.9%)


In [23]:
## use the KFold class to randomly split the dataset into 5-folds and check the composition of each train and test set

In [24]:
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
for train,test in kfold.split(X):
    # Select ROWS
    train_X, test_X = X[train], X[test]
    train_y, test_y = y[train], y[test]
    
    # Summarize train and test
    train_0 , train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
    test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1]) 
    
    print('>Train: 0=%d, 1=%d, Test:0=%d, 1=%d' % (train_0, train_1, test_0, test_1))    

>Train: 0=716, 1=84, Test:0=185, 1=15
>Train: 0=728, 1=72, Test:0=173, 1=27
>Train: 0=722, 1=78, Test:0=179, 1=21
>Train: 0=718, 1=82, Test:0=183, 1=17
>Train: 0=720, 1=80, Test:0=181, 1=19


In [19]:
from sklearn.model_selection import StratifiedKFold

In [20]:
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [27]:
for train,test in skfold.split(X,y):
    # Select ROWS
    train_X, test_X = X[train], X[test]
    train_y, test_y = y[train], y[test]
    
    # Summarize train and test
    train_0 , train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
    test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1]) 
    
    print('>Train: 0=%d, 1=%d, Test:0=%d, 1=%d' % (train_0, train_1, test_0, test_1))    

>Train: 0=720, 1=80, Test:0=181, 1=19
>Train: 0=721, 1=79, Test:0=180, 1=20
>Train: 0=721, 1=79, Test:0=180, 1=20
>Train: 0=721, 1=79, Test:0=180, 1=20
>Train: 0=721, 1=79, Test:0=180, 1=20


In [28]:
from sklearn.model_selection import StratifiedShuffleSplit

In [29]:
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
sss.get_n_splits(X, y)

5

In [30]:
for train,test in sss.split(X,y):
    # Select ROWS
    train_X, test_X = X[train], X[test]
    train_y, test_y = y[train], y[test]
    
    # Summarize train and test
    train_0 , train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
    test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1]) 
    
    print('>Train: 0=%d, 1=%d, Test:0=%d, 1=%d' % (train_0, train_1, test_0, test_1))    

>Train: 0=631, 1=69, Test:0=270, 1=30
>Train: 0=631, 1=69, Test:0=270, 1=30
>Train: 0=631, 1=69, Test:0=270, 1=30
>Train: 0=631, 1=69, Test:0=270, 1=30
>Train: 0=631, 1=69, Test:0=270, 1=30
