

## Cross Validation in Practice: main types and examples
One of the topics that it called my attention from weeks 1 and 2 of #mlzoomcamp was cross validation. After some search in Google, I can se that there are various types of Cross Valitation used in machine learning. Since the list of available types is extensive, I only put my attention in the types with most mentions in my search.

The list with methods with most mentions is:
- [Holdout cross-validation](#method-1)
- [Leave p out cross-validation (LpOCV)](#method-2)
- [Leave one out cross-validation (LoOCV)](#method-3)
- [K-fold cross-validation](#method-4)
- [Stratified k-fold cross-validation](#method-5)


### Import modules

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split # Holdout cross-validation
from sklearn.model_selection import LeavePOut # Leave p out cross-validation
from sklearn.model_selection import LeaveOneOut # Leave one out cross-validation
from sklearn.model_selection import KFold # K-fold cross-validation
from sklearn.model_selection import StratifiedKFold # Stratified k-fold cross-validation


In [2]:
# Read some data to test 
if not os.path.isfile("data.csv"):
    !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
        
df = pd.read_csv("data.csv")
X = df.iloc[np.arange(10)]
X

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,31200
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916,44100
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,39300
8,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,36900
9,BMW,1 Series,2013,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,27,18,3916,37200


In [3]:
# Extract target column
y = X.MSRP
y

0    46135
1    40650
2    36350
3    29450
4    34500
5    31200
6    44100
7    39300
8    36900
9    37200
Name: MSRP, dtype: int64

In [4]:
# Delete target column from features data
del X['MSRP']
X

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
8,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
9,BMW,1 Series,2013,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,27,18,3916


### Holdout cross-validation <a class="anchor" id="method-1"></a>
Is the most basic used approach. It consist in split the dataset in two subsets: train and validation (test). Generally the split ratio is 80:20 or 70:30.

<img src="images/hold_out.jpg" width=400 height=400 />
<a href="https://www.mygreatlearning.com/blog/cross-validation/#sh211">Image Source</a>

This approach is recommended for large datasets. Otherwise, there is a high possibility that the testing data may contain some important information that we lose at training time. **Note:** The accuracy of the model is very sensible to the performed split so may vary for different splits.

In [5]:
# Perform Hold Out method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(f"Train set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Train set size: 7
Test set size: 3


In [6]:
X_train

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
8,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916


In [7]:
X_test

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
9,BMW,1 Series,2013,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,27,18,3916
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916


### Leave p out cross-validation (LpOCV) <a class="anchor" id="method-2"></a>
It is an exhaustive cross-validation method where all possible combinations are tested by dividing the data (n observations), into a validation set of “p” observations and a training set of (n-p) observations.

To get the final accuracy, we must average the accuracies of all iterations.

In [8]:
# Configure LeavePOut instance
lpo = LeavePOut(3) # 3 observations
lpo.get_n_splits(X) #Combinations

120

In [9]:
for train_index, test_index in lpo.split(X.values):
    print("Train:", train_index, "Test:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Train: [3 4 5 6 7 8 9] Test: [0 1 2]
Train: [2 4 5 6 7 8 9] Test: [0 1 3]
Train: [2 3 5 6 7 8 9] Test: [0 1 4]
Train: [2 3 4 6 7 8 9] Test: [0 1 5]
Train: [2 3 4 5 7 8 9] Test: [0 1 6]
Train: [2 3 4 5 6 8 9] Test: [0 1 7]
Train: [2 3 4 5 6 7 9] Test: [0 1 8]
Train: [2 3 4 5 6 7 8] Test: [0 1 9]
Train: [1 4 5 6 7 8 9] Test: [0 2 3]
Train: [1 3 5 6 7 8 9] Test: [0 2 4]
Train: [1 3 4 6 7 8 9] Test: [0 2 5]
Train: [1 3 4 5 7 8 9] Test: [0 2 6]
Train: [1 3 4 5 6 8 9] Test: [0 2 7]
Train: [1 3 4 5 6 7 9] Test: [0 2 8]
Train: [1 3 4 5 6 7 8] Test: [0 2 9]
Train: [1 2 5 6 7 8 9] Test: [0 3 4]
Train: [1 2 4 6 7 8 9] Test: [0 3 5]
Train: [1 2 4 5 7 8 9] Test: [0 3 6]
Train: [1 2 4 5 6 8 9] Test: [0 3 7]
Train: [1 2 4 5 6 7 9] Test: [0 3 8]
Train: [1 2 4 5 6 7 8] Test: [0 3 9]
Train: [1 2 3 6 7 8 9] Test: [0 4 5]
Train: [1 2 3 5 7 8 9] Test: [0 4 6]
Train: [1 2 3 5 6 8 9] Test: [0 4 7]
Train: [1 2 3 5 6 7 9] Test: [0 4 8]
Train: [1 2 3 5 6 7 8] Test: [0 4 9]
Train: [1 2 3 4 7 8 9] Test: [0 5 6]
T

In [10]:
# Show the last split
X_train

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916


In [11]:
X_test

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
8,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
9,BMW,1 Series,2013,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,27,18,3916


### Leave one out cross-validation (LoOCV) <a class="anchor" id="method-3"></a>

It is an exhaustive cross validation method, which is a special case of leave p out cross validation (LpOCV), where p = 1. To get the final accuracy, we must average the accuracies of all iterations.

<img src="images/loocv.gif" width=400 height=400 />
<a href="https://en.wikipedia.org/wiki/File:LOOCV.gif">Image Source</a>


In [12]:
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    print("Train:", train_index, "Test:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Train: [1 2 3 4 5 6 7 8 9] Test: [0]
Train: [0 2 3 4 5 6 7 8 9] Test: [1]
Train: [0 1 3 4 5 6 7 8 9] Test: [2]
Train: [0 1 2 4 5 6 7 8 9] Test: [3]
Train: [0 1 2 3 5 6 7 8 9] Test: [4]
Train: [0 1 2 3 4 6 7 8 9] Test: [5]
Train: [0 1 2 3 4 5 7 8 9] Test: [6]
Train: [0 1 2 3 4 5 6 8 9] Test: [7]
Train: [0 1 2 3 4 5 6 7 9] Test: [8]
Train: [0 1 2 3 4 5 6 7 8] Test: [9]


In [13]:
# Show the last split
X_train

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
8,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916


In [14]:
X_test

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
9,BMW,1 Series,2013,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,27,18,3916


### K-fold cross-validation <a class="anchor" id="method-4"></a>

In this method, the dataset is divided into K equal sections (folds). Then at the time of training, one of these sections is selected to be used as a validation set and the remaining sections (K-1) are used as a training set. This process is repeated until each section (fold) has been used as a validation set.

<img src="images/KfoldCV.gif" width=400 height=400 />
<a href="https://en.wikipedia.org/wiki/File:KfoldCV.gif">Image Source</a>

**Note**: In some experiments, the training data may be unbalanced. If this happens, the trained model may get biased. K-fold cross validation is not careful in this aspect at the moment of the split. One solution to this problem is to use the approach Stratified k-fold cross-validation

In [15]:
# Create some data for a classification problem with unbalanced data
X_unb, y_unb = np.ones((16, 1)), np.hstack(([0] * 12, [1] * 4))
X_unb

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]])

In [16]:
y_unb

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

In [17]:
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X_unb, y_unb):
    print("Train:", train_index, "Test:", test_index)

Train: [ 4  5  6  7  8  9 10 11 12 13 14 15] Test: [0 1 2 3]
Train: [ 0  1  2  3  8  9 10 11 12 13 14 15] Test: [4 5 6 7]
Train: [ 0  1  2  3  4  5  6  7 12 13 14 15] Test: [ 8  9 10 11]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11] Test: [12 13 14 15]


However, as we can see below, the training and test sets can be made up of a single class.

In [18]:
for train_index, test_index in kf.split(X_unb, y_unb):
    print('Classes Distribution ([0, 1]):  Train -  {}   |   Test -  {}'.format(
    np.bincount(y_unb[train_index]), np.bincount(y_unb[test_index])))

Classes Distribution ([0, 1]):  Train -  [8 4]   |   Test -  [4]
Classes Distribution ([0, 1]):  Train -  [8 4]   |   Test -  [4]
Classes Distribution ([0, 1]):  Train -  [8 4]   |   Test -  [4]
Classes Distribution ([0, 1]):  Train -  [12]   |   Test -  [0 4]


### Stratified k-fold cross-validation <a class="anchor" id="method-5"></a>

This cross validation approach is similar to k-folds cross validation in that the data is divided into k-folds. However, by rearranging the data, it ensures that the distribution / proportion of the data in each fold is representative of the entire dataset. Thus it is possible to use this approach on unbalanced datasets.

In [19]:
skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X_unb, y_unb):
    print("Train:", train_index, "Test:", test_index)

Train: [ 3  4  5  6  7  8  9 10 11 13 14 15] Test: [ 0  1  2 12]
Train: [ 0  1  2  6  7  8  9 10 11 12 14 15] Test: [ 3  4  5 13]
Train: [ 0  1  2  3  4  5  9 10 11 12 13 15] Test: [ 6  7  8 14]
Train: [ 0  1  2  3  4  5  6  7  8 12 13 14] Test: [ 9 10 11 15]


In [20]:
for train_index, test_index in skf.split(X_unb, y_unb):
    print('Classes Distribution ([0, 1]):  Train -  {}   |   Test -  {}'.format(
    np.bincount(y_unb[train_index]), np.bincount(y_unb[test_index])))

Classes Distribution ([0, 1]):  Train -  [9 3]   |   Test -  [3 1]
Classes Distribution ([0, 1]):  Train -  [9 3]   |   Test -  [3 1]
Classes Distribution ([0, 1]):  Train -  [9 3]   |   Test -  [3 1]
Classes Distribution ([0, 1]):  Train -  [9 3]   |   Test -  [3 1]


**Note**: As can be seen, unlike k-fold, stratified k-fold cross-validation takes into account the distribution of the classes when doing the split, and can be used in unbalanced datasets.

## References
<ul>
    <li><a href="https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection">scikit-learn- Cross-validation: evaluating estimator performance</a></li>
    <li><a href="https://www.analyticsvidhya.com/blog/2021/05/4-ways-to-evaluate-your-machine-learning-model-cross-validation-techniques-with-python-code/">4 Ways to Evaluate your Machine Learning Model: Cross-Validation Techniques (with Python code)</a> </li>
    <li><a href="https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d">Understanding 8 types of Cross-Validation</a> </li>
    <li><a href="https://www.analyticssteps.com/blogs/7-types-cross-validation">7 Types of Cross-validation</a> </li>
    <li><a href="https://www.kaggle.com/general/204878">Cross Validation and its types!</a> </li>
    <li><a href="https://medium.com/nerd-for-tech/cross-validation-and-types-a7498a68f413">Cross Validation and types</a> </li>
    <li><a href="https://www.mygreatlearning.com/blog/cross-validation/">What is Cross Validation in Machine learning? Types of Cross Validation</a> </li>

</ul>
