# Introduction to Cross-Validation

-----

In previous notebooks, we split dataset to training and testing set for supervised learning(classification and regression). The training set is used to train the model, while the testing set is used to provide an unbiased evaluation of a model fit on the training set. This approach seems reasonable but there are two problems:
1. The training set is only a subset of the available data. The model trained on the training set may fail to adequately capture the characteristics of the whole dataset.
2. Similarly, the evaluation metrics are generated based on testing set only. A different split may result in different metric values, making the metrics less reliable.

To solve these problems, we introduce [cross-validation][wcv], which is a method that attempts to maximize the use of the available data for training and then testing a model to provide a range of more accurate metrics.

We will use adult income data to demonstrate cross-validation.

-----

[wcv]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

## Table of Contents

[Data](#Data)

[Train-Test Split Approach](#Train-Test-Split-Approach)

[Cross Validation](#Cross-Validation)

- [KFold](#KFold)
- [Stratified KFold](#Stratified-KFold)
- [Leave One Out](#Leave-One-Out)
- [Leave P Out](#Leave-P-Out)
- [Shuffle Split](#Shuffle-Split)
- [Group KFold](#Group-KFold)

[Evaluate Model with Cross-Validation](#Evaluate-Model-with-Cross-Validation)
- [Custom Scorer](#Custom-Scorer)

-----

Before proceeding with the rest of this notebook, we first have our standard notebook setup code.

-----

In [1]:
# Set up Notebook
%matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

# Set default seaborn plotting style
sns.set_style('white')

-----

[[Back to TOC]](#Table-of-Contents)

## Data

We will use the [adult income data][uciad] throughout this notebook. With the help of feature selection introduced in the previous lesson, we will choose following features as our training features: age, fnlwgt, education-level, marital-status, occupation, relationship, capital-gain, hours-per-week. 

The following Code cell prepares the data:

1. Load data
2. Create label from Salary column
3. Encode categorical features with string value
4. Combine numerical features and encoded categorical features.
5. Choose training features

-----
[uciad]: https://archive.ics.uci.edu/ml/datasets/Adult

In [2]:
from sklearn.preprocessing import LabelEncoder

# Read CSV data
adult_data = pd.read_csv('data/adult_income.csv')

# Create label column, one for >50K, zero otherwise.
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0)

# Generate categorical features(with string values)
categorical_features = adult_data[['Workclass', 'Education', 'MaritalStatus', 
               'Occupation', 'Relationship', 'Race', 'Sex', 'NativeCountry']]

#encode categorical features
categorical_features = categorical_features.apply(LabelEncoder().fit_transform)

# Extract numerical features
numerical_features = adult_data[['Age', 'FNLWGT', 'EducationLevel', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek']]

all_features = pd.concat([numerical_features, categorical_features], axis=1)

features = all_features[['Age', 'FNLWGT', 'EducationLevel', 'MaritalStatus', 'Occupation', 
                         'Relationship', 'CapitalGain', 'HoursPerWeek']]

label = adult_data['Label']

#display sample data
pd.concat([features, label], axis=1).sample(5, random_state=2)

Unnamed: 0,Age,FNLWGT,EducationLevel,MaritalStatus,Occupation,Relationship,CapitalGain,HoursPerWeek,Label
3846,22,351952,10,4,10,4,0,20,0
848,55,368797,14,2,4,0,0,60,1
1658,34,113198,12,2,1,0,0,28,0
3415,53,274276,9,0,1,1,0,40,0
3678,31,59969,9,2,1,2,0,35,0


-----

[[Back to TOC]](#Table-of-Contents)

## Train-Test Split Approach

First, we will split the dataset to training and testing as we did before. We will split the dataset twice with different random state, then train a Decision Tree Classifier and compare model accuracy. The reason we choose Decision Tree Classifier is that with Decision Tree: We don't need to create dummy features for categorical features and we don't need to normalize continuous features. Using Decision Tree simplifies data preparation.

In the following Code cells, we first create a Decision Tree Classifier, then split the data twice with different random states, train the decision tree classifier with each split, and print out the model accuracy score.

From the output, we can see that the same decision tree classifier, trained twice on the same dataset, gives completely different accuracy score. The only reason is that we split the data with different random state. 


In [3]:
from sklearn.tree import DecisionTreeClassifier

adult_model = DecisionTreeClassifier(random_state=23)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn import metrics

d_train, d_test, l_train, l_test = train_test_split(features, label, test_size=0.4, random_state=0)

adult_model = adult_model.fit(d_train, l_train)

predicted = adult_model.predict(d_test)
score = 100.0 * metrics.accuracy_score(l_test, predicted)
print(f'Decision Tree Classification [Adult Data] Score = {score:4.1f}%\n')

Decision Tree Classification [Adult Data] Score = 81.0%



In [5]:
d_train, d_test, l_train, l_test = train_test_split(features, label, test_size=0.4, random_state=2)

adult_model = adult_model.fit(d_train, l_train)

predicted = adult_model.predict(d_test)
score = 100.0 * metrics.accuracy_score(l_test, predicted)
print(f'Decision Tree Classification [Adult Data] Score = {score:4.1f}%\n')

Decision Tree Classification [Adult Data] Score = 77.7%



-----

[[Back to TOC]](#Table-of-Contents)

## Cross Validation

A preferred way to solve the problems caused by train test split, is to divide the data into multiple segments, or **folds**, hold one segment out for model validation, and using the other folds for training. We can repeat the training and validating by iterating through the folds, so that we use all data for training and validating. By statistically combining these model predictions, we can determine the more accurate evaluation metrics. This approach is known as [**cross-validation**][wcv]. 

There are multiple different cross-validation methods. The scikit-learn library provides the following [cross-validation iterators][skcvi]:
- `KFold` splits the data into k _folds_, trains on  $k - 1$ folds, and validates on the remaining fold.
- `StratifiedKFold` similar to `KFold`, but preserves the relative ratio of labeled classes within each fold.
- `GroupKFold` similar to `KFold`, but limits the testing data to only one group within each fold.
- `LeaveOneOut` iteratively leaves one observation out to validate the model trained on the remaining data.
- `LeavePOut` iteratively leaves $P$ observations out to validate the model trained on the remaining data.
- `ShuffleSplit` generates a user defined number of train/validate data sets, by first randomly shuffling the data.

These different cross-validation techniques are demonstrated in the following set of Code cells. First, we create a ten-element array to demonstrate how these different techniques generate training and validating data set. Note, in this case, we generate data with only one feature (a numerical value zero through nine, inclusive), but these results extend to multi-feature data sets. In addition, these cross-validation techniques are implemented as iterators, which enables them to be used in loops to create the training and testing data sets, that can subsequently be processed efficiently inside the body of the _for_ loop.

-----
[wcv]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
[skcvi]: http://scikit-learn.org/stable/modules/cross_validation.html

In [6]:
# We create a ten element data array to demonstrate
# cross-validation iterators
data = np.arange(10)
print(f'{data}')

[0 1 2 3 4 5 6 7 8 9]


-----

### KFold

One of the most popular cross-validation techniques is the k-fold cross-validation. This technique is implemented in the scikit-learn library in the [`KFold`][skkf] iterator. This technique generates $k$ samples by splitting the original data into training and testing samples. If there are $n$ data values, the testing samples will contain  $n/k$ values. This is demonstrated in the following Code cell, where we apply K-fold to a ten-element array. In this case, $k=5$, the original dataset is splitted into 5 parts, with each part having 2 elements. There are then 5 iterations. In each iteration, one part will be the testing set and the other 4 parts will be the training set. If you combine the testing sets from all 5 iterations, you'll get the complete dataset.

-----
[skkf]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

In [7]:
from sklearn.model_selection import KFold

# Create cross-validation iterator
kf = KFold(n_splits=5, random_state=23)

# Compute and display results
print('Train\t\t\t  Test')
print(30*'-')
for train, test in kf.split(data):
    print(f'{train}        {test}')

Train			  Test
------------------------------
[2 3 4 5 6 7 8 9]        [0 1]
[0 1 4 5 6 7 8 9]        [2 3]
[0 1 2 3 6 7 8 9]        [4 5]
[0 1 2 3 4 5 8 9]        [6 7]
[0 1 2 3 4 5 6 7]        [8 9]


-----

### Stratified KFold

When the training data is unbalanced, we need to be careful when dividing the data into training and testing data sets, so that class balance is preserved in both training and testing. Otherwise, the resulting model will perform poorly in the testing process (or even worse, on new unseen data). In this case, we may employ stratified sampling, which attempts to preserve class balance in the training and testing data sets. The scikit-learn module provides several stratified cross-validation techniques including:
- [`StratifiedKFold`][skskf]
- [`StratifiedShuffleSplit`][sksss].

You may ask one question: "Isn't Stratified Kfold always preferred over Kfold? Even with balanced dataset, Stratified Kfold will just work as well as Kfold. Why do we ever need Kfold if we have Stratified Kfold?"

The reason is a bit subtle. When we train and evaluate a model, we don't want to introduce any new information other than the dataset itself. Stratified Kfold splits the dataset depending on the labels. This process assumes that the proportion of each class in the training dataset reflects that of the population. This assumption is a kind of new information. Since the dataset we have is most likely not the population data, we really don't know whether the class proportion in the available data is true or not. On the other hand, Kfold splits the dataset randomly so it doesn't introduce any new information to the dataset. So for a very unbalanced dataset, choose Stratified Kfold, otherwise choose Kfold.

-----
[skskf]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
[sksss]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

-----

### Leave One Out


The simplest cross-validation technique to understand is the leave-one-out technique, which is implemented in the scikit-learn library by the [`LeaveOneOut`][skloo] iterator. This technique iteratively holds out one data value from the input data set out for testing, while the others are used for training. Thus, if we have ten data values, we will end up with ten samples, with each having nine values for training and one for testing. 

The following Code cell demonstrates this cross-validation technique, where we divide the original ten-element array into ten samples, where one value is held out for testing and nine are used for training.

-----

[skloo]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

In [8]:
from sklearn.model_selection import LeaveOneOut

# Create cross-validation iterator
loo = LeaveOneOut()

# Compute and display results
print('Train\t\t\t  Test')
print(30*'-')
for train, test in loo.split(data):
    print(f'{train}        {test}')

Train			  Test
------------------------------
[1 2 3 4 5 6 7 8 9]        [0]
[0 2 3 4 5 6 7 8 9]        [1]
[0 1 3 4 5 6 7 8 9]        [2]
[0 1 2 4 5 6 7 8 9]        [3]
[0 1 2 3 5 6 7 8 9]        [4]
[0 1 2 3 4 6 7 8 9]        [5]
[0 1 2 3 4 5 7 8 9]        [6]
[0 1 2 3 4 5 6 8 9]        [7]
[0 1 2 3 4 5 6 7 9]        [8]
[0 1 2 3 4 5 6 7 8]        [9]


-----

### Leave P Out

We can extend the leave-one-out cross-validation technique to leave multiple items out. Note, however, that this can generate a very large number of splits. This technique is implemented in the scikit-learn library in the [`LeavePOut`][sklpo] iterator. Given $n$ data points, selecting $p$ items to leave out is a combinatorial problem:

$$
{n \choose p}= \frac{n!}{p! (n - p)!}
$$

For example, when n = 10, and p = 2, we have forty-five different splits that can be created. This cross-validation technique is demonstrated in the following Code cell, where to limit the output we select splits from only the first half of our data array. This results in only ten splits, which will more easily fit in this notebook. 

Notice how the training data clearly follow the combinatorial process: the first value is selected, and then the second one iterates through the remaining combinations. Next, the first value is incremented, and the process continues until no new, unique splits can be created.

-----
[sklpo]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePOut.html

In [9]:
from sklearn.model_selection import LeavePOut

# Create cross-validation iterator
lpo = LeavePOut(p=2)

# Compute and display results
print('Train\t\tTest')
print(20*'-')

# We only use first five values in our data array to limit output
for train, test in lpo.split(data[:5]):
    print(f'{train}        {test}')

Train		Test
--------------------
[2 3 4]        [0 1]
[1 3 4]        [0 2]
[1 2 4]        [0 3]
[1 2 3]        [0 4]
[0 3 4]        [1 2]
[0 2 4]        [1 3]
[0 2 3]        [1 4]
[0 1 4]        [2 3]
[0 1 3]        [2 4]
[0 1 2]        [3 4]


-----

### Shuffle Split

Shuffle split is implemented in the scikit-learn library by the [`ShuffleSplit`][skss] iterator. This form of cross-validation first randomly shuffles the data, and then splits the data into the training and testing samples. This provides an extra degree of randomness, since data can appear in a training or testing set more times than with the other cross-validation techniques. This is demonstrated in the following Code cell, where we create ten samples with 80% of the data used for training from our original numerical sample. If equally divided, each number should appear in the test data twice, but the output indicates numbers can occur once, twice, and even three times.

-----
[skss]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

In [10]:
from sklearn.model_selection import ShuffleSplit

# Create cross-validation iterator
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=23)

# Display results
print('\tTrain\t\tTest')
print(30*'-')
for train, test in ss.split(data):
    print(f'{train}      {test}')

	Train		Test
------------------------------
[2 9 4 7 1 0 6 3]      [5 8]
[8 2 1 4 5 9 6 3]      [0 7]
[4 3 6 2 1 9 5 0]      [7 8]
[2 5 3 8 9 1 4 7]      [0 6]
[0 7 4 8 2 1 6 3]      [9 5]
[4 2 5 0 8 7 9 6]      [1 3]
[1 5 9 7 2 6 0 3]      [4 8]
[0 8 4 5 3 2 9 7]      [6 1]
[5 2 8 6 1 3 4 0]      [9 7]
[6 4 3 0 2 1 7 8]      [5 9]


-----

### Group KFold

In some cases, data are naturally grouped. For example, we might have data that have a natural grouping, such as data collected from different runs using the same equipment, or from runs that were done by different laboratories. Likewise, if the data are not independent, we can look for a natural grouping that would enable cross-validation to be utilized. In these cases, we need a different cross-validation technique that maintains the group structure so that groups do not span training and testing data. These techniques are known as group-wise cross-validators. The scikit-learn library provides group-wise versions of the main cross-validation techniques, including:
- [`GroupKFold`][skgkf]
- [`LeaveOneGroupOut`][skloo]
- [`LeavePGroupsOut`][sklpo]
- [`GroupShuffleSplit`][skgss]

-----
[skgkf]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html
[skloo]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html
[sklpo]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html
[skgss]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we demonstrated different cross-validation techniques. Now that you have run the previous cells, try making changes to the notebook and see if you can predict the results.

1. Change the original data array to contain only odd integers between 1 and 21. How do the folds in the `KFold` cross-validation technique change?

-----

## Evaluate Model with Cross-Validation

We can now compute cross validation score for a given model. The easiest way to accomplish this is to use the [`cross_val_score`][skcvs] function in the scikit-learn module, which applies an estimator with a cross-validation technique to a training data set. This function will compute and return an array of scores for each fold in the cross-validation. The scores are computed using the default score method for the provided estimator, such as `accuracy`, but different metrics can be supplied via the `scoring` hyperparameter.

In the following code, we apply `cross_val_score` function on the decision tree classifier created above. The training dataset is the adult income data. Since the data is unbalanced, we will choose StratifiedKFold with 10 folds as cross validation iterator. The number of folds(k) is determined by the number of instances in the dataset. The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset. 5 and 10 are commonly accepted k values. Adult income data has 4000 instances, so we choose 10 folds. For a smaller dataset, like the Iris or the Tips dataset, we may choose 5 folds.

The `cross_val_score` function returns an array of scores, one for each iteration. We take the mean of the scores and print it out as the accuracy score of our model. The accuracy now is 80.4% which is between the two scores we got from train-test split approach. It's considered as the more reliable accuracy score.

-----
[skcvs]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [11]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits=10, random_state=23)
score = cross_val_score(adult_model, features, label, cv=skf)

print(f'Accuracy Score from Cross Validation: {np.mean(score)*100:4.1f}%')

Accuracy Score from Cross Validation: 80.4%


-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cell, we used StratifiedKFold with 10 fold in `cross_val_score`. Try to make the following changes and see if what you get.

1. Change the number of splits, i.e., 5 folds, in StratifiedKFold.
2. Use `KFold` instead of `StratifiedKFold`.

-----

### Custom Scorer

`cross_val_score` only returns one kind of score which by default is accuracy score for classification and $R^2$ score for regression. You may set `scoring` argument to get other cross validation scores. The following Code cell demonstrates how to return an accuracy score(default), a precision score of positive class, a recall score of positive class, and area under curve from `cross_val_score`. Check out [sklearn.metrics module](#https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) for details.

In [12]:
skf = StratifiedKFold(n_splits=10, random_state=23)

#accuracy
score = cross_val_score(adult_model, features, label, cv=skf)
#positive precision
precision_score = cross_val_score(adult_model, features, label, cv=skf, scoring='precision')
#positive recall
recall_score = cross_val_score(adult_model, features, label, cv=skf, scoring='recall')
#auc
auc_score = cross_val_score(adult_model, features, label, cv=skf, scoring='roc_auc')

print(f'Accuracy Score from Cross Validation: {np.mean(score)*100:4.1f}%')
print(f'Positive Precision Rate from Cross Validation: {np.mean(precision_score)*100:4.1f}%')
print(f'Positive Recall Rate from Cross Validation: {np.mean(recall_score)*100:4.1f}%')
print(f'Area Under Curve from Cross Validation: {np.mean(auc_score)*100:4.1f}%')

Accuracy Score from Cross Validation: 80.4%
Positive Precision Rate from Cross Validation: 56.9%
Positive Recall Rate from Cross Validation: 59.6%
Area Under Curve from Cross Validation: 73.1%


## Ancillary information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [A Gentle Introduction to k-fold Cross-Validation][1]

-----

[1]: https://machinelearningmastery.com/k-fold-cross-validation/

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode