# Cross validation

Cross-validation (CV) is an important step in the whole predictive modelling process. Luckily it is not the hardest bit to code. Let's have a look at the basic setup below.

## The dataset and classifier

First, let's introduce the dataset and divide it into training and test set:

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X,y = make_classification(n_samples=1000, n_features=10,
                               n_informative=2, n_redundant=0, n_repeated=0,
                               n_classes=2,
                               n_clusters_per_class=1,
                               weights=(0.7,0.3),
                               class_sep=0.99, random_state=14)


# You already know about training and test splits:
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

To apply cross-validation, we need a classifier as well. Let's use logistic regression for now:

In [5]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

## Applying cross validation

Here, we introduce the classifier into the CV process. It really is a process, e.g., for 10-Fold CV, we have a 10-step process. The classifier is embedded into the whole process, and used on the 10 training sets that are generated:

In [6]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracy scores: '+str(scores))

Accuracy scores: [0.97857143 1.         0.97857143 0.99285714 0.99285714]


What have we just calculated? In this case, the accuracy. Let's now add some other metrics as well, e.g., the AUC:

In [7]:
outcomes = cross_val_score(classifier, X_train, y_train, cv=10, scoring='roc_auc')
print(outcomes)

[1.         0.99443414 1.         1.         0.98979592 1.
 1.         1.         0.97123016 1.        ]


If you are interested in multiple metrics at the same time, another function is more appropriate:

In [8]:
from sklearn.model_selection import cross_validate

# metrics you want to have computed
metrics = ['roc_auc','accuracy','precision']

# By default, we should not really care about the training scores. To show them, we add the extra return_train_score parameter
outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=10, return_train_score=True)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00166774 0.00152278 0.00139213 0.00149798 0.00143528 0.0014863
 0.00129914 0.00148988 0.00146437 0.00148535]
score_time value: [0.00180101 0.00159717 0.00159097 0.00157166 0.00154829 0.00159287
 0.001544   0.00159526 0.00157309 0.00155425]
test_roc_auc value: [1.         0.99443414 1.         1.         0.98979592 1.
 1.         1.         0.97123016 1.        ]
train_roc_auc value: [0.99681514 0.9972073  0.99685079 0.99691021 0.99739744 0.99699214
 0.99693317 0.99672085 0.99949279 0.99673264]
test_accuracy value: [0.95774648 0.98591549 1.         1.         0.97183099 1.
 1.         0.98550725 0.98550725 1.        ]
train_accuracy value: [0.99205087 0.99205087 0.98887122 0.98887122 0.9936407  0.9889065
 0.9889065  0.99049128 0.99049128 0.9889065 ]
test_precision value: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
train_precision value: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


Now, the outcome is a dictionary with the different metrics per fold for both the training and test set (note that, since we have set aside a separate test set, this is our validation set in this case).

## Setting up a pipeline

Remember when we talked about training, validation and test sets, we mentioned that the pre-processing (e.g., replacing missing values, transformations, over- and under-sampling, etc.) should be performed on the training and test set separately to avoid any bias? That is, the same transformation, with the same parameters, should be applied to both. Otherwise, information of the training set can 'leak' into the testing process, while the testing stage needs to be completely independent.

To simplify this, we can set up a pipeline containing the various steps that need to be applied, i.e., transformation and training a classifier:

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

metrics = ['accuracy']

pipeline = make_pipeline(StandardScaler(), classifier)
outcomes = cross_validate(pipeline, X_train, y_train, scoring=metrics, cv=10, return_train_score=True)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00207186 0.00186396 0.00192285 0.00184631 0.00183868 0.00193381
 0.00195336 0.00185204 0.00182962 0.00188851]
score_time value: [0.00029707 0.00029182 0.00029445 0.00031281 0.00028801 0.00028872
 0.00029397 0.00028563 0.0002861  0.00028586]
test_accuracy value: [0.97183099 0.98591549 1.         1.         0.97183099 0.98550725
 1.         0.98550725 0.98550725 1.        ]
train_accuracy value: [0.99205087 0.99205087 0.99046105 0.99046105 0.99205087 0.99049128
 0.9889065  0.99207607 0.99049128 0.9889065 ]


## Predictions for every sample

If you want to obtain the predictions for every sample from when it was in the test set (in 10-fold CV, every sample is used exactly once), the following code can be used:

In [7]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

predictions = cross_val_predict(pipeline, X_train, y_train, cv=10)
print(accuracy_score(y_train, predictions))

0.9885714285714285


Typically, we will use cross-validation to see what classifier, or what parameters, are working best over our training/validation sets. Then, finally, we use them on our test set for our final evaluation.

## Adding sampling strategy to pipeline

Since our data is imbalanced, we might want to preserve this imbalance in every fold. To do so, we can use the stratified CV procedure as well:

In [8]:
from sklearn.model_selection import StratifiedKFold

stratified_kfold = StratifiedKFold(n_splits=10, random_state=40)
outcomes = cross_validate(pipeline, X_train, y_train, scoring=metrics, cv=stratified_kfold, return_train_score=True)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00200963 0.00181651 0.00188613 0.00185299 0.00185609 0.0019474
 0.00192094 0.00187111 0.00183249 0.00187373]
score_time value: [0.00029373 0.00028753 0.00028324 0.00028539 0.00028801 0.00028729
 0.00028419 0.00028729 0.00028229 0.0002811 ]
test_accuracy value: [0.97183099 0.98591549 1.         1.         0.97183099 0.98550725
 1.         0.98550725 0.98550725 1.        ]
train_accuracy value: [0.99205087 0.99205087 0.99046105 0.99046105 0.99205087 0.99049128
 0.9889065  0.99207607 0.99049128 0.9889065 ]
