# Cross validation

Cross-validation (CV) is an important step in the whole predictive modelling process.

Let's have a look at the basic setup below.

## The dataset and classifier

First, introduce the dataset:

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X,y = make_classification(n_samples=1000, n_features=10,
                               n_informative=2, n_redundant=0, n_repeated=0,
                               n_classes=2,
                               n_clusters_per_class=1,
                               weights=(0.7,0.3),
                               class_sep=0.99, random_state=14)


To apply cross-validation, we need a **classifier** as well. Let's use logistic regression for now:

In [2]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

## Applying cross validation

Then, we introduce the classifier into the cross validation process.

The classifier is embedded into the whole process, and used on the 10 training sets that are generated:

In [3]:
from sklearn.model_selection import cross_val_score

# Evaluate a score by cross-validation.
scores = cross_val_score(classifier, X, y, cv=10)
# Determines the cross-validation splitting strategy.
# Possible inputs for cv are: - `None`, to use the default 5-fold cross validation,
# - int, to specify the number of folds

print('Accuracy scores: '+str(scores))

Accuracy scores: [1.   0.96 1.   0.99 0.98 0.99 0.99 1.   0.98 1.  ]


We can change the metrics for the evaluation. Here we use the accuracy to estimate test errors. Let's now add some other metrics as well, e.g., the AUC:

In [4]:
outcomes = cross_val_score(classifier, X, y, cv=10, scoring='roc_auc')
print(outcomes)

[1.         0.99666667 1.         1.         0.98571429 0.9952381
 1.         1.         0.98571429 1.        ]


If **multiple metrics** are required, apply another function "corss_validate()":

In [5]:
from sklearn.model_selection import cross_validate

# metrics you want to have computed as a list
# It stores each matrix as index for further reference.
metrics = ['roc_auc','accuracy','precision']

# By default, we should not really care about the training scores.
# To show them, we add the extra return_train_score parameter
outcomes = cross_validate(classifier, X, y, scoring=metrics, cv=10, return_train_score=True)

# We can check the structure of 'outcomes' and print them via keys.
# print(outcomes)

for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00474572 0.         0.01558685 0.01561785 0.         0.00826049
 0.00688982 0.00805593 0.00656676 0.00850964]
score_time value: [0.00517964 0.         0.         0.00284147 0.01979542 0.
 0.00798917 0.00713611 0.         0.        ]
test_roc_auc value: [1.         0.99666667 1.         1.         0.98571429 0.9952381
 1.         1.         0.98571429 1.        ]
train_roc_auc value: [0.99738354 0.9975654  0.99718994 0.99741873 0.9988971  0.99775313
 0.9973542  0.99736594 0.99782939 0.9972134 ]
test_accuracy value: [1.   0.96 1.   0.99 0.98 0.99 0.99 1.   0.98 1.  ]
train_accuracy value: [0.99       0.99444444 0.99111111 0.99111111 0.99111111 0.99222222
 0.99       0.99       0.99333333 0.99111111]
test_precision value: [1.         1.         1.         1.         1.         1.
 1.         1.         0.96666667 1.        ]
train_precision value: [0.99621212 0.99626866 0.99622642 0.99622642 0.99622642 0.9962406
 0.99621212 0.99621212 1.         0.99621212]


Remember when we talked about training, validation and test sets, we mentioned that the pre-processing (e.g., replacing missing values, transformations, over- and under-sampling, etc.) should be performed on the training and test set **separately** to avoid any bias?

**That is, the same transformation, with the same parameters, should be applied to both.**

Otherwise, information of the testing set can 'leak' into the training process, while the testing stage needs to be completely independent.

To simplify this, we can set up a pipeline containing the various steps that need to be applied, i.e., transformation and training a classifier.

We apply function "make_pipeline" from sklearn, to set pipeline of transforms with a final estimator.

This function sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be 'transforms', that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to **assemble several steps** that can be cross-validated together **while** setting different parameters.
For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__'.

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

metrics = ['accuracy']

# Construct a Pipeline from the given estimators.
# add another step of standardizing.
pipeline = make_pipeline(StandardScaler(), classifier)

outcomes = cross_validate(pipeline, X, y, scoring=metrics, cv=10, return_train_score=True)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00299191 0.         0.01568913 0.         0.0169239  0.
 0.         0.00266767 0.         0.01688075]
score_time value: [0.         0.         0.         0.         0.         0.
 0.01575208 0.         0.         0.00096369]
test_accuracy value: [1.   0.96 1.   0.99 0.98 0.99 1.   1.   0.98 1.  ]
train_accuracy value: [0.99222222 0.99555556 0.99111111 0.99222222 0.99222222 0.99222222
 0.99111111 0.99222222 0.99333333 0.99111111]


## Predictions for every sample

In [7]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

# Generate cross-validated estimates for each input data point.
predictions = cross_val_predict(pipeline, X, y, cv=10)

# print(predictions)

print(accuracy_score(y, predictions))

0.99
