*ref: https://inria.github.io/scikit-learn-mooc/python_scripts/02_numerical_pipeline_cross_validation.html*

We will discuss the practical aspects of assessing the generalization performance of our model via cross-validation instead of a single train-test split.

# Validation of a model

https://www.youtube.com/watch?v=kLWvI9fSnKc&t=51s

# Data preparation

In [3]:
import pandas as pd

adult_census = pd.read_csv("../../datasets/adult-census.csv")

In [4]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

In [6]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]

In [9]:
# We can now create a model using the make_pipeline tool to chain the preprocessing and
# the estimator in every iteration of the cross-validation.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

# Cross-validation
the function cross_validate allows to do cross-validation and you need to pass it the model, the data, and the target.

In [11]:
%%time
from sklearn.model_selection import cross_validate

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_result = cross_validate(model, data_numeric, target, cv=5)
cv_result

CPU times: total: 562 ms
Wall time: 563 ms


{'fit_time': array([0.07860184, 0.07812262, 0.07807779, 0.09371614, 0.07809162]),
 'score_time': array([0.01562452, 0.01563954, 0.0156579 , 0.01565933, 0.01721191]),
 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80436118])}

The output of cross_validate is a Python dictionary, which by default contains three entries:

- (i) the time to train the model on the training data for each fold, fit_time
- (ii) the time to predict with the model on the testing data for each fold, score_time
- (iii) the default score on the testing data for each fold, test_score.

In [12]:
# Let’s extract the scores computed on the test fold of each cross-validation round from the cv_result dictionary and compute the mean accuracy and the variation of the accuracy across folds.

scores = cv_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.800 ± 0.003
