
# Cross-validation strategies

The previous notebooks introduced how to evaluate a model and how to create a
specific preprocessing pipeline depending of the last model.

In this notebook, we will check a bit more some details regarding the cross-validation
strategies and some of the pitfalls that we can encounter.

Let's take iris dataset and evaluate a logistic regression model.

In [None]:
from sklearn.datasets import load_iris

df, target = load_iris(as_frame=True, return_X_y=True)

In [None]:
df

In [None]:
target

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
logistic_regression

In [None]:
import pandas as pd
from sklearn.model_selection import cross_validate, KFold

cv = KFold(n_splits=3)
cv_results = cross_validate(
    logistic_regression, df, target, cv=cv, return_train_score=True
)
cv_results = pd.DataFrame(cv_results)
cv_results[["train_score", "test_score"]]


We observe that the training score is always zero that is really surprising. We can
check the target to understand why.

In [None]:
ax = target.plot()
_ = ax.set(
    xlabel="Sample index",
    ylabel="Target value",
    title="Iris dataset target values",
)


We observe that the data is ordered by target. This is a problem because the KFold
object is not shuffling the data before splitting it. Therefore, we always get a
test set that does not contain a class seen during `fit`.

In [None]:
for cv_fold_idx, (train_indices, test_indices) in enumerate(cv.split(df, target)):
    print(f"Fold {cv_fold_idx}:\n")
    print(
        f"Class counts on the train set:\n"
        f"{target.iloc[train_indices].value_counts()}"
    )
    print(
        f"Class counts on the test set:\n" f"{target.iloc[test_indices].value_counts()}"
    )
    print()


We can use a `StratifiedKFold` object to ensure that the class distribution is
preserved in each fold. A side effect will be that all classes will be present in the
training set and testing set.

In [None]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=3)
cv_results = cross_validate(
    logistic_regression, df, target, cv=cv, return_train_score=True
)
cv_results = pd.DataFrame(cv_results)
cv_results[["train_score", "test_score"]]

In [None]:
for cv_fold_idx, (train_indices, test_indices) in enumerate(cv.split(df, target)):
    print(f"Fold {cv_fold_idx}:\n")
    print(
        f"Class counts on the train set:\n"
        f"{target.iloc[train_indices].value_counts()}"
    )
    print(
        f"Class counts on the test set:\n" f"{target.iloc[test_indices].value_counts()}"
    )
    print()


This is particularly useful when we have imbalanced classes. Let's check the class
distribution of the breast cancer dataset.

In [None]:
from sklearn.datasets import load_breast_cancer

df, target = load_breast_cancer(as_frame=True, return_X_y=True)

In [None]:
target.value_counts(normalize=True)


Here, we see that the proportion of the two classes is not equal. We can check the
class distribution in each fold using a `KFold` object.

In [None]:
cv = KFold(n_splits=3, shuffle=True, random_state=0)
for cv_fold_idx, (train_indices, test_indices) in enumerate(cv.split(df, target)):
    print(f"Fold {cv_fold_idx}:\n")
    print(
        "Class counts on the train set:\n"
        f"{target.iloc[train_indices].value_counts(normalize=True)}\n"
    )
    print(
        f"Class counts on the test set:\n"
        f"{target.iloc[test_indices].value_counts(True)}"
    )
    print()


We observe that the class distribution is not preserved in each fold. We can use a
`StratifiedKFold` object to ensure that the class distribution is preserved in each
fold.

In [None]:
cv = StratifiedKFold(n_splits=3)
for cv_fold_idx, (train_indices, test_indices) in enumerate(cv.split(df, target)):
    print(f"Fold {cv_fold_idx}:\n")
    print(
        "Class counts on the train set:\n"
        f"{target.iloc[train_indices].value_counts(normalize=True)}\n"
    )
    print(
        f"Class counts on the test set:\n"
        f"{target.iloc[test_indices].value_counts(True)}"
    )
    print()


Now, let's check the documentation of the `cross_validate` function to see if this
function was already providing a way to stratify the data.

In [None]:
help(cross_validate)


Now, we will look at the notion of `groups` in cross-validation. We will use the
digits dataset and group the samples by writer.

In [None]:
from sklearn.datasets import load_digits

df, target = load_digits(return_X_y=True)


We create a simple model that is a logistic regression model with a scaling of the
data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

logistic_regression = make_pipeline(MinMaxScaler(), LogisticRegression())


Let's start to evaluate the model using a `KFold` object, once without shuffling and
once with shuffling.

In [None]:
cv = KFold(n_splits=13)
cv_results = cross_validate(logistic_regression, df, target, cv=cv)
print(
    f"Mean test score: {cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}"
)

In [None]:
cv = KFold(n_splits=13, shuffle=True, random_state=0)
cv_results = cross_validate(logistic_regression, df, target, cv=cv)
print(
    f"Mean test score: {cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}"
)


Surprisingly, the mean test score is increasing when shuffling the data. Let's check
if this is due to the random seed.

In [None]:
for seed in range(10):
    cv = KFold(n_splits=13, shuffle=True, random_state=seed)
    cv_results = cross_validate(logistic_regression, df, target, cv=cv)
    print(
        f"Mean test score: {cv_results['test_score'].mean():.3f} +/- "
        f"{cv_results['test_score'].std():.3f}"
    )


Apparently not. The reason is that the samples are grouped by writer. By shuffling,
we are mixing the samples from different writers. Therefore, we are learning a model
on some writers that are also used to test. However, if we want to have a model that
generalizes well to new writers, we should not mix the samples from the same writer
between the training and testing set.

Here, we provide a `groups` array that mentioned the writer ID for each sample.

In [None]:
from itertools import count
import numpy as np

# defines the lower and upper bounds of sample indices
# for each writer
writer_boundaries = [
    0,
    130,
    256,
    386,
    516,
    646,
    776,
    915,
    1029,
    1157,
    1287,
    1415,
    1545,
    1667,
    1797,
]
groups = np.zeros_like(target)
lower_bounds = writer_boundaries[:-1]
upper_bounds = writer_boundaries[1:]

for group_id, lb, up in zip(count(), lower_bounds, upper_bounds):
    groups[lb:up] = group_id

In [None]:
import matplotlib.pyplot as plt

plt.plot(groups)
plt.yticks(np.unique(groups))
plt.xticks(writer_boundaries, rotation=90)
plt.xlabel("Target index")
plt.ylabel("Writer index")
_ = plt.title("Underlying writer groups existing in the target")


We can use this information to properly evaluate our model. We need to use the
`GroupKFold` object and pass the `groups` parameter to the `cross_validate` function.

In [None]:
from sklearn.model_selection import GroupKFold

cv = GroupKFold(n_splits=13)
cv_results = cross_validate(logistic_regression, df, target, groups=groups, cv=cv)
print(
    f"Mean test score: {cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}"
)


We observe that the mean test score is even lower but certainly closer to the true
performance of the model.