# Cross-Validation in scikit-learn

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-1-of-2/blob/master/notebooks/01-cross-validation.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
if 'google.colab' in sys.modules:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intermediate-1-of-2/master/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [None]:
import seaborn as sns
sns.set_theme(context="notebook", font_scale=1.2,
              rc={"figure.figsize": [10, 6]})
sklearn.set_config(display="diagram")

## Load sample data

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

spam = fetch_openml(data_id=44, as_frame=True)
X, y = spam.data, spam.target
y = y.cat.codes

In [None]:
print(spam.DESCR)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, stratify=y)

## Cross validation for model selection

### Try DummyClassifier

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier

In [None]:
dummy_clf = DummyClassifier(strategy="prior")
dummy_scores = cross_val_score(dummy_clf, X_train, y_train)

In [None]:
dummy_scores

In [None]:
dummy_scores.mean()

### Try KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

knc = make_pipeline(StandardScaler(), KNeighborsClassifier())
knc_scores = cross_val_score(knc, X_train, y_train)

In [None]:
knc_scores

In [None]:
knc_scores.mean()

### Try LogisticRegression

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=0)
)

In [None]:
log_reg_scores = cross_val_score(log_reg, X_train, y_train)

In [None]:
log_reg_scores

In [None]:
log_reg_scores.mean()

### Which model do we choose?

1. Dummy
2. KNeighborsClassifier
3. LogisticRegression

## Exercise 1

1. Is the target, `y`, balanced? (**Hint**: `value_counts`)
2. Train the best model on the training set and evaluate on the test data.
3. **Extra**: Add the `scoring='roc_auc'` to change the return the roc auc score. Which model performs the best in this case?

In [None]:
# %load solutions/01-ex01-solutions.py

## Cross validation Strategies

### KFold

In [None]:
from sklearn.model_selection import KFold

cross_val_score(log_reg, X_train, y_train, cv=KFold(n_splits=4))

## Repeated KFold

In [None]:
from sklearn.model_selection import RepeatedKFold

scores = cross_val_score(log_reg, X_train, y_train,
                         cv=RepeatedKFold(n_splits=4, n_repeats=2))

In [None]:
scores

In [None]:
scores.shape

## StratifiedKFold

In [None]:
from sklearn.model_selection import StratifiedKFold

scores = cross_val_score(log_reg, X_train, y_train,
                         cv=StratifiedKFold(n_splits=4))

In [None]:
scores

This is a binary classification problem:

In [None]:
y.value_counts()

Scikit-learn will use `StratifiedKFold` by default:

In [None]:
cross_val_score(log_reg, X_train, y_train, cv=4)

## RepeatedStratifiedKFold

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

scores = cross_val_score(
    log_reg, X_train, y_train,
    cv=RepeatedStratifiedKFold(n_splits=4, n_repeats=3))

In [None]:
scores

In [None]:
scores.shape

## Exercise 2

1. Use `sklearn.model_selection.cross_validate` instead of of `cross_val_score` with `cv=4`.
2. What additional information does `cross_validate` provide?
3. Set `scoring=['f1', 'accuracy']` in `cross_validate`'s evalute on multiple metrics.

In [None]:
# %load solutions/01-ex02-solutions.py

### Appendix: TimeSeriesSplit

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

X = np.arange(10)

In [None]:
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

With `gap=2`:

In [None]:
tscv_gap = TimeSeriesSplit(n_splits=3, gap=2)
for train_index, test_index in tscv_gap.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)