# 02 - Cross-Validation and Time-Series Splits

---

## Learning Objectives

By the end of this notebook you will be able to:

- Explain why **cross-validation** gives a more reliable performance estimate than a single train/test split
- Implement **KFold** and **StratifiedKFold** with scikit-learn
- Use `cross_val_score` and `cross_validate` for quick evaluation
- Apply **GroupKFold** to user/group-level data
- Use **TimeSeriesSplit** and understand walk-forward validation
- Visualize cross-validation folds

---

## Prerequisites

- Completed **Notebook 01** (train/test/validation splits)
- Basic understanding of model fitting and scoring
- Familiarity with NumPy, Pandas, and Matplotlib

---

## Table of Contents

1. [Why Cross-Validation?](#1-why-cross-validation)
2. [KFold Cross-Validation](#2-kfold-cross-validation)
3. [StratifiedKFold](#3-stratifiedkfold)
4. [cross_val_score and cross_validate](#4-cross_val_score-and-cross_validate)
5. [GroupKFold for User-Level Data](#5-groupkfold-for-user-level-data)
6. [TimeSeriesSplit](#6-timeseriessplit)
7. [Walk-Forward Validation](#7-walk-forward-validation)
8. [Single Split vs. 5-Fold CV Comparison](#8-single-split-vs-5-fold-cv-comparison)
9. [Common Mistakes](#9-common-mistakes)
10. [Exercise](#10-exercise)

---

## 1. Why Cross-Validation?

A single train/test split suffers from **high variance**: the performance estimate depends heavily on *which* samples land in each set.

**$k$-Fold Cross-Validation** addresses this by:

1. Dividing the data into $k$ equally-sized folds
2. Training on $k-1$ folds and testing on the remaining fold
3. Repeating $k$ times so every sample serves as test exactly once
4. Averaging the $k$ scores

$$
\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} L_i
$$

**Bias-Variance trade-off of $k$**:

| $k$ | Bias | Variance | Cost |
|-----|------|----------|------|
| Small (e.g., 2) | Higher (less training data per fold) | Lower | Cheaper |
| Large (e.g., $n$, LOOCV) | Lower (nearly all data for training) | Higher | Expensive |
| Typical (5 or 10) | Good balance | Good balance | Moderate |

In practice, **$k = 5$ or $k = 10$** is the standard choice.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

from sklearn.datasets import make_classification
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    GroupKFold,
    TimeSeriesSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Synthetic dataset (reused throughout)
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    weights=[0.7, 0.3],
    random_state=42,
)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.bincount(y)}")

---

## 2. KFold Cross-Validation

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    print(
        f"Fold {fold}: "
        f"train size = {len(train_idx)}, "
        f"test size = {len(test_idx)}, "
        f"test class dist = {np.bincount(y[test_idx])}"
    )

Notice that plain `KFold` does **not** guarantee the same class ratio in each fold -- the class distribution can vary.

---

## 3. StratifiedKFold

`StratifiedKFold` preserves the percentage of samples for each class in every fold.

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    class_pct = np.bincount(y[test_idx]) / len(test_idx) * 100
    print(
        f"Fold {fold}: "
        f"test size = {len(test_idx)}, "
        f"class % = [{class_pct[0]:.1f}, {class_pct[1]:.1f}]"
    )

Each fold now has roughly the same 70/30 split -- much better for imbalanced problems.

### Visualizing the folds

In [None]:
def plot_cv_folds(cv, X, y, groups=None, title="Cross-Validation Folds"):
    """Visualize which indices are train vs. test in each fold."""
    fig, ax = plt.subplots(figsize=(12, 4))
    n_samples = len(y)

    split_args = (X, y, groups) if groups is not None else (X, y)
    for fold, (train_idx, test_idx) in enumerate(cv.split(*split_args)):
        indices = np.zeros(n_samples)
        indices[test_idx] = 1
        ax.scatter(
            range(n_samples), [fold] * n_samples,
            c=indices, cmap="coolwarm", marker="|", s=30, linewidths=0.8
        )

    ax.set_yticks(range(cv.get_n_splits()))
    ax.set_yticklabels([f"Fold {i}" for i in range(cv.get_n_splits())])
    ax.set_xlabel("Sample index")
    ax.set_title(title)
    ax.legend(
        handles=[Patch(color="#3B4CC0", label="Train"), Patch(color="#B40426", label="Test")],
        loc="upper right"
    )
    plt.tight_layout()
    plt.show()

plot_cv_folds(skf, X, y, title="StratifiedKFold (5 folds)")

---

## 4. `cross_val_score` and `cross_validate`

These convenience functions automate the fit-predict-score loop.

In [None]:
model = LogisticRegression(max_iter=1000, random_state=42)

# cross_val_score: returns an array of scores (one per fold)
scores = cross_val_score(
    model, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring="accuracy",
)

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

In [None]:
# cross_validate: returns a dict with fit_time, score_time, and test scores
# Can also return train scores and multiple metrics
cv_results = cross_validate(
    model, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring=["accuracy", "f1", "roc_auc"],
    return_train_score=True,
)

results_df = pd.DataFrame(cv_results)
print(results_df.round(4).to_string(index=False))

---

## 5. GroupKFold for User-Level Data

When samples are grouped (e.g., by user or patient), regular `KFold` can leak information. `GroupKFold` keeps all samples from a group in the same fold.

In [None]:
# Simulate: 50 users, variable number of samples per user
rng = np.random.RandomState(42)
n_users = 50
samples_per_user = rng.randint(5, 15, size=n_users)
groups = np.repeat(np.arange(n_users), samples_per_user)
n_total = len(groups)

X_grp = rng.randn(n_total, 8)
y_grp = rng.randint(0, 2, n_total)

print(f"Total samples: {n_total}, Total groups: {n_users}")

gkf = GroupKFold(n_splits=5)

for fold, (tr_idx, te_idx) in enumerate(gkf.split(X_grp, y_grp, groups)):
    tr_groups = set(groups[tr_idx])
    te_groups = set(groups[te_idx])
    print(
        f"Fold {fold}: train={len(tr_idx)} samples ({len(tr_groups)} groups), "
        f"test={len(te_idx)} samples ({len(te_groups)} groups), "
        f"overlap={tr_groups & te_groups}"
    )

---

## 6. TimeSeriesSplit

For time-ordered data, **future data must never appear in the training set**. `TimeSeriesSplit` implements an **expanding-window** strategy:

- Fold 0: train on `[0, ..., k]`, test on `[k+1, ..., 2k]`
- Fold 1: train on `[0, ..., 2k]`, test on `[2k+1, ..., 3k]`
- ...and so on

The training set grows with each fold, and the test set is always in the "future".

In [None]:
# Simulated time series: 120 monthly observations
n_ts = 120
time_index = pd.date_range("2015-01-01", periods=n_ts, freq="MS")
X_ts = np.arange(n_ts).reshape(-1, 1)  # simple feature: time step
y_ts = np.sin(np.arange(n_ts) * 0.1) + rng.randn(n_ts) * 0.3

tscv = TimeSeriesSplit(n_splits=5)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    print(
        f"Fold {fold}: train [{train_idx[0]}..{train_idx[-1]}] ({len(train_idx)} pts), "
        f"test [{test_idx[0]}..{test_idx[-1]}] ({len(test_idx)} pts)"
    )

In [None]:
# Visualize TimeSeriesSplit
fig, ax = plt.subplots(figsize=(14, 4))

cmap_train = "#2196F3"
cmap_test = "#F44336"

for fold, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    ax.barh(fold, len(train_idx), left=train_idx[0], height=0.6, color=cmap_train, edgecolor="white")
    ax.barh(fold, len(test_idx), left=test_idx[0], height=0.6, color=cmap_test, edgecolor="white")

ax.set_yticks(range(5))
ax.set_yticklabels([f"Fold {i}" for i in range(5)])
ax.set_xlabel("Sample index (time order)")
ax.set_title("TimeSeriesSplit -- Train (blue) vs. Test (red)")
ax.legend(
    handles=[Patch(color=cmap_train, label="Train"), Patch(color=cmap_test, label="Test")],
    loc="upper left"
)
plt.tight_layout()
plt.show()

---

## 7. Walk-Forward Validation

Walk-forward validation extends `TimeSeriesSplit` with a **fixed-size sliding window** for training, rather than an ever-growing window. This is useful when:

- Older data becomes less relevant (concept drift)
- Training cost scales with dataset size

```
Fold 0:  [===TRAIN===][=TEST=]...........................
Fold 1:  ...[===TRAIN===][=TEST=].......................
Fold 2:  ......[===TRAIN===][=TEST=]...................
```

scikit-learn's `TimeSeriesSplit` supports this via `max_train_size`:

In [None]:
tscv_wf = TimeSeriesSplit(n_splits=5, max_train_size=30)

for fold, (train_idx, test_idx) in enumerate(tscv_wf.split(X_ts)):
    print(
        f"Fold {fold}: train [{train_idx[0]}..{train_idx[-1]}] (size {len(train_idx)}), "
        f"test [{test_idx[0]}..{test_idx[-1]}] (size {len(test_idx)})"
    )

Notice the training window is always capped at 30 samples.

---

## 8. Single Split vs. 5-Fold CV Comparison

Let's empirically compare the **stability** of a single random split vs. 5-fold cross-validation.

In [None]:
model_lr = LogisticRegression(max_iter=1000, random_state=42)
model_dt = DecisionTreeClassifier(max_depth=5, random_state=42)

n_experiments = 20
single_split_scores_lr = []
single_split_scores_dt = []

for seed in range(n_experiments):
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.20, random_state=seed, stratify=y
    )
    # Logistic Regression
    model_lr.fit(X_tr, y_tr)
    single_split_scores_lr.append(model_lr.score(X_te, y_te))
    # Decision Tree
    model_dt.fit(X_tr, y_tr)
    single_split_scores_dt.append(model_dt.score(X_te, y_te))

# 5-fold CV scores
cv5_scores_lr = cross_val_score(
    LogisticRegression(max_iter=1000, random_state=42), X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
)
cv5_scores_dt = cross_val_score(
    DecisionTreeClassifier(max_depth=5, random_state=42), X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
)

print("=== Logistic Regression ===")
print(f"  Single splits (20 runs): mean={np.mean(single_split_scores_lr):.4f}, std={np.std(single_split_scores_lr):.4f}")
print(f"  5-Fold CV:               mean={cv5_scores_lr.mean():.4f}, std={cv5_scores_lr.std():.4f}")

print("\n=== Decision Tree ===")
print(f"  Single splits (20 runs): mean={np.mean(single_split_scores_dt):.4f}, std={np.std(single_split_scores_dt):.4f}")
print(f"  5-Fold CV:               mean={cv5_scores_dt.mean():.4f}, std={cv5_scores_dt.std():.4f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(13, 4), sharey=True)

# Logistic Regression
axes[0].hist(single_split_scores_lr, bins=10, alpha=0.6, color="#2196F3", label="Single splits")
axes[0].axvline(cv5_scores_lr.mean(), color="#F44336", linestyle="--", linewidth=2, label=f"5-fold CV mean ({cv5_scores_lr.mean():.3f})")
axes[0].set_xlabel("Accuracy")
axes[0].set_ylabel("Count")
axes[0].set_title("Logistic Regression")
axes[0].legend()

# Decision Tree
axes[1].hist(single_split_scores_dt, bins=10, alpha=0.6, color="#FF9800", label="Single splits")
axes[1].axvline(cv5_scores_dt.mean(), color="#F44336", linestyle="--", linewidth=2, label=f"5-fold CV mean ({cv5_scores_dt.mean():.3f})")
axes[1].set_xlabel("Accuracy")
axes[1].set_title("Decision Tree")
axes[1].legend()

plt.suptitle("Single Split Variability vs. 5-Fold CV", fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

The histogram shows how much a single split estimate can vary. The 5-fold CV mean (red dashed line) provides a more stable estimate.

---

## 9. Common Mistakes

| Mistake | Why it is harmful | Fix |
|---------|-------------------|-----|
| **Using regular KFold on time-series data** | Future data leaks into training, inflating scores | Use `TimeSeriesSplit` |
| **Leaking validation into training** | Preprocessing (e.g., scaling) fitted on all data before CV, or performing feature selection using all data | Use `Pipeline` so preprocessing is done inside each CV fold |
| **Ignoring group structure** | Same user/patient in both train and test folds | Use `GroupKFold` |
| **Not shuffling KFold** | If data is sorted by class, folds will be pure and scores misleading | Set `shuffle=True` (but **not** for time series) |
| **Reporting training scores as performance** | Overfitting goes undetected | Always report *test fold* scores from CV |

---

## 10. Exercise

**Task**: Using the synthetic dataset `(X, y)` defined at the top of this notebook:

1. Run **10-fold StratifiedKFold** cross-validation with a `DecisionTreeClassifier(max_depth=3, random_state=42)`.
2. Report the **mean accuracy** and **standard deviation** across folds.
3. Compare the result to the 5-fold result above -- is 10-fold more stable (lower std)?

*Hint*: Use `cross_val_score` with `cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=42)`.

In [None]:
# YOUR CODE HERE
# ----------------------------------------------------------------
# scores_10fold = cross_val_score(...)
# print(f"10-Fold CV: mean={...:.4f}, std={...:.4f}")
# ----------------------------------------------------------------