# 01 - Train, Test, and Validation Splits

---

## Learning Objectives

By the end of this notebook you will be able to:

- Explain **why** we split data into training and testing sets
- Use `sklearn.model_selection.train_test_split` with common ratios
- Apply **stratified splitting** to preserve class distributions
- Implement a **3-way split** (train / validation / test)
- Understand **GroupKFold** and when group-aware splitting is necessary
- Recognize common mistakes that lead to data leakage or biased evaluation

---

## Prerequisites

- Basic Python (lists, functions, f-strings)
- Familiarity with NumPy arrays and Pandas DataFrames
- High-level understanding of classification vs. regression

---

## Table of Contents

1. [Why Split Data?](#1-why-split-data)
2. [Train/Test Split with scikit-learn](#2-traintest-split-with-scikit-learn)
3. [Stratified Splits for Classification](#3-stratified-splits-for-classification)
4. [Three-Way Split: Train / Validation / Test](#4-three-way-split-train--validation--test)
5. [GroupKFold: Preventing Group Leakage](#5-groupkfold-preventing-group-leakage)
6. [Visualizing Split Sizes](#6-visualizing-split-sizes)
7. [Common Mistakes](#7-common-mistakes)
8. [Exercise](#8-exercise)

---

## 1. Why Split Data?

The fundamental goal in machine learning is **generalization** -- performing well on *unseen* data, not just the data the model was trained on.

- **Overfitting**: a model memorizes training noise instead of learning the true pattern
- **Underfitting**: a model is too simple to capture the underlying relationship
- We need a **held-out test set** that the model never sees during training so we can get an honest estimate of real-world performance

Mathematically, we want to minimize the **expected prediction error** on new data:

$$
\text{EPE} = \mathbb{E}\left[L\big(y, \hat{f}(x)\big)\right]
$$

where $L$ is a loss function (e.g., squared error, log-loss). Evaluating on training data gives an **optimistically biased** estimate of EPE.

---

## 2. Train/Test Split with scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a synthetic binary classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    weights=[0.7, 0.3],   # imbalanced: 70% class-0, 30% class-1
    random_state=42,
)

print(f"Full dataset  -> X shape: {X.shape}, y shape: {y.shape}")
print(f"Class distribution: {np.bincount(y)}")

### Typical ratios

| Split | Small data (< 10 k) | Medium data | Large data (> 100 k) |
|-------|---------------------|-------------|----------------------|
| Train | 70-80 % | 80-90 % | 95-99 % |
| Test  | 20-30 % | 10-20 % | 1-5 % |

Key parameter: **`random_state`** -- set it to get reproducible splits.

In [None]:
# 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print(f"Train set -> {X_train.shape[0]} samples ({X_train.shape[0]/len(y)*100:.0f}%)")
print(f"Test  set -> {X_test.shape[0]} samples ({X_test.shape[0]/len(y)*100:.0f}%)")

In [None]:
# 70/30 split for comparison
X_train_70, X_test_30, y_train_70, y_test_30 = train_test_split(
    X, y, test_size=0.30, random_state=42
)

print(f"70/30 -> Train: {X_train_70.shape[0]}, Test: {X_test_30.shape[0]}")

---

## 3. Stratified Splits for Classification

When classes are **imbalanced**, a random split can give a test set with a very different class ratio than the full dataset. The `stratify` parameter ensures each split has approximately the same class proportions.

In [None]:
# --- Non-stratified split ---
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.20, random_state=42
)

# --- Stratified split ---
X_tr_s, X_te_s, y_tr_s, y_te_s = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

def class_pct(arr):
    counts = np.bincount(arr)
    return counts / counts.sum() * 100

print("Full dataset class %:     ", np.round(class_pct(y), 2))
print("Non-stratified test %:    ", np.round(class_pct(y_te), 2))
print("Stratified test %:        ", np.round(class_pct(y_te_s), 2))

Notice how the stratified test set preserves the original 70/30 class ratio much more closely.

---

## 4. Three-Way Split: Train / Validation / Test

When you need to **tune hyperparameters**, a single train/test split is not enough:

- **Training set** -- fit the model
- **Validation set** -- choose hyperparameters / compare models
- **Test set** -- final, unbiased evaluation (touch **once**)

A common ratio is **60 / 20 / 20** or **70 / 15 / 15**.

We achieve this by calling `train_test_split` **twice**.

In [None]:
# Step 1: separate out the test set (20%)
X_temp, X_test_3, y_temp, y_test_3 = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Step 2: split the remaining 80% into train (75% of 80% = 60%) and val (25% of 80% = 20%)
X_train_3, X_val_3, y_train_3, y_val_3 = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

total = len(y)
print(f"Train:      {len(y_train_3)} samples  ({len(y_train_3)/total*100:.0f}%)")
print(f"Validation: {len(y_val_3)} samples  ({len(y_val_3)/total*100:.0f}%)")
print(f"Test:       {len(y_test_3)} samples  ({len(y_test_3)/total*100:.0f}%)")

---

## 5. GroupKFold: Preventing Group Leakage

Sometimes rows are not independent. For example:

- Multiple medical images from the **same patient**
- Multiple transactions from the **same user**
- Multiple frames from the **same video**

If the same group appears in both train *and* test, the model can "cheat" by memorizing group-specific patterns. **GroupKFold** ensures that all samples from a given group stay together in the same fold.

In [None]:
from sklearn.model_selection import GroupKFold

# Simulate 200 samples from 20 users (10 samples each)
n_users = 20
samples_per_user = 10
n_total = n_users * samples_per_user

rng = np.random.RandomState(42)
groups = np.repeat(np.arange(n_users), samples_per_user)
X_grp = rng.randn(n_total, 5)
y_grp = rng.randint(0, 2, n_total)

gkf = GroupKFold(n_splits=5)

for fold, (train_idx, test_idx) in enumerate(gkf.split(X_grp, y_grp, groups)):
    train_groups = set(groups[train_idx])
    test_groups = set(groups[test_idx])
    overlap = train_groups & test_groups
    print(
        f"Fold {fold}: train groups {sorted(train_groups)}, "
        f"test groups {sorted(test_groups)}, overlap: {overlap}"
    )

Zero overlap means no group leakage -- exactly what we want.

---

## 6. Visualizing Split Sizes

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# --- Bar chart: split sizes ---
labels = ["Train", "Validation", "Test"]
sizes = [len(y_train_3), len(y_val_3), len(y_test_3)]
colors = ["#2196F3", "#FF9800", "#4CAF50"]

axes[0].bar(labels, sizes, color=colors, edgecolor="black")
for i, v in enumerate(sizes):
    axes[0].text(i, v + 10, str(v), ha="center", fontweight="bold")
axes[0].set_ylabel("Number of samples")
axes[0].set_title("3-Way Split Sizes (60 / 20 / 20)")

# --- Bar chart: class distribution per split ---
splits = {
    "Full": y,
    "Train": y_train_3,
    "Val": y_val_3,
    "Test": y_test_3,
}
x_pos = np.arange(len(splits))
width = 0.35

class0_pcts = [np.mean(v == 0) * 100 for v in splits.values()]
class1_pcts = [np.mean(v == 1) * 100 for v in splits.values()]

axes[1].bar(x_pos - width / 2, class0_pcts, width, label="Class 0", color="#90CAF9")
axes[1].bar(x_pos + width / 2, class1_pcts, width, label="Class 1", color="#EF9A9A")
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(splits.keys())
axes[1].set_ylabel("Percentage (%)")
axes[1].set_title("Class Distribution Across Splits (Stratified)")
axes[1].legend()

plt.tight_layout()
plt.show()

---

## 7. Common Mistakes

| Mistake | Why it is harmful | Fix |
|---------|-------------------|-----|
| **Fitting / evaluating on test data** | Gives an overoptimistic estimate of generalization | Keep the test set locked away until final evaluation |
| **Not stratifying imbalanced data** | Splits can have very different class ratios, making metrics unreliable | Use `stratify=y` in `train_test_split` |
| **Ignoring group structure** | Leaks information between train/test when rows are not independent | Use `GroupKFold` or `GroupShuffleSplit` |
| **Using a fixed split on tiny datasets** | High variance in performance estimate | Use cross-validation (next notebook) |
| **Preprocessing before splitting** | Leaks statistics from test set into training | Always split first, then preprocess (Notebook 03) |

---

## 8. Exercise

**Task**: Using the `make_classification` dataset created above, perform the following:

1. Create a **90 / 10** stratified train/test split.
2. Print the class distribution (as percentages) for both the train and test sets.
3. Verify that the distributions are nearly identical.

*Hint*: Use `train_test_split` with `test_size=0.10`, `stratify=y`, and `random_state=42`.

In [None]:
# YOUR CODE HERE
# ----------------------------------------------------------------
# X_train_ex, X_test_ex, y_train_ex, y_test_ex = ...
# print class distributions
# ----------------------------------------------------------------