# Cross-Validation (CV)

### What Is Cross-Validation?

Cross-validation (CV) is a resampling technique that repeatedly rotates a subset of the dataset to be the test set. Compared to a split-sample validation, CV lets every observation play roles in both training and testing across different rounds.

---

### How It Works

Consider a K-fold cross-validation ($K = 5, 10$ usually). 
1. Create K folds by partitioning the data into K equal-size, non-overlapping folds.  
2. From $i=1$ to $K$, use the $ith$ fold to evaluate the model trained on the other $K − 1$ folds, 
3. Average the scores from the $K$ iterations in Step 2
4. Select & refit – Pick the model/hyper-parameters with the best average score, then retrain that model on the full dataset.

---

### Why It Works 

1. Independent test set in each iteration.    
2. Theory shows that iterating through folds has minimum impacts 

### Pros & Cons

**Pros**

✅ Use all data for training and testing.  
✅ Theoretical guarantee from statistical theory     
✅ Simple idea that is easy to adapt to other data types.   

**Cons**

❌ Computationally expansive (need to run $K$ times for every model and every choice of tuning parameter).  
❌ Data leakage risk if not handled property for dependent data.  
❌ Does not work for small sample size    




In [1]:
import numpy as np, matplotlib.pyplot as plt, seaborn as sns
from ipywidgets import Button, Output, VBox
from IPython.display import display
from matplotlib.patches import Rectangle

In [2]:
# ------------------------------------------------------------------
# fixed true function on design grid
n = 100
x_grid = np.linspace(0, 4, n)
f_true = np.sin(np.pi*x_grid) + 0.5*np.sin(2*np.pi*x_grid)
sigma2 = 1.0
rng    = np.random.default_rng(42)

# one noisy sample we’ll keep for all CV runs
y_observed = f_true + rng.normal(scale=np.sqrt(sigma2), size=n)

def poly_design(x, K):
    return np.column_stack([x**k for k in range(1, K+1)])

# theoretical total risk for reference
def theo_total(K):
    X = poly_design(x_grid, K)
    H = X @ np.linalg.pinv(X.T @ X) @ X.T
    bias2 = np.mean(((np.eye(n)-H) @ f_true)**2)
    var   = sigma2 * np.trace(H@H.T)/n
    return bias2 + var

Kmax = 20
Ks   = np.arange(1, Kmax+1)
tot_theo = np.array([theo_total(k) for k in Ks])

# ------------------------------------------------------------------
# widget components
out = Output()
rand_btn = Button(description="Randomize folds",
                  button_style="info", layout={'width':'160px'})
display(VBox([rand_btn, out]))

palette = sns.color_palette("husl", 5)  # one colour per fold

def redraw(*_):
    # ----- random 5-fold split ------------------------------------
    perm = rng.permutation(n)
    fold_sizes = np.full(5, n//5)
    fold_sizes[:n % 5] += 1
    folds = np.split(perm, np.cumsum(fold_sizes)[:-1])  # list of 5 index arrays

    # empirical risk arrays (5 folds × Kmax)
    emp_risk = np.zeros((5, Kmax))

    for fold_id, test_idx in enumerate(folds):
        train_idx = np.setdiff1d(np.arange(n), test_idx)

        for k_idx, K in enumerate(Ks):
            # design matrices
            X_tr = poly_design(x_grid[train_idx], K)
            X_te = poly_design(x_grid[test_idx],  K)

            beta  = np.linalg.pinv(X_tr.T @ X_tr) @ X_tr.T @ y_observed[train_idx]
            y_hat = X_te @ beta
            emp_risk[fold_id, k_idx] = np.mean((y_observed[test_idx] - y_hat)**2)

    cv_mean = emp_risk.mean(axis=0)

    # ------------------ plotting ----------------------------------
    with out:
        out.clear_output(wait=True)
        fig, (ax_bar, ax_risk) = plt.subplots(
            1, 2, figsize=(12, 5), gridspec_kw={'width_ratios':[1.2, 2]})

        # ---- left: fold layout bar plot --------------------------
                # ---- left: fold composition (100 tiny bars per row) -------
        ax_bar.clear()                             # in case of redraw
        bar_height   = 0.8                         # a bit of vertical gap
        y_positions  = np.linspace(4, 0, 5, endpoint=False)

        for row, (y0, test_idx) in enumerate(zip(y_positions, folds)):
            test_set = set(test_idx)               # O(1) membership check
            for i in range(n):
                ax_bar.add_patch(Rectangle((i, y0),      # (x, y)
                                            1, bar_height,
                                            facecolor=palette[row] if i in test_set
                                                     else "lightgrey",
                                            edgecolor=None))  # ← no boundary

        ax_bar.set_xlim(0, n)
        ax_bar.set_ylim(-0.2, 5)                   # leave a tad of margin
        ax_bar.set_facecolor("white")
        ax_bar.set_xticks([]); ax_bar.set_yticks([])
        ax_bar.set_title("5-fold CV composition\n(coloured = test fold)")


        # ---- right: risk curves ----------------------------------
        ax_risk.plot(Ks, tot_theo, color="black", label="theoretical risk")
        for f in range(5):
            ax_risk.plot(Ks, emp_risk[f], color=palette[f], alpha=.8,
                         label=f"fold {f+1}")
        ax_risk.plot(Ks, cv_mean, color="red", lw=2.5,
                     label="CV average")
        best_k = Ks[np.argmin(cv_mean)]
        ax_risk.axvline(best_k, ls="--", color="red",
                        label=f"min CV risk (K={best_k})")
        ax_risk.set(xlabel="Polynomial degree K",
                    ylabel="Empirical / theoretical MSE",
                    title="Risk vs model complexity (5-fold CV)")
        ax_risk.legend(); ax_risk.grid(alpha=.3)

        plt.tight_layout()
        display(fig); plt.close(fig)

# wire button
rand_btn.on_click(redraw)
redraw()        # initial display


VBox(children=(Button(button_style='info', description='Randomize folds', layout=Layout(width='160px'), style=…

Check out this example by [MLU](https://mlu-explain.github.io/cross-validation/). Notice that they actually leave out an additional test set for the grand final comparison. It is not something you have to do in model evaluation. But this is a common practice in competitions to ensure honest evaluation of the model performance. 

> Note: Even if the test data is hidden from the participants, one can still tune models based on the metrics returned from the test data...

### Exercise

Perform a 5-fold cross-validation on the following dataset to select the best ridge penalty, assuming we fit a linear regression of $y$ on $x_1$, $x_2$, and the intercept term. 

In [4]:
import numpy as np

rng = np.random.default_rng(42)        # reproducible
n = 50                                # number of observations

# design matrix: each row is (x1, x2)
X = rng.normal(size=(n, 2))           # shape (50, 2)
# prepend a column of 1s for the intercept term
X = np.hstack((np.ones((n, 1)), X))   # shape (50, 3)

# noise term
epsilon = rng.normal(scale=0.5, size=n)

y = X[:,0]+X[:, 1] + X[:, 2]**2 + epsilon


In [None]:
# You can start with one simple ridge regression model
# Fit a ridge regression model
# from sklearn.linear_model import Ridge
# ridge = Ridge(alpha=1.0)  # alpha is the penalty that we need to tune

# Now to create the folds for cross-validation, we can use KFold from sklearn
# from sklearn.model_selection import KFold

# Apply f-Fold Cross-Validation on ridge regression to tune the alpha parameter

# Visualize the cross-validation results

# Select the best alpha and retrain the model on the entire dataset


Mean Squared Error: 0.4675


### Appendix: variants of cross-validation

Not required! 


| Variant                                       | Typical syntax in `sklearn`                             | When to use                                                               | Key idea / mechanics                                                             | Pros                                               | Cons / warnings                                                  |
| --------------------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------- |
| **Standard $K$-fold**                         | `KFold(n_splits=K, shuffle=True)`                       | Any i.i.d. tabular data                                                   | Randomly partition into $K$ equal folds; hold one out each round                 | Simple, low variance                               | Folds may break stratification or group integrity                |
| **Stratified $K$-fold**                       | `StratifiedKFold`                                       | Classification with class imbalance                                       | Keeps class proportions identical in every fold                                  | Fair evaluation of minority class                  | Not defined for regression                                       |
| **Leave-One-Out (LOO)**                       | `LeaveOneOut()`                                         | Very small $n$ (≤ 100) or when bias must be minimal                       | Each observation is its own test set; $n$ fits                                   | Nearly unbiased; uses almost all data for training | High variance; $n$ times slower; not group-safe                  |
| **Shuffle-Split / Monte-Carlo CV**            | `ShuffleSplit(n_splits=50, test_size=0.2)`              | Large datasets where full $K$-fold is costly                              | Randomly sample train/test partitions many times                                 | Flexible split sizes; good variance-bias trade-off | Overlaps across splits ⇒ scores not independent                  |
| **Nested CV**                                 | `cross_val_score(GridSearchCV(...), X, y, cv=outer_cv)` | Hyper-parameter tuning when an unbiased generalisation estimate is needed | Outer loop estimates performance; inner loop tunes                               | Guards against “double dipping”                    | Computationally expensive                                        |
| **Group $K$-fold**                            | `GroupKFold(n_splits=K)`                                | Data clustered by patient, user, site, etc.                               | Ensures all samples of a group stay together in train or test                    | Prevents leakage across correlated rows            | Needs a reliable group label                                     |
| **Leave-One-Group-Out (LOGO)**                | `LeaveOneGroupOut()`                                    | Small # of groups (e.g., leave one patient out)                           | Each group becomes the lone test set in turn                                     | Max training data under group constraint           | High runtime if many groups                                      |
| **Blocked / Rolling-Origin (Time-Series CV)** | `TimeSeriesSplit(n_splits=K, gap=∆)`                    | Temporal or sequential data (finance, sensor, NLP)                        | Preserve order: training window precedes test window; can expand or slide        | Mimics real-time forecasting; no look-ahead bias   | Old data may be less relevant; fewer training pts in early folds |
| **Purged & Embargoed Split** (quant-finance)  | custom                                                  | High-frequency trades with overlapping labels                             | Remove (“purge”) overlapping events; embargo neighbours around test period       | Strictly avoids label leakage                      | Complex; reduces usable data                                     |
| **Cluster-based CV**                          | `GroupKFold` with cluster ID                            | Spatial or network data; clustering validation                            | First cluster the data (or use known clusters) and treat each cluster as a group | Tests generalisation to unseen clusters            | Performance depends on clustering quality                        |
| **Stratified‐Group K-fold**                   | `StratifiedGroupKFold` (scikit-learn ≥ 1.3)             | Imbalanced classes **and** groups                                         | Simultaneously stratifies by label and respects group boundaries                 | Fair & leakage-free                                | May be impossible if constraints clash                           |