# 4. Model Validation (Hands-on with KNN)

This notebook practices the validation concepts from class with very simple code.

Goals:
- Understand why validation is needed (overfitting).
- Build train/test splits manually (including stratification).
- Build train/validation/test splits for hyperparameter tuning.
- Implement k-fold and leave-one-out validation manually.
- Use `KNeighborsClassifier` with toy datasets from `sklearn.datasets`.


## 0. Imports and helper functions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

from sklearn.datasets import load_iris, make_moons
from sklearn.neighbors import KNeighborsClassifier

np.set_printoptions(suppress=True)


In [None]:
def accuracy_manual(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    return (y_true == y_pred).mean()


def class_percentages(y):
    y = np.array(y)
    classes, counts = np.unique(y, return_counts=True)
    percentages = 100 * counts / len(y)
    return pd.DataFrame({
        'class': classes,
        'count': counts,
        'percentage': percentages.round(2)
    })


## 1. Why validation: train accuracy vs test accuracy (overfitting intuition)

In [None]:
def train_test_split_manual(X, y, test_size=0.2, seed=42, stratify=False):
    X = np.array(X)
    y = np.array(y)

    rng = np.random.default_rng(seed)
    n = len(y)

    if not stratify:
        indices = np.arange(n)
        rng.shuffle(indices)

        n_test = int(round(test_size * n))
        test_idx = indices[:n_test]
        train_idx = indices[n_test:]
    else:
        train_idx = []
        test_idx = []

        for cls in np.unique(y):
            cls_idx = np.where(y == cls)[0]
            rng.shuffle(cls_idx)

            n_cls_test = int(round(test_size * len(cls_idx)))
            test_idx.extend(cls_idx[:n_cls_test])
            train_idx.extend(cls_idx[n_cls_test:])

        train_idx = np.array(train_idx)
        test_idx = np.array(test_idx)

        rng.shuffle(train_idx)
        rng.shuffle(test_idx)

    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]


In [None]:
# Toy dataset with some noise
X_moons, y_moons = make_moons(n_samples=300, noise=0.28, random_state=7)

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split_manual(
    X_moons, y_moons, test_size=0.30, seed=7, stratify=True
)

results = []
for k in [1, 3, 15, 35]:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train_m, y_train_m)

    train_pred = model.predict(X_train_m)
    test_pred = model.predict(X_test_m)

    results.append({
        'k': k,
        'train_accuracy': round(accuracy_manual(y_train_m, train_pred), 4),
        'test_accuracy': round(accuracy_manual(y_test_m, test_pred), 4)
    })

pd.DataFrame(results)


In [None]:
# Optional visualization (helps understand overfitting intuition)
plt.figure(figsize=(6, 4))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, s=20, cmap='coolwarm')
plt.title('Toy dataset: make_moons')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.show()


### Exercise 1
Change `test_size`, `seed`, and `k` values above.

Questions:
- Which `k` gives very high train accuracy but lower test accuracy?
- Do your conclusions change when you change the random seed?

## 2. Train/Test split details: randomness, reproducibility, proportion, stratification

In [None]:
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
target_names = iris.target_names

print('Dataset size:', len(y_iris))
print('Class names:', list(target_names))
class_percentages(y_iris)

# Use only to first features for a more complex dataset
X_iris = X_iris[:, :2]


In [None]:
# Split without stratification
X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split_manual(
    X_iris, y_iris, test_size=0.20, seed=3, stratify=False
)

print('Without stratification - train class distribution')
display(class_percentages(y_train_ns))
print('Without stratification - test class distribution')
display(class_percentages(y_test_ns))


In [None]:
# Split with stratification
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split_manual(
    X_iris, y_iris, test_size=0.20, seed=3, stratify=True
)

print('With stratification - train class distribution')
display(class_percentages(y_train_s))
print('With stratification - test class distribution')
display(class_percentages(y_test_s))


In [None]:
# Train one KNN model with the non-stratified split
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ns, y_train_ns)

y_pred_train = knn.predict(X_train_ns)
y_pred_test = knn.predict(X_test_ns)

print('Train accuracy:', round(accuracy_manual(y_train_ns, y_pred_train), 4))
print('Test accuracy :', round(accuracy_manual(y_test_ns, y_pred_test), 4))


In [None]:
# Train one KNN model with the stratified split
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train_s)

y_pred_train = knn.predict(X_train_s)
y_pred_test = knn.predict(X_test_s)

print('Train accuracy:', round(accuracy_manual(y_train_s, y_pred_train), 4))
print('Test accuracy :', round(accuracy_manual(y_test_s, y_pred_test), 4))

### Exercise 2
Try these changes and rerun:
- `test_size=0.30`
- `seed=10`
- `n_neighbors` in KNN as 1, 3, 7, 11

Write one short conclusion: which setting seems more stable?

## 3. Split into Train / Validation / Test (for hyperparameter tuning)

In [None]:
def train_val_test_split_manual(X, y, val_size=0.20, test_size=0.20, seed=42, stratify=True):
    # First split: separate test set
    X_train_val, X_test, y_train_val, y_test = train_test_split_manual(
        X, y, test_size=test_size, seed=seed, stratify=stratify
    )

    # Second split: split train_val into train and validation
    # val_size is relative to the original full dataset
    val_relative = val_size / (1 - test_size)

    X_train, X_val, y_train, y_val = train_test_split_manual(
        X_train_val, y_train_val, test_size=val_relative, seed=seed + 1, stratify=stratify
    )

    return X_train, X_val, X_test, y_train, y_val, y_test


In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split_manual(
    X_iris, y_iris, val_size=0.20, test_size=0.20, seed=8, stratify=True
)

print('Sizes -> train:', len(y_train), 'val:', len(y_val), 'test:', len(y_test))
print('Train distribution')
display(class_percentages(y_train))
print('Validation distribution')
display(class_percentages(y_val))
print('Test distribution')
display(class_percentages(y_test))


In [None]:
# Hyperparameter tuning using validation set only
candidate_k = [1, 3, 5, 7, 9, 11, 13]
val_results = []

for k in candidate_k:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    val_pred = model.predict(X_val)

    val_results.append({
        'k': k,
        'validation_accuracy': round(accuracy_manual(y_val, val_pred), 4)
    })

val_df = pd.DataFrame(val_results).sort_values('validation_accuracy', ascending=False)
val_df


In [None]:
best_k = int(val_df.iloc[0]['k'])
print('Best k selected on validation:', best_k)

# Final model: train with train+validation, evaluate once on test
X_train_final = np.vstack([X_train, X_val])
y_train_final = np.concatenate([y_train, y_val])

final_model = KNeighborsClassifier(n_neighbors=best_k)
final_model.fit(X_train_final, y_train_final)
test_pred = final_model.predict(X_test)

print('Final test accuracy:', round(accuracy_manual(y_test, test_pred), 4))


### Exercise 3
Change `candidate_k` and split proportions.

Questions:
- Does the selected `best_k` change?
- Is test accuracy always the same as validation accuracy? Why not?

## 4. Manual K-Fold Cross-Validation

In [None]:
def make_stratified_folds(y, n_folds=5, seed=42):
    y = np.array(y)
    rng = np.random.default_rng(seed)

    folds = [[] for _ in range(n_folds)]

    for cls in np.unique(y):
        cls_idx = np.where(y == cls)[0]
        rng.shuffle(cls_idx)

        # Round-robin assignment keeps class proportions balanced
        for i, idx in enumerate(cls_idx):
            fold_id = i % n_folds
            folds[fold_id].append(idx)

    return [np.array(fold, dtype=int) for fold in folds]


def k_fold_cv_knn(X, y, n_neighbors=5, n_folds=5, seed=42):
    X = np.array(X)
    y = np.array(y)

    folds = make_stratified_folds(y, n_folds=n_folds, seed=seed)
    fold_scores = []

    for fold_id in range(n_folds):
        val_idx = folds[fold_id]

        train_parts = [folds[i] for i in range(n_folds) if i != fold_id]
        train_idx = np.concatenate(train_parts)

        X_train, y_train = X[train_idx], y[train_idx]
        X_val, y_val = X[val_idx], y[val_idx]

        model = KNeighborsClassifier(n_neighbors=n_neighbors)
        model.fit(X_train, y_train)

        y_val_pred = model.predict(X_val)
        fold_acc = accuracy_manual(y_val, y_val_pred)
        fold_scores.append(fold_acc)

    return np.array(fold_scores)


In [None]:
X_train, X_test, y_train, y_test = train_test_split_manual(
    X_iris, y_iris, test_size=0.20, seed=2, stratify=True
)

k_values = [1, 3, 5, 7, 9, 11, 13, 15]
cv_rows = []

for k in k_values:
    scores = k_fold_cv_knn(X_train, y_train, n_neighbors=k, n_folds=5, seed=4)
    cv_rows.append({
        'k': k,
        'fold_scores': np.round(scores, 4),
        'mean_cv_accuracy': round(scores.mean(), 4),
        'std_cv_accuracy': round(scores.std(), 4),
    })

cv_df = pd.DataFrame(cv_rows).sort_values('mean_cv_accuracy', ascending=False)
cv_df


In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(cv_df['k'], cv_df['mean_cv_accuracy'], marker='o')
plt.title('5-fold CV mean accuracy vs k')
plt.xlabel('k (neighbors)')
plt.ylabel('mean CV accuracy')
# plt.ylim(0.70, 0.85)
plt.grid(alpha=0.3)
plt.show()


#### Precision of CV estimates compared to test accuracy

In [None]:
X_train.shape

In [None]:
cv_rows = []
for n_folds in [2, 5, 10, 20, 40]:
    scores = k_fold_cv_knn(X_train, y_train, n_neighbors=11, n_folds=n_folds, seed=4)
    test_pred = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train).predict(X_test)
    test_accuracy = accuracy_manual(y_test, test_pred)
    cv_rows.append({
        'n_folds': n_folds,
        'fold_scores': np.round(scores, 4),
        'mean_cv_accuracy': round(scores.mean(), 4),
        'std_cv_accuracy': round(scores.std(), 4),
        'test_accuracy': test_accuracy
    })

cv_df = pd.DataFrame(cv_rows).sort_values('n_folds', ascending=True)
cv_df

### Exercise 4
- Change `n_folds` from 5 to 3 and 10.
- Change `seed`.

Questions:
- Which `k` is most stable (high mean, low std)?
- How do results compare to one train/test split?

## 5. Leave-One-Out Cross-Validation (LOO)

In [None]:
def leave_one_out_cv_knn(X, y, n_neighbors=5):
    X = np.array(X)
    y = np.array(y)

    n = len(y)
    correct = 0

    for i in range(n):
        train_idx = np.array([j for j in range(n) if j != i])
        test_idx = i

        model = KNeighborsClassifier(n_neighbors=n_neighbors)
        model.fit(X[train_idx], y[train_idx])

        pred = model.predict(X[test_idx:test_idx+1])[0]
        if pred == y[test_idx]:
            correct += 1

    return correct / n


In [None]:
# Use a subset to keep runtime small for beginners

comparison_rows = []
for k in [1, 3, 5, 7, 9, 11]:
    loo_acc = leave_one_out_cv_knn(X_train, y_train, n_neighbors=k)
    k5_acc = k_fold_cv_knn(X_train, y_train, n_neighbors=k, n_folds=5, seed=12).mean()
    k10_acc = k_fold_cv_knn(X_train, y_train, n_neighbors=k, n_folds=10, seed=12).mean()
    test_pred = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train).predict(X_test)
    test_accuracy = accuracy_manual(y_test, test_pred)

    comparison_rows.append({
        'k': k,
        'loo_accuracy': round(loo_acc, 4),
        '5fold_accuracy': round(k5_acc, 4),
        '10fold_accuracy': round(k10_acc, 4),
        'test_accuracy': round(test_accuracy, 4)
    })

pd.DataFrame(comparison_rows)


### Exercise 5
- Increase subset size from 60 to 100.
- Time the execution of LOO and 5-fold CV using `%time`.

Questions:
- Which method is slower?
- Are LOO and 5-fold results very different?

## 6. Optional: Manual one-hot encoding (without sklearn)
This is a small extra example to keep preprocessing transparent for beginners.

The function `one_hot_encode_manual` takes a DataFrame and a column name, and creates new binary columns for each category in that column. By default, it excludes the last category to avoid multicollinearity (the "dummy variable trap"). You can change this behavior with the `exclude_last` parameter.

#### When to exclude the last category?
- If you are using linear models (like linear regression or logistic regression), you should exclude one category to avoid multicollinearity.
- If you are using tree-based models (like decision trees or random forests), you can include all categories since these models are not affected by multicollinearity.
- On KNN, we want to include all categories so distances are calculated equally.

In [None]:
def one_hot_encode_manual(df, column, exclude_last=True):
    df = df.copy()
    categories = sorted(df[column].unique())
    if len(categories) < 2:
        raise ValueError("Column must have at least 2 unique categories for one-hot encoding.")

    # If exclude_last is True, we encode all but the last category to avoid multicollinearity
    categories_to_encode = categories[:-1] if exclude_last else categories  

    for cat in categories_to_encode:
        new_col = f'{column}_{cat}'
        df[new_col] = (df[column] == cat).astype(int)

    df = df.drop(columns=[column])
    return df


toy_df = pd.DataFrame({
    'size': ['small', 'medium', 'small', 'large', 'medium'],
    'price': [10, 15, 12, 20, 18]
})

print('Original:')
display(toy_df)

print('One-hot encoded manually:')
display(one_hot_encode_manual(toy_df, 'size'))


## 7. Conclusions
- A single train/test split is simple, but can depend on randomness.
- Train/validation/test helps tune hyperparameters more safely.
- K-fold CV is usually more robust than one split.
- Leave-One-Out is informative but can be expensive.
- Manual implementations help understand each validation step clearly.