# Notebook 4: Hyperparameter Tuning — Grid, Random, and Bayesian Search

**Module ML600 — Optimization, Regularization, and Model Selection**

## Learning Objectives

By the end of this notebook you will be able to:

- Distinguish **hyperparameters** from **learned parameters** and explain why tuning matters
- Perform exhaustive search with **GridSearchCV**
- Perform efficient random search with **RandomizedSearchCV**
- Understand **nested cross-validation** for unbiased model evaluation
- Describe the concept of **Bayesian optimization** and know where to find libraries (Optuna, scikit-optimize)
- Apply best practices: always use CV, report on held-out test sets, avoid common pitfalls

## Prerequisites

- Familiarity with scikit-learn estimators (`fit` / `predict` / `score`)
- Understanding of cross-validation (see Notebook 03)
- Basic knowledge of Random Forests or any tree-based model
- Python libraries: `numpy`, `pandas`, `matplotlib`, `seaborn`, `sklearn`

## Table of Contents

1. [Hyperparameters vs Parameters](#1)
2. [GridSearchCV — Exhaustive Search](#2)
3. [RandomizedSearchCV — Random Sampling](#3)
4. [Grid vs Random: Time and Performance Comparison](#4)
5. [Nested Cross-Validation](#5)
6. [Bayesian Optimization (Conceptual)](#6)
7. [Best Practices](#7)
8. [Common Mistakes](#8)
9. [Exercise](#9)

---
## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import (
    train_test_split, GridSearchCV, RandomizedSearchCV,
    cross_val_score, KFold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import randint, uniform

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
print('Setup complete.')

### Load the Breast Cancer Dataset

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # 0 = malignant, 1 = benign

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set:     {X_test.shape[0]} samples')
print(f'Features:     {X_train.shape[1]}')

<a id='1'></a>
## 1. Hyperparameters vs Parameters

| Aspect | **Parameters** | **Hyperparameters** |
|--------|---------------|--------------------|
| Set by | Learning algorithm during training | Engineer **before** training |
| Examples | Weights in linear regression, split thresholds in trees | `n_estimators`, `max_depth`, `learning_rate`, `C` |
| How chosen | Optimized by the loss function | Chosen via search + cross-validation |

### Why tuning matters

- **Default hyperparameters** are sensible starting points but rarely optimal for your data
- Under-tuning can leave significant performance on the table
- Over-tuning on training data leads to **overfitting**
- Proper tuning with CV gives an **honest estimate** of generalization

In [None]:
# Baseline: RandomForest with defaults
rf_default = RandomForestClassifier(random_state=RANDOM_STATE)
rf_default.fit(X_train, y_train)
default_acc = rf_default.score(X_test, y_test)
print(f'Default RandomForest accuracy: {default_acc:.4f}')

<a id='2'></a>
## 2. GridSearchCV — Exhaustive Search

**GridSearchCV** tries **every combination** in a parameter grid and picks the best one via cross-validation.

Key arguments:
- `estimator` — the model
- `param_grid` — dict mapping parameter names to lists of values
- `cv` — number of folds (default 5)
- `scoring` — metric to optimize (e.g., `'accuracy'`, `'f1'`, `'roc_auc'`)
- `n_jobs` — parallel jobs (`-1` = all cores)

In [None]:
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

total_combos = 1
for v in param_grid.values():
    total_combos *= len(v)
print(f'Total combinations to evaluate: {total_combos}')

In [None]:
# Run GridSearchCV
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    return_train_score=True
)

start_time = time.time()
grid_search.fit(X_train, y_train)
grid_time = time.time() - start_time

print(f'GridSearchCV completed in {grid_time:.2f} seconds')
print(f'Best CV accuracy:  {grid_search.best_score_:.4f}')
print(f'Best parameters:   {grid_search.best_params_}')

In [None]:
# Evaluate best model on held-out test set
grid_test_acc = grid_search.score(X_test, y_test)
print(f'Test accuracy (GridSearchCV best): {grid_test_acc:.4f}')
print(f'Test accuracy (default RF):        {default_acc:.4f}')

In [None]:
# Visualize top-10 hyperparameter combinations
results_df = pd.DataFrame(grid_search.cv_results_)
top10 = results_df.nsmallest(10, 'rank_test_score')[[
    'params', 'mean_test_score', 'std_test_score', 'mean_train_score', 'rank_test_score'
]].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(range(len(top10)), top10['mean_test_score'], xerr=top10['std_test_score'],
        color=sns.color_palette('viridis', len(top10)), edgecolor='black')
ax.set_yticks(range(len(top10)))
ax.set_yticklabels([str(p) for p in top10['params']], fontsize=7)
ax.set_xlabel('Mean CV Accuracy')
ax.set_title('Top 10 Hyperparameter Combinations (GridSearchCV)')
plt.tight_layout()
plt.show()

<a id='3'></a>
## 3. RandomizedSearchCV — Random Sampling

When the grid is large, exhaustive search becomes **prohibitively slow**. `RandomizedSearchCV` samples a fixed number (`n_iter`) of random parameter combinations from specified distributions.

Advantages:
- **Much faster** for large search spaces
- Can sample from **continuous distributions** (not just discrete lists)
- Empirically reaches near-optimal solutions with far fewer evaluations (Bergstra & Bengio, 2012)

In [None]:
# Define distributions for random search
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': [3, 5, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE),
    param_distributions=param_distributions,
    n_iter=50,           # try 50 random combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=RANDOM_STATE,
    return_train_score=True
)

start_time = time.time()
random_search.fit(X_train, y_train)
random_time = time.time() - start_time

print(f'RandomizedSearchCV completed in {random_time:.2f} seconds')
print(f'Best CV accuracy:  {random_search.best_score_:.4f}')
print(f'Best parameters:   {random_search.best_params_}')

In [None]:
random_test_acc = random_search.score(X_test, y_test)
print(f'Test accuracy (RandomizedSearchCV best): {random_test_acc:.4f}')

<a id='4'></a>
## 4. Grid vs Random: Time and Performance Comparison

In [None]:
comparison = pd.DataFrame({
    'Method': ['Default RF', 'GridSearchCV', 'RandomizedSearchCV'],
    'Test Accuracy': [default_acc, grid_test_acc, random_test_acc],
    'Best CV Accuracy': [None, grid_search.best_score_, random_search.best_score_],
    'Time (s)': [None, grid_time, random_time],
    'Combinations Tried': [1, total_combos, 50]
})
print(comparison.to_string(index=False))

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Accuracy comparison
methods = ['Default', 'Grid', 'Random']
accs = [default_acc, grid_test_acc, random_test_acc]
colors = ['#999999', '#2196F3', '#FF9800']
axes[0].bar(methods, accs, color=colors, edgecolor='black')
axes[0].set_ylabel('Test Accuracy')
axes[0].set_title('Test Accuracy Comparison')
axes[0].set_ylim(min(accs) - 0.02, max(accs) + 0.02)
for i, v in enumerate(accs):
    axes[0].text(i, v + 0.003, f'{v:.4f}', ha='center', fontweight='bold')

# Time comparison
axes[1].bar(['Grid', 'Random'], [grid_time, random_time],
            color=['#2196F3', '#FF9800'], edgecolor='black')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Search Time Comparison')
for i, v in enumerate([grid_time, random_time]):
    axes[1].text(i, v + 0.1, f'{v:.2f}s', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

<a id='5'></a>
## 5. Nested Cross-Validation

Standard `GridSearchCV` uses a **single** train/test split for final evaluation. This can be optimistic if the test set happens to be "easy".

**Nested CV** provides an unbiased estimate of generalization:

```
Outer CV (evaluation)
  |--- Fold 1 train -> Inner CV (tuning) -> best params -> score on Fold 1 test
  |--- Fold 2 train -> Inner CV (tuning) -> best params -> score on Fold 2 test
  |--- ...
```

- **Inner loop**: tunes hyperparameters via CV
- **Outer loop**: evaluates the *entire tuning procedure* on unseen data

In [None]:
# Nested CV: inner loop for tuning, outer loop for evaluation
inner_cv = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Inner search object
inner_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=RANDOM_STATE),
    param_grid={
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, None]
    },
    cv=inner_cv,
    scoring='accuracy',
    n_jobs=-1
)

# Outer evaluation
nested_scores = cross_val_score(
    inner_search, X_train, y_train, cv=outer_cv, scoring='accuracy'
)

print(f'Nested CV scores: {nested_scores}')
print(f'Nested CV mean:   {nested_scores.mean():.4f} +/- {nested_scores.std():.4f}')
print(f'\nThis gives an unbiased estimate of how well our tuning procedure generalizes.')

<a id='6'></a>
## 6. Bayesian Optimization (Conceptual)

Both Grid and Random search are **uninformed** — each trial ignores the results of previous trials.

**Bayesian optimization** builds a **surrogate model** (usually a Gaussian Process or Tree Parzen Estimator) of the objective function and uses it to decide which hyperparameters to try next.

How it works:
1. Evaluate a few random points
2. Fit a surrogate model to map hyperparameters -> score
3. Use an **acquisition function** (e.g., Expected Improvement) to pick the next point that balances exploration vs. exploitation
4. Evaluate that point, update the surrogate, repeat

Popular libraries:
- **[Optuna](https://optuna.org/)** — flexible, supports pruning, great visualization
- **[scikit-optimize](https://scikit-optimize.github.io/)** — `BayesSearchCV` with scikit-learn API
- **[Hyperopt](http://hyperopt.github.io/hyperopt/)** — Tree Parzen Estimators

> **Note**: Bayesian methods shine when evaluation is expensive (e.g., deep learning). For small sklearn models, RandomizedSearchCV is often sufficient.

In [None]:
# Conceptual illustration: Bayesian optimization flow (pseudocode)
# This cell is illustrative -- it does NOT require optuna to be installed.

bayesian_pseudocode = """
import optuna

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth    = trial.suggest_int('max_depth', 3, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(study.best_params)
print(study.best_value)
"""
print('Bayesian Optimization pseudocode (requires optuna):')
print(bayesian_pseudocode)

<a id='7'></a>
## 7. Best Practices

1. **Always use cross-validation** during tuning, not a single validation split
2. **Report final results on a held-out test set** that was never used during tuning
3. **Start broad, then narrow**: use RandomizedSearchCV to find promising regions, then GridSearchCV to fine-tune
4. **Fix `random_state`** for reproducibility
5. **Use `n_jobs=-1`** to parallelize across CPU cores
6. **Check for overfitting**: compare `mean_train_score` vs `mean_test_score` in CV results
7. **Consider nested CV** when you need an unbiased estimate of the tuning procedure

In [None]:
# Check for overfitting in GridSearchCV results
results = pd.DataFrame(grid_search.cv_results_)
results_sorted = results.sort_values('rank_test_score').head(15)

fig, ax = plt.subplots(figsize=(10, 5))
x = range(len(results_sorted))
ax.plot(x, results_sorted['mean_train_score'], 'o-', label='Train CV', color='#2196F3')
ax.plot(x, results_sorted['mean_test_score'], 's-', label='Validation CV', color='#FF5722')
ax.fill_between(x,
                results_sorted['mean_test_score'] - results_sorted['std_test_score'],
                results_sorted['mean_test_score'] + results_sorted['std_test_score'],
                alpha=0.2, color='#FF5722')
ax.set_xlabel('Rank (sorted by validation score)')
ax.set_ylabel('Accuracy')
ax.set_title('Train vs Validation CV Accuracy (Top 15 Configurations)')
ax.legend()
plt.tight_layout()
plt.show()

<a id='8'></a>
## 8. Common Mistakes

| Mistake | Why It Is Wrong | Fix |
|---------|----------------|-----|
| **Tuning on the test set** | Test set leaks into model selection, giving an optimistic estimate | Use CV on training data only; evaluate on test set **once** at the end |
| **Not using cross-validation** | A single train/val split is noisy and unreliable | Always use `cv >= 3` in `GridSearchCV` / `RandomizedSearchCV` |
| **Grid too coarse** | Misses good hyperparameter regions | Start with random search to find promising areas, then refine |
| **Grid too fine** | Wastes compute on negligible differences | Focus on hyperparameters that matter most (use feature importance of HP) |
| **Ignoring `random_state`** | Results are not reproducible | Set `random_state` in the estimator and in the search object |
| **Not reporting test set results** | CV score alone can still be optimistic if you tried many configs | Always hold out a final test set and report that number |

<a id='9'></a>
## 9. Exercise

**Task**: Tune a `GradientBoostingClassifier` on the breast cancer dataset.

1. Define a parameter grid with at least:
   - `n_estimators`: [50, 100, 200]
   - `learning_rate`: [0.01, 0.1, 0.2]
   - `max_depth`: [3, 5, 7]
2. Run `RandomizedSearchCV` with `n_iter=20` and `cv=5`
3. Print the best parameters and best CV score
4. Evaluate on the held-out test set
5. Compare with the default `GradientBoostingClassifier`

In [None]:
# YOUR CODE HERE
from sklearn.ensemble import GradientBoostingClassifier

# Step 1: Define parameter grid
# param_dist_gb = { ... }

# Step 2: Run RandomizedSearchCV
# rand_gb = RandomizedSearchCV(...)
# rand_gb.fit(X_train, y_train)

# Step 3: Print best params and CV score
# print(rand_gb.best_params_)
# print(rand_gb.best_score_)

# Step 4: Evaluate on test set
# print(rand_gb.score(X_test, y_test))

# Step 5: Compare with default
# gb_default = GradientBoostingClassifier(random_state=42)
# gb_default.fit(X_train, y_train)
# print(gb_default.score(X_test, y_test))

---
**End of Notebook 4** | Next: [05 — Feature Selection: Filter, Wrapper, and Embedded](05_Feature_Selection_Filter_Wrapper_Embedded.ipynb)