# Day 05 — Hyperparameter Tuning: Theory + Hands-On Walkthrough

Welcome to Day 05 of the ML Track.

In Day 02 and Day 03, we trained strong baseline models and learned how to evaluate them.
Today, we answer an important question:

> **How do we systematically improve a model without manually guessing settings?**

The answer is **hyperparameter tuning**.

---
## Learning goals

By the end of this notebook, you will be able to:

1. Explain the difference between **model parameters** and **hyperparameters**.
2. Use **cross-validation** correctly for model selection.
3. Run both **GridSearchCV** and **RandomizedSearchCV**.
4. Compare tuned models against a baseline using multiple metrics.
5. Interpret search results and choose a practical final model.


## Theory: Parameters vs Hyperparameters

- **Parameters** are learned from data during training.
  - Example (logistic regression): feature weights / coefficients.
- **Hyperparameters** are set *before* training.
  - Example: regularization strength `C`, solver choice, iteration limit.

Hyperparameters control model flexibility, optimization behavior, and generalization.
If we choose them poorly, we can underfit, overfit, or train inefficiently.


## Theory: Why We Need Validation and Cross-Validation

If we tune directly on the test set, we leak information and get over-optimistic results.

Correct workflow:

1. Split data into **train** and **test**.
2. Use only the **training split** for tuning.
3. Within training data, use **k-fold cross-validation**:
   - Train on k-1 folds
   - Validate on the remaining fold
   - Repeat for all folds and average the score
4. Evaluate the selected model **once** on the untouched test set.

This keeps the test set as an honest estimate of real-world performance.


## Setup and imports


In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    StratifiedKFold,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


## Load dataset (Breast Cancer Wisconsin)

We keep the same dataset used earlier in the track for apples-to-apples comparison.

- **Target classes**:
  - `0` = malignant
  - `1` = benign
- This is a binary classification problem with numeric features.


In [None]:
# Load data
cancer = load_breast_cancer(as_frame=True)
X = cancer.data
y = cancer.target

print(f"Feature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")
print("\nClass distribution:")
print(y.value_counts(normalize=True).rename("ratio"))


## Train/test split

We reserve 20% for final testing and use `stratify=y` to preserve class balance.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
    stratify=y,
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")


## Build a baseline pipeline

Why a pipeline?

- Prevents data leakage by fitting preprocessing only on training folds.
- Keeps steps reusable and clean (`scaler` -> `model`).
- Makes tuning easier with namespaced hyperparameters like `model__C`.


In [None]:
baseline_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=2000, random_state=42)),
    ]
)

baseline_pipeline.fit(X_train, y_train)
y_pred_baseline = baseline_pipeline.predict(X_test)
y_proba_baseline = baseline_pipeline.predict_proba(X_test)[:, 1]

baseline_metrics = {
    "accuracy": accuracy_score(y_test, y_pred_baseline),
    "precision": precision_score(y_test, y_pred_baseline),
    "recall": recall_score(y_test, y_pred_baseline),
    "f1": f1_score(y_test, y_pred_baseline),
    "roc_auc": roc_auc_score(y_test, y_proba_baseline),
}

pd.Series(baseline_metrics).round(4)


## Theory: Grid Search vs Randomized Search

### Grid Search (`GridSearchCV`)
- Exhaustively evaluates every combination in a predefined grid.
- Good when search space is small and we want deterministic coverage.
- Cost grows quickly as dimensions increase.

### Randomized Search (`RandomizedSearchCV`)
- Samples a fixed number of combinations (`n_iter`).
- Often finds near-optimal settings much faster in larger spaces.
- Better compute/performance tradeoff in many real projects.


## Define cross-validation strategy

We use **StratifiedKFold** so each fold keeps class proportions similar.


In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


## Grid Search (exhaustive tuning)

We tune logistic regression hyperparameters that usually matter most:

- `C`: inverse regularization strength
- `penalty`: regularization type
- `solver`: optimization algorithm


In [None]:
grid_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=4000, random_state=42)),
    ]
)

param_grid = {
    "model__C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    "model__solver": ["liblinear", "lbfgs"],
    "model__penalty": ["l2"],
}

grid_search = GridSearchCV(
    estimator=grid_pipeline,
    param_grid=param_grid,
    scoring="f1",
    cv=cv,
    n_jobs=-1,
    refit=True,
)

grid_search.fit(X_train, y_train)

print("Best GridSearch params:", grid_search.best_params_)
print("Best mean CV F1:", round(grid_search.best_score_, 4))


## Inspect top grid-search configurations


In [None]:
grid_results = pd.DataFrame(grid_search.cv_results_)
columns_to_show = [
    "mean_test_score",
    "std_test_score",
    "param_model__C",
    "param_model__solver",
    "rank_test_score",
]

grid_results[columns_to_show].sort_values("rank_test_score").head(8)


## Randomized Search (faster exploration)

Here we sample from a wider `C` range but evaluate only a fixed number of trials.


In [None]:
random_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=4000, random_state=42)),
    ]
)

param_dist = {
    "model__C": np.logspace(-4, 3, 200),
    "model__solver": ["liblinear", "lbfgs"],
    "model__penalty": ["l2"],
}

random_search = RandomizedSearchCV(
    estimator=random_pipeline,
    param_distributions=param_dist,
    n_iter=24,
    scoring="f1",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    refit=True,
)

random_search.fit(X_train, y_train)

print("Best RandomSearch params:", random_search.best_params_)
print("Best mean CV F1:", round(random_search.best_score_, 4))


## Evaluate baseline vs tuned models on test data

We evaluate:
- Baseline model
- Best GridSearch model
- Best RandomSearch model

Using multiple metrics gives a fuller picture than accuracy alone.


In [None]:
def evaluate_model(name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    return {
        "model": name,
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred),
        "recall": recall_score(y_test, y_pred),
        "f1": f1_score(y_test, y_pred),
        "roc_auc": roc_auc_score(y_test, y_proba),
    }

comparison = pd.DataFrame([
    evaluate_model("baseline", baseline_pipeline, X_test, y_test),
    evaluate_model("grid_best", grid_search.best_estimator_, X_test, y_test),
    evaluate_model("random_best", random_search.best_estimator_, X_test, y_test),
]).set_index("model")

comparison.round(4).sort_values("f1", ascending=False)


## Theory: How to choose the final model in practice

Do **not** pick based on one metric only.

A practical decision framework:

1. Prioritize metric aligned with business risk (e.g., recall for missed positives).
2. Compare CV score and test score (watch for instability/overfitting).
3. Prefer simpler/cheaper models when performance is similar.
4. Document why the final model was chosen.


## Summary of Day 05

Today we learned that hyperparameter tuning is a disciplined search process, not trial-and-error guessing.

### What we accomplished
- Built a robust baseline pipeline.
- Tuned logistic regression with both grid and randomized search.
- Used stratified 5-fold CV for reliable model selection.
- Compared baseline and tuned models on multiple test metrics.

### Connection to Day 06
In Day 06, we focus on **interpretability**: once a tuned model performs well, we must explain *why* it predicts what it does.
