# Day 2 (Part 2): Model Selection & Tuning

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll learn how to systematically improve your ML models through:
- Cross-validation for robust evaluation
- Hyperparameter tuning (grid search, random search)
- Feature importance analysis
- Final model selection

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sysylvia/ethiopia-ds-workshop-2026/blob/main/notebooks/03-model-tuning.ipynb)

## Setup

In [None]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import (
    train_test_split, 
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
    KFold
)
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.inspection import permutation_importance

import warnings
warnings.filterwarnings('ignore')

# For reproducibility
np.random.seed(42)

print("Packages loaded!")

## Part 1: Load Data

We'll use the same supply chain dataset from the previous notebook.

In [None]:
# Load the supply chain dataset
url = "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"
df = pd.read_csv(url)

print(f"Data shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
df.head()

In [None]:
# Prepare features (same as notebook 02)
# Encode categorical variables
le_region = LabelEncoder()
le_facility = LabelEncoder()
le_season = LabelEncoder()

df['region_encoded'] = le_region.fit_transform(df['region'])
df['facility_encoded'] = le_facility.fit_transform(df['facility_type'])
df['season_encoded'] = le_season.fit_transform(df['season'])

# Define features
feature_cols = [
    'population_served', 'month', 'previous_demand', 
    'distance_to_warehouse', 'stockout_last_month', 
    'avg_delivery_days', 'storage_capacity',
    'region_encoded', 'facility_encoded', 'season_encoded'
]

X = df[feature_cols]
y = df['actual_demand']

print(f"Features: {len(feature_cols)}")
print(f"Samples: {len(X)}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Part 2: Why Tune Models?

### Hyperparameters vs Learned Parameters

| Type | What it is | Examples | How set? |
|------|-----------|----------|----------|
| **Learned parameters** | Values learned from data | Regression coefficients, tree splits | Training algorithm |
| **Hyperparameters** | Choices that control learning | Number of trees, learning rate, max depth | You choose! |

**Key insight:** Wrong hyperparameters can lead to underfitting (too simple) or overfitting (too complex).

## Part 3: Cross-Validation Deep Dive

Cross-validation gives us a more robust estimate of model performance than a single train/test split.

### K-Fold Cross-Validation

```
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]

Final score = average of all 5 folds
```

In [None]:
# Baseline: Random Forest with default settings
rf_default = RandomForestRegressor(random_state=42, n_jobs=-1)

# 5-fold cross-validation
cv_scores = cross_val_score(
    rf_default, X_train, y_train, 
    cv=5, 
    scoring='neg_mean_squared_error'
)

# Convert to RMSE (scores are negative MSE)
cv_rmse = np.sqrt(-cv_scores)

print("5-Fold Cross-Validation Results:")
print(f"  RMSE per fold: {[f'{x:.2f}' for x in cv_rmse]}")
print(f"  Mean RMSE: {cv_rmse.mean():.2f} (+/- {cv_rmse.std()*2:.2f})")

In [None]:
# Visualize cross-validation results
fig, ax = plt.subplots(figsize=(8, 4))

folds = range(1, 6)
ax.bar(folds, cv_rmse, color='steelblue', alpha=0.7)
ax.axhline(y=cv_rmse.mean(), color='red', linestyle='--', label=f'Mean RMSE: {cv_rmse.mean():.2f}')
ax.fill_between([0.5, 5.5], 
                cv_rmse.mean() - cv_rmse.std(), 
                cv_rmse.mean() + cv_rmse.std(), 
                color='red', alpha=0.1, label='+/- 1 std')

ax.set_xlabel('Fold')
ax.set_ylabel('RMSE')
ax.set_title('Cross-Validation Performance Across Folds')
ax.legend()
plt.tight_layout()
plt.show()

## Part 4: Grid Search for Hyperparameter Tuning

**Grid Search** tries every combination of hyperparameters you specify.

For Random Forest, key hyperparameters include:
- `n_estimators`: Number of trees (more = better but slower)
- `max_depth`: Maximum tree depth (controls overfitting)
- `min_samples_leaf`: Minimum samples in leaf nodes (prevents overly specific rules)

In [None]:
# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_leaf': [1, 2, 5]
}

# Total combinations
total = len(param_grid['n_estimators']) * len(param_grid['max_depth']) * len(param_grid['min_samples_leaf'])
print(f"Total combinations to try: {total}")
print(f"With 5-fold CV: {total * 5} model fits")

In [None]:
# Run grid search
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error',
    verbose=1,
    return_train_score=True
)

print("Running grid search (this may take a minute)...")
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.2f}")

In [None]:
# View top 5 configurations
results = pd.DataFrame(grid_search.cv_results_)
results['rmse'] = np.sqrt(-results['mean_test_score'])

top5 = results.nsmallest(5, 'rmse')[[
    'param_n_estimators', 'param_max_depth', 'param_min_samples_leaf', 
    'rmse', 'rank_test_score'
]]

print("Top 5 Configurations:")
display(top5)

## Part 5: Visualize Hyperparameter Effects

In [None]:
# Effect of max_depth on performance
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: max_depth effect
depth_results = results.groupby('param_max_depth')['rmse'].mean().reset_index()
depth_results['param_max_depth'] = depth_results['param_max_depth'].fillna('None')

axes[0].bar(depth_results['param_max_depth'].astype(str), depth_results['rmse'], color='steelblue')
axes[0].set_xlabel('max_depth')
axes[0].set_ylabel('Mean CV RMSE')
axes[0].set_title('Effect of max_depth')

# Plot 2: n_estimators effect
est_results = results.groupby('param_n_estimators')['rmse'].mean().reset_index()

axes[1].bar(est_results['param_n_estimators'].astype(str), est_results['rmse'], color='coral')
axes[1].set_xlabel('n_estimators')
axes[1].set_ylabel('Mean CV RMSE')
axes[1].set_title('Effect of n_estimators')

plt.tight_layout()
plt.show()

## Part 6: Random Search (Faster Alternative)

When the hyperparameter space is large, **Random Search** samples random combinations instead of trying everything.

In [None]:
# Define parameter distributions for random search
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_leaf': randint(1, 10),
    'min_samples_split': randint(2, 20)
}

# Run random search with 20 iterations
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    param_distributions,
    n_iter=20,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    verbose=1
)

print("Running random search...")
random_search.fit(X_train, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-random_search.best_score_):.2f}")

## Part 7: Feature Importance Analysis

Understanding which features matter helps:
- Interpret the model
- Identify important supply chain factors
- Guide future data collection

In [None]:
# Get best model from grid search
best_model = grid_search.best_estimator_

# Built-in feature importance (Mean Decrease in Impurity)
importance_mdi = pd.DataFrame({
    'feature': feature_cols,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (MDI - Built-in):")
display(importance_mdi)

In [None]:
# Permutation importance (more robust)
perm_importance = permutation_importance(
    best_model, X_test, y_test, 
    n_repeats=10, 
    random_state=42,
    n_jobs=-1
)

importance_perm = pd.DataFrame({
    'feature': feature_cols,
    'importance': perm_importance.importances_mean,
    'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

print("Feature Importance (Permutation):")
display(importance_perm)

In [None]:
# Visualize both importance measures
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MDI importance
sns.barplot(data=importance_mdi, x='importance', y='feature', ax=axes[0], color='steelblue')
axes[0].set_title('Built-in Feature Importance (MDI)')
axes[0].set_xlabel('Importance')

# Permutation importance with error bars
axes[1].barh(importance_perm['feature'], importance_perm['importance'], 
             xerr=importance_perm['std'], color='coral', alpha=0.7)
axes[1].set_title('Permutation Feature Importance')
axes[1].set_xlabel('Mean Importance')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## Part 8: Final Model Evaluation

Now we evaluate the tuned model on the **held-out test set** (which we haven't touched during tuning).

In [None]:
# Compare default vs tuned model on test set
rf_default = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_default.fit(X_train, y_train)
default_pred = rf_default.predict(X_test)

tuned_pred = best_model.predict(X_test)

# Metrics comparison
comparison = pd.DataFrame({
    'Model': ['Default RF', 'Tuned RF'],
    'RMSE': [
        np.sqrt(mean_squared_error(y_test, default_pred)),
        np.sqrt(mean_squared_error(y_test, tuned_pred))
    ],
    'MAE': [
        mean_absolute_error(y_test, default_pred),
        mean_absolute_error(y_test, tuned_pred)
    ],
    'R-squared': [
        r2_score(y_test, default_pred),
        r2_score(y_test, tuned_pred)
    ]
})

print("Test Set Performance:")
display(comparison)

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, pred, name in zip(axes, [default_pred, tuned_pred], ['Default RF', 'Tuned RF']):
    ax.scatter(y_test, pred, alpha=0.5)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    ax.set_xlabel('Actual Demand')
    ax.set_ylabel('Predicted Demand')
    ax.set_title(f'{name}\nRMSE: {np.sqrt(mean_squared_error(y_test, pred)):.2f}')

plt.tight_layout()
plt.show()

## Part 9: Save Model for Day 3

We'll save the trained model so we can use it in the ML-to-ABM integration notebook.

In [None]:
import joblib

# Save the tuned model
model_info = {
    'model': best_model,
    'feature_cols': feature_cols,
    'encoders': {
        'region': le_region,
        'facility_type': le_facility,
        'season': le_season
    },
    'best_params': grid_search.best_params_,
    'test_rmse': np.sqrt(mean_squared_error(y_test, tuned_pred))
}

# In Colab, save to session storage
joblib.dump(model_info, 'demand_model.joblib')
print("Model saved as 'demand_model.joblib'")
print(f"\nModel summary:")
print(f"  Best parameters: {model_info['best_params']}")
print(f"  Test RMSE: {model_info['test_rmse']:.2f}")

## Summary

In this notebook, you learned:

1. **Cross-validation** provides robust performance estimates
2. **Grid search** systematically explores hyperparameter combinations
3. **Random search** is faster for large search spaces
4. **Feature importance** helps interpret model decisions
5. **Final evaluation** on held-out test data prevents overfitting

### Key Takeaways for Supply Chain Forecasting

- Previous demand is typically the strongest predictor
- Facility characteristics (type, capacity) capture structural differences
- Seasonal patterns (rainy/dry) affect demand
- Distance and delivery time influence supply chain dynamics

**Next:** Use these predictions as inputs to Agent-Based Models (Day 3)

---

## Exercise (Optional)

Try tuning a Gradient Boosting model using the same process:

1. Define a parameter grid for `GradientBoostingRegressor` (try `n_estimators`, `learning_rate`, `max_depth`)
2. Run grid search with 5-fold CV
3. Compare the best GB model with the tuned RF model

Which performs better on this data?

In [None]:
# Your code here
# Hint: GradientBoostingRegressor key parameters:
# - n_estimators: [50, 100, 200]
# - learning_rate: [0.01, 0.1, 0.2]
# - max_depth: [3, 5, 7]
