# Notebook 6: End-to-End ML Project Template

**Module ML600 — Optimization, Regularization, and Model Selection**

A fully runnable walkthrough that covers every step from problem framing to model saving.

## Learning Objectives

By the end of this notebook you will be able to:

- **Frame** an ML problem: define the objective, metric, and baseline
- **Load and explore** data with summary statistics, distributions, and correlations
- **Split** data correctly (train/test, stratification considerations)
- **Build preprocessing pipelines** with `ColumnTransformer`
- **Train and compare** multiple models using cross-validation
- **Analyze errors** with residual plots and worst-prediction inspection
- **Tune hyperparameters** with `GridSearchCV`
- **Evaluate** the final model on a held-out test set
- **Save and reload** a trained pipeline with `joblib`
- Follow a **reproducibility checklist**

## Prerequisites

- Completed Notebooks 01–05 in this module (or equivalent knowledge)
- Familiarity with scikit-learn pipelines, cross-validation, and GridSearchCV
- Python libraries: `numpy`, `pandas`, `matplotlib`, `seaborn`, `sklearn`, `joblib`

## Table of Contents

1. [Problem Framing](#1)
2. [Data Loading and EDA](#2)
3. [Data Splitting](#3)
4. [Baseline Model](#4)
5. [Preprocessing Pipeline](#5)
6. [Model Training](#6)
7. [Evaluation — Cross-Validation](#7)
8. [Error Analysis](#8)
9. [Hyperparameter Tuning](#9)
10. [Final Evaluation on Test Set](#10)
11. [Model Saving](#11)
12. [Reproducibility Checklist](#12)
13. [Common Mistakes](#13)
14. [Exercise](#14)

---
## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import warnings
import os

from sklearn.datasets import load_diabetes
from sklearn.model_selection import (
    train_test_split, cross_val_score, cross_val_predict, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
print('Setup complete.')

<a id='1'></a>
## 1. Problem Framing

Before writing any code, clearly define:

| Question | Answer for This Project |
|----------|------------------------|
| **What are we predicting?** | A quantitative measure of disease progression one year after baseline (regression) |
| **What is the target variable?** | `target` column in the diabetes dataset |
| **What metric will we optimize?** | RMSE (Root Mean Squared Error) — lower is better |
| **What is a reasonable baseline?** | Predict the mean of the training set (`DummyRegressor`) |
| **What data do we have?** | 10 baseline variables: age, sex, BMI, blood pressure, 6 blood serum measurements |
| **How many samples?** | 442 |

> **Goal**: Build a regression model that significantly outperforms the "predict the mean" baseline.

<a id='2'></a>
## 2. Data Loading and EDA

In [None]:
# Load data
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target, name='target')

print(f'Dataset shape: {X.shape}')
print(f'Target range:  [{y.min():.0f}, {y.max():.0f}]')
print(f'Target mean:   {y.mean():.1f}')
print()
X.head()

In [None]:
# Basic statistics
X.describe().T

In [None]:
# Check for missing values
print('Missing values per column:')
print(X.isnull().sum().to_string())
print(f'\nTotal missing: {X.isnull().sum().sum()}')

In [None]:
# Distribution of features
fig, axes = plt.subplots(2, 5, figsize=(18, 7))
for i, col in enumerate(X.columns):
    ax = axes[i // 5, i % 5]
    ax.hist(X[col], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_title(col, fontsize=10)
plt.suptitle('Feature Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Distribution of target variable
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(y, bins=30, color='#FF7043', edgecolor='black', alpha=0.8)
ax.axvline(y.mean(), color='black', linestyle='--', label=f'Mean = {y.mean():.1f}')
ax.set_xlabel('Disease Progression')
ax.set_ylabel('Count')
ax.set_title('Target Distribution')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
full_df = X.copy()
full_df['target'] = y

fig, ax = plt.subplots(figsize=(12, 10))
corr = full_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, square=True, linewidths=0.5, ax=ax)
ax.set_title('Correlation Matrix (including target)')
plt.tight_layout()
plt.show()

print('Features most correlated with target:')
target_corr = corr['target'].drop('target').abs().sort_values(ascending=False)
print(target_corr.to_string())

<a id='3'></a>
## 3. Data Splitting

- We use an 80/20 train/test split
- For **classification** tasks, use `stratify=y` to preserve class proportions
- For **regression** (our case), stratification is not strictly needed, but we could stratify on binned target values if the distribution is skewed

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set:     {X_test.shape[0]} samples')
print(f'\nTrain target mean: {y_train.mean():.1f}')
print(f'Test target mean:  {y_test.mean():.1f}')

<a id='4'></a>
## 4. Baseline Model

A baseline sets the **floor** for model performance. If your model cannot beat the baseline, something is wrong.

`DummyRegressor(strategy='mean')` predicts the mean of the training target for every sample.

In [None]:
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train, y_train)

y_pred_dummy = dummy.predict(X_test)
baseline_rmse = np.sqrt(mean_squared_error(y_test, y_pred_dummy))
baseline_mae = mean_absolute_error(y_test, y_pred_dummy)
baseline_r2 = r2_score(y_test, y_pred_dummy)

print('=== Baseline (Predict Mean) ===')
print(f'RMSE: {baseline_rmse:.2f}')
print(f'MAE:  {baseline_mae:.2f}')
print(f'R2:   {baseline_r2:.4f}')

<a id='5'></a>
## 5. Preprocessing Pipeline

Even though the diabetes dataset is already preprocessed (centered, scaled), we build a full pipeline to demonstrate best practices.

In a real project, you would handle:
- **Numeric features**: imputation + scaling
- **Categorical features**: imputation + one-hot encoding

We use `ColumnTransformer` to apply different transformations to different column types.

In [None]:
# Identify column types
# In the diabetes dataset, all features are numeric
numeric_features = X_train.columns.tolist()
# categorical_features = []  # none in this dataset

print(f'Numeric features ({len(numeric_features)}): {numeric_features}')

# Build the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        # If you had categorical features:
        # ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ],
    remainder='passthrough'  # keep any other columns unchanged
)

print('Preprocessor built.')

<a id='6'></a>
## 6. Model Training

We wrap each model in a `Pipeline` so that preprocessing and prediction are a single object.

In [None]:
# Define models to compare
models = {
    'Linear Regression': Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ]),
    'Ridge (alpha=1.0)': Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', Ridge(alpha=1.0, random_state=RANDOM_STATE))
    ]),
    'Random Forest': Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', RandomForestRegressor(
            n_estimators=100, random_state=RANDOM_STATE
        ))
    ])
}

print(f'Models to compare: {list(models.keys())}')

In [None]:
# Fit all models on training data
for name, pipe in models.items():
    pipe.fit(X_train, y_train)
    print(f'{name}: fitted')

<a id='7'></a>
## 7. Evaluation — Cross-Validation

We evaluate each model with 5-fold cross-validation on the **training set** to get an honest estimate before touching the test set.

In [None]:
cv_results = []

for name, pipe in models.items():
    # Negative MSE because sklearn maximizes the score
    neg_mse = cross_val_score(
        pipe, X_train, y_train, cv=5,
        scoring='neg_mean_squared_error'
    )
    rmse_scores = np.sqrt(-neg_mse)
    
    r2_scores = cross_val_score(
        pipe, X_train, y_train, cv=5, scoring='r2'
    )
    
    cv_results.append({
        'Model': name,
        'CV RMSE (mean)': rmse_scores.mean(),
        'CV RMSE (std)': rmse_scores.std(),
        'CV R2 (mean)': r2_scores.mean(),
        'CV R2 (std)': r2_scores.std()
    })

# Add baseline
dummy_neg_mse = cross_val_score(
    DummyRegressor(strategy='mean'), X_train, y_train, cv=5,
    scoring='neg_mean_squared_error'
)
dummy_r2 = cross_val_score(
    DummyRegressor(strategy='mean'), X_train, y_train, cv=5,
    scoring='r2'
)
cv_results.append({
    'Model': 'Baseline (Mean)',
    'CV RMSE (mean)': np.sqrt(-dummy_neg_mse).mean(),
    'CV RMSE (std)': np.sqrt(-dummy_neg_mse).std(),
    'CV R2 (mean)': dummy_r2.mean(),
    'CV R2 (std)': dummy_r2.std()
})

cv_df = pd.DataFrame(cv_results).sort_values('CV RMSE (mean)')
print('=== Model Comparison (5-Fold CV on Training Set) ===')
print(cv_df.to_string(index=False))

In [None]:
# Visualize CV results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE
cv_sorted = cv_df.sort_values('CV RMSE (mean)', ascending=True)
colors = ['#4CAF50' if 'Baseline' not in m else '#999999' for m in cv_sorted['Model']]
axes[0].barh(cv_sorted['Model'], cv_sorted['CV RMSE (mean)'],
             xerr=cv_sorted['CV RMSE (std)'], color=colors, edgecolor='black')
axes[0].set_xlabel('RMSE (lower is better)')
axes[0].set_title('Cross-Validation RMSE')

# R2
cv_sorted_r2 = cv_df.sort_values('CV R2 (mean)', ascending=True)
colors_r2 = ['#2196F3' if 'Baseline' not in m else '#999999' for m in cv_sorted_r2['Model']]
axes[1].barh(cv_sorted_r2['Model'], cv_sorted_r2['CV R2 (mean)'],
             xerr=cv_sorted_r2['CV R2 (std)'], color=colors_r2, edgecolor='black')
axes[1].set_xlabel('R2 (higher is better)')
axes[1].set_title('Cross-Validation R2')

plt.tight_layout()
plt.show()

<a id='8'></a>
## 8. Error Analysis

Before tuning, inspect the errors of the best model to understand **where** it fails.

In [None]:
# Pick the best model from CV for error analysis
# We use cross_val_predict for residual analysis on training data
best_model_name = cv_df.sort_values('CV RMSE (mean)').iloc[0]['Model']
print(f'Best model for error analysis: {best_model_name}')

best_pipe = models[best_model_name]
y_pred_cv = cross_val_predict(best_pipe, X_train, y_train, cv=5)

In [None]:
# Residual plots
residuals = y_train.values - y_pred_cv

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Actual vs Predicted
axes[0].scatter(y_train, y_pred_cv, alpha=0.5, edgecolors='black', linewidth=0.5)
min_val = min(y_train.min(), y_pred_cv.min())
max_val = max(y_train.max(), y_pred_cv.max())
axes[0].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect')
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title('Actual vs Predicted (CV)')
axes[0].legend()

# 2. Residuals vs Predicted
axes[1].scatter(y_pred_cv, residuals, alpha=0.5, edgecolors='black', linewidth=0.5)
axes[1].axhline(0, color='red', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residuals vs Predicted')

# 3. Residual distribution
axes[2].hist(residuals, bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[2].axvline(0, color='red', linestyle='--', lw=2)
axes[2].set_xlabel('Residual')
axes[2].set_ylabel('Count')
axes[2].set_title('Residual Distribution')

plt.suptitle(f'Error Analysis: {best_model_name}', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Worst predictions analysis
error_df = X_train.copy()
error_df['actual'] = y_train.values
error_df['predicted'] = y_pred_cv
error_df['residual'] = residuals
error_df['abs_error'] = np.abs(residuals)

print('=== Top 10 Worst Predictions (by absolute error) ===')
worst = error_df.nlargest(10, 'abs_error')[['actual', 'predicted', 'residual', 'abs_error']]
print(worst.to_string())

print(f'\nMean absolute error: {error_df["abs_error"].mean():.2f}')
print(f'Median absolute error: {error_df["abs_error"].median():.2f}')
print(f'Max absolute error: {error_df["abs_error"].max():.2f}')

<a id='9'></a>
## 9. Hyperparameter Tuning

We tune the best-performing model using `GridSearchCV` on the training set.

In [None]:
# Build a tunable pipeline
# We tune Ridge and RandomForest, then pick the best

# --- Ridge tuning ---
ridge_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', Ridge(random_state=RANDOM_STATE))
])

ridge_param_grid = {
    'regressor__alpha': [0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]
}

ridge_grid = GridSearchCV(
    ridge_pipe, ridge_param_grid, cv=5,
    scoring='neg_mean_squared_error', n_jobs=-1, return_train_score=True
)
ridge_grid.fit(X_train, y_train)

print('=== Ridge Tuning ===')
print(f'Best alpha: {ridge_grid.best_params_["regressor__alpha"]}')
print(f'Best CV RMSE: {np.sqrt(-ridge_grid.best_score_):.2f}')

In [None]:
# --- RandomForest tuning ---
rf_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=RANDOM_STATE))
])

rf_param_grid = {
    'regressor__n_estimators': [100, 200, 300],
    'regressor__max_depth': [5, 10, 20, None],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

rf_grid = GridSearchCV(
    rf_pipe, rf_param_grid, cv=5,
    scoring='neg_mean_squared_error', n_jobs=-1, return_train_score=True
)
rf_grid.fit(X_train, y_train)

print('=== Random Forest Tuning ===')
print(f'Best params: {rf_grid.best_params_}')
print(f'Best CV RMSE: {np.sqrt(-rf_grid.best_score_):.2f}')

In [None]:
# Compare tuned models
tuned_results = pd.DataFrame([
    {'Model': 'Ridge (tuned)', 'CV RMSE': np.sqrt(-ridge_grid.best_score_)},
    {'Model': 'Random Forest (tuned)', 'CV RMSE': np.sqrt(-rf_grid.best_score_)},
    {'Model': 'Baseline (Mean)', 'CV RMSE': np.sqrt(-dummy_neg_mse).mean()}
]).sort_values('CV RMSE')

print('=== Tuned Model Comparison (CV RMSE) ===')
print(tuned_results.to_string(index=False))

# Select final model
final_model_name = tuned_results.iloc[0]['Model']
if 'Ridge' in final_model_name:
    final_model = ridge_grid.best_estimator_
else:
    final_model = rf_grid.best_estimator_

print(f'\nFinal model selected: {final_model_name}')

<a id='10'></a>
## 10. Final Evaluation on Test Set

This is the **only time** we use the test set. The number we report here is our best estimate of real-world performance.

In [None]:
# Final evaluation
y_pred_final = final_model.predict(X_test)

final_rmse = np.sqrt(mean_squared_error(y_test, y_pred_final))
final_mae = mean_absolute_error(y_test, y_pred_final)
final_r2 = r2_score(y_test, y_pred_final)

print('=' * 50)
print(f'FINAL TEST SET EVALUATION: {final_model_name}')
print('=' * 50)
print(f'RMSE: {final_rmse:.2f}')
print(f'MAE:  {final_mae:.2f}')
print(f'R2:   {final_r2:.4f}')
print()
print(f'Baseline RMSE: {baseline_rmse:.2f}')
print(f'Improvement over baseline: {((baseline_rmse - final_rmse) / baseline_rmse * 100):.1f}%')

In [None]:
# Final model comparison table
final_comparison = []

# Baseline on test set
final_comparison.append({
    'Model': 'Baseline (Mean)',
    'Test RMSE': baseline_rmse,
    'Test MAE': baseline_mae,
    'Test R2': baseline_r2
})

# All models on test set
for name, pipe in models.items():
    yp = pipe.predict(X_test)
    final_comparison.append({
        'Model': name,
        'Test RMSE': np.sqrt(mean_squared_error(y_test, yp)),
        'Test MAE': mean_absolute_error(y_test, yp),
        'Test R2': r2_score(y_test, yp)
    })

# Tuned models on test set
for tag, grid_obj in [('Ridge (tuned)', ridge_grid), ('RF (tuned)', rf_grid)]:
    yp = grid_obj.best_estimator_.predict(X_test)
    final_comparison.append({
        'Model': tag,
        'Test RMSE': np.sqrt(mean_squared_error(y_test, yp)),
        'Test MAE': mean_absolute_error(y_test, yp),
        'Test R2': r2_score(y_test, yp)
    })

final_table = pd.DataFrame(final_comparison).sort_values('Test RMSE')
print('=== Complete Model Comparison (Test Set) ===')
print(final_table.to_string(index=False))

In [None]:
# Visualize final comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ft = final_table.sort_values('Test RMSE')
colors_rmse = ['#E53935' if 'Baseline' in m else '#4CAF50' if 'tuned' in m else '#2196F3'
               for m in ft['Model']]
axes[0].barh(ft['Model'], ft['Test RMSE'], color=colors_rmse, edgecolor='black')
axes[0].set_xlabel('RMSE (lower is better)')
axes[0].set_title('Test RMSE Comparison')

ft_r2 = final_table.sort_values('Test R2')
colors_r2 = ['#E53935' if 'Baseline' in m else '#4CAF50' if 'tuned' in m else '#2196F3'
             for m in ft_r2['Model']]
axes[1].barh(ft_r2['Model'], ft_r2['Test R2'], color=colors_r2, edgecolor='black')
axes[1].set_xlabel('R2 (higher is better)')
axes[1].set_title('Test R2 Comparison')

plt.tight_layout()
plt.show()

<a id='11'></a>
## 11. Model Saving

Use `joblib` to save the entire pipeline (preprocessing + model) so it can be loaded later for inference.

In [None]:
# Save the final model pipeline
model_dir = 'saved_models'
os.makedirs(model_dir, exist_ok=True)
model_path = os.path.join(model_dir, 'diabetes_best_pipeline.joblib')

joblib.dump(final_model, model_path)
print(f'Model saved to: {model_path}')
print(f'File size: {os.path.getsize(model_path) / 1024:.1f} KB')

In [None]:
# Reload and verify
loaded_model = joblib.load(model_path)
y_pred_loaded = loaded_model.predict(X_test)

# Verify predictions match
assert np.allclose(y_pred_final, y_pred_loaded), 'Predictions do not match!'
print('Model loaded and verified -- predictions match exactly.')

# Clean up
os.remove(model_path)
os.rmdir(model_dir)
print('Cleaned up saved model files.')

<a id='12'></a>
## 12. Reproducibility Checklist

Before sharing your project or deploying a model, verify:

- [ ] **`random_state=42`** set in all stochastic components (train_test_split, models, CV)
- [ ] **`requirements.txt`** listing all package versions (`pip freeze > requirements.txt`)
- [ ] **Pipeline saved** with `joblib.dump()` (includes preprocessor + model)
- [ ] **Train/test split** is fixed and documented
- [ ] **No test data leakage**: preprocessing fitted on training data only
- [ ] **Results reported** on held-out test set (not CV scores)
- [ ] **Code is runnable** end-to-end from a clean environment
- [ ] **Data source** documented (or data saved/versioned)

In [None]:
# Print environment info for reproducibility
import sklearn
print('=== Environment ===')
print(f'numpy:        {np.__version__}')
print(f'pandas:       {pd.__version__}')
print(f'scikit-learn: {sklearn.__version__}')
print(f'seaborn:      {sns.__version__}')
print(f'random_state: {RANDOM_STATE}')

<a id='13'></a>
## 13. Common Mistakes

| Step | Mistake | Fix |
|------|---------|-----|
| **Problem Framing** | Jumping to modeling without defining success metric | Define the metric and baseline before any code |
| **EDA** | Skipping EDA, missing data issues or class imbalance | Always visualize distributions, check for nulls, correlations |
| **Splitting** | Fitting preprocessor on full data (train+test) | Fit only on training data; transform test data |
| **Baseline** | No baseline to compare against | Always establish a DummyRegressor/DummyClassifier floor |
| **Pipelines** | Separate preprocessing and model steps (risk of data leakage) | Use sklearn `Pipeline` to chain everything |
| **Evaluation** | Reporting CV scores as final results | CV is for model selection; final metric comes from held-out test set |
| **Error Analysis** | Not inspecting errors before tuning | Residual plots reveal systematic patterns that tuning alone cannot fix |
| **Tuning** | Tuning on the test set | Use `GridSearchCV` with CV on training data only |
| **Final Report** | Not comparing against baseline in the final table | Always include the baseline in comparison tables |
| **Reproducibility** | Forgetting `random_state` or not saving the pipeline | Use a checklist (see Section 12) |

<a id='14'></a>
## 14. Exercise

**Task**: Adapt this template to a **classification** problem.

1. Load `sklearn.datasets.load_breast_cancer()`
2. Frame the problem: what metric? (accuracy, F1, AUC?). What baseline? (`DummyClassifier`)
3. EDA: class balance, feature distributions
4. Split: 80/20, `stratify=y`, `random_state=42`
5. Baseline: `DummyClassifier(strategy='most_frequent')`
6. Fit at least 3 models: `LogisticRegression`, `RandomForestClassifier`, `GradientBoostingClassifier`
7. Cross-validate on training set
8. Tune the best model with `GridSearchCV`
9. Evaluate on test set
10. Save the final pipeline
11. Create a final comparison table

In [None]:
# YOUR CODE HERE
# Follow the same 12-step structure used above, but for classification.

# from sklearn.datasets import load_breast_cancer
# from sklearn.dummy import DummyClassifier
# from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.metrics import accuracy_score, f1_score, classification_report

# Step 1: Load data
# ...

# Step 2-12: Follow the template above
# ...

---
**End of Notebook 6 and Module ML600** | Congratulations on completing the Optimization, Regularization, and Model Selection module!