# Notebook 03: Model Evaluation Metrics

Learn how to properly evaluate machine learning models.

## Learning Objectives
- Understand regression metrics (MSE, RMSE, MAE, R²)
- Understand classification metrics (Accuracy, Precision, Recall, F1, ROC-AUC)
- Visualize confusion matrices
- Plot learning curves and validation curves
- Use cross-validation effectively

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import (
    train_test_split, cross_val_score, learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import (
    # Regression metrics
    mean_squared_error, mean_absolute_error, r2_score,
    # Classification metrics
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve, auc
)

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

## Part 1: Regression Metrics

In [None]:
# Generate regression data
X, y = make_regression(n_samples=500, n_features=10, noise=20, random_state=42)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = MLPRegressor(hidden_layer_sizes=(50, 25), max_iter=500, random_state=42)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

print(f"Training samples: {len(y_train)}")
print(f"Test samples: {len(y_test)}")

In [None]:
# Calculate regression metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Regression Metrics:")
print("="*50)
print(f"Mean Squared Error (MSE):     {mse:.4f}")
print(f"Root Mean Squared Error:      {rmse:.4f}")
print(f"Mean Absolute Error (MAE):    {mae:.4f}")
print(f"R² Score:                     {r2:.4f}")

In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Scatter plot: Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Predicted vs Actual (R² = {r2:.3f})')

# Residual plot
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')

# Residual distribution
axes[2].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[2].axvline(x=0, color='r', linestyle='--')
axes[2].set_xlabel('Residual')
axes[2].set_ylabel('Frequency')
axes[2].set_title(f'Residual Distribution (MAE = {mae:.2f})')

plt.tight_layout()
plt.show()

### Metric Explanations

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| MSE | Σ(y - ŷ)² / n | Penalizes large errors more |
| RMSE | √MSE | Same units as target |
| MAE | Σ|y - ŷ| / n | Average absolute error |
| R² | 1 - SS_res/SS_tot | Variance explained (0-1) |

## Part 2: Classification Metrics

In [None]:
# Generate classification data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3],  # Imbalanced
    random_state=42
)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
clf = MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=500, random_state=42)
clf.fit(X_train_scaled, y_train)

# Predictions
y_pred = clf.predict(X_test_scaled)
y_prob = clf.predict_proba(X_test_scaled)[:, 1]

print(f"Class distribution in test set: {np.bincount(y_test)}")

In [None]:
# Calculate classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("Classification Metrics:")
print("="*50)
print(f"Accuracy:     {accuracy:.4f}")
print(f"Precision:    {precision:.4f}")
print(f"Recall:       {recall:.4f}")
print(f"F1 Score:     {f1:.4f}")
print(f"ROC-AUC:      {roc_auc:.4f}")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (Counts)')

# Normalized
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix (Normalized)')

plt.tight_layout()
plt.show()

# Extract values
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives: {tn}, False Positives: {fp}")
print(f"False Negatives: {fn}, True Positives: {tp}")

In [None]:
# Detailed classification report
print("\nClassification Report:")
print("="*60)
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

### Metric Explanations

| Metric | Formula | Use When |
|--------|---------|----------|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | Cost of false positives high |
| Recall | TP/(TP+FN) | Cost of false negatives high |
| F1 | 2·(P·R)/(P+R) | Balance precision & recall |

## Part 3: ROC and Precision-Recall Curves

In [None]:
# Calculate curves
fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob)
precision_curve, recall_curve, thresholds_pr = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall_curve, precision_curve)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
axes[0].plot(fpr, tpr, 'b-', lw=2, label=f'ROC (AUC = {roc_auc:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', lw=1, label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend(loc='lower right')
axes[0].grid(True)

# Precision-Recall Curve
axes[1].plot(recall_curve, precision_curve, 'b-', lw=2, label=f'PR (AUC = {pr_auc:.3f})')
baseline = np.sum(y_test) / len(y_test)
axes[1].axhline(y=baseline, color='k', linestyle='--', label=f'Baseline ({baseline:.2f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend(loc='lower left')
axes[1].grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Threshold analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Metrics vs threshold
thresholds = np.linspace(0.1, 0.9, 50)
precisions = []
recalls = []
f1s = []

for thresh in thresholds:
    y_pred_thresh = (y_prob >= thresh).astype(int)
    precisions.append(precision_score(y_test, y_pred_thresh, zero_division=0))
    recalls.append(recall_score(y_test, y_pred_thresh, zero_division=0))
    f1s.append(f1_score(y_test, y_pred_thresh, zero_division=0))

axes[0].plot(thresholds, precisions, 'b-', label='Precision')
axes[0].plot(thresholds, recalls, 'r-', label='Recall')
axes[0].plot(thresholds, f1s, 'g-', label='F1')
axes[0].set_xlabel('Threshold')
axes[0].set_ylabel('Score')
axes[0].set_title('Metrics vs Classification Threshold')
axes[0].legend()
axes[0].grid(True)

# Find optimal threshold
optimal_idx = np.argmax(f1s)
optimal_threshold = thresholds[optimal_idx]
axes[0].axvline(x=optimal_threshold, color='k', linestyle='--', alpha=0.5)

# Probability distribution
axes[1].hist(y_prob[y_test == 0], bins=30, alpha=0.5, label='Class 0', color='blue')
axes[1].hist(y_prob[y_test == 1], bins=30, alpha=0.5, label='Class 1', color='red')
axes[1].axvline(x=0.5, color='k', linestyle='--', label='Threshold=0.5')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Probability Distribution by Class')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Optimal threshold (max F1): {optimal_threshold:.2f}")

## Part 4: Multi-class Classification Metrics

In [None]:
# Generate multi-class data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=4, n_clusters_per_class=1,
    random_state=42
)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
clf_multi = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
clf_multi.fit(X_train_scaled, y_train)

# Predictions
y_pred_multi = clf_multi.predict(X_test_scaled)

print(f"Number of classes: {len(np.unique(y))}")
print(f"Test set distribution: {np.bincount(y_test)}")

In [None]:
# Multi-class metrics
print("Multi-class Classification Report:")
print("="*60)
print(classification_report(y_test, y_pred_multi))

# Different averaging methods
print("\nF1 Scores with different averaging:")
for avg in ['micro', 'macro', 'weighted']:
    f1_avg = f1_score(y_test, y_pred_multi, average=avg)
    print(f"  {avg:8s}: {f1_avg:.4f}")

In [None]:
# Multi-class confusion matrix
cm_multi = confusion_matrix(y_test, y_pred_multi)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Multi-class Confusion Matrix')
plt.show()

## Part 5: Cross-Validation

In [None]:
# Generate data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10,
                          n_classes=2, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cross-validation with different metrics
clf = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42)

metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = {}

for metric in metrics:
    scores = cross_val_score(clf, X_scaled, y, cv=5, scoring=metric)
    results[metric] = scores
    print(f"{metric:12s}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

In [None]:
# Visualize CV results
fig, ax = plt.subplots(figsize=(10, 6))

positions = np.arange(len(metrics))
bp = ax.boxplot([results[m] for m in metrics], positions=positions, widths=0.6)

ax.set_xticks(positions)
ax.set_xticklabels(metrics)
ax.set_ylabel('Score')
ax.set_title('5-Fold Cross-Validation Scores')

# Add mean markers
means = [results[m].mean() for m in metrics]
ax.scatter(positions, means, marker='D', color='red', s=50, zorder=3, label='Mean')
ax.legend()

plt.tight_layout()
plt.show()

## Part 6: Learning Curves

In [None]:
# Generate learning curve
clf = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42)

train_sizes, train_scores, test_scores = learning_curve(
    clf, X_scaled, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Calculate mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
test_mean = test_scores.mean(axis=1)
test_std = test_scores.std(axis=1)

# Plot
plt.figure(figsize=(10, 6))

plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')

plt.plot(train_sizes, test_mean, 'o-', color='green', label='Cross-validation score')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='green')

plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

print("Interpretation:")
print("- High training, low CV → Overfitting (need more data or regularization)")
print("- Both low → Underfitting (need more complex model)")
print("- Both converging high → Good fit")

## Part 7: Validation Curves

In [None]:
# Validation curve for alpha (regularization)
param_range = np.logspace(-5, 1, 10)

train_scores, test_scores = validation_curve(
    MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42),
    X_scaled, y,
    param_name='alpha',
    param_range=param_range,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Calculate mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
test_mean = test_scores.mean(axis=1)
test_std = test_scores.std(axis=1)

# Plot
plt.figure(figsize=(10, 6))

plt.semilogx(param_range, train_mean, 'o-', color='blue', label='Training score')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')

plt.semilogx(param_range, test_mean, 'o-', color='green', label='Cross-validation score')
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.1, color='green')

plt.xlabel('Alpha (Regularization)')
plt.ylabel('Accuracy')
plt.title('Validation Curve: Effect of Regularization')
plt.legend(loc='best')
plt.grid(True)
plt.show()

# Find optimal alpha
optimal_alpha = param_range[np.argmax(test_mean)]
print(f"Optimal alpha: {optimal_alpha:.6f}")

In [None]:
# Validation curve for hidden layer size
hidden_sizes = [(10,), (25,), (50,), (100,), (200,)]

train_scores_list = []
test_scores_list = []

for size in hidden_sizes:
    clf = MLPClassifier(hidden_layer_sizes=size, max_iter=500, random_state=42)
    scores = cross_val_score(clf, X_scaled, y, cv=5, scoring='accuracy')
    
    # Also get training scores
    clf.fit(X_scaled, y)
    train_acc = clf.score(X_scaled, y)
    
    train_scores_list.append(train_acc)
    test_scores_list.append(scores.mean())

# Plot
plt.figure(figsize=(10, 6))

x_labels = [str(s[0]) for s in hidden_sizes]
x_pos = np.arange(len(hidden_sizes))

plt.plot(x_pos, train_scores_list, 'o-', color='blue', label='Training score')
plt.plot(x_pos, test_scores_list, 'o-', color='green', label='CV score')

plt.xticks(x_pos, x_labels)
plt.xlabel('Hidden Layer Size')
plt.ylabel('Accuracy')
plt.title('Effect of Network Size on Performance')
plt.legend()
plt.grid(True)
plt.show()

## Part 8: Comparing Models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'MLP': MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42)
}

# Compare with cross-validation
results = {}
print("Model Comparison (5-Fold CV):")
print("="*60)

for name, model in models.items():
    scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
    results[name] = scores
    print(f"{name:25s}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))

names = list(results.keys())
positions = np.arange(len(names))

bp = ax.boxplot([results[name] for name in names], positions=positions, widths=0.6)

ax.set_xticks(positions)
ax.set_xticklabels(names, rotation=15)
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison: 5-Fold Cross-Validation')

# Add mean markers
means = [results[name].mean() for name in names]
ax.scatter(positions, means, marker='D', color='red', s=50, zorder=3)

plt.tight_layout()
plt.show()

## Summary

In this notebook, you learned:

### Regression Metrics
- **MSE/RMSE**: Penalize large errors
- **MAE**: Average absolute error
- **R²**: Variance explained

### Classification Metrics
- **Accuracy**: Overall correctness
- **Precision**: True positives / predicted positives
- **Recall**: True positives / actual positives
- **F1**: Harmonic mean of precision and recall
- **ROC-AUC**: Area under ROC curve

### Model Evaluation Tools
- **Confusion Matrix**: Detailed error analysis
- **Cross-Validation**: Robust performance estimation
- **Learning Curves**: Diagnose overfitting/underfitting
- **Validation Curves**: Tune hyperparameters

### Key Takeaways
- Choose metrics based on your problem requirements
- Use cross-validation for robust estimates
- Learning curves help diagnose model issues
- Always consider class imbalance when choosing metrics

### Next Steps
Continue to **Notebook 04** to learn about linear models with parameter simulation.