# Lesson 10: Model Comparison

**What you'll learn:**
- Compare multiple models fairly
- Use cross-validation for reliable comparison
- Visualize comparisons
- Select the best model

---

## Section 1: Why Compare Models?

### READ

Different algorithms work better for different problems.
There's no "best" algorithm - you need to test and compare!

**Fair comparison requires:**
- Same data split
- Same evaluation metric
- Multiple runs (cross-validation)

### TRY IT - Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score

# Load data
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

# Scale
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print(f"Samples: {len(X)}, Features: {X.shape[1]}")

---

## Section 2: Cross-Validation Comparison

### READ

**Cross-validation** gives more reliable comparison:
1. Split data into K folds
2. Train on K-1 folds, test on 1 fold
3. Repeat K times
4. Average the results

This avoids "lucky" or "unlucky" single splits.

### TRY IT

In [None]:
# Define models to compare
models = [
    ('Logistic Regression', LogisticRegression(max_iter=1000)),
    ('Decision Tree', DecisionTreeClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier(n_neighbors=5)),
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('SVM', SVC(random_state=42))
]

# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Model Comparison (5-fold CV, F1-weighted)")
print("="*50)

results = {}
for name, model in models:
    scores = cross_val_score(model, X_scaled, y, cv=cv, scoring='f1_weighted')
    results[name] = scores
    print(f"{name:20s}: {scores.mean():.3f} (+/- {scores.std():.3f})")

### EXPLAIN

- `StratifiedKFold`: Keeps class proportions in each fold
- `cross_val_score`: Runs the full CV process
- **Mean**: Typical performance
- **Std**: How consistent the model is (lower = more stable)

---

## Section 3: Visualizing Comparison

In [None]:
# Box plot comparison
plt.figure(figsize=(10, 6))
plt.boxplot(results.values(), labels=results.keys())
plt.ylabel('F1 Score (weighted)')
plt.title('Model Comparison (5-Fold Cross-Validation)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Bar chart with error bars
means = [scores.mean() for scores in results.values()]
stds = [scores.std() for scores in results.values()]

plt.figure(figsize=(10, 6))
bars = plt.bar(results.keys(), means, yerr=stds, capsize=5, color='steelblue', alpha=0.7)
plt.ylabel('F1 Score (weighted)')
plt.title('Model Comparison with Standard Deviation')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

---

## Section 4: Selecting the Best Model

In [None]:
# Summary table
summary = pd.DataFrame({
    'Model': results.keys(),
    'Mean F1': [scores.mean() for scores in results.values()],
    'Std': [scores.std() for scores in results.values()]
}).sort_values('Mean F1', ascending=False)

print("\nModel Ranking:")
print(summary.to_string(index=False))

best_model = summary.iloc[0]['Model']
print(f"\nBest Model: {best_model}")

---

## Quick Reference

```python
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Compare models
for name, model in models:
    scores = cross_val_score(model, X, y, cv=cv, scoring='f1_weighted')
    print(f"{name}: {scores.mean():.3f} (+/- {scores.std():.3f})")
```

---

## Next Lesson

In **Lesson 11: Assignment Guide**, you'll learn:
- Step-by-step walkthrough with NSL-KDD dataset
- Building baseline model
- Applying optimization
- Preparing your report