# Lesson 07: Hyperparameter Tuning

**What you'll learn:**
- What are hyperparameters
- Manual tuning with loops
- GridSearchCV (automated search)
- Tuning KNN, Decision Tree, Random Forest

**This is ONE of the optimization techniques for your assignment!**

---

## Section 1: What are Hyperparameters?

### READ

**Parameters**: Learned BY the model during training
- (e.g., the split points in a decision tree)

**Hyperparameters**: Set by YOU before training
- (e.g., how many neighbors in KNN)

**Finding good hyperparameters can significantly improve performance!**

Common hyperparameters:
- KNN: `n_neighbors` (how many neighbors to check)
- Decision Tree: `max_depth` (how deep the tree can grow)
- Random Forest: `n_estimators` (how many trees)

### TRY IT - Setup

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Load and prepare data
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

---

## Section 2: Manual Tuning with Loops

### READ

Simplest approach: Try different values in a loop and see which works best.

### TRY IT

In [None]:
# Try different K values for KNN
print("KNN: Testing different K values")
print("-" * 40)

best_k = 1
best_score = 0

for k in [1, 3, 5, 7, 9, 11, 15, 21]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    score = f1_score(y_test, knn.predict(X_test_scaled), average='weighted')
    print(f"K={k:2d}: F1 = {score:.3f}")
    
    if score > best_score:
        best_score = score
        best_k = k

print(f"\nBest: K={best_k} with F1={best_score:.3f}")

---

## Section 3: GridSearchCV - The Better Way

### READ

**GridSearchCV** automates tuning with cross-validation:
1. Define a grid of hyperparameter values to try
2. For each combination, do cross-validation
3. Return the best combination

**Cross-validation**: Split training data into folds, train on some, validate on others. More reliable than single split!

### TRY IT

In [None]:
# Define the parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance']  # uniform = all equal, distance = closer = more weight
}

print("Parameter Grid:")
print(param_grid)
print(f"\nTotal combinations to try: {len(param_grid['n_neighbors']) * len(param_grid['weights'])}")

In [None]:
# Create GridSearchCV
knn = KNeighborsClassifier()

grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='f1_weighted',   # Metric to optimize
    verbose=1                # Show progress
)

# Run the search
print("Running GridSearchCV...")
grid_search.fit(X_train_scaled, y_train)

# Results
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

In [None]:
# Use the best model
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test_scaled)

print("\nTest Set Performance with Tuned Model:")
print(f"F1-score: {f1_score(y_test, predictions, average='weighted'):.3f}")

---

## Section 4: Tuning Decision Tree

In [None]:
# Decision Tree hyperparameters
param_grid_tree = {
    'max_depth': [3, 5, 10, 15, None],      # How deep the tree can grow
    'min_samples_split': [2, 5, 10],         # Min samples to split a node
    'criterion': ['gini', 'entropy']         # How to measure split quality
}

tree = DecisionTreeClassifier(random_state=42)
grid_tree = GridSearchCV(tree, param_grid_tree, cv=5, scoring='f1_weighted')
grid_tree.fit(X_train_scaled, y_train)

print("Decision Tree Best Parameters:")
print(grid_tree.best_params_)
print(f"Best CV Score: {grid_tree.best_score_:.3f}")

---

## Section 5: Tuning Random Forest

In [None]:
# Random Forest hyperparameters (smaller grid for speed)
param_grid_rf = {
    'n_estimators': [50, 100, 200],     # Number of trees
    'max_depth': [5, 10, None],         # Tree depth
    'min_samples_split': [2, 5]         # Min samples to split
}

rf = RandomForestClassifier(random_state=42)
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='f1_weighted', verbose=1)
grid_rf.fit(X_train_scaled, y_train)

print("\nRandom Forest Best Parameters:")
print(grid_rf.best_params_)
print(f"Best CV Score: {grid_rf.best_score_:.3f}")

---

## Section 6: Comparing Baseline vs Tuned

In [None]:
# Baseline (default parameters)
rf_baseline = RandomForestClassifier(random_state=42)
rf_baseline.fit(X_train_scaled, y_train)
baseline_score = f1_score(y_test, rf_baseline.predict(X_test_scaled), average='weighted')

# Tuned model
tuned_score = f1_score(y_test, grid_rf.best_estimator_.predict(X_test_scaled), average='weighted')

print("="*40)
print("BASELINE vs TUNED")
print("="*40)
print(f"Baseline F1: {baseline_score:.3f}")
print(f"Tuned F1:    {tuned_score:.3f}")
print(f"Improvement: {(tuned_score - baseline_score)*100:.1f}%")

---

## Quick Reference

```python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'param1': [value1, value2],
    'param2': [value1, value2]
}

# Run grid search
grid = GridSearchCV(model, param_grid, cv=5, scoring='f1_weighted')
grid.fit(X_train, y_train)

# Get results
print(grid.best_params_)
print(grid.best_score_)
best_model = grid.best_estimator_
```

---

## Next Lesson

In **Lesson 08: Feature Selection**, you'll learn:
- Why selecting features matters
- Correlation-based selection
- SelectKBest
- Testing if it improves your model