# 3.4 **Tune** Random Forest Hyperparameters - Optimize Student Departure Prediction

## Model Cycle: The 5 Key Steps

### 1. Build the Model : Create the Random Forest pipeline.  
### 2. Train the Model : Fit the model on the training data.  
### 3. Generate Predictions : Use the trained model to make predictions.  
### 4. Evaluate the Model : Assess performance using evaluation metrics.  
### **5. Improve the Model : Tune hyperparameters for optimal performance.**

## Introduction

In the previous notebooks, we built and evaluated Random Forest models with default or manually selected hyperparameters. Now we systematically search for the **optimal hyperparameters** to maximize model performance.

Hyperparameter tuning is critical because:
1. Default values are not always optimal for your specific dataset
2. Small changes in hyperparameters can significantly impact performance
3. Different hyperparameter combinations may work better together

### Learning Objectives

By the end of this notebook, you will be able to:

1. Identify the key hyperparameters that affect Random Forest performance
2. Use Grid Search to systematically explore hyperparameter combinations
3. Use Randomized Search for efficient exploration of large search spaces
4. Interpret hyperparameter tuning results and select the best model
5. Compare tuned models to baseline models and logistic regression

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle
import time
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import (
    GridSearchCV, RandomizedSearchCV, cross_val_score, 
    StratifiedKFold, validation_curve
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from scipy.stats import randint, uniform

pd.options.display.max_columns = None

In [None]:
# Set up file paths
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_filepath = f'{root_filepath}course_3/'
models_path = f'{course3_filepath}models/'

In [None]:
# Load training and testing data
df_training = pd.read_csv(f'{data_filepath}training.csv')
df_testing = pd.read_csv(f'{data_filepath}testing.csv')

print(f"Training data: {df_training.shape}")
print(f"Testing data: {df_testing.shape}")

In [None]:
# Define features and target
X_train = df_training
y_train = df_training['SEM_3_STATUS']

X_test = df_testing
y_test = df_testing['SEM_3_STATUS']

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

In [None]:
# Rebuild the preprocessing pipeline
minmax_columns = ['HS_GPA', 'GPA_1', 'GPA_2', 'DFW_RATE_1', 'DFW_RATE_2']
standard_columns = ['UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2']
categorical_columns = ['GENDER', 'RACE_ETHNICITY', 'FIRST_GEN_STATUS']

preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), minmax_columns),
        ('standard', StandardScaler(), standard_columns),
        ('onehot', OneHotEncoder(handle_unknown='ignore', 
                                  drop=['Female', 'Other', 'Unknown'], 
                                  sparse_output=False), categorical_columns)
    ],
    remainder='drop'
)

print("Preprocessor configured.")

## 2. Understanding Hyperparameters

### 2.1 Key Hyperparameters to Tune

Random Forests have many hyperparameters, but these are the most impactful:

| Hyperparameter | Description | Typical Range | Effect on Model |
|:---------------|:------------|:--------------|:----------------|
| `n_estimators` | Number of trees | 100 - 1000 | More trees = more stable, slower |
| `max_depth` | Maximum tree depth | 5 - 50 or None | Limits complexity, prevents overfitting |
| `max_features` | Features per split | 'sqrt', 'log2', float | Controls randomness and diversity |
| `min_samples_split` | Min samples to split | 2 - 20 | Prevents overfitting |
| `min_samples_leaf` | Min samples in leaf | 1 - 10 | Controls tree size |

In [None]:
# Visualize hyperparameter effects
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'n_estimators: More Trees = More Stable',
    'max_depth: Deeper = More Complex',
    'max_features: Lower = More Diverse',
    'min_samples_split: Higher = Less Overfit'
))

# Simulated effects
x1 = [10, 50, 100, 200, 500, 1000]
y1_train = [0.98, 0.97, 0.96, 0.955, 0.95, 0.95]
y1_test = [0.70, 0.78, 0.82, 0.83, 0.835, 0.84]

x2 = [3, 5, 10, 15, 20, None]
x2_display = [3, 5, 10, 15, 20, 25]
y2_train = [0.75, 0.85, 0.92, 0.95, 0.98, 0.99]
y2_test = [0.73, 0.82, 0.84, 0.83, 0.80, 0.78]

x3 = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
y3_diversity = [0.95, 0.85, 0.75, 0.55, 0.35, 0.15]
y3_accuracy = [0.72, 0.78, 0.82, 0.84, 0.85, 0.83]

x4 = [2, 5, 10, 15, 20, 30]
y4_train = [0.99, 0.96, 0.92, 0.88, 0.85, 0.80]
y4_test = [0.78, 0.82, 0.84, 0.83, 0.82, 0.78]

# Plot 1: n_estimators
fig.add_trace(go.Scatter(x=x1, y=y1_train, name='Train', line=dict(color='lightblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=x1, y=y1_test, name='Test', line=dict(color='darkblue')), row=1, col=1)

# Plot 2: max_depth
fig.add_trace(go.Scatter(x=x2_display, y=y2_train, name='Train', line=dict(color='lightblue'), showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=x2_display, y=y2_test, name='Test', line=dict(color='darkblue'), showlegend=False), row=1, col=2)

# Plot 3: max_features
fig.add_trace(go.Scatter(x=x3, y=y3_diversity, name='Diversity', line=dict(color='orange')), row=2, col=1)
fig.add_trace(go.Scatter(x=x3, y=y3_accuracy, name='Accuracy', line=dict(color='green')), row=2, col=1)

# Plot 4: min_samples_split
fig.add_trace(go.Scatter(x=x4, y=y4_train, name='Train', line=dict(color='lightblue'), showlegend=False), row=2, col=2)
fig.add_trace(go.Scatter(x=x4, y=y4_test, name='Test', line=dict(color='darkblue'), showlegend=False), row=2, col=2)

fig.update_layout(height=600, title_text='Hyperparameter Effects on Model Performance')
fig.show()

### 2.2 Hyperparameter Interactions

Hyperparameters don't work in isolation—they interact:

- **n_estimators + max_features**: More trees can compensate for aggressive feature subsampling
- **max_depth + min_samples_split**: Both control tree complexity; may be redundant to tune both aggressively
- **Class imbalance**: `class_weight='balanced'` interacts with tree structure parameters

This is why we search **combinations** of hyperparameters, not just individual values.

## 3. Grid Search

### 3.1 Setting Up Grid Search

**Grid Search** exhaustively tries all combinations of specified hyperparameter values.

```
param_grid = {
    'n_estimators': [100, 200, 300],  # 3 values
    'max_depth': [5, 10, 15],          # 3 values
    'max_features': ['sqrt', 'log2']   # 2 values
}

Total combinations: 3 × 3 × 2 = 18
With 5-fold CV: 18 × 5 = 90 model fits
```

**Pros**: Thorough, guaranteed to find best combination in grid

**Cons**: Computationally expensive, grows exponentially

In [None]:
# Create base pipeline for tuning
rf_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        bootstrap=True,
        oob_score=True,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

print("Base pipeline created for hyperparameter tuning.")

In [None]:
# Define parameter grid
# Note: Use 'classifier__' prefix for pipeline parameters
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, 15, 20, None],
    'classifier__max_features': ['sqrt', 'log2', 0.3],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Calculate total combinations
total_combinations = 1
for key, values in param_grid.items():
    total_combinations *= len(values)
    print(f"{key}: {len(values)} values")

print(f"\nTotal combinations: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} model fits")

**Note**: With 405 combinations and 5-fold CV, this would require 2025 model fits. We'll use a smaller grid for demonstration.

In [None]:
# Reduced parameter grid for faster execution
param_grid_reduced = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 15, None],
    'classifier__max_features': ['sqrt', 'log2'],
    'classifier__min_samples_split': [2, 5],
}

total_reduced = 2 * 3 * 2 * 2
print(f"Reduced combinations: {total_reduced}")
print(f"With 5-fold CV: {total_reduced * 5} model fits")

### 3.2 Running Grid Search

In [None]:
# Set up cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create Grid Search
grid_search = GridSearchCV(
    rf_pipeline,
    param_grid_reduced,
    cv=cv,
    scoring='roc_auc',  # Optimize for ROC-AUC
    n_jobs=-1,
    verbose=2,
    return_train_score=True
)

print("Grid Search configured.")
print(f"Optimizing for: ROC-AUC")

In [None]:
# Run Grid Search
print("Running Grid Search...")
print("="*50)

start_time = time.time()
grid_search.fit(X_train, y_train)
grid_time = time.time() - start_time

print(f"\nGrid Search completed in {grid_time:.2f} seconds")
print(f"Best ROC-AUC score: {grid_search.best_score_:.4f}")

In [None]:
# Display best parameters
print("Best Hyperparameters:")
print("="*50)
for param, value in grid_search.best_params_.items():
    param_name = param.replace('classifier__', '')
    print(f"{param_name}: {value}")

### 3.3 Analyzing Grid Search Results

In [None]:
# Convert results to DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns
display_cols = [
    'param_classifier__n_estimators',
    'param_classifier__max_depth',
    'param_classifier__max_features',
    'param_classifier__min_samples_split',
    'mean_train_score',
    'mean_test_score',
    'std_test_score',
    'rank_test_score'
]

# Display top 10 configurations
top_results = results_df[display_cols].sort_values('rank_test_score').head(10)
top_results.columns = ['n_estimators', 'max_depth', 'max_features', 'min_samples_split', 
                       'Train Score', 'Test Score', 'Std', 'Rank']
top_results

In [None]:
# Visualize Grid Search results
fig = go.Figure()

# Add all results as scatter
fig.add_trace(go.Scatter(
    x=results_df['mean_train_score'],
    y=results_df['mean_test_score'],
    mode='markers',
    marker=dict(
        size=10,
        color=results_df['rank_test_score'],
        colorscale='Viridis_r',
        showscale=True,
        colorbar=dict(title='Rank')
    ),
    text=[f"n_est={row['param_classifier__n_estimators']}, depth={row['param_classifier__max_depth']}" 
          for _, row in results_df.iterrows()],
    hovertemplate='Train: %{x:.4f}<br>Test: %{y:.4f}<br>%{text}<extra></extra>',
    name='Configurations'
))

# Add diagonal line
fig.add_trace(go.Scatter(
    x=[0.7, 1.0], y=[0.7, 1.0],
    mode='lines',
    line=dict(color='gray', dash='dash'),
    name='Perfect Generalization'
))

# Mark best configuration
best_idx = results_df['rank_test_score'].idxmin()
fig.add_trace(go.Scatter(
    x=[results_df.loc[best_idx, 'mean_train_score']],
    y=[results_df.loc[best_idx, 'mean_test_score']],
    mode='markers',
    marker=dict(size=20, color='red', symbol='star'),
    name='Best Configuration'
))

fig.update_layout(
    title='Grid Search Results: Train vs Test Scores',
    xaxis_title='Mean Train Score (ROC-AUC)',
    yaxis_title='Mean Test Score (ROC-AUC)',
    height=500
)

fig.show()

In [None]:
# Analyze hyperparameter effects
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'Effect of n_estimators',
    'Effect of max_depth',
    'Effect of max_features',
    'Effect of min_samples_split'
))

# n_estimators effect
n_est_means = results_df.groupby('param_classifier__n_estimators')['mean_test_score'].mean()
fig.add_trace(go.Bar(x=[str(x) for x in n_est_means.index], y=n_est_means.values, 
                     marker_color='darkblue'), row=1, col=1)

# max_depth effect
depth_means = results_df.groupby('param_classifier__max_depth')['mean_test_score'].mean()
fig.add_trace(go.Bar(x=[str(x) for x in depth_means.index], y=depth_means.values,
                     marker_color='darkgreen'), row=1, col=2)

# max_features effect
feat_means = results_df.groupby('param_classifier__max_features')['mean_test_score'].mean()
fig.add_trace(go.Bar(x=[str(x) for x in feat_means.index], y=feat_means.values,
                     marker_color='darkorange'), row=2, col=1)

# min_samples_split effect
split_means = results_df.groupby('param_classifier__min_samples_split')['mean_test_score'].mean()
fig.add_trace(go.Bar(x=[str(x) for x in split_means.index], y=split_means.values,
                     marker_color='darkred'), row=2, col=2)

fig.update_layout(
    title='Average Test Score by Hyperparameter Value',
    height=500,
    showlegend=False
)
fig.update_yaxes(title_text='Mean ROC-AUC')

fig.show()

## 4. Randomized Search

### 4.1 When to Use Randomized Search

**Randomized Search** samples a fixed number of random combinations from specified distributions.

**Advantages over Grid Search**:
- More efficient for large search spaces
- Can explore continuous distributions
- Often finds good solutions faster
- Doesn't require specifying exact values

**Rule of thumb**: Use Randomized Search when:
- Grid has > 100-200 combinations
- Hyperparameters are continuous
- You want to explore broadly first

In [None]:
# Define parameter distributions for randomized search
param_distributions = {
    'classifier__n_estimators': randint(50, 500),  # Random integer 50-500
    'classifier__max_depth': [5, 10, 15, 20, 25, 30, None],
    'classifier__max_features': ['sqrt', 'log2', 0.2, 0.3, 0.4, 0.5],
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10)
}

print("Parameter distributions defined for Randomized Search")

### 4.2 Implementing Randomized Search

In [None]:
# Create Randomized Search
random_search = RandomizedSearchCV(
    rf_pipeline,
    param_distributions,
    n_iter=50,  # Number of random combinations to try
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42,
    return_train_score=True
)

print("Randomized Search configured.")
print(f"Sampling {50} random combinations")

In [None]:
# Run Randomized Search
print("Running Randomized Search...")
print("="*50)

start_time = time.time()
random_search.fit(X_train, y_train)
random_time = time.time() - start_time

print(f"\nRandomized Search completed in {random_time:.2f} seconds")
print(f"Best ROC-AUC score: {random_search.best_score_:.4f}")

In [None]:
# Display best parameters from Randomized Search
print("Best Hyperparameters (Randomized Search):")
print("="*50)
for param, value in random_search.best_params_.items():
    param_name = param.replace('classifier__', '')
    print(f"{param_name}: {value}")

In [None]:
# Compare Grid Search vs Randomized Search
print("\nComparison: Grid Search vs Randomized Search")
print("="*50)
print(f"Grid Search:")
print(f"  Best Score: {grid_search.best_score_:.4f}")
print(f"  Time: {grid_time:.2f} seconds")
print(f"  Combinations tested: {len(grid_search.cv_results_['mean_test_score'])}")
print(f"\nRandomized Search:")
print(f"  Best Score: {random_search.best_score_:.4f}")
print(f"  Time: {random_time:.2f} seconds")
print(f"  Combinations tested: {len(random_search.cv_results_['mean_test_score'])}")

## 5. Tuning Individual Hyperparameters

### 5.1 Tuning n_estimators

Let's analyze how performance changes with the number of trees.

In [None]:
# Analyze n_estimators using validation curve
n_estimators_range = [10, 25, 50, 100, 150, 200, 300, 500]

# Create a simple RF for validation curve (no pipeline for speed)
# First, preprocess the data
X_train_preprocessed = preprocessor.fit_transform(X_train)

rf_simple = RandomForestClassifier(
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print("Computing validation curve for n_estimators...")
train_scores, test_scores = validation_curve(
    rf_simple, X_train_preprocessed, y_train,
    param_name='n_estimators',
    param_range=n_estimators_range,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

print("Validation curve computed.")

In [None]:
# Plot validation curve for n_estimators
fig = go.Figure()

# Training scores
fig.add_trace(go.Scatter(
    x=n_estimators_range,
    y=train_scores.mean(axis=1),
    mode='lines+markers',
    name='Training Score',
    line=dict(color='lightblue', width=2)
))

# Fill area for training std
fig.add_trace(go.Scatter(
    x=n_estimators_range + n_estimators_range[::-1],
    y=list(train_scores.mean(axis=1) + train_scores.std(axis=1)) + 
      list(train_scores.mean(axis=1) - train_scores.std(axis=1))[::-1],
    fill='toself',
    fillcolor='rgba(173,216,230,0.3)',
    line=dict(color='rgba(255,255,255,0)'),
    name='Training Std',
    showlegend=False
))

# Test scores
fig.add_trace(go.Scatter(
    x=n_estimators_range,
    y=test_scores.mean(axis=1),
    mode='lines+markers',
    name='Cross-Validation Score',
    line=dict(color='darkblue', width=2)
))

# Fill area for test std
fig.add_trace(go.Scatter(
    x=n_estimators_range + n_estimators_range[::-1],
    y=list(test_scores.mean(axis=1) + test_scores.std(axis=1)) + 
      list(test_scores.mean(axis=1) - test_scores.std(axis=1))[::-1],
    fill='toself',
    fillcolor='rgba(0,0,139,0.2)',
    line=dict(color='rgba(255,255,255,0)'),
    name='CV Std',
    showlegend=False
))

fig.update_layout(
    title='Validation Curve: n_estimators',
    xaxis_title='Number of Trees (n_estimators)',
    yaxis_title='ROC-AUC Score',
    height=450
)

fig.show()

### 5.2 Tuning max_depth

In [None]:
# Validation curve for max_depth
max_depth_range = [3, 5, 7, 10, 12, 15, 20, 25, 30]

rf_depth = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print("Computing validation curve for max_depth...")
train_scores_depth, test_scores_depth = validation_curve(
    rf_depth, X_train_preprocessed, y_train,
    param_name='max_depth',
    param_range=max_depth_range,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

print("Validation curve computed.")

In [None]:
# Plot validation curve for max_depth
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=max_depth_range,
    y=train_scores_depth.mean(axis=1),
    mode='lines+markers',
    name='Training Score',
    line=dict(color='lightgreen', width=2)
))

fig.add_trace(go.Scatter(
    x=max_depth_range,
    y=test_scores_depth.mean(axis=1),
    mode='lines+markers',
    name='Cross-Validation Score',
    line=dict(color='darkgreen', width=2)
))

# Find optimal depth
optimal_idx = test_scores_depth.mean(axis=1).argmax()
optimal_depth = max_depth_range[optimal_idx]

fig.add_vline(x=optimal_depth, line_dash="dash", line_color="red",
              annotation_text=f"Optimal: {optimal_depth}")

fig.update_layout(
    title='Validation Curve: max_depth',
    xaxis_title='Maximum Tree Depth (max_depth)',
    yaxis_title='ROC-AUC Score',
    height=450
)

fig.show()

### 5.3 Tuning max_features

In [None]:
# Test different max_features values
max_features_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

rf_features = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print("Computing validation curve for max_features...")
train_scores_feat, test_scores_feat = validation_curve(
    rf_features, X_train_preprocessed, y_train,
    param_name='max_features',
    param_range=max_features_values,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

print("Validation curve computed.")

In [None]:
# Plot validation curve for max_features
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=max_features_values,
    y=train_scores_feat.mean(axis=1),
    mode='lines+markers',
    name='Training Score',
    line=dict(color='lightsalmon', width=2)
))

fig.add_trace(go.Scatter(
    x=max_features_values,
    y=test_scores_feat.mean(axis=1),
    mode='lines+markers',
    name='Cross-Validation Score',
    line=dict(color='darkorange', width=2)
))

# Mark sqrt and log2 equivalents (approximately)
n_features = X_train_preprocessed.shape[1]
sqrt_equiv = np.sqrt(n_features) / n_features
log2_equiv = np.log2(n_features) / n_features

fig.add_vline(x=sqrt_equiv, line_dash="dot", line_color="purple",
              annotation_text=f"sqrt ({sqrt_equiv:.2f})")

fig.update_layout(
    title='Validation Curve: max_features (fraction of features)',
    xaxis_title='Fraction of Features (max_features)',
    yaxis_title='ROC-AUC Score',
    height=450
)

fig.show()

### 5.4 Tuning min_samples_split and min_samples_leaf

In [None]:
# Validation curve for min_samples_split
min_split_range = [2, 5, 10, 15, 20, 30, 50]

rf_split = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

print("Computing validation curve for min_samples_split...")
train_scores_split, test_scores_split = validation_curve(
    rf_split, X_train_preprocessed, y_train,
    param_name='min_samples_split',
    param_range=min_split_range,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

print("Validation curve computed.")

In [None]:
# Plot validation curves for min_samples_split
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=min_split_range,
    y=train_scores_split.mean(axis=1),
    mode='lines+markers',
    name='Training Score',
    line=dict(color='lightcoral', width=2)
))

fig.add_trace(go.Scatter(
    x=min_split_range,
    y=test_scores_split.mean(axis=1),
    mode='lines+markers',
    name='Cross-Validation Score',
    line=dict(color='darkred', width=2)
))

fig.update_layout(
    title='Validation Curve: min_samples_split',
    xaxis_title='Minimum Samples to Split (min_samples_split)',
    yaxis_title='ROC-AUC Score',
    height=450
)

fig.show()

## 6. Final Model Selection

### 6.1 Best Hyperparameters

Based on our tuning experiments, let's select the best hyperparameters.

In [None]:
# Compare best parameters from different searches
print("Best Hyperparameters Comparison:")
print("="*60)
print("\nGrid Search Best:")
for param, value in grid_search.best_params_.items():
    print(f"  {param.replace('classifier__', '')}: {value}")
print(f"  Score: {grid_search.best_score_:.4f}")

print("\nRandomized Search Best:")
for param, value in random_search.best_params_.items():
    print(f"  {param.replace('classifier__', '')}: {value}")
print(f"  Score: {random_search.best_score_:.4f}")

In [None]:
# Select the best model
if random_search.best_score_ >= grid_search.best_score_:
    best_model = random_search.best_estimator_
    best_params = random_search.best_params_
    best_score = random_search.best_score_
    best_source = 'Randomized Search'
else:
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    best_source = 'Grid Search'

print(f"\nSelected Best Model from: {best_source}")
print(f"Best CV Score: {best_score:.4f}")

### 6.2 Final Model Evaluation

In [None]:
# Evaluate on test set
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

# Calculate metrics
test_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1-Score': f1_score(y_test, y_pred),
    'ROC-AUC': roc_auc_score(y_test, y_prob)
}

print("Final Model Test Set Performance:")
print("="*50)
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")

In [None]:
# Detailed classification report
print("\nClassification Report:")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['Retained', 'Departed']))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig = go.Figure(go.Heatmap(
    z=cm,
    x=['Predicted Retained', 'Predicted Departed'],
    y=['Actual Retained', 'Actual Departed'],
    colorscale='Blues',
    text=cm,
    texttemplate='%{text}',
    textfont=dict(size=16)
))

fig.update_layout(
    title='Confusion Matrix - Tuned Random Forest',
    height=400
)

fig.show()

In [None]:
# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=fpr, y=tpr,
    mode='lines',
    name=f'Tuned RF (AUC={auc:.4f})',
    line=dict(color='darkblue', width=3)
))

fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random (AUC=0.5)',
    line=dict(color='gray', dash='dash')
))

fig.update_layout(
    title='ROC Curve - Tuned Random Forest',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    height=450
)

fig.show()

## 7. Comparing to Baseline and Logistic Regression

In [None]:
# Load baseline RF model
baseline_rf = pickle.load(open(f'{models_path}rf_baseline_model.pkl', 'rb'))
baseline_rf.fit(X_train, y_train)

# Baseline predictions
y_pred_baseline = baseline_rf.predict(X_test)
y_prob_baseline = baseline_rf.predict_proba(X_test)[:, 1]

In [None]:
# Try to load logistic regression model if available
try:
    lr_model = pickle.load(open(f'{models_path}l2_ridge_logistic_model.pkl', 'rb'))
    lr_model.fit(X_train, y_train)
    y_pred_lr = lr_model.predict(X_test)
    y_prob_lr = lr_model.predict_proba(X_test)[:, 1]
    lr_available = True
    print("Logistic Regression model loaded.")
except:
    lr_available = False
    print("Logistic Regression model not available for comparison.")

In [None]:
# Create comparison table
comparison_data = [
    {
        'Model': 'RF Baseline',
        'Accuracy': accuracy_score(y_test, y_pred_baseline),
        'Precision': precision_score(y_test, y_pred_baseline),
        'Recall': recall_score(y_test, y_pred_baseline),
        'F1-Score': f1_score(y_test, y_pred_baseline),
        'ROC-AUC': roc_auc_score(y_test, y_prob_baseline)
    },
    {
        'Model': 'RF Tuned',
        'Accuracy': test_metrics['Accuracy'],
        'Precision': test_metrics['Precision'],
        'Recall': test_metrics['Recall'],
        'F1-Score': test_metrics['F1-Score'],
        'ROC-AUC': test_metrics['ROC-AUC']
    }
]

if lr_available:
    comparison_data.append({
        'Model': 'Logistic Regression (L2)',
        'Accuracy': accuracy_score(y_test, y_pred_lr),
        'Precision': precision_score(y_test, y_pred_lr),
        'Recall': recall_score(y_test, y_pred_lr),
        'F1-Score': f1_score(y_test, y_pred_lr),
        'ROC-AUC': roc_auc_score(y_test, y_prob_lr)
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df

In [None]:
# Visualize comparison
fig = go.Figure()

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
colors = ['lightblue', 'darkblue', 'green']

for i, row in comparison_df.iterrows():
    values = [row[m] for m in metrics]
    fig.add_trace(go.Bar(
        name=row['Model'],
        x=metrics,
        y=values,
        marker_color=colors[i]
    ))

fig.update_layout(
    title='Model Comparison: Baseline vs Tuned RF vs Logistic Regression',
    xaxis_title='Metric',
    yaxis_title='Score',
    barmode='group',
    height=450,
    yaxis=dict(range=[0, 1])
)

fig.show()

In [None]:
# ROC curves comparison
fig = go.Figure()

# Baseline RF
fpr_base, tpr_base, _ = roc_curve(y_test, y_prob_baseline)
auc_base = roc_auc_score(y_test, y_prob_baseline)
fig.add_trace(go.Scatter(
    x=fpr_base, y=tpr_base,
    mode='lines',
    name=f'RF Baseline (AUC={auc_base:.4f})',
    line=dict(color='lightblue', width=2)
))

# Tuned RF
fig.add_trace(go.Scatter(
    x=fpr, y=tpr,
    mode='lines',
    name=f'RF Tuned (AUC={auc:.4f})',
    line=dict(color='darkblue', width=2)
))

# Logistic Regression if available
if lr_available:
    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
    auc_lr = roc_auc_score(y_test, y_prob_lr)
    fig.add_trace(go.Scatter(
        x=fpr_lr, y=tpr_lr,
        mode='lines',
        name=f'Logistic Regression (AUC={auc_lr:.4f})',
        line=dict(color='green', width=2)
    ))

# Random baseline
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random (AUC=0.5)',
    line=dict(color='gray', dash='dash')
))

fig.update_layout(
    title='ROC Curve Comparison',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    height=500
)

fig.show()

In [None]:
# Save the tuned model
tuned_model_path = f'{models_path}rf_tuned_best_model.pkl'
pickle.dump(best_model, open(tuned_model_path, 'wb'))
print(f"Tuned model saved to: {tuned_model_path}")

# Save best parameters
params_path = f'{models_path}rf_best_params.pkl'
pickle.dump(best_params, open(params_path, 'wb'))
print(f"Best parameters saved to: {params_path}")

## 8. Summary

In this notebook, we systematically tuned Random Forest hyperparameters for student departure prediction.

### Key Findings

#### Best Hyperparameters

In [None]:
print("Optimal Hyperparameters:")
print("="*50)
for param, value in best_params.items():
    print(f"{param.replace('classifier__', '')}: {value}")

### Tuning Methods Comparison

| Method | Pros | Cons | Best For |
|:-------|:-----|:-----|:---------|
| **Grid Search** | Thorough, exhaustive | Slow, exponential growth | Small search spaces |
| **Randomized Search** | Efficient, explores broadly | May miss optimal | Large search spaces |
| **Validation Curves** | Visual understanding | Single parameter at a time | Initial exploration |

### Hyperparameter Insights

| Hyperparameter | Finding |
|:---------------|:--------|
| `n_estimators` | More trees help, diminishing returns after ~200 |
| `max_depth` | Moderate depth prevents overfitting |
| `max_features` | 'sqrt' or similar works well for classification |
| `min_samples_split` | Small values (2-5) often work best |

### Performance Improvement

Tuning typically improves model performance by:
- Reducing overfitting through depth/sample constraints
- Finding the optimal trade-off between bias and variance
- Balancing tree diversity and individual accuracy

### Next Steps

With our tuned Random Forest model, you can:
1. Deploy the model for student departure predictions
2. Compare with other ensemble methods (Gradient Boosting)
3. Explore more advanced feature engineering

**Module Complete!** You have successfully learned to build, train, evaluate, and tune Random Forest models for classification.