# 4.5 Tune Boosted Model Hyperparameters

## Introduction

In the previous notebooks, we built and evaluated gradient boosting models with default or manually chosen hyperparameters. Now we focus on **systematic hyperparameter tuning** to optimize model performance.

Hyperparameter tuning is crucial because:
- Default parameters are general-purpose, not optimal for your specific data
- The right hyperparameters can significantly improve performance
- Proper tuning prevents overfitting while maximizing predictive power

We'll explore three tuning strategies:
1. **Grid Search**: Exhaustive search over specified parameter values
2. **Randomized Search**: Random sampling from parameter distributions
3. **Bayesian Optimization**: Intelligent search using prior results (Optuna)

### Learning Objectives

By the end of this notebook, you will be able to:

1. Identify the most important hyperparameters for gradient boosting
2. Implement grid search and randomized search for hyperparameter tuning
3. Use Optuna for efficient Bayesian optimization
4. Analyze and visualize hyperparameter search results
5. Build a final optimized model for production use

## 1. Setup and Data Preparation

### 1.1 Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Scikit-learn
from sklearn.model_selection import (
    train_test_split, cross_val_score, StratifiedKFold,
    GridSearchCV, RandomizedSearchCV
)
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, make_scorer
)
from scipy.stats import uniform, randint, loguniform

# Gradient Boosting Libraries
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Optuna for Bayesian optimization
import optuna
from optuna.samplers import TPESampler

print("Libraries loaded successfully!")

### 1.2 Load and Prepare Data

In [None]:
# Generate synthetic student data
np.random.seed(42)
n_students = 3000

data = {
    'STUDENT_ID': range(1, n_students + 1),
    'HS_GPA': np.random.normal(3.2, 0.5, n_students).clip(2.0, 4.0),
    'MATH_PLACEMENT': np.random.choice(['Remedial', 'College-Ready', 'Advanced'], n_students, p=[0.2, 0.5, 0.3]),
    'FIRST_GEN': np.random.choice(['Yes', 'No'], n_students, p=[0.35, 0.65]),
    'PELL_ELIGIBLE': np.random.choice(['Yes', 'No'], n_students, p=[0.40, 0.60]),
    'RESIDENCY': np.random.choice(['In-State', 'Out-of-State', 'International'], n_students, p=[0.7, 0.2, 0.1]),
    'UNITS_ATTEMPT_1': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_1': np.random.normal(2.8, 0.7, n_students).clip(0.0, 4.0),
    'DFW_RATE_1': np.random.beta(2, 8, n_students),
    'UNITS_ATTEMPT_2': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_2': np.random.normal(2.9, 0.6, n_students).clip(0.0, 4.0),
    'DFW_RATE_2': np.random.beta(2, 8, n_students),
}

df = pd.DataFrame(data)

# Derived features
df['CUM_GPA'] = (df['GPA_1'] + df['GPA_2']) / 2
df['CUM_UNITS'] = df['UNITS_ATTEMPT_1'] + df['UNITS_ATTEMPT_2']
df['AVG_DFW'] = (df['DFW_RATE_1'] + df['DFW_RATE_2']) / 2
df['GPA_TREND'] = df['GPA_2'] - df['GPA_1']

# Generate target
departure_prob = (
    0.3 - 0.15 * (df['CUM_GPA'] - 2.5) + 0.3 * df['AVG_DFW']
    + 0.05 * (df['FIRST_GEN'] == 'Yes') - 0.02 * (df['HS_GPA'] - 3.0)
    + 0.05 * (df['MATH_PLACEMENT'] == 'Remedial') - 0.05 * df['GPA_TREND']
)
departure_prob = departure_prob.clip(0.05, 0.95)
df['DEPARTED'] = np.random.binomial(1, departure_prob)

# Define columns
categorical_cols = ['MATH_PLACEMENT', 'FIRST_GEN', 'PELL_ELIGIBLE', 'RESIDENCY']
numerical_cols = ['HS_GPA', 'UNITS_ATTEMPT_1', 'GPA_1', 'DFW_RATE_1', 
                  'UNITS_ATTEMPT_2', 'GPA_2', 'DFW_RATE_2', 
                  'CUM_GPA', 'CUM_UNITS', 'AVG_DFW', 'GPA_TREND']
feature_cols = categorical_cols + numerical_cols

X = df[feature_cols]
y = df['DEPARTED']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Prepare encoded versions for different libraries
# One-hot for XGBoost
X_train_xgb = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test_xgb = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)
X_test_xgb = X_test_xgb.reindex(columns=X_train_xgb.columns, fill_value=0)

# Label encode for LightGBM
X_train_lgb = X_train.copy()
X_test_lgb = X_test.copy()
for col in categorical_cols:
    le = LabelEncoder()
    X_train_lgb[col] = le.fit_transform(X_train_lgb[col])
    X_test_lgb[col] = le.transform(X_test_lgb[col])

print(f"Training set: {len(X_train)} students")
print(f"Test set: {len(X_test)} students")
print(f"Departure rate: {y_train.mean():.1%}")

## 2. Key Hyperparameters for Gradient Boosting

### 2.1 Learning Rate and Number of Estimators

The **learning rate** (`learning_rate` or `eta`) and **number of estimators** (`n_estimators`) work together:

- Lower learning rate = more trees needed, but often better generalization
- Higher learning rate = fewer trees, faster training, risk of overfitting

**Rule of thumb**: Start with `learning_rate=0.1`, then try lower values with more trees.

In [None]:
# Hyperparameter categories
param_categories = {
    'Category': ['Boosting', 'Boosting', 'Tree', 'Tree', 'Tree', 
                 'Regularization', 'Regularization', 'Regularization',
                 'Sampling', 'Sampling', 'Class Balance'],
    'Parameter': ['n_estimators', 'learning_rate', 'max_depth', 'num_leaves', 'min_child_weight',
                  'gamma', 'reg_alpha', 'reg_lambda',
                  'subsample', 'colsample_bytree', 'scale_pos_weight'],
    'XGBoost Name': ['n_estimators', 'learning_rate', 'max_depth', 'N/A', 'min_child_weight',
                     'gamma', 'reg_alpha', 'reg_lambda',
                     'subsample', 'colsample_bytree', 'scale_pos_weight'],
    'LightGBM Name': ['n_estimators', 'learning_rate', 'max_depth', 'num_leaves', 'min_child_samples',
                      'min_split_gain', 'reg_alpha', 'reg_lambda',
                      'subsample', 'colsample_bytree', 'scale_pos_weight'],
    'CatBoost Name': ['iterations', 'learning_rate', 'depth', 'N/A', 'min_data_in_leaf',
                      'N/A', 'N/A', 'l2_leaf_reg',
                      'subsample', 'rsm', 'auto_class_weights'],
    'Typical Range': ['100-1000', '0.01-0.3', '3-10', '15-255', '1-10',
                      '0-5', '0-1', '0-1',
                      '0.5-1.0', '0.5-1.0', '1 or class ratio']
}

pd.DataFrame(param_categories)

### 2.2 Tree Complexity Parameters

These control how complex each individual tree can be:

| Parameter | Effect |
|:----------|:-------|
| `max_depth` | Maximum tree depth (deeper = more complex) |
| `num_leaves` | Maximum leaves per tree (LightGBM) |
| `min_child_weight` | Minimum sum of instance weight in a leaf |

### 2.3 Regularization Parameters

| Parameter | Description |
|:----------|:------------|
| `gamma` | Minimum loss reduction for a split |
| `reg_alpha` | L1 regularization on weights |
| `reg_lambda` | L2 regularization on weights |

### 2.4 Sampling Parameters

| Parameter | Description |
|:----------|:------------|
| `subsample` | Fraction of samples used per tree |
| `colsample_bytree` | Fraction of features used per tree |

In [None]:
# Visualize hyperparameter effects
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'Learning Rate: Lower = More Conservative',
    'Max Depth: Higher = More Complex',
    'Regularization: Higher = Simpler Model',
    'Subsample: Lower = More Stochastic'
))

# Learning rate effect
x = np.linspace(0.01, 0.3, 100)
fig.add_trace(go.Scatter(
    x=x, y=1 - np.exp(-10*x),
    mode='lines', name='Faster Learning',
    line=dict(color='red')
), row=1, col=1)

# Max depth effect
x = np.arange(1, 11)
fig.add_trace(go.Scatter(
    x=x, y=1 - 1/x,
    mode='lines+markers', name='Complexity',
    line=dict(color='blue')
), row=1, col=2)

# Regularization effect
x = np.linspace(0, 1, 100)
fig.add_trace(go.Scatter(
    x=x, y=1 - x,
    mode='lines', name='Model Freedom',
    line=dict(color='green')
), row=2, col=1)

# Subsample effect
x = np.linspace(0.5, 1.0, 100)
fig.add_trace(go.Scatter(
    x=x, y=x,
    mode='lines', name='Stability',
    line=dict(color='orange')
), row=2, col=2)

fig.update_layout(height=500, showlegend=False,
                  title_text='Hyperparameter Effects on Model Behavior')
fig.show()

## 3. Manual Hyperparameter Exploration

### 3.1 Learning Rate vs. n_estimators Trade-off

In [None]:
# Explore learning rate and n_estimators combinations
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.3]
n_estimators_list = [50, 100, 200, 500]

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
results = []

for lr in learning_rates:
    for n_est in n_estimators_list:
        model = XGBClassifier(
            n_estimators=n_est,
            learning_rate=lr,
            max_depth=6,
            random_state=42,
            eval_metric='logloss',
            use_label_encoder=False
        )
        
        scores = cross_val_score(model, X_train_xgb, y_train, cv=cv, scoring='roc_auc')
        
        results.append({
            'learning_rate': lr,
            'n_estimators': n_est,
            'mean_auc': scores.mean(),
            'std_auc': scores.std()
        })

results_df = pd.DataFrame(results)

# Create heatmap
pivot = results_df.pivot(index='learning_rate', columns='n_estimators', values='mean_auc')

fig = go.Figure(data=go.Heatmap(
    z=pivot.values,
    x=[str(x) for x in pivot.columns],
    y=[str(x) for x in pivot.index],
    text=pivot.values.round(4),
    texttemplate='%{text}',
    colorscale='Blues',
    colorbar=dict(title='ROC-AUC')
))

fig.update_layout(
    title='Learning Rate vs n_estimators: Cross-Validation ROC-AUC',
    xaxis_title='n_estimators',
    yaxis_title='learning_rate',
    height=400
)

fig.show()

best_idx = results_df['mean_auc'].idxmax()
print(f"\nBest combination:")
print(f"  learning_rate: {results_df.loc[best_idx, 'learning_rate']}")
print(f"  n_estimators: {results_df.loc[best_idx, 'n_estimators']}")
print(f"  ROC-AUC: {results_df.loc[best_idx, 'mean_auc']:.4f}")

### 3.2 Max Depth Exploration

In [None]:
# Explore max_depth
max_depths = [2, 3, 4, 5, 6, 8, 10]

depth_results = []

for depth in max_depths:
    model = XGBClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42,
        eval_metric='logloss',
        use_label_encoder=False
    )
    
    # Get train and validation scores
    model.fit(X_train_xgb, y_train)
    train_score = roc_auc_score(y_train, model.predict_proba(X_train_xgb)[:, 1])
    
    cv_scores = cross_val_score(model, X_train_xgb, y_train, cv=cv, scoring='roc_auc')
    
    depth_results.append({
        'max_depth': depth,
        'train_auc': train_score,
        'cv_auc': cv_scores.mean(),
        'cv_std': cv_scores.std()
    })

depth_df = pd.DataFrame(depth_results)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=depth_df['max_depth'],
    y=depth_df['train_auc'],
    mode='lines+markers',
    name='Train AUC',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=depth_df['max_depth'],
    y=depth_df['cv_auc'],
    mode='lines+markers',
    name='CV AUC',
    line=dict(color='green', width=2),
    error_y=dict(type='data', array=depth_df['cv_std'])
))

fig.update_layout(
    title='Effect of max_depth on Model Performance',
    xaxis_title='max_depth',
    yaxis_title='ROC-AUC',
    height=400
)

fig.show()

print("Observation: As max_depth increases:")
print("- Training AUC increases (more complex model)")
print("- CV AUC may plateau or decrease (overfitting)")

## 4. Grid Search

### 4.1 Basic Grid Search

In [None]:
# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Calculate total combinations
total_combinations = 1
for param, values in param_grid.items():
    total_combinations *= len(values)

print(f"Grid Search Configuration")
print("=" * 40)
print(f"Total parameter combinations: {total_combinations}")
print(f"With 3-fold CV: {total_combinations * 3} model fits")
print(f"\nParameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

In [None]:
# Run Grid Search
xgb_base = XGBClassifier(
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)

grid_search = GridSearchCV(
    estimator=xgb_base,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=3,
    verbose=1,
    n_jobs=-1,  # Use all cores
    return_train_score=True
)

start_time = time.time()
grid_search.fit(X_train_xgb, y_train)
grid_time = time.time() - start_time

print(f"\nGrid Search completed in {grid_time:.1f} seconds")
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest CV Score: {grid_search.best_score_:.4f}")

### 4.2 Analyzing Grid Search Results

In [None]:
# Convert results to DataFrame
grid_results = pd.DataFrame(grid_search.cv_results_)

# Sort by mean test score
grid_results_sorted = grid_results.sort_values('mean_test_score', ascending=False)

# Display top 10 combinations
display_cols = ['param_n_estimators', 'param_max_depth', 'param_learning_rate',
                'param_subsample', 'param_colsample_bytree',
                'mean_test_score', 'std_test_score', 'rank_test_score']

print("Top 10 Parameter Combinations:")
grid_results_sorted[display_cols].head(10).round(4)

In [None]:
# Visualize parameter importance
# Calculate mean score for each parameter value
param_analysis = {}

for param in param_grid.keys():
    param_col = f'param_{param}'
    param_means = grid_results.groupby(param_col)['mean_test_score'].mean()
    param_analysis[param] = param_means

# Create subplots
fig = make_subplots(rows=2, cols=3, subplot_titles=list(param_grid.keys()))

positions = [(1,1), (1,2), (1,3), (2,1), (2,2)]

for idx, (param, means) in enumerate(param_analysis.items()):
    row, col = positions[idx]
    fig.add_trace(go.Bar(
        x=[str(x) for x in means.index],
        y=means.values,
        marker_color='steelblue'
    ), row=row, col=col)

fig.update_layout(height=500, showlegend=False,
                  title_text='Mean CV Score by Parameter Value')
fig.show()

## 5. Randomized Search

### 5.1 Randomized Search Setup

Randomized search samples from continuous distributions, exploring more of the parameter space.

In [None]:
# Define parameter distributions
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 12),
    'learning_rate': loguniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),  # 0.6 to 1.0
    'colsample_bytree': uniform(0.6, 0.4),
    'min_child_weight': randint(1, 10),
    'gamma': uniform(0, 0.5),
    'reg_alpha': loguniform(1e-5, 1),
    'reg_lambda': loguniform(1e-5, 1)
}

# Number of random samples
n_iter = 50

print(f"Randomized Search Configuration")
print("=" * 40)
print(f"Random samples: {n_iter}")
print(f"With 3-fold CV: {n_iter * 3} model fits")
print(f"\nParameter Distributions:")
for param, dist in param_distributions.items():
    print(f"  {param}: {type(dist).__name__}")

In [None]:
# Run Randomized Search
random_search = RandomizedSearchCV(
    estimator=xgb_base,
    param_distributions=param_distributions,
    n_iter=n_iter,
    scoring='roc_auc',
    cv=3,
    verbose=1,
    n_jobs=-1,
    random_state=42,
    return_train_score=True
)

start_time = time.time()
random_search.fit(X_train_xgb, y_train)
random_time = time.time() - start_time

print(f"\nRandomized Search completed in {random_time:.1f} seconds")
print(f"\nBest Parameters:")
for param, value in random_search.best_params_.items():
    if isinstance(value, float):
        print(f"  {param}: {value:.4f}")
    else:
        print(f"  {param}: {value}")
print(f"\nBest CV Score: {random_search.best_score_:.4f}")

### 5.2 Comparing Search Strategies

In [None]:
# Compare Grid vs Random Search
comparison = pd.DataFrame({
    'Metric': ['Best CV Score', 'Time (seconds)', 'Combinations Tried', 'Parameters Tuned'],
    'Grid Search': [f"{grid_search.best_score_:.4f}", f"{grid_time:.1f}", 
                    total_combinations, len(param_grid)],
    'Random Search': [f"{random_search.best_score_:.4f}", f"{random_time:.1f}",
                      n_iter, len(param_distributions)]
})

print("Grid Search vs Randomized Search")
comparison

In [None]:
# Visualize the search trajectories
random_results = pd.DataFrame(random_search.cv_results_)

fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Grid Search: Score Distribution',
    'Random Search: Score Distribution'
))

fig.add_trace(go.Histogram(
    x=grid_results['mean_test_score'],
    nbinsx=20,
    name='Grid Search',
    marker_color='blue'
), row=1, col=1)

fig.add_trace(go.Histogram(
    x=random_results['mean_test_score'],
    nbinsx=20,
    name='Random Search',
    marker_color='green'
), row=1, col=2)

fig.update_layout(height=400, showlegend=False,
                  title_text='Distribution of CV Scores Across Search')
fig.update_xaxes(title_text='Mean CV Score')
fig.update_yaxes(title_text='Count')

fig.show()

## 6. Bayesian Optimization with Optuna

### 6.1 Introduction to Optuna

**Optuna** uses Bayesian optimization to intelligently search the parameter space:

- Uses previous results to guide the search
- Balances exploration (trying new areas) and exploitation (refining good areas)
- More efficient than grid or random search
- Supports pruning (stopping unpromising trials early)

### 6.2 Optuna Study

In [None]:
# Define the objective function
def objective(trial):
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0, 0.5),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 1, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 1, log=True),
    }
    
    # Create model with suggested parameters
    model = XGBClassifier(
        **params,
        random_state=42,
        eval_metric='logloss',
        use_label_encoder=False
    )
    
    # Cross-validation
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    scores = cross_val_score(model, X_train_xgb, y_train, cv=cv, scoring='roc_auc')
    
    return scores.mean()

print("Optuna objective function defined!")

In [None]:
# Create and run the study
sampler = TPESampler(seed=42)  # Tree-structured Parzen Estimator

study = optuna.create_study(
    direction='maximize',  # Maximize ROC-AUC
    sampler=sampler
)

# Suppress Optuna logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

start_time = time.time()
study.optimize(objective, n_trials=50, show_progress_bar=True)
optuna_time = time.time() - start_time

print(f"\nOptuna Study completed in {optuna_time:.1f} seconds")
print(f"\nBest Trial:")
print(f"  Value (ROC-AUC): {study.best_value:.4f}")
print(f"\nBest Parameters:")
for param, value in study.best_params.items():
    if isinstance(value, float):
        print(f"  {param}: {value:.4f}")
    else:
        print(f"  {param}: {value}")

### 6.3 Visualizing Optimization

In [None]:
# Plot optimization history
trials_df = study.trials_dataframe()

fig = go.Figure()

# All trials
fig.add_trace(go.Scatter(
    x=trials_df['number'],
    y=trials_df['value'],
    mode='markers',
    name='Trial Score',
    marker=dict(color='lightblue', size=8)
))

# Best so far
best_so_far = trials_df['value'].cummax()
fig.add_trace(go.Scatter(
    x=trials_df['number'],
    y=best_so_far,
    mode='lines',
    name='Best So Far',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title='Optuna Optimization History',
    xaxis_title='Trial Number',
    yaxis_title='ROC-AUC Score',
    height=400
)

fig.show()

In [None]:
# Parameter importance
importance = optuna.importance.get_param_importances(study)

importance_df = pd.DataFrame({
    'Parameter': list(importance.keys()),
    'Importance': list(importance.values())
}).sort_values('Importance', ascending=True)

fig = px.bar(
    importance_df,
    x='Importance', y='Parameter',
    orientation='h',
    title='Hyperparameter Importance (Optuna)',
    color='Importance',
    color_continuous_scale='Blues'
)

fig.update_layout(height=450, yaxis_title='')
fig.show()

In [None]:
# Parameter relationship visualization
# Plot learning_rate vs score
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'learning_rate', 'max_depth', 'n_estimators', 'subsample'
))

params_to_plot = ['learning_rate', 'max_depth', 'n_estimators', 'subsample']

for idx, param in enumerate(params_to_plot):
    row = idx // 2 + 1
    col = idx % 2 + 1
    
    fig.add_trace(go.Scatter(
        x=trials_df[f'params_{param}'],
        y=trials_df['value'],
        mode='markers',
        marker=dict(color=trials_df['number'], colorscale='Viridis', size=8),
        showlegend=False
    ), row=row, col=col)

fig.update_layout(height=600, title_text='Parameter-Score Relationships')
fig.show()

## 7. Final Model and Comparison

### 7.1 Training the Tuned Model

In [None]:
# Train final model with Optuna's best parameters
best_params = study.best_params

final_model = XGBClassifier(
    **best_params,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)

final_model.fit(X_train_xgb, y_train)

# Evaluate on test set
y_pred = final_model.predict(X_test_xgb)
y_pred_proba = final_model.predict_proba(X_test_xgb)[:, 1]

print("Final Tuned Model Performance on Test Set")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba):.4f}")

### 7.2 Performance Comparison

In [None]:
# Compare all models: Default, Grid, Random, Optuna

# Default model
default_model = XGBClassifier(
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)
default_model.fit(X_train_xgb, y_train)
default_proba = default_model.predict_proba(X_test_xgb)[:, 1]
default_pred = default_model.predict(X_test_xgb)

# Grid Search model
grid_model = grid_search.best_estimator_
grid_proba = grid_model.predict_proba(X_test_xgb)[:, 1]
grid_pred = grid_model.predict(X_test_xgb)

# Random Search model
random_model = random_search.best_estimator_
random_proba = random_model.predict_proba(X_test_xgb)[:, 1]
random_pred = random_model.predict(X_test_xgb)

# Compile results
model_comparison = pd.DataFrame({
    'Model': ['Default XGBoost', 'Grid Search', 'Random Search', 'Optuna (Bayesian)'],
    'CV Score': [
        cross_val_score(default_model, X_train_xgb, y_train, cv=3, scoring='roc_auc').mean(),
        grid_search.best_score_,
        random_search.best_score_,
        study.best_value
    ],
    'Test ROC-AUC': [
        roc_auc_score(y_test, default_proba),
        roc_auc_score(y_test, grid_proba),
        roc_auc_score(y_test, random_proba),
        roc_auc_score(y_test, y_pred_proba)
    ],
    'Test Accuracy': [
        accuracy_score(y_test, default_pred),
        accuracy_score(y_test, grid_pred),
        accuracy_score(y_test, random_pred),
        accuracy_score(y_test, y_pred)
    ],
    'Test F1': [
        f1_score(y_test, default_pred),
        f1_score(y_test, grid_pred),
        f1_score(y_test, random_pred),
        f1_score(y_test, y_pred)
    ],
    'Tuning Time (s)': [0, grid_time, random_time, optuna_time]
})

# Highlight best values
def highlight_best(s):
    if s.name == 'Tuning Time (s)':
        return [''] * len(s)
    is_best = s == s.max()
    return ['background-color: lightgreen' if v else '' for v in is_best]

model_comparison.style.apply(highlight_best).format({
    'CV Score': '{:.4f}',
    'Test ROC-AUC': '{:.4f}',
    'Test Accuracy': '{:.4f}',
    'Test F1': '{:.4f}',
    'Tuning Time (s)': '{:.1f}'
})

In [None]:
# Visualize comparison
fig = go.Figure()

metrics = ['CV Score', 'Test ROC-AUC', 'Test Accuracy', 'Test F1']
colors = ['blue', 'green', 'orange', 'red']

for model, color in zip(model_comparison['Model'], colors):
    model_data = model_comparison[model_comparison['Model'] == model]
    fig.add_trace(go.Bar(
        name=model,
        x=metrics,
        y=[model_data[m].values[0] for m in metrics],
        marker_color=color
    ))

fig.update_layout(
    title='Model Comparison: Default vs Tuned',
    xaxis_title='Metric',
    yaxis_title='Score',
    barmode='group',
    height=450,
    yaxis=dict(range=[0.6, 0.85])
)

fig.show()

In [None]:
# Summary of tuning methods
tuning_summary = pd.DataFrame({
    'Method': ['Grid Search', 'Random Search', 'Optuna (Bayesian)'],
    'Pros': [
        'Exhaustive, reproducible, simple',
        'Explores continuous space, faster than grid',
        'Intelligent search, most efficient, handles complex spaces'
    ],
    'Cons': [
        'Computationally expensive, misses between grid points',
        'Random, no learning from previous trials',
        'More complex setup, requires more dependencies'
    ],
    'Best For': [
        'Small parameter spaces, quick validation',
        'Medium-sized searches, continuous parameters',
        'Large searches, production tuning'
    ]
})

tuning_summary

## 8. Summary

In this notebook, we covered:

### Key Concepts

1. **Important Hyperparameters**:
   - Boosting: `learning_rate`, `n_estimators`
   - Tree: `max_depth`, `min_child_weight`
   - Regularization: `gamma`, `reg_alpha`, `reg_lambda`
   - Sampling: `subsample`, `colsample_bytree`

2. **Tuning Strategies**:
   - **Grid Search**: Exhaustive but expensive
   - **Random Search**: Efficient for continuous parameters
   - **Bayesian (Optuna)**: Most efficient, learns from trials

3. **Best Practices**:
   - Start with sensible defaults
   - Tune learning rate and n_estimators together
   - Use early stopping during tuning
   - Validate final model on held-out test set

### Hyperparameter Tuning Workflow

```
1. Start with defaults
   |
2. Manual exploration of key parameters
   |
3. Coarse search (Grid or Random)
   |
4. Fine-tuning (Optuna/Bayesian)
   |
5. Final validation on test set
```

### Summary Table

| Topic | Key Takeaway |
|:------|:-------------|
| Learning Rate | Lower = more trees, often better generalization |
| Max Depth | 4-8 usually sufficient; higher may overfit |
| Regularization | Use when overfitting observed |
| Grid Search | Good for small, discrete spaces |
| Random Search | Good for continuous parameters |
| Optuna | Best for production tuning |

### Module 4 Complete!

Congratulations! You have completed Module 4 on Gradient Boosting. You now know how to:

1. Understand boosting theory and gradient descent connection
2. Build models with XGBoost, LightGBM, and CatBoost
3. Implement early stopping and cross-validation
4. Interpret models using SHAP values
5. Tune hyperparameters using multiple strategies

**Continue to Module 5** to learn about model deployment and productionization!