# 4.3 Build LightGBM and CatBoost Models

## Introduction

In the previous notebook, we built XGBoost models for student departure prediction. Now we explore two powerful alternatives: **LightGBM** and **CatBoost**. Each library has unique strengths:

- **LightGBM**: Optimized for speed and memory efficiency, excellent for large datasets
- **CatBoost**: Specialized handling of categorical features, robust to overfitting

Understanding when to use each library will help you choose the best tool for your specific higher education analytics problems.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Build LightGBM classification models with native categorical support
2. Build CatBoost classification models with automatic categorical handling
3. Configure key hyperparameters for each library
4. Compare performance, training time, and ease of use across all three libraries
5. Make informed decisions about which library to use for different scenarios

## 1. Setup and Data Loading

### 1.1 Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report, confusion_matrix
)

# Gradient Boosting Libraries
import xgboost as xgb
from xgboost import XGBClassifier

import lightgbm as lgb
from lightgbm import LGBMClassifier

from catboost import CatBoostClassifier, Pool

print(f"XGBoost version: {xgb.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print("Libraries loaded successfully!")

### 1.2 Load and Prepare Data

In [None]:
# Generate synthetic student data (same as previous notebook)
np.random.seed(42)
n_students = 2000

data = {
    'STUDENT_ID': range(1, n_students + 1),
    'HS_GPA': np.random.normal(3.2, 0.5, n_students).clip(2.0, 4.0),
    'MATH_PLACEMENT': np.random.choice(['Remedial', 'College-Ready', 'Advanced'], n_students, p=[0.2, 0.5, 0.3]),
    'FIRST_GEN': np.random.choice(['Yes', 'No'], n_students, p=[0.35, 0.65]),
    'PELL_ELIGIBLE': np.random.choice(['Yes', 'No'], n_students, p=[0.40, 0.60]),
    'RESIDENCY': np.random.choice(['In-State', 'Out-of-State', 'International'], n_students, p=[0.7, 0.2, 0.1]),
    'MAJOR_CATEGORY': np.random.choice(['STEM', 'Business', 'Humanities', 'Social Science', 'Undeclared'], 
                                        n_students, p=[0.25, 0.20, 0.15, 0.20, 0.20]),
    'UNITS_ATTEMPT_1': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_1': np.random.normal(2.8, 0.7, n_students).clip(0.0, 4.0),
    'DFW_RATE_1': np.random.beta(2, 8, n_students),
    'UNITS_ATTEMPT_2': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_2': np.random.normal(2.9, 0.6, n_students).clip(0.0, 4.0),
    'DFW_RATE_2': np.random.beta(2, 8, n_students),
}

df = pd.DataFrame(data)

# Calculate derived features
df['CUM_GPA'] = (df['GPA_1'] + df['GPA_2']) / 2
df['CUM_UNITS'] = df['UNITS_ATTEMPT_1'] + df['UNITS_ATTEMPT_2']
df['AVG_DFW'] = (df['DFW_RATE_1'] + df['DFW_RATE_2']) / 2

# Generate target variable
departure_prob = (
    0.3 - 0.15 * (df['CUM_GPA'] - 2.5) + 0.3 * df['AVG_DFW']
    + 0.05 * (df['FIRST_GEN'] == 'Yes') - 0.02 * (df['HS_GPA'] - 3.0)
    + 0.05 * (df['MATH_PLACEMENT'] == 'Remedial')
    + 0.03 * (df['MAJOR_CATEGORY'] == 'Undeclared')
)
departure_prob = departure_prob.clip(0.05, 0.95)
df['DEPARTED'] = np.random.binomial(1, departure_prob)

# Define feature columns
categorical_cols = ['MATH_PLACEMENT', 'FIRST_GEN', 'PELL_ELIGIBLE', 'RESIDENCY', 'MAJOR_CATEGORY']
numerical_cols = ['HS_GPA', 'UNITS_ATTEMPT_1', 'GPA_1', 'DFW_RATE_1', 
                  'UNITS_ATTEMPT_2', 'GPA_2', 'DFW_RATE_2', 
                  'CUM_GPA', 'CUM_UNITS', 'AVG_DFW']

feature_cols = categorical_cols + numerical_cols
target_col = 'DEPARTED'

X = df[feature_cols]
y = df[target_col]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Dataset shape: {df.shape}")
print(f"Departure rate: {df['DEPARTED'].mean():.1%}")
print(f"\nCategorical features: {categorical_cols}")
print(f"Numerical features: {numerical_cols}")

## 2. LightGBM

### 2.1 LightGBM Key Features

LightGBM (Light Gradient Boosting Machine) was developed by Microsoft with a focus on efficiency:

| Feature | Description |
|:--------|:------------|
| **Leaf-wise Growth** | Grows trees by choosing the leaf with maximum delta loss (vs. level-wise) |
| **GOSS** | Gradient-based One-Side Sampling: keeps large gradients, samples small ones |
| **EFB** | Exclusive Feature Bundling: bundles mutually exclusive features |
| **Histogram-based** | Bins continuous features into discrete buckets for faster splits |
| **Native Categorical** | Can handle categorical features directly (integer-encoded) |
| **Memory Efficient** | Lower memory footprint than XGBoost |

In [None]:
# Visualize leaf-wise vs level-wise tree growth
fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Level-wise Growth (XGBoost default)',
    'Leaf-wise Growth (LightGBM)'
))

# Level-wise tree (grows all leaves at each level)
level_wise_nodes = [
    (4, 5, 'Root', 'lightblue'),
    (2, 3.5, 'L1', 'lightgreen'), (6, 3.5, 'L2', 'lightgreen'),
    (1, 2, 'L3', 'lightyellow'), (3, 2, 'L4', 'lightyellow'),
    (5, 2, 'L5', 'lightyellow'), (7, 2, 'L6', 'lightyellow')
]

# Leaf-wise tree (grows leaf with max gain)
leaf_wise_nodes = [
    (4, 5, 'Root', 'lightblue'),
    (2, 3.5, 'L1', 'lightgreen'), (6, 3.5, 'L2', 'lightcoral'),  # L2 has lower gain
    (1, 2, 'L3', 'lightyellow'), (3, 2, 'L4', 'lightyellow'),  # Continue growing L1's children
    (0.5, 0.5, 'L5', 'lightpink'), (1.5, 0.5, 'L6', 'lightpink')  # Grow highest gain leaf
]

# Draw level-wise
for x, y, label, color in level_wise_nodes:
    fig.add_trace(go.Scatter(
        x=[x], y=[y], mode='markers+text', text=[label],
        textposition='middle center',
        marker=dict(size=40, color=color, line=dict(width=2, color='darkgray')),
        showlegend=False
    ), row=1, col=1)

# Draw leaf-wise
for x, y, label, color in leaf_wise_nodes:
    fig.add_trace(go.Scatter(
        x=[x], y=[y], mode='markers+text', text=[label],
        textposition='middle center',
        marker=dict(size=40, color=color, line=dict(width=2, color='darkgray')),
        showlegend=False
    ), row=1, col=2)

# Add edges for level-wise
edges_lw = [((4, 5), (2, 3.5)), ((4, 5), (6, 3.5)), 
            ((2, 3.5), (1, 2)), ((2, 3.5), (3, 2)),
            ((6, 3.5), (5, 2)), ((6, 3.5), (7, 2))]

for (x1, y1), (x2, y2) in edges_lw:
    fig.add_trace(go.Scatter(
        x=[x1, x2], y=[y1-0.3, y2+0.3], mode='lines',
        line=dict(color='gray', width=2), showlegend=False
    ), row=1, col=1)

# Add edges for leaf-wise
edges_leaf = [((4, 5), (2, 3.5)), ((4, 5), (6, 3.5)),
              ((2, 3.5), (1, 2)), ((2, 3.5), (3, 2)),
              ((1, 2), (0.5, 0.5)), ((1, 2), (1.5, 0.5))]

for (x1, y1), (x2, y2) in edges_leaf:
    fig.add_trace(go.Scatter(
        x=[x1, x2], y=[y1-0.3, y2+0.3], mode='lines',
        line=dict(color='gray', width=2), showlegend=False
    ), row=1, col=2)

fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
fig.update_layout(
    height=400,
    title='Tree Growth Strategies: Level-wise vs Leaf-wise',
    annotations=[
        dict(x=0.25, y=-0.1, xref='paper', yref='paper', text='Grows all leaves at each level',
             showarrow=False, font=dict(size=11)),
        dict(x=0.75, y=-0.1, xref='paper', yref='paper', text='Grows leaf with highest gain',
             showarrow=False, font=dict(size=11))
    ]
)

fig.show()

**Key Insight**: Leaf-wise growth can achieve lower loss with fewer splits, but may overfit on small datasets. Use `num_leaves` to control complexity.

### 2.2 Building a LightGBM Classifier

In [None]:
# First, encode categorical features for LightGBM
# LightGBM requires integer-encoded categoricals or one-hot encoding

# Create copies for encoding
X_train_lgb = X_train.copy()
X_test_lgb = X_test.copy()

# Label encode categorical columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X_train_lgb[col] = le.fit_transform(X_train_lgb[col])
    X_test_lgb[col] = le.transform(X_test_lgb[col])
    label_encoders[col] = le

print("Encoded categorical features for LightGBM:")
X_train_lgb[categorical_cols].head()

In [None]:
# Build LightGBM classifier
lgb_model = LGBMClassifier(
    n_estimators=100,
    max_depth=-1,          # -1 means no limit (leaf-wise uses num_leaves instead)
    num_leaves=31,         # Controls complexity (default=31)
    learning_rate=0.1,
    subsample=0.8,         # Called 'bagging_fraction' in LightGBM native API
    colsample_bytree=0.8,  # Called 'feature_fraction' in native API
    random_state=42,
    verbose=-1             # Suppress warnings
)

# Train with categorical feature specification
lgb_model.fit(
    X_train_lgb, y_train,
    categorical_feature=categorical_cols  # Tell LightGBM which columns are categorical
)

# Predict
y_pred_lgb = lgb_model.predict(X_test_lgb)
y_pred_proba_lgb = lgb_model.predict_proba(X_test_lgb)[:, 1]

# Evaluate
print("LightGBM Model Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_lgb):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_lgb):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred_lgb):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_lgb):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba_lgb):.3f}")

### 2.3 LightGBM with Native Categorical Support

LightGBM can handle categorical features natively without one-hot encoding, which is more memory efficient for high-cardinality features.

In [None]:
# Key LightGBM parameters
lgb_params = {
    'Parameter': [
        'n_estimators',
        'num_leaves',
        'max_depth',
        'learning_rate',
        'min_child_samples',
        'subsample (bagging_fraction)',
        'colsample_bytree (feature_fraction)',
        'reg_alpha',
        'reg_lambda',
        'is_unbalance / scale_pos_weight'
    ],
    'Description': [
        'Number of boosting iterations',
        'Maximum leaves per tree (key for leaf-wise)',
        'Maximum tree depth (-1 for unlimited)',
        'Shrinkage rate',
        'Minimum samples in a leaf',
        'Fraction of rows for each tree',
        'Fraction of features for each tree',
        'L1 regularization',
        'L2 regularization',
        'Handle class imbalance'
    ],
    'Default': ['100', '31', '-1', '0.1', '20', '1.0', '1.0', '0', '0', 'False / 1.0']
}

pd.DataFrame(lgb_params)

### 2.4 LightGBM in Pipelines

In [None]:
# Create a pipeline with LightGBM
# Note: For categorical features to work in pipeline, we need to encode them

from sklearn.preprocessing import OrdinalEncoder

# Use OrdinalEncoder for categoricals (LightGBM-friendly)
preprocessor_lgb = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_cols)
    ]
)

lgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_lgb),
    ('classifier', LGBMClassifier(
        n_estimators=100,
        num_leaves=31,
        learning_rate=0.1,
        random_state=42,
        verbose=-1
    ))
])

# Fit and predict
lgb_pipeline.fit(X_train, y_train)
y_pred_lgb_pipe = lgb_pipeline.predict(X_test)
y_pred_proba_lgb_pipe = lgb_pipeline.predict_proba(X_test)[:, 1]

print("LightGBM Pipeline Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_lgb_pipe):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba_lgb_pipe):.3f}")

## 3. CatBoost

### 3.1 CatBoost Key Features

CatBoost (Categorical Boosting) was developed by Yandex with a focus on categorical features:

| Feature | Description |
|:--------|:------------|
| **Native Categorical** | Handles string/object categorical features directly (no encoding needed!) |
| **Ordered Boosting** | Prevents target leakage by using ordered target statistics |
| **Symmetric Trees** | Balanced trees that are faster to evaluate |
| **GPU Training** | Excellent GPU support out of the box |
| **Robust to Overfitting** | Ordered boosting reduces prediction shift |
| **Feature Combinations** | Automatically creates combinations of categorical features |

In [None]:
# Illustrate the target leakage problem that CatBoost solves
print("The Target Leakage Problem with Categorical Encoding")
print("=" * 60)
print()
print("Traditional Target Encoding:")
print("  - Calculate mean target for each category using ALL training data")
print("  - Problem: Sample's own target value influences its encoding")
print("  - Results in optimistic training performance, poor generalization")
print()
print("CatBoost's Ordered Boosting:")
print("  - For each sample, only use target values from PREVIOUS samples")
print("  - Like time-series: can't look into the future")
print("  - Eliminates target leakage, better generalization")

### 3.2 Building a CatBoost Classifier

In [None]:
# CatBoost can use the original data directly!
# No need to encode categorical features

cat_model = CatBoostClassifier(
    iterations=100,           # Same as n_estimators
    depth=6,                  # Maximum tree depth
    learning_rate=0.1,
    cat_features=categorical_cols,  # Specify categorical columns
    random_state=42,
    verbose=0                 # Suppress training output
)

# Train directly on original data (strings are OK!)
cat_model.fit(X_train, y_train)

# Predict
y_pred_cat = cat_model.predict(X_test)
y_pred_proba_cat = cat_model.predict_proba(X_test)[:, 1]

# Evaluate
print("CatBoost Model Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_cat):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_cat):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred_cat):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_cat):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba_cat):.3f}")

### 3.3 CatBoost with Native Categorical Handling

CatBoost's main advantage is seamless categorical handling. Let's explore this further.

In [None]:
# CatBoost key parameters
cat_params = {
    'Parameter': [
        'iterations',
        'depth',
        'learning_rate',
        'l2_leaf_reg',
        'cat_features',
        'one_hot_max_size',
        'auto_class_weights',
        'random_strength',
        'bagging_temperature',
        'early_stopping_rounds'
    ],
    'Description': [
        'Number of boosting iterations (trees)',
        'Depth of each tree (symmetric)',
        'Shrinkage rate',
        'L2 regularization on leaves',
        'List of categorical feature names/indices',
        'Max categories for one-hot (otherwise ordered encoding)',
        'Auto handle class imbalance (None/Balanced/SqrtBalanced)',
        'Amount of randomness for scoring splits',
        'Controls intensity of Bayesian bootstrap (0=no bootstrap)',
        'Stop if no improvement for N rounds'
    ],
    'Default': ['1000', '6', '0.03', '3.0', 'None', '2', 'None', '1', '1', 'False']
}

pd.DataFrame(cat_params)

In [None]:
# Demonstrate CatBoost's handling of high-cardinality categorical features
print("CatBoost Categorical Feature Handling")
print("=" * 50)
print()
print("For each categorical feature, CatBoost automatically:")
print()
print("1. If categories <= one_hot_max_size (default=2):")
print("   -> Uses one-hot encoding")
print()
print("2. If categories > one_hot_max_size:")
print("   -> Uses ordered target encoding")
print("   -> Prevents target leakage via random permutations")
print()
print("3. Optionally creates feature combinations:")
print("   -> E.g., MAJOR + FIRST_GEN combinations")
print("   -> Captures interaction effects automatically")

### 3.4 CatBoost in Pipelines

In [None]:
# CatBoost works well in pipelines, but we need to handle categorical features carefully
# Option 1: Pass cat_features directly (if using CatBoost outside pipeline)
# Option 2: Use ColumnTransformer with passthrough

# Simple approach: Scale numericals, pass categoricals through
preprocessor_cat = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', 'passthrough', categorical_cols)  # Pass through as-is
    ]
)

# Get the indices of categorical features after transformation
n_numerical = len(numerical_cols)
cat_feature_indices = list(range(n_numerical, n_numerical + len(categorical_cols)))

cat_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_cat),
    ('classifier', CatBoostClassifier(
        iterations=100,
        depth=6,
        learning_rate=0.1,
        cat_features=cat_feature_indices,  # Use indices after preprocessing
        random_state=42,
        verbose=0
    ))
])

# Fit and predict
cat_pipeline.fit(X_train, y_train)
y_pred_cat_pipe = cat_pipeline.predict(X_test)
y_pred_proba_cat_pipe = cat_pipeline.predict_proba(X_test)[:, 1]

print("CatBoost Pipeline Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_cat_pipe):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba_cat_pipe):.3f}")

## 4. Comparing XGBoost, LightGBM, and CatBoost

### 4.1 Performance Comparison

In [None]:
# Train all three models with similar settings

# Prepare data for XGBoost (one-hot encoded)
X_train_xgb = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test_xgb = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)
X_test_xgb = X_test_xgb.reindex(columns=X_train_xgb.columns, fill_value=0)

# XGBoost
xgb_model = XGBClassifier(
    n_estimators=100, max_depth=6, learning_rate=0.1,
    random_state=42, eval_metric='logloss', use_label_encoder=False
)
start = time.time()
xgb_model.fit(X_train_xgb, y_train)
xgb_time = time.time() - start
xgb_proba = xgb_model.predict_proba(X_test_xgb)[:, 1]
xgb_pred = xgb_model.predict(X_test_xgb)

# LightGBM
lgb_model = LGBMClassifier(
    n_estimators=100, num_leaves=31, learning_rate=0.1,
    random_state=42, verbose=-1
)
start = time.time()
lgb_model.fit(X_train_lgb, y_train, categorical_feature=categorical_cols)
lgb_time = time.time() - start
lgb_proba = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_pred = lgb_model.predict(X_test_lgb)

# CatBoost
cat_model = CatBoostClassifier(
    iterations=100, depth=6, learning_rate=0.1,
    cat_features=categorical_cols, random_state=42, verbose=0
)
start = time.time()
cat_model.fit(X_train, y_train)
cat_time = time.time() - start
cat_proba = cat_model.predict_proba(X_test)[:, 1]
cat_pred = cat_model.predict(X_test)

print("Model Training Complete!")

In [None]:
# Compile performance metrics
results = {
    'Model': ['XGBoost', 'LightGBM', 'CatBoost'],
    'Accuracy': [
        accuracy_score(y_test, xgb_pred),
        accuracy_score(y_test, lgb_pred),
        accuracy_score(y_test, cat_pred)
    ],
    'Precision': [
        precision_score(y_test, xgb_pred),
        precision_score(y_test, lgb_pred),
        precision_score(y_test, cat_pred)
    ],
    'Recall': [
        recall_score(y_test, xgb_pred),
        recall_score(y_test, lgb_pred),
        recall_score(y_test, cat_pred)
    ],
    'F1 Score': [
        f1_score(y_test, xgb_pred),
        f1_score(y_test, lgb_pred),
        f1_score(y_test, cat_pred)
    ],
    'ROC-AUC': [
        roc_auc_score(y_test, xgb_proba),
        roc_auc_score(y_test, lgb_proba),
        roc_auc_score(y_test, cat_proba)
    ],
    'Training Time (s)': [xgb_time, lgb_time, cat_time]
}

results_df = pd.DataFrame(results)
results_df.set_index('Model', inplace=True)

# Style the dataframe
def highlight_best(s):
    if s.name == 'Training Time (s)':
        is_best = s == s.min()
    else:
        is_best = s == s.max()
    return ['background-color: lightgreen' if v else '' for v in is_best]

styled_results = results_df.style.apply(highlight_best).format('{:.4f}')
styled_results

In [None]:
# Visualize performance comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']

fig = go.Figure()

colors = {'XGBoost': 'blue', 'LightGBM': 'green', 'CatBoost': 'orange'}

for model in ['XGBoost', 'LightGBM', 'CatBoost']:
    fig.add_trace(go.Bar(
        name=model,
        x=metrics,
        y=[results_df.loc[model, m] for m in metrics],
        marker_color=colors[model]
    ))

fig.update_layout(
    title='Performance Comparison: XGBoost vs LightGBM vs CatBoost',
    xaxis_title='Metric',
    yaxis_title='Score',
    barmode='group',
    height=450,
    yaxis=dict(range=[0, 1])
)

fig.show()

### 4.2 Training Time Comparison

In [None]:
# Compare training times across different dataset sizes
# (Simulated for demonstration)

dataset_sizes = [500, 1000, 2000, 5000, 10000]
training_times = {
    'XGBoost': [],
    'LightGBM': [],
    'CatBoost': []
}

# Simulate timing for different sizes
np.random.seed(42)
for size in dataset_sizes:
    # XGBoost: roughly linear with size
    training_times['XGBoost'].append(0.05 + size * 0.00008 + np.random.normal(0, 0.01))
    # LightGBM: faster, especially for larger sizes
    training_times['LightGBM'].append(0.03 + size * 0.00004 + np.random.normal(0, 0.005))
    # CatBoost: more overhead but efficient
    training_times['CatBoost'].append(0.08 + size * 0.00006 + np.random.normal(0, 0.01))

fig = go.Figure()

for model, color in colors.items():
    fig.add_trace(go.Scatter(
        x=dataset_sizes,
        y=training_times[model],
        mode='lines+markers',
        name=model,
        line=dict(color=color, width=2),
        marker=dict(size=8)
    ))

fig.update_layout(
    title='Training Time Scaling by Dataset Size',
    xaxis_title='Number of Samples',
    yaxis_title='Training Time (seconds)',
    height=400
)

fig.show()

print("Note: LightGBM typically shows the biggest speed advantage on larger datasets.")

### 4.3 When to Use Each Library

In [None]:
# Decision guide
decision_guide = {
    'Scenario': [
        'Large dataset (>100k rows)',
        'Many categorical features',
        'Quick prototyping needed',
        'Production deployment',
        'GPU training required',
        'Minimal preprocessing desired',
        'Best documentation/community',
        'Small dataset (risk of overfit)',
        'High-cardinality categoricals',
        'Need feature selection'
    ],
    'Best Choice': [
        'LightGBM',
        'CatBoost',
        'CatBoost',
        'XGBoost or LightGBM',
        'CatBoost or XGBoost',
        'CatBoost',
        'XGBoost',
        'CatBoost',
        'CatBoost',
        'Any (all support)
    ],
    'Reason': [
        'Fastest training, lowest memory',
        'Native handling without encoding',
        'Works out of the box with minimal tuning',
        'Mature, well-tested, great tooling',
        'Best GPU implementation',
        'Accepts string categoricals directly',
        'Largest community, most tutorials',
        'Ordered boosting prevents overfit',
        'Efficient ordered target encoding',
        'Built-in importance metrics'
    ]
}

guide_df = pd.DataFrame(decision_guide)
guide_df

In [None]:
# Summary comparison table
summary_comparison = {
    'Aspect': [
        'Speed',
        'Memory Usage',
        'Categorical Handling',
        'Missing Values',
        'Default Performance',
        'Overfitting Risk',
        'GPU Support',
        'scikit-learn Compatible',
        'Ease of Use'
    ],
    'XGBoost': [
        'Fast',
        'Moderate',
        'Requires encoding',
        'Native (learns direction)',
        'Good',
        'Moderate',
        'Yes',
        'Yes',
        'Good'
    ],
    'LightGBM': [
        'Very Fast',
        'Low',
        'Native (integer-encoded)',
        'Native',
        'Good',
        'Higher (leaf-wise)',
        'Yes',
        'Yes',
        'Good'
    ],
    'CatBoost': [
        'Fast',
        'Moderate',
        'Native (strings OK!)',
        'Native',
        'Excellent',
        'Lower (ordered)',
        'Excellent',
        'Yes',
        'Excellent'
    ]
}

pd.DataFrame(summary_comparison).set_index('Aspect')

### Recommendation for Higher Education Analytics

For **student departure prediction** and similar higher education problems:

| Situation | Recommendation |
|:----------|:---------------|
| **Starting out** | CatBoost - easiest setup, handles categoricals automatically |
| **Large institution (100k+ students)** | LightGBM - fastest training |
| **Production system** | XGBoost - most mature, best tooling |
| **Many demographic categoricals** | CatBoost - best categorical handling |
| **Ensemble all three** | Often gives best results! |

## 5. Summary

In this notebook, we covered:

### Key Concepts

1. **LightGBM**:
   - Leaf-wise tree growth for efficiency
   - Native categorical support (integer-encoded)
   - Best for large datasets and speed-critical applications
   - Use `num_leaves` to control complexity

2. **CatBoost**:
   - Native string categorical support (no encoding needed)
   - Ordered boosting prevents target leakage
   - Best for ease of use and categorical-heavy data
   - Robust to overfitting on small datasets

3. **Comparison**:
   - All three perform similarly on many tasks
   - Choose based on data characteristics and requirements
   - Consider ensembling for best results

### Summary Table

| Library | Key Strength | Best For |
|:--------|:-------------|:---------|
| XGBoost | Maturity, tooling | Production, benchmarking |
| LightGBM | Speed, efficiency | Large datasets |
| CatBoost | Categorical handling | Ease of use, categoricals |

### Next Steps

In the next notebook, we will explore **training and evaluation** techniques including early stopping, cross-validation, and SHAP-based feature importance.

**Proceed to:** `4.4 Train and Evaluate Boosted Models`