# 6.1 Special Topics: Additional Boosting Algorithms

## Course 3: Advanced Classification Models for Student Success

## Introduction

In Module 2, we covered the three core tree-based models: Decision Trees, Random Forests, and XGBoost. This special topics module explores **additional algorithms** that are valuable to know but less commonly deployed in higher education settings.

This notebook covers:
- **AdaBoost** — the original boosting algorithm
- **LightGBM** — Microsoft's fast gradient boosting library
- **CatBoost** — Yandex's categorical-feature-optimized boosting library

### When to Use These Models

These models are worth exploring when:
- You need faster training on very large datasets (LightGBM)
- Your data has many categorical features (CatBoost)
- You want to understand the historical foundations of boosting (AdaBoost)
- You're benchmarking many algorithms for a research paper

## 1. AdaBoost: The Original Boosting Algorithm

**AdaBoost** (Adaptive Boosting), introduced in 1996, works by:
1. Giving all training samples equal weight
2. Training a weak learner (typically a decision stump)
3. Increasing weights of misclassified samples
4. Training the next weak learner on reweighted data
5. Combining all learners with weighted voting

### Key Characteristics
- Uses **decision stumps** (trees with depth=1) by default
- Very sensitive to **outliers and noise**
- Has been largely superseded by gradient boosting in practice

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np

# Load data (same preparation as Module 2)
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

numeric_features = ['HS_GPA','HS_MATH_GPA','HS_ENGL_GPA','UNITS_ATTEMPTED_1','UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1','UNITS_COMPLETED_2','DFW_UNITS_1','DFW_UNITS_2','GPA_1','GPA_2',
    'DFW_RATE_1','DFW_RATE_2','GRADE_POINTS_1','GRADE_POINTS_2']
categorical_features = ['RACE_ETHNICITY','GENDER','FIRST_GEN_STATUS','COLLEGE']

train_enc = pd.get_dummies(train_df[numeric_features + categorical_features],
                           columns=categorical_features, drop_first=True)
test_enc = pd.get_dummies(test_df[numeric_features + categorical_features],
                          columns=categorical_features, drop_first=True)
train_enc, test_enc = train_enc.align(test_enc, join='left', axis=1, fill_value=0)
train_enc = train_enc.fillna(train_enc.median())
test_enc = test_enc.fillna(test_enc.median())

X_train, y_train = train_enc, train_df['DEPARTED']
X_test, y_test = test_enc, test_df['DEPARTED']

# AdaBoost
ada = AdaBoostClassifier(n_estimators=200, learning_rate=0.1, random_state=42)
ada.fit(X_train, y_train)
ada_prob = ada.predict_proba(X_test)[:, 1]

print(f"AdaBoost ROC-AUC: {roc_auc_score(y_test, ada_prob):.4f}")

## 2. LightGBM: Fast Gradient Boosting

**LightGBM** (Light Gradient Boosting Machine) by Microsoft focuses on speed and efficiency:

- **Leaf-wise tree growth** (vs. level-wise in XGBoost) — grows the leaf with maximum loss reduction
- **Histogram-based splitting** — bins continuous features for faster splits
- **GOSS** (Gradient-based One-Side Sampling) — keeps high-gradient samples, samples low-gradient ones
- **EFB** (Exclusive Feature Bundling) — bundles mutually exclusive features

### When to Choose LightGBM over XGBoost
- Very large datasets (millions of rows)
- Training speed is critical
- Memory constraints

In [None]:
try:
    from lightgbm import LGBMClassifier

    lgbm = LGBMClassifier(
        n_estimators=150, learning_rate=0.1, max_depth=5,
        num_leaves=31, min_child_samples=20,
        subsample=0.8, colsample_bytree=0.8,
        class_weight='balanced', random_state=42, verbose=-1
    )
    lgbm.fit(X_train, y_train)
    lgbm_prob = lgbm.predict_proba(X_test)[:, 1]

    print(f"LightGBM ROC-AUC: {roc_auc_score(y_test, lgbm_prob):.4f}")

except ImportError:
    print("LightGBM not installed. Install with: pip install lightgbm")

## 3. CatBoost: Native Categorical Feature Handling

**CatBoost** (Categorical Boosting) by Yandex is designed for data with many categorical features:

- **Native categorical encoding** — no need for one-hot encoding
- **Ordered boosting** — prevents target leakage
- **Symmetric trees** — faster prediction
- **Good out-of-the-box performance** with minimal tuning

### When to Choose CatBoost
- Data with many categorical features
- You want minimal preprocessing
- Small datasets where overfitting is a concern

In [None]:
try:
    from catboost import CatBoostClassifier

    cat = CatBoostClassifier(
        iterations=150, learning_rate=0.1, depth=5,
        auto_class_weights='Balanced',
        random_seed=42, verbose=0
    )
    cat.fit(X_train, y_train)
    cat_prob = cat.predict_proba(X_test)[:, 1]

    print(f"CatBoost ROC-AUC: {roc_auc_score(y_test, cat_prob):.4f}")

except ImportError:
    print("CatBoost not installed. Install with: pip install catboost")

## 4. Comparison of All Boosting Methods

| Feature | AdaBoost | XGBoost | LightGBM | CatBoost |
|:--------|:---------|:--------|:---------|:---------|
| **Year** | 1996 | 2014 | 2017 | 2017 |
| **Strategy** | Sample reweighting | Gradient on residuals | Gradient + GOSS | Ordered boosting |
| **Speed** | Moderate | Fast | Very fast | Fast |
| **Categorical handling** | Needs encoding | Needs encoding | Native (integer) | Native (string) |
| **Missing data** | Needs imputation | Native | Native | Native |
| **Best for** | Historical interest | General purpose | Large data, speed | Categorical features |
| **In this course** | Special topic | Core model (Module 2) | Special topic | Special topic |

### Bottom Line

For most higher education use cases, **XGBoost** (covered in Module 2) is sufficient. LightGBM and CatBoost offer marginal improvements in specific scenarios but add complexity to your toolkit.

## 5. Summary

- **AdaBoost**: Historical importance, largely superseded by gradient boosting
- **LightGBM**: Choose for very large datasets or when speed is critical
- **CatBoost**: Choose when you have many categorical features and want easy setup
- **For this course**: XGBoost is the recommended boosting algorithm for practical use

**Proceed to:** `6.2 Special Topics: Neural Networks`