# 05 â€” XGBoost (Hands-on)

Objectives:
- Train an XGBoost classifier on tabular data with early stopping
- Understand core hyperparameters: learning rate (eta), max_depth, subsample, colsample_bytree
- Evaluate with ROC-AUC, classification report; compare validation vs test
- Inspect feature importance and discuss leakage/overfitting risks

Assumptions:
- Boosting: models improve by fitting residuals/gradients of prior iterations
- Best for structured/tabular data; handles numeric features well

Cautions/Data Prep:
- Missing values are handled natively but can still impact results
- Sensitive to overfitting: use early stopping, regularization, and CV
- Watch for data leakage (ensure proper splits; avoid using future info)
- Tuning (eta, max_depth, subsample, colsample_*) is crucial


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
np.random.seed(42)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay, confusion_matrix
from xgboost import XGBClassifier


## 1) Load dataset and create train/validation/test split
We will create a validation set for early stopping separate from the test set.

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Train/test split first
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
# Split validation from the training portion
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=42, stratify=y_train_full
)
X.shape, y.value_counts(normalize=True).round(3)

## 2) Baseline XGBoost with early stopping
Keep CPU-friendly settings. Early stopping halts when validation metric doesn't improve for `early_stopping_rounds` rounds.

In [None]:
xgb = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    objective='binary:logistic',
    eval_metric='auc',
    n_jobs=-1,
    random_state=42,
    tree_method='hist'  # fast CPU histogram algorithm
)

xgb.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    verbose=False,
    early_stopping_rounds=20
)

best_it = int(xgb.best_iteration) if hasattr(xgb, 'best_iteration') else None
best_score = float(xgb.best_score) if hasattr(xgb, 'best_score') else None
print({'best_iteration': best_it, 'best_val_auc': round(best_score, 4) if best_score else None})

proba_val = xgb.predict_proba(X_val)[:,1]
proba_test = xgb.predict_proba(X_test)[:,1]
print({'val_auc': round(roc_auc_score(y_val, proba_val), 4),
       'test_auc': round(roc_auc_score(y_test, proba_test), 4)})
pred_test = (proba_test >= 0.5).astype(int)
print(classification_report(y_test, pred_test, digits=3))
confusion_matrix(y_test, pred_test)

In [None]:
RocCurveDisplay.from_predictions(y_test, proba_test, name='XGBoost (test)')
plt.show()

## 3) Feature importance
XGBoost provides multiple importance types. Here we show gain-based importance via `feature_importances_` (by default: weight/gain depending on version), and also the built-in plot for convenience (optional if GUI backends are limited).

In [None]:
imp = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False)
imp.head(10)

## 4) Hyperparameter sweeps (CPU-friendly)
Small grids for `max_depth` and `learning_rate`. Observe validation and test AUC trends. In practice, use cross-validation for robust selection.

In [None]:
def train_eval(eta=0.05, depth=4, subs=0.8, cols=0.8):
    model = XGBClassifier(
        n_estimators=400,
        learning_rate=eta,
        max_depth=depth,
        subsample=subs,
        colsample_bytree=cols,
        reg_lambda=1.0,
        objective='binary:logistic',
        eval_metric='auc',
        n_jobs=-1,
        random_state=42,
        tree_method='hist'
    )
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=False,
        early_stopping_rounds=15
    )
    vauc = roc_auc_score(y_val, model.predict_proba(X_val)[:,1])
    tauc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    return vauc, tauc

depths = [3, 4, 5, 6]
etas = [0.03, 0.05, 0.1]
results = []
for d in depths:
    for e in etas:
        v,t = train_eval(eta=e, depth=d)
        results.append((d, e, v, t))

res_df = pd.DataFrame(results, columns=['max_depth','eta','val_auc','test_auc'])
display(res_df.sort_values(['val_auc','test_auc'], ascending=False).head(10))

plt.figure(figsize=(7,4))
for d in depths:
    sub = res_df[res_df.max_depth==d]
    plt.plot(sub.eta, sub.val_auc, marker='o', label=f'depth={d}')
plt.xlabel('learning_rate (eta)')
plt.ylabel('Validation AUC')
plt.title('Validation AUC vs eta by max_depth')
plt.legend(); plt.tight_layout(); plt.show()

Subsampling often improves generalization with some variance. Try a quick sweep for `subsample` and `colsample_bytree` at a fixed depth/eta.

In [None]:
subs = [0.6, 0.8, 1.0]
cols = [0.6, 0.8, 1.0]
grid = []
for s in subs:
    for c in cols:
        v,t = train_eval(eta=0.05, depth=4, subs=s, cols=c)
        grid.append((s, c, v, t))
grid_df = pd.DataFrame(grid, columns=['subsample','colsample_bytree','val_auc','test_auc'])
display(grid_df.sort_values(['val_auc','test_auc'], ascending=False))

## 5) Overfitting and leakage notes
- Monitor the gap between validation and test AUC; large gaps may indicate overfitting or unlucky splits
- Ensure all preprocessing (if any) is fit only on training data, then applied to val/test
- Beware engineered features that accidentally use label information (e.g., target means computed across full data)


## Exercises
Instructor solution cells are hidden/collapsed.
1. Learning rate sweep: Try `eta` in [0.01, 0.03, 0.05, 0.1] at `max_depth=4` and plot validation vs test AUC lines.
2. Early stopping sensitivity: Compare `early_stopping_rounds` = 5 vs 20 vs 50; report best_iteration and test AUC.
3. Sanity check: Shuffle labels on the training set and show that validation AUC drops near 0.5 (random). Restore labels afterwards.


In [None]:
# Exercise 1: Learning rate sweep
# TODO: Evaluate eta in [0.01, 0.03, 0.05, 0.1] at depth=4 with early stopping; plot val/test AUC curves.
...

In [None]:
# Solution 1 (hidden)
etas2 = [0.01, 0.03, 0.05, 0.1]
vals, tests = [], []
for e in etas2:
    v,t = train_eval(eta=e, depth=4)
    vals.append(v); tests.append(t)
plt.figure(figsize=(6,4))
plt.plot(etas2, vals, marker='o', label='Val AUC')
plt.plot(etas2, tests, marker='s', label='Test AUC')
plt.xlabel('eta'); plt.ylabel('AUC'); plt.title('AUC vs learning rate (depth=4)')
plt.legend(); plt.tight_layout(); plt.show()
list(zip(etas2, [round(v,3) for v in vals], [round(t,3) for t in tests]))

In [None]:
# Exercise 2: Early stopping sensitivity
# TODO: Train with early_stopping_rounds in [5, 20, 50] and record best_iteration and test AUC.
...

In [None]:
# Solution 2 (hidden)
rows = []
for es in [5, 20, 50]:
    m = XGBClassifier(n_estimators=600, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8,
                      objective='binary:logistic', eval_metric='auc', n_jobs=-1, random_state=42, tree_method='hist')
    m.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=es, verbose=False)
    it = getattr(m, 'best_iteration', None)
    auc_t = roc_auc_score(y_test, m.predict_proba(X_test)[:,1])
    rows.append((es, it, auc_t))
pd.DataFrame(rows, columns=['early_stopping_rounds','best_iteration','test_auc'])

In [None]:
# Exercise 3: Sanity check with label shuffle
# TODO: Shuffle y_train (keeping X_train) and refit a small model; show val AUC ~ 0.5. Then restore original labels.
...

In [None]:
# Solution 3 (hidden)
y_train_shuff = y_train.sample(frac=1.0, random_state=123).reset_index(drop=True)
X_train_shuff = X_train.reset_index(drop=True)
m = XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, subsample=0.8, colsample_bytree=0.8,
                  objective='binary:logistic', eval_metric='auc', n_jobs=-1, random_state=123, tree_method='hist')
m.fit(X_train_shuff, y_train_shuff, eval_set=[(X_val, y_val)], early_stopping_rounds=10, verbose=False)
val_auc = roc_auc_score(y_val, m.predict_proba(X_val)[:,1])
round(val_auc,3)

## Wrap-up checklist
- [ ] Create train/val/test splits; avoid leakage
- [ ] Use early stopping and track best_iteration
- [ ] Tune eta, max_depth, subsample, colsample_bytree
- [ ] Compare validation vs test AUC to gauge generalization
- [ ] Inspect feature importance carefully; validate with robust metrics
