# 04 â€” Random Forest (Hands-on)

Objectives:
- Train a Random Forest classifier and understand bagging + feature randomness
- Use OOB score and test metrics to evaluate generalization
- Tune key hyperparameters (n_estimators, max_depth, max_features, min_samples_leaf)
- Interpret models via feature importance and permutation importance
- Consider class imbalance and basic compute/runtime trade-offs

Assumptions:
- Ensemble diversity: multiple de-correlated trees improve generalization
- Randomness: bootstrap samples and random feature subsets reduce variance

Cautions/Data Prep:
- Handle missing values beforehand (sklearn trees don't accept NaNs)
- Encode categorical features (one-hot) if present; scaling is not required
- RF can be memory/time intensive for large data; adjust n_estimators/depth


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
np.random.seed(42)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance


## 1) Load dataset and stratified split
We use the Breast Cancer dataset again for comparability with Decision Trees.

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X.shape, y.value_counts(normalize=True).round(3)

## 2) Baseline Random Forest with OOB score
Use `oob_score=True` to estimate generalization from the training bootstrap (requires `bootstrap=True`). Keep CPU-friendly settings.

In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    max_features='sqrt',
    min_samples_leaf=1,
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
rf.fit(X_train, y_train)
oob = getattr(rf, 'oob_score_', None)
proba = rf.predict_proba(X_test)[:, 1]
pred = (proba >= 0.5).astype(int)
print({'oob_score': round(oob, 4) if oob is not None else None,
       'test_auc': round(roc_auc_score(y_test, proba), 4)})
print(classification_report(y_test, pred, digits=3))
confusion_matrix(y_test, pred)

Plot ROC curve.

In [None]:
RocCurveDisplay.from_predictions(y_test, proba, name='RandomForest')
plt.show()

## 3) Feature importance and permutation importance
Gini importance can be biased toward high-cardinality or noisy features. Permutation importance provides a more robust perspective (albeit slower).

In [None]:
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
imp.head(10)

In [None]:
perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
perm_imp = pd.Series(perm.importances_mean, index=X.columns).sort_values(ascending=False)
perm_imp.head(10)

## 4) Hyperparameter effects (quick sweeps)
CPU-friendly small sweeps for `n_estimators` and `max_depth`. Observe AUC trends; more trees tend to stabilize, depth controls overfitting.

In [None]:
def eval_rf(n_estimators=200, max_depth=None, max_features='sqrt', min_samples_leaf=1):
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        n_jobs=-1,
        random_state=42
    ).fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    return auc

ests = [50, 100, 200, 400]
aucs_e = [eval_rf(n_estimators=e) for e in ests]
plt.figure(figsize=(6,4))
plt.plot(ests, aucs_e, marker='o')
plt.xlabel('n_estimators')
plt.ylabel('Test ROC-AUC')
plt.title('Effect of n_estimators')
plt.tight_layout(); plt.show()
list(zip(ests, [round(a,3) for a in aucs_e]))

In [None]:
depths = [None, 3, 5, 7, 9]
aucs_d = [eval_rf(max_depth=d) for d in depths]
plt.figure(figsize=(6,4))
labels = ['None' if d is None else str(d) for d in depths]
plt.plot(labels, aucs_d, marker='o')
plt.xlabel('max_depth')
plt.ylabel('Test ROC-AUC')
plt.title('Effect of max_depth')
plt.tight_layout(); plt.show()
list(zip(labels, [round(a,3) for a in aucs_d]))

## 5) Class imbalance: class_weight
Use `class_weight='balanced'` to compensate for skew. Compare F1/Recall/ROC-AUC to baseline.

In [None]:
rf_bal = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
).fit(X_train, y_train)
prob_bal = rf_bal.predict_proba(X_test)[:,1]
pred_bal = (prob_bal >= 0.5).astype(int)
print('Balanced class weights:')
print(classification_report(y_test, pred_bal, digits=3))
print('ROC-AUC:', round(roc_auc_score(y_test, prob_bal), 3))

## Exercises
Instructor solutions are hidden/collapsed.
1. Max features: Try `max_features` in ['sqrt','log2', 0.5] and compare AUC + feature importances.
2. Leaves: Sweep `min_samples_leaf` = [1, 2, 5, 10]. Observe how it affects overfitting and AUC.
3. Permutation importance stability: Increase `n_repeats` for permutation importance and compare top-5 rankings.


In [None]:
# Exercise 1: max_features
# TODO: Evaluate AUC for max_features in ['sqrt','log2', 0.5] using 200 trees; plot a bar chart of AUCs.
...

In [None]:
# Solution 1 (hidden)
opts = ['sqrt', 'log2', 0.5]
results = {}
for mf in opts:
    auc = eval_rf(n_estimators=200, max_features=mf)
    results[str(mf)] = auc
plt.figure(figsize=(5,3))
plt.bar(list(results.keys()), list(results.values()))
plt.ylabel('AUC')
plt.title('AUC vs max_features')
plt.tight_layout(); plt.show()
results

In [None]:
# Exercise 2: min_samples_leaf
# TODO: Sweep min_samples_leaf = [1,2,5,10] at fixed n_estimators=200 and max_features='sqrt'; plot AUC vs min_samples_leaf.
...

In [None]:
# Solution 2 (hidden)
leaves = [1,2,5,10]
aucs_l = [eval_rf(n_estimators=200, max_features='sqrt', min_samples_leaf=l) for l in leaves]
plt.figure(figsize=(5,3))
plt.plot(leaves, aucs_l, marker='o')
plt.xlabel('min_samples_leaf')
plt.ylabel('AUC')
plt.title('AUC vs min_samples_leaf')
plt.tight_layout(); plt.show()
list(zip(leaves, [round(a,3) for a in aucs_l]))

In [None]:
# Exercise 3: Permutation importance stability
# TODO: Compute permutation importance with n_repeats=5 and n_repeats=30; compare top-5 features in each case.
...

In [None]:
# Solution 3 (hidden)
perm5 = permutation_importance(rf, X_test, y_test, n_repeats=5, random_state=42, n_jobs=-1)
top5_a = pd.Series(perm5.importances_mean, index=X.columns).sort_values(ascending=False).head(5)
perm30 = permutation_importance(rf, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1)
top5_b = pd.Series(perm30.importances_mean, index=X.columns).sort_values(ascending=False).head(5)
top5_a, top5_b

## Wrap-up checklist
- [ ] Consider OOB score for quick generalization estimate
- [ ] Tune trees count and depth for stability vs compute
- [ ] Inspect both Gini and permutation importance
- [ ] Handle class imbalance if present
- [ ] Monitor memory/time for large datasets
