# 03 â€” Decision Trees (Hands-on)

Objectives:
- Train a Decision Tree classifier and understand hierarchical splitting
- Compare impurity criteria (gini vs entropy), control overfitting via max_depth and pruning (ccp_alpha)
- Evaluate with accuracy, precision/recall/F1, ROC-AUC, confusion matrix
- Discuss class imbalance handling and interpretability (feature importances, shallow visualization)

Assumptions:
- Data can be recursively split by feature thresholds to improve purity
- Works with both numeric and categorical (after encoding) features

Cautions/Data Prep:
- Trees can overfit small/noisy datasets; use pruning or limit depth/leaves
- No scaling needed, but watch class imbalance; use stratified splits and/or class weights
- Very deep trees become less interpretable


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
np.random.seed(42)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score,
                             RocCurveDisplay)


## 1) Load dataset and create stratified split
We use the Breast Cancer Wisconsin dataset (binary classification; somewhat imbalanced).

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X.shape, y.value_counts(normalize=True).round(3)

## 2) Fit baseline tree and evaluate
Start with modest constraints to avoid severe overfitting (e.g., max_depth=5). Compare to a less constrained tree (None).

In [None]:
tree5 = DecisionTreeClassifier(max_depth=5, random_state=42)
tree5.fit(X_train, y_train)
proba5 = tree5.predict_proba(X_test)[:,1]
pred5 = (proba5 >= 0.5).astype(int)

print('== Max depth 5 ==')
print(classification_report(y_test, pred5, digits=3))
print('ROC-AUC:', round(roc_auc_score(y_test, proba5), 3))
cm5 = confusion_matrix(y_test, pred5)
cm5


In [None]:
tree_free = DecisionTreeClassifier(random_state=42)
tree_free.fit(X_train, y_train)
proba_free = tree_free.predict_proba(X_test)[:,1]
pred_free = (proba_free >= 0.5).astype(int)

print('== No depth limit ==')
print(classification_report(y_test, pred_free, digits=3))
print('ROC-AUC:', round(roc_auc_score(y_test, proba_free), 3))
cm_free = confusion_matrix(y_test, pred_free)
cm_free


Plot ROC curves and compare models.

In [None]:
disp5 = RocCurveDisplay.from_predictions(y_test, proba5, name='Tree depth=5')
dispF = RocCurveDisplay.from_predictions(y_test, proba_free, name='Tree no limit', ax=disp5.ax_)
plt.show()

## 3) Feature importance and shallow visualization
Interpretability: list top features. For plotting, we visualize a shallow tree (max_depth=3) to keep it readable.

In [None]:
imp = pd.Series(tree5.feature_importances_, index=X.columns).sort_values(ascending=False)
imp.head(10)

In [None]:
tree3 = DecisionTreeClassifier(max_depth=3, random_state=42)
tree3.fit(X_train, y_train)
plt.figure(figsize=(16, 8))
plot_tree(tree3, feature_names=X.columns, class_names=data.target_names, filled=True, rounded=True)
plt.title('Decision Tree (max_depth=3)')
plt.show()

## 4) Hyperparameters: depth sweep and pruning
Overfitting check: cross-validated accuracy across depths. Then try cost-complexity pruning using ccp_alpha path.

In [None]:
depths = list(range(1, 11))
cv_acc = []
for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=42)
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    cv_acc.append(scores.mean())
plt.figure(figsize=(6,4))
plt.plot(depths, cv_acc, marker='o')
plt.xlabel('max_depth')
plt.ylabel('CV Accuracy')
plt.title('Depth vs CV Accuracy')
plt.xticks(depths)
plt.tight_layout(); plt.show()
list(zip(depths, [round(a,3) for a in cv_acc]))

In [None]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
accs = []
for a in ccp_alphas:
    pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=a)
    accs.append(cross_val_score(pruned, X_train, y_train, cv=5, scoring='accuracy').mean())
best_idx = int(np.argmax(accs))
best_alpha = float(ccp_alphas[best_idx])
plt.figure(figsize=(6,4))
plt.plot(ccp_alphas, accs, marker='o')
plt.xscale('log')
plt.xlabel('ccp_alpha (log scale)')
plt.ylabel('CV Accuracy')
plt.title('Pruning with CCP Alpha')
plt.tight_layout(); plt.show()
best_alpha, round(accs[best_idx],3)

In [None]:
pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
pruned.fit(X_train, y_train)
probaP = pruned.predict_proba(X_test)[:,1]
predP = (probaP >= 0.5).astype(int)
print('== Pruned tree ==')
print(classification_report(y_test, predP, digits=3))
print('ROC-AUC:', round(roc_auc_score(y_test, probaP), 3))
confusion_matrix(y_test, predP)

## 5) Class imbalance considerations
Use `class_weight='balanced'` or resampling when classes are skewed. Compare ROC-AUC/F1 with and without balancing.

In [None]:
tree_bal = DecisionTreeClassifier(max_depth=5, class_weight='balanced', random_state=42)
tree_bal.fit(X_train, y_train)
proba_bal = tree_bal.predict_proba(X_test)[:,1]
pred_bal = (proba_bal >= 0.5).astype(int)
print('== Class weight balanced, depth=5 ==')
print(classification_report(y_test, pred_bal, digits=3))
print('ROC-AUC:', round(roc_auc_score(y_test, proba_bal), 3))


## Exercises
Complete the tasks below. Instructor solution cells are hidden; expand them only if needed.
1. Criterion: Train with `criterion='gini'` vs `criterion='entropy'` for a few depths; report differences in CV accuracy and test ROC-AUC.
2. Threshold tuning: Instead of 0.5, vary threshold from 0.2 to 0.8; plot Precision-Recall curve or trade-offs for the depth=5 model.
3. Pruning: Use a simple train/val split on X_train to select `ccp_alpha` (instead of CV). Does it differ from the CV-selected alpha?


In [None]:
# Exercise 1: Gini vs Entropy
# TODO: For depths in [3,5,7], compare CV accuracy and test ROC-AUC between gini and entropy.
...

In [None]:
# Solution 1 (hidden)
depths_try = [3,5,7]
res = []
for d in depths_try:
    for crit in ['gini','entropy']:
        clf = DecisionTreeClassifier(max_depth=d, criterion=crit, random_state=42)
        cv = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy').mean()
        clf.fit(X_train, y_train)
        proba = clf.predict_proba(X_test)[:,1]
        auc = roc_auc_score(y_test, proba)
        res.append((d, crit, round(cv,3), round(auc,3)))
res

In [None]:
# Exercise 2: Threshold tuning
# TODO: Sweep thresholds, compute precision and recall, and plot precision vs recall for tree5.
...

In [None]:
# Solution 2 (hidden)
from sklearn.metrics import precision_score, recall_score
ths = np.linspace(0.1, 0.9, 17)
prec, rec = [], []
for t in ths:
    p = (proba5 >= t).astype(int)
    prec.append(precision_score(y_test, p))
    rec.append(recall_score(y_test, p))
plt.figure(figsize=(5,4))
plt.plot(rec, prec, marker='o')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall tradeoff (threshold sweep)')
plt.tight_layout(); plt.show()
list(zip([round(x,2) for x in ths], [round(p,3) for p in prec], [round(r,3) for r in rec]))

In [None]:
# Exercise 3: Simple validation-based pruning
# TODO: Split X_train into (X_sub, X_val). For several ccp_alpha values, fit and evaluate on X_val; pick best alpha, refit on full X_train, and evaluate on X_test.
...

In [None]:
# Solution 3 (hidden)
Xs, Xv, ys, yv = train_test_split(X_train, y_train, test_size=0.25, random_state=42, stratify=y_train)
tmp = DecisionTreeClassifier(random_state=42).fit(Xs, ys)
alphas = tmp.cost_complexity_pruning_path(Xs, ys).ccp_alphas
best_a, best_auc = 0.0, -np.inf
for a in alphas:
    m = DecisionTreeClassifier(random_state=42, ccp_alpha=a).fit(Xs, ys)
    auc = roc_auc_score(yv, m.predict_proba(Xv)[:,1])
    if auc > best_auc:
        best_a, best_auc = a, auc
final_m = DecisionTreeClassifier(random_state=42, ccp_alpha=best_a).fit(X_train, y_train)
final_auc = roc_auc_score(y_test, final_m.predict_proba(X_test)[:,1])
best_a, round(best_auc,3), round(final_auc,3)

## Wrap-up checklist
- [ ] Use stratified splits for classification problems
- [ ] Limit depth/leaves or use pruning to control overfitting
- [ ] Inspect feature importances and a shallow tree for interpretability
- [ ] Consider class weights or resampling for imbalance
- [ ] Report multiple metrics (accuracy, F1, ROC-AUC) and inspect confusion matrix
