# AAE 722 — Lab 8 (Supervised Learning)

**Author:** Wenshi (Gary) Sun  
**Note:** Each question cell is clearly labeled. I re-ran all code locally so outputs are visible.

---
### Environment & Notes
- Using `ISLP` datasets (`Hitters`, `OJ`, `Auto`).
- Random seeds fixed at `random_state=42`.
- Metrics rounded per question instructions.


In [None]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import sklearn.model_selection as skm
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier as DTC, DecisionTreeRegressor as DTR
from sklearn.ensemble import RandomForestRegressor as RF, GradientBoostingClassifier as GBC
from ISLP import load_data
from ISLP.models import ModelSpec as MS

def mse(y_true, y_pred):
    return np.mean((np.array(y_true) - np.array(y_pred))**2)


## Question 1 — Bagging for Salary on Hitters
**Task.** Remove rows with missing *Salary* values, split 70/30 (random_state=42), fit a bagging model (RF with `max_features = #predictors`) using 300 trees to predict `Salary`. Report test MSE (2 decimals).

In [None]:
# Q1
Hitters = load_data('Hitters')
Hitters = Hitters.dropna(subset=['Salary'])
spec_q1 = MS(Hitters.columns.drop('Salary'), intercept=False)
D_q1 = spec_q1.fit_transform(Hitters)
X_q1 = np.asarray(D_q1)
y_q1 = Hitters['Salary'].to_numpy()
(Xtr_q1, Xte_q1, ytr_q1, yte_q1) = skm.train_test_split(
    X_q1, y_q1, test_size=0.30, random_state=42
)
bag_model = RF(
    n_estimators=300,
    max_features=Xtr_q1.shape[1],
    random_state=42
)
bag_model.fit(Xtr_q1, ytr_q1)
yhat_q1 = bag_model.predict(Xte_q1)
mse_q1 = mse(yte_q1, yhat_q1)
print('Q1 — Test MSE:', round(mse_q1, 2))

## Question 2 — Random Forest with `max_features=5` (same split as Q1)
**Task.** Using the same train/test split from Q1, fit RF with `max_features=5`, 300 trees. Show a DataFrame of feature importances (descending). Identify the top variable.

In [None]:
# Q2
rf5 = RF(n_estimators=300, max_features=5, random_state=42)
rf5.fit(Xtr_q1, ytr_q1)
imp_q2 = (pd.DataFrame({'feature': D_q1.columns,
                        'importance': rf5.feature_importances_})
          .sort_values('importance', ascending=False)
          .reset_index(drop=True))
display(imp_q2.head(10))
top_var_q2 = imp_q2.loc[0, 'feature']
print('Q2 — Most important variable:', top_var_q2)

## Question 3 — Decision Tree Classifier on OJ
**Task.** Predict `Purchase` using all other variables. Split 75/25 (random_state=42). Fit `DTC(criterion="entropy", max_depth=4, min_samples_leaf=5, random_state=42)`. Report test accuracy (3 decimals).

In [None]:
# Q3
OJ = load_data('OJ')
spec_q3 = MS(OJ.columns.drop('Purchase'), intercept=False)
D_q3 = spec_q3.fit_transform(OJ)
X_q3 = np.asarray(D_q3)
y_q3 = OJ['Purchase'].to_numpy()
(Xtr_q3, Xte_q3, ytr_q3, yte_q3) = skm.train_test_split(
    X_q3, y_q3, test_size=0.25, random_state=42
)
dtc = DTC(criterion='entropy', max_depth=4, min_samples_leaf=5, random_state=42)
dtc.fit(Xtr_q3, ytr_q3)
acc_q3 = accuracy_score(yte_q3, dtc.predict(Xte_q3))
print('Q3 — Test accuracy:', round(acc_q3, 3))

## Question 4 — Gradient Boosting on Auto (binary mpg_high)
**Task.** Remove missing rows. Create `mpg_high = 1(mpg > median)`. Split 70/30 (random_state=42). Use all variables except `mpg` and `name`. Fit `GBC(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)`. Report test accuracy (3 decimals).

In [None]:
# Q4
Auto = load_data('Auto').dropna()
Auto = Auto.copy()
Auto['mpg_high'] = (Auto['mpg'] > Auto['mpg'].median()).astype(int)
predictors_q4 = [c for c in Auto.columns if c not in ['mpg', 'mpg_high', 'name']]
spec_q4 = MS(predictors_q4, intercept=False)
D_q4 = spec_q4.fit_transform(Auto)
X_q4 = np.asarray(D_q4)
y_q4 = Auto['mpg_high'].to_numpy()
(Xtr_q4, Xte_q4, ytr_q4, yte_q4) = skm.train_test_split(
    X_q4, y_q4, test_size=0.30, random_state=42
)
gbc = GBC(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
gbc.fit(Xtr_q4, ytr_q4)
acc_q4 = accuracy_score(yte_q4, gbc.predict(Xte_q4))
print('Q4 — Test accuracy:', round(acc_q4, 3))

## Question 5 — Regression Tree with Cost-Complexity Pruning (Hitters)
**Task.** On `Hitters` (same NA rule), split 70/30 (random_state=42). Fit `DTR(max_depth=6, random_state=42)`. Use `cost_complexity_pruning_path` to build a grid over `ccp_alpha`, then `GridSearchCV(cv=5, shuffle=True, random_state=42)` with scoring=`neg_mean_squared_error`. Report: (a) number of leaf nodes of the best estimator; (b) test MSE of the pruned tree (2 decimals).

In [None]:
# Q5
Hitters2 = load_data('Hitters')
Hitters2 = Hitters2.dropna(subset=['Salary'])
spec_q5 = MS(Hitters2.columns.drop('Salary'), intercept=False)
D_q5 = spec_q5.fit_transform(Hitters2)
X_q5 = np.asarray(D_q5)
y_q5 = Hitters2['Salary'].to_numpy()
(Xtr_q5, Xte_q5, ytr_q5, yte_q5) = skm.train_test_split(
    X_q5, y_q5, test_size=0.30, random_state=42
)
tree0 = DTR(max_depth=6, random_state=42)
tree0.fit(Xtr_q5, ytr_q5)
path = tree0.cost_complexity_pruning_path(Xtr_q5, ytr_q5)
alphas = np.unique(path.ccp_alphas)
cv = skm.KFold(n_splits=5, shuffle=True, random_state=42)
gs = skm.GridSearchCV(
    estimator=DTR(random_state=42),
    param_grid={'ccp_alpha': alphas},
    scoring='neg_mean_squared_error',
    cv=cv,
    refit=True
)
gs.fit(Xtr_q5, ytr_q5)
best_tree = gs.best_estimator_
n_leaves = best_tree.tree_.n_leaves
yhat_q5 = best_tree.predict(Xte_q5)
mse_q5 = mse(yte_q5, yhat_q5)
print('Q5 — Best ccp_alpha:', gs.best_params_['ccp_alpha'])
print('Q5 — #Leaf nodes:', n_leaves)
print('Q5 — Test MSE (pruned):', round(mse_q5, 2))

---
### Submission Checklist
- [ ] All cells executed and outputs visible.
- [ ] Notebook saved.
- [ ] Pushed to GitHub (AAE722 repo) in a clear path (e.g., `Labs/Lab8/AAE722_Lab8_WenshiSun.ipynb`).
- [ ] Direct link shared on Canvas (opens the notebook file directly).
