
# üß™ Module 3 ‚Äî Hands-On Exercise A  
## Heart Disease Classification (Logistic Regression ‚Ä¢ Trees ‚Ä¢ Ensembles ‚Ä¢ Metrics)

### Goal
- Compare **four classifiers** side-by-side  
- Practice evaluating with **multiple metrics** (not just accuracy)  
- Explore **thresholding**, **ROC curves**, and **model tuning**  
- Gain intuition for **trade-offs** between interpretability and performance  


In [None]:

# --- Imports ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, confusion_matrix
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Dataset helper
from datasets_module3 import make_heart_disease_synth

SEED = 1955

# --- Step 1: Load Dataset ---
df = make_heart_disease_synth(n=600, seed=SEED)
df.head()



### üîç Step 1 ‚Äî Explore the Dataset
Use `df.head()`, `df.info()`, and `df.describe()` to understand the features and the target (`disease`).



## üßº Step 2 ‚Äî Clean & Prepare the Data
We will:
- Drop missing targets  
- Identify numeric and categorical columns  
- Build a preprocessing pipeline (impute + scale/encode)  


In [None]:

# --- Step 2: Clean & Prepare ---

X = df.drop('disease', axis=1)
y = df['disease']

num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()

from sklearn.impute import SimpleImputer

num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

pre = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])


## üîÄ Step 3 ‚Äî Train/Test Split

In [None]:

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=SEED)
Xtr.shape, Xte.shape


## ‚öôÔ∏è Step 4 ‚Äî Logistic Regression (Baseline)

In [None]:

log_reg = Pipeline([
    ('pre', pre),
    ('model', LogisticRegression(max_iter=500, random_state=SEED))
])

log_reg.fit(Xtr, ytr)
yhat_lr = log_reg.predict(Xte)
yprob_lr = log_reg.predict_proba(Xte)[:, 1]

# Metrics
print("Logistic Regression Metrics:")
print("Accuracy :", accuracy_score(yte, yhat_lr))
print("Precision:", precision_score(yte, yhat_lr))
print("Recall   :", recall_score(yte, yhat_lr))
print("F1 Score :", f1_score(yte, yhat_lr))
print("AUC      :", roc_auc_score(yte, yprob_lr))


## üå≥ Step 5 ‚Äî Decision Tree Classifier

In [None]:

tree = Pipeline([
    ('pre', pre),
    ('model', DecisionTreeClassifier(max_depth=4, random_state=SEED))
])

tree.fit(Xtr, ytr)
yhat_tree = tree.predict(Xte)
yprob_tree = tree.predict_proba(Xte)[:, 1]

print("Decision Tree Metrics:")
print("Accuracy :", accuracy_score(yte, yhat_tree))
print("Precision:", precision_score(yte, yhat_tree))
print("Recall   :", recall_score(yte, yhat_tree))
print("F1 Score :", f1_score(yte, yhat_tree))
print("AUC      :", roc_auc_score(yte, yprob_tree))


## üå≤ Step 6 ‚Äî Random Forest & Gradient Boosting

In [None]:

# Random Forest
rf = Pipeline([
    ('pre', pre),
    ('model', RandomForestClassifier(n_estimators=200, random_state=SEED))
])

# Gradient Boosting
gb = Pipeline([
    ('pre', pre),
    ('model', GradientBoostingClassifier(
        learning_rate=0.05, 
        n_estimators=200, 
        max_depth=3,
        random_state=SEED))
])

rf.fit(Xtr, ytr)
gb.fit(Xtr, ytr)

yhat_rf = rf.predict(Xte)
yprob_rf = rf.predict_proba(Xte)[:, 1]

yhat_gb = gb.predict(Xte)
yprob_gb = gb.predict_proba(Xte)[:, 1]

print("Random Forest AUC:", roc_auc_score(yte, yprob_rf))
print("Gradient Boosting AUC:", roc_auc_score(yte, yprob_gb))


## üìä Step 7 ‚Äî Compare All Models (Metrics Table)

In [None]:

def evaluate(name, pred, prob):
    return {
        "Model": name,
        "Accuracy": accuracy_score(yte, pred),
        "Precision": precision_score(yte, pred),
        "Recall": recall_score(yte, pred),
        "F1": f1_score(yte, pred),
        "AUC": roc_auc_score(yte, prob)
    }

results = pd.DataFrame([
    evaluate("Logistic Regression", yhat_lr, yprob_lr),
    evaluate("Decision Tree", yhat_tree, yprob_tree),
    evaluate("Random Forest", yhat_rf, yprob_rf),
    evaluate("Gradient Boosting", yhat_gb, yprob_gb)
])

results.sort_values("AUC", ascending=False)


## üìà Step 8 ‚Äî ROC Curves (All Models)

In [None]:

plt.figure(figsize=(7,6))

for name, prob in [
    ("Logistic", yprob_lr),
    ("Tree", yprob_tree),
    ("RF", yprob_rf),
    ("GB", yprob_gb)
]:
    fpr, tpr, _ = roc_curve(yte, prob)
    plt.plot(fpr, tpr, label=name)

plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves ‚Äî All Models")
plt.legend()
plt.show()


## üõ†Ô∏è Step 9 ‚Äî Model Tuning (Grid Search)

In [None]:

params = {"model__max_depth": [2,3,4,5,6,8]}

gs_tree = GridSearchCV(
    Pipeline([('pre', pre), ('model', DecisionTreeClassifier(random_state=SEED))]),
    param_grid=params,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

gs_tree.fit(Xtr, ytr)

print("Best Tree Depth:", gs_tree.best_params_['model__max_depth'])
print("Best CV Acc    :", gs_tree.best_score_)



## üß† Step 10 ‚Äî Reflection Questions

- Which model performed best overall? Why?  
- Which metric (Accuracy, Precision, Recall, F1, AUC) changed your opinion the most?  
- When might you prefer Logistic Regression over Random Forest?  
- Would you deploy Gradient Boosting if interpretability mattered?  
- How did tuning the Decision Tree affect performance?  

---
