# Exercise 8.1: Boosting on Pima Indian Diabetes Dataset

**Objective:** Compare boosting algorithms (AdaBoost vs GradientBoosting) on the Pima Indian Diabetes classification task.

## Experiment Setup
- **Dataset:** Pima Indian Diabetes (8 features, binary target)  
- **Test Size:** 30% holdout  
- **Metrics:** Accuracy, ROC AUC  
- **Models:**  
  1. AdaBoostClassifier   
  2. GradientBoostingClassifier (loss='deviance')  
- **Stability:** 10 independent runs for each model

## 1️⃣ Imports

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.inspection import permutation_importance


## 2️⃣ Data Loading & Inspection

Load Pima Indian Diabetes from a CSV URL, assign column names, and inspect.

In [None]:
# URL from UCI repository mirror
url = (
    'https://raw.githubusercontent.com/jbrownlee/Datasets/master/'
    'pima-indians-diabetes.csv'
)
cols = [
    'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
    'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome'
]
df = pd.read_csv(url, header=None, names=cols)

print(df.head())
print("\nShape:", df.shape)
print("Class counts:", df['Outcome'].value_counts().to_dict())

## 3️⃣ Train/Test Split & Normalization

- Split into 70% train / 30% test  
- Standardize features (zero mean, unit variance)

In [None]:
X = df.drop('Outcome', axis=1).values
y = df['Outcome'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)

print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")

## 4️⃣ Model Definitions

Initialize AdaBoost and GradientBoosting classifiers with default settings. 

In [None]:
ada = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)

gbc = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

print(ada)
print(gbc)

## 5️⃣ Single‐Run Training & Evaluation

Fit each model once and report accuracy, ROC AUC, and confusion matrix.

In [None]:
# Fit
ada.fit(X_train, y_train)
gbc.fit(X_train, y_train)

# Predict
y_ada = ada.predict(X_test)
y_gbc = gbc.predict(X_test)

# Metrics
acc_ada = accuracy_score(y_test, y_ada)
acc_gbc = accuracy_score(y_test, y_gbc)
auc_ada = roc_auc_score(y_test, ada.predict_proba(X_test)[:,1])
auc_gbc = roc_auc_score(y_test, gbc.predict_proba(X_test)[:,1])

print(f"AdaBoost    | Acc: {acc_ada:.3f}, AUC: {auc_ada:.3f}")
print(f"GradBoost   | Acc: {acc_gbc:.3f}, AUC: {auc_gbc:.3f}")
print("\nAdaBoost Confusion Matrix:\n", confusion_matrix(y_test, y_ada))
print("\nGradBoost Confusion Matrix:\n", confusion_matrix(y_test, y_gbc))

## 6️⃣ Mean Performance Over 10 Runs

Repeat train/test split, fit, and evaluate 10 times with different seeds.


In [None]:
results = []
for seed in range(10):
    # split & scale
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.3, random_state=seed, stratify=y)
    sc = StandardScaler().fit(X_tr)
    X_tr, X_te = sc.transform(X_tr), sc.transform(X_te)
    
    # init
    m1 = AdaBoostClassifier(
        n_estimators=100, 
        learning_rate=0.1, 
        random_state=seed)
    m2 = GradientBoostingClassifier(
        n_estimators=100, 
        learning_rate=0.1,
        max_depth=3, 
        random_state=seed)
    
    # fit
    m1.fit(X_tr, y_tr)
    m2.fit(X_tr, y_tr)
    
    # eval
    a1 = accuracy_score(y_te, m1.predict(X_te))
    a2 = accuracy_score(y_te, m2.predict(X_te))
    auc1 = roc_auc_score(y_te, m1.predict_proba(X_te)[:,1])
    auc2 = roc_auc_score(y_te, m2.predict_proba(X_te)[:,1])
    results.append((a1, auc1, a2, auc2))

arr = np.array(results)
print("AdaBoost   Mean Acc: %.3f ± %.3f | Mean AUC: %.3f ± %.3f" % (
    arr[:,0].mean(), arr[:,0].std(), arr[:,1].mean(), arr[:,1].std()))
print("GradBoost  Mean Acc: %.3f ± %.3f | Mean AUC: %.3f ± %.3f" % (
    arr[:,2].mean(), arr[:,2].std(), arr[:,3].mean(), arr[:,3].std()))

## 7️⃣ ROC Curves

Plot ROC curves of both models on one representative split (seed=42).

In [None]:
seed = 42
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, random_state=seed, stratify=y)
sc = StandardScaler().fit(X_tr)
X_tr, X_te = sc.transform(X_tr), sc.transform(X_te)

ada.fit(X_tr, y_tr)
gbc.fit(X_tr, y_tr)

plt.figure(figsize=(7,5))
for name, model in [('AdaBoost', ada), ('GradBoost', gbc)]:
    proba = model.predict_proba(X_te)[:,1]
    fpr, tpr, _ = roc_curve(y_te, proba)
    auc = roc_auc_score(y_te, proba)
    plt.plot(fpr, tpr, lw=2, label=f"{name} (AUC={auc:.3f})")
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## 8️⃣ Feature Importances (GBC)

Show impurity‐based importances from GradientBoosting.

In [None]:
# impurity-based
imp = gbc.feature_importances_
fi = pd.DataFrame({'feature': df.columns[:-1], 'impurity': imp})
fi = fi.sort_values('impurity', ascending=False)
print(fi)


## 9️⃣ Challenges
1. **Grid‐search** over `n_estimators` and `learning_rate` for both AdaBoost and GBC using `GridSearchCV`.  
2. **Try a different dataset** (e.g. Energy or Abalone) and repeat the boosting comparisons.  
3. **Loss variants:** for `GradientBoostingClassifier`, swap `loss='exponential'` and compare performance.  
4. **Calibration:** use `CalibratedClassifierCV` to calibrate probabilities and see if AUC improves.