# Case Study 2: Gradient Boosting Classifier (Breast Cancer Dataset)

This notebook demonstrates **Gradient Boosting (sklearn's GradientBoostingClassifier)** on the Breast Cancer dataset. We'll include training, evaluation, feature importance, and a small hyperparameter search.

In [None]:
# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
# Load data
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
pd.DataFrame(X, columns=feature_names).head()

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
print(X_train.shape, X_test.shape)

In [None]:
# Fit Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb.fit(X_train, y_train)

# Evaluate
y_pred = gb.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n', classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Gradient Boosting')
plt.show()

In [None]:
# Feature importances
importances = gb.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=True)
plt.figure(figsize=(8,10))
feat_imp.plot(kind='barh')
plt.title('Feature Importances - Gradient Boosting')
plt.show()

In [None]:
# Hyperparameter tuning (small grid)
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [2, 3]
}
gs = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid, cv=3, scoring='accuracy')
gs.fit(X_train, y_train)
print('Best params:', gs.best_params_)
print('Best CV score:', gs.best_score_)

best = gs.best_estimator_
y_pred_best = best.predict(X_test)
print('Test accuracy (best):', accuracy_score(y_test, y_pred_best))

## Takeaways

- Gradient Boosting builds trees sequentially to correct previous errors.
- It's powerful for tabular data; regularize with `learning_rate`, `n_estimators`, and `max_depth`.
- Consider early stopping for larger datasets.