# Gradient-Boosting Classifiers

**Dataset:** UCI / Kaggle Bank Marketing (`bank-additional-full.csv`). ~41,000 rows, mixed types.

This notebook covers:

- Section 1: dataset, EDA, preprocessing (categorical, numerical, boolean)
- Section 2: sklearn GradientBoostingClassifier (train, tune, feature importance)
- Section 3: XGBoost (train, tune, feature importance + Optuna Bayesian tuning example)
- Section 4: LightGBM (train, tune, feature importance)
- Section 5: CatBoost (train, tune, feature importance)

**Notes:**
- Download `bank-additional-full.csv` from UCI (https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) and place it next to this notebook.
- Hyperparameter Tuning may take time depending on your machine.


**Installation Note:**

If you haven't installed the packages like xgboost, catboost, and lightGBM, use the following command to install them all in your virtual environment.

```shell
pip install xgboost catboost lightgbm
```

## Section 1 — Load data, EDA and preprocessing

We use the Bank Marketing dataset (campaign results of a Portuguese bank). Target: `y` (yes/no) — converted to binary. The dataset contains numeric, categorical and binary features.


In [None]:
# Load libraries and dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

DATA_PATH = "./bank-additional-full.csv"  # place the file next to this notebook
df = pd.read_csv(DATA_PATH, sep=';')
print("Shape:", df.shape)
df.head()

In [None]:
# Quick EDA
print(df.dtypes.value_counts())
print("\nTarget distribution:")
print(df['y'].value_counts(normalize=True))
print("\nSample missing values per column:")
print(df.isna().sum().sort_values(ascending=False).head())

# Identify types
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# In this dataset some "categorical" may be encoded as object; boolean-like cols are within object too.
print(f"\nCategorical cols ({len(cat_cols)}): {cat_cols}")
print(f"Numeric cols ({len(num_cols)}): {num_cols}")

# Convert target to 0/1
df['target'] = (df['y'] == 'yes').astype(int)
df.drop(columns=['y'], inplace=True)

# Basic numeric histograms
plt.figure(figsize=(10,4))
df[num_cols].hist(bins=30, figsize=(12,6))
plt.tight_layout()
plt.show()

### Preprocessing plan

- **Categorical**: One-hot encoding for sklearn GB; for other methods we will use Ordinal/Label encoding or let libraries handle categoricals (CatBoost accepts categorical features natively).
- **Numeric**: median imputation (though this dataset has no missing numeric values) and optional scaling for sklearn GB.
- **Train/test split**: stratified split on `target`.


In [None]:
# Train/test split and create preprocessing pipelines
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

TARGET = 'target'
X = df.drop(columns=[TARGET])
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train:", X_train.shape, "Test:", X_test.shape)

# recompute types on train
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

# preprocessing for sklearn GB (one-hot + scaling numeric)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor_sklearn = ColumnTransformer(transformers=[
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])

# preprocessing for other models (ordinal encode categoricals, scale not required)
ordinal_transformer = ColumnTransformer(transformers=[
    ('num', SimpleImputer(strategy='median'), num_cols),
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), cat_cols)
])

# For CatBoost we'll pass raw data and list of categorical column names (it handles encoding)
print("Numeric cols:", len(num_cols), "Categorical cols:", len(cat_cols))

print("Numerical columns: ", num_cols)
print("Categorical columns: ", cat_cols)

## Section 2 — sklearn `GradientBoostingClassifier`

Gradient boosting builds an additive model of weak learners (trees). At iteration $m$ the model is:

$$F_m(x) = F_{m-1}(x) + \alpha h_m(x)$$

where $\alpha$ is the learning rate and $h_m$ is the new tree fit to the negative gradient.

**Documentation:** 

- [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
- [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)

**Main hyperparameters:** `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `max_features`, `min_samples_leaf`.


In [None]:
# Train a baseline sklearn GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.pipeline import make_pipeline

gb_pipe = make_pipeline(preprocessor_sklearn, GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.05, max_depth=3, random_state=42
))
gb_pipe.fit(X_train, y_train)
proba_gb = gb_pipe.predict_proba(X_test)[:,1]
print(f"Sklearn GB Test AUC: {roc_auc_score(y_test, proba_gb):.4f}")

### Hyperparameter tuning (RandomizedSearchCV)

We use `RandomizedSearchCV` over reasonable ranges. 


In [None]:
# RandomizedSearchCV for sklearn GB
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from scipy.stats import randint, uniform

param_distributions = {
    'gradientboostingclassifier__n_estimators': randint(100, 500),
    'gradientboostingclassifier__learning_rate': uniform(0.01, 0.2),
    'gradientboostingclassifier__max_depth': randint(2, 6),
    'gradientboostingclassifier__subsample': uniform(0.6, 0.4)
}

# create pipeline with named step so params can be referenced
from sklearn.pipeline import Pipeline
gb_pipeline = Pipeline(steps=[('pre', preprocessor_sklearn),
                              ('gradientboostingclassifier', GradientBoostingClassifier(random_state=42))])

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
rs = RandomizedSearchCV(gb_pipeline, param_distributions=param_distributions,
                        n_iter=20, scoring='roc_auc', cv=cv, verbose=2, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
print("Best params:", rs.best_params_)
print("Best CV AUC:", rs.best_score_)

best_gb = rs.best_estimator_
proba_gb_best = best_gb.predict_proba(X_test)[:,1]
print("Tuned Sklearn GB Test AUC:", roc_auc_score(y_test, proba_gb_best))

### Feature importance (sklearn GB)

`sklearn` exposes `feature_importances_` which is the (normalized) total reduction of the criterion brought by that feature.


In [None]:
# Extract feature names after preprocessing
feature_names_num = num_cols
# for onehot, get names from the OneHotEncoder
ohe = gb_pipeline.named_steps['pre'].named_transformers_['cat'].named_steps['onehot']
ohe_names = list(ohe.get_feature_names_out(cat_cols))
feature_names = feature_names_num + ohe_names

import numpy as np
import matplotlib.pyplot as plt

importances = best_gb.named_steps['gradientboostingclassifier'].feature_importances_
indices = np.argsort(importances)[-20:]

plt.figure(figsize=(8,6))
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.title("Top 20 feature importances — sklearn GB")
plt.tight_layout()
plt.show()

## Section 3 — XGBoost (`xgboost.XGBClassifier`)

XGBoost uses a second-order Taylor expansion (gradient and hessian) for the loss at each boosting iteration. The per-step objective minimized is approximated as:

$$\mathcal{L}^{(t)} \approx \sum_i \left[g_i f_t(x_i) + \tfrac{1}{2} h_i f_t(x_i)^2\right] + \Omega(f_t)$$

**Documentation:**
- [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
- [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)

**Main hyperparameters:** `n_estimators`, `learning_rate` (eta), `max_depth`, `min_child_weight`, `gamma`, `subsample`, `colsample_bytree`, `reg_alpha`, `reg_lambda`.

1. **`n_estimators`**  
   - The number of boosting rounds (trees) to fit.  
   - More trees can improve performance but increase training time and risk overfitting.  
   - Often tuned together with `learning_rate`.

2. **`learning_rate` (`eta`)**  
   - Step size shrinkage applied after each boosting step to prevent overfitting.  
   - Lower values make the model more robust but require more trees (`n_estimators`).  
   - Typical range: `0.01`–`0.3`.

3. **`max_depth`**  
   - Maximum depth of a tree.  
   - Controls model complexity — deeper trees can capture more patterns but may overfit.  
   - Typical range: `3`–`10`.

4. **`min_child_weight`**  
   - Minimum sum of instance weights (Hessian) in a child node.  
   - Higher values make the algorithm more conservative (prevent overfitting).  
   - Useful for controlling tree splitting when dataset has high variance.

5. **`gamma`** (`min_split_loss`)  
   - Minimum loss reduction required to make a further partition on a leaf node.  
   - Larger values make the algorithm more conservative.  
   - Acts as a regularization parameter for tree growth.

6. **`subsample`**  
   - Fraction of the training data randomly sampled for growing each tree.  
   - Helps prevent overfitting.  
   - Typical range: `0.5`–`1.0`.

7. **`colsample_bytree`**  
   - Fraction of features (columns) randomly sampled for each tree.  
   - Helps reduce correlation between trees and overfitting.  
   - Typical range: `0.5`–`1.0`.

8. **`reg_alpha`** (L1 regularization term on weights)  
   - Increases sparsity of weights (drives some leaf values to zero).  
   - Can help in feature selection.

9. **`reg_lambda`** (L2 regularization term on weights)  
   - Penalizes large leaf weights.  
   - Helps reduce model complexity and overfitting.

xgboost all hyperparameters: https://xgboost.readthedocs.io/en/stable/parameter.html

In [None]:
# Train XGBoost (needs xgboost installed)
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

# Prepare data with ordinal encoding (trees don't need one-hot)
X_train_ord = ordinal_transformer.fit_transform(X_train)
X_test_ord = ordinal_transformer.transform(X_test)

xgb = XGBClassifier(eval_metric='auc', n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42)
xgb.fit(X_train_ord, y_train)
p_xgb = xgb.predict_proba(X_test_ord)[:,1]
print("XGBoost Test AUC:", roc_auc_score(y_test, p_xgb))

### Hyperparameter tuning (RandomizedSearchCV)

This section runs `RandomizedSearchCV` for XGBoost.


In [None]:
# Randomized search for XGBoost
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

xgb_param_dist = {
    'n_estimators': randint(100, 500),
    'learning_rate': uniform(0.01, 0.2),
    'max_depth': randint(3, 8),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(0.5, 2)
}

xgb_clf = XGBClassifier(eval_metric='auc', random_state=42)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
rs_xgb = RandomizedSearchCV(xgb_clf, param_distributions=xgb_param_dist, n_iter=25, scoring='roc_auc', cv=cv, verbose=2, n_jobs=-1, random_state=42)
rs_xgb.fit(X_train_ord, y_train)
print("Best XGB params:", rs_xgb.best_params_)
print("Best XGB CV AUC:", rs_xgb.best_score_)

best_xgb = rs_xgb.best_estimator_
p_xgb_best = best_xgb.predict_proba(X_test_ord)[:,1]
print("Tuned XGBoost Test AUC:", roc_auc_score(y_test, p_xgb_best))

### Feature importance (XGBoost)

XGBoost provides `feature_importances_` (based on gain, weight, cover) and `plot_importance` utilities. 

- **Weight (or Frequency):** This metric counts the number of times a feature is used to split the data across all trees in the model. Features with higher weight are used more frequently for splitting.

- **Gain:** This is the most common and often preferred metric. It represents the average gain (reduction in impurity) achieved by splits involving a particular feature across all trees. A higher gain indicates a more significant contribution to reducing the model's error.

- **Cover:** This metric reflects the average coverage or number of samples affected by splits involving a particular feature. It essentially measures the relative number of observations for which a feature is responsible in splits.


We'll use `feature_importances_` (gain-based if `importance_type='gain'`).


In [None]:
# Get feature names used for XGBoost (ordinal mapping produced numeric array)
# For ordinal_transformer, feature order is num_cols then cat_cols
xgb_feature_names = num_cols + cat_cols
import matplotlib.pyplot as plt
import numpy as np

importances = best_xgb.feature_importances_   # by default importance type is 'gain'
indices = np.argsort(importances)[-20:]

plt.figure(figsize=(8,6))
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [xgb_feature_names[i] for i in indices])
plt.title("Top 20 feature importances — XGBoost (gain based)")
plt.tight_layout()
plt.show()

In [None]:
importances = np.array(list(best_xgb.get_booster().get_score(importance_type='weight').values())) 
importances = importances / importances.sum()
indices = np.argsort(importances)[-20:]

plt.figure(figsize=(8,6))
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [xgb_feature_names[i] for i in indices])
plt.title("Top 20 feature importances — XGBoost (weight based)")
plt.tight_layout()
plt.show()

### Optional: Bayesian optimization with Optuna (XGBoost)

This subsection uses **Optuna** to find a high-performing set of hyperparameters. It executes multiple trials — be prepared to wait depending on `n_trials`.


In [None]:
# Optuna for XGBoost (ensure optuna is installed)
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold

def objective(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.2),
        'max_depth': trial.suggest_int('max_depth', 3, 8),
        'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0)
    }
    clf = XGBClassifier(use_label_encoder=False, eval_metric='auc', random_state=42, **param)
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    scores = cross_val_score(clf, X_train_ord, y_train, scoring='roc_auc', cv=skf, n_jobs=-1)
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30, show_progress_bar=True)
print("Best optuna params:", study.best_params)
# Train final with best params
optuna_xgb = XGBClassifier(use_label_encoder=False, eval_metric='auc', random_state=42, **study.best_params)
optuna_xgb.fit(X_train_ord, y_train)
print("Optuna XGBoost Test AUC:", roc_auc_score(y_test, optuna_xgb.predict_proba(X_test_ord)[:,1]))

## Section 4 — LightGBM (`lightgbm.LGBMClassifier`)

LightGBM is optimized for performance on large datasets using histogram-based algorithms and techniques like GOSS and EFB.

**LightGBM Documentation:**
- [LightGBM Classifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html)
- [LightGBM Regressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html)

**Main hyperparameters:** `n_estimators`, `learning_rate`, `num_leaves`, `max_depth`, `min_data_in_leaf`, `subsample`, `colsample_bytree`, `reg_alpha`, `reg_lambda`.

**LightGBM all hyperparameter:** https://lightgbm.readthedocs.io/en/latest/Parameters.html


In [None]:
# Train LightGBM (requires lightgbm installed)
import lightgbm as lgb

lgb_clf = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, random_state=42)
# Use ordinal encoded arrays
lgb_clf.fit(X_train_ord, y_train, eval_set=[(X_test_ord, y_test)], eval_metric='auc', early_stopping_rounds=50, verbose=False)
print("LightGBM Test AUC:", roc_auc_score(y_test, lgb_clf.predict_proba(X_test_ord)[:,1]))

### Hyperparameter tuning (RandomizedSearchCV)


In [None]:
# Randomized search for LightGBM
from scipy.stats import randint, uniform

lgb_param_dist = {
    'n_estimators': randint(100, 800),
    'learning_rate': uniform(0.01, 0.2),
    'num_leaves': randint(20, 150),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}

lgb_model = lgb.LGBMClassifier(random_state=42)
rs_lgb = RandomizedSearchCV(lgb_model, param_distributions=lgb_param_dist, n_iter=25, scoring='roc_auc', cv=cv, verbose=2, n_jobs=-1, random_state=42)
rs_lgb.fit(X_train_ord, y_train)
print("Best LGB params:", rs_lgb.best_params_)
print("Best LGB CV AUC:", rs_lgb.best_score_)

best_lgb = rs_lgb.best_estimator_
print("Tuned LightGBM Test AUC:", roc_auc_score(y_test, best_lgb.predict_proba(X_test_ord)[:,1]))

### Feature importance (LightGBM)


In [None]:
# Feature importances from LightGBM
import numpy as np
import matplotlib.pyplot as plt

fi = best_lgb.feature_importances_
indices = np.argsort(fi)[-20:]
feature_names = num_cols + cat_cols

plt.figure(figsize=(8,6))
plt.barh(range(len(indices)), fi[indices], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.title("Top 20 feature importances — LightGBM")
plt.tight_layout()
plt.show()

## Section 5 — CatBoost (`catboost.CatBoostClassifier`)

CatBoost handles categorical features natively and uses ordered boosting to reduce target leakage. Provide categorical feature names/indices to the model.

**Documentation:**

- [CatBoost Classifier](https://catboost.ai/docs/en/concepts/python-reference_catboostclassifier)
- [CatBoost Regressor](https://catboost.ai/docs/en/concepts/python-reference_catboostregressor)

**Main hyperparameters:** `iterations`, `learning_rate`, `depth`, `l2_leaf_reg`, `bagging_temperature`, `border_count`.

- `iterations`: Number of boosting rounds (trees) to build. More iterations can improve accuracy but risk overfitting.

- `learning_rate`: Step size for updating trees. Smaller values improve generalization but need more iterations.

- `depth`: Maximum depth of each tree, controlling model complexity. Higher depth captures interactions but risks overfitting.

- `l2_leaf_reg`: L2 regularization coefficient for leaf values. Helps prevent overfitting by penalizing large weights.

- `bagging_temperature`: Controls randomness in sampling. Higher values → more uniform sampling; lower values → greedier sampling of best points.

- `border_count`: Number of splits (bins) used for numeric feature discretization. Larger values give finer splits but increase computation.



In [None]:
# Train CatBoost (requires catboost installed)
from catboost import CatBoostClassifier

# Prepare data: pass raw DataFrame and list of categorical feature names
cat_features = cat_cols  # these are column NAMES in the original DataFrame
cb = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6, verbose=50, random_seed=42)
cb.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test), early_stopping_rounds=50)
print("CatBoost Test AUC:", roc_auc_score(y_test, cb.predict_proba(X_test)[:,1]))

### Hyperparameter tuning (RandomizedSearchCV / CatBoost)


In [None]:
# Randomized search for CatBoost (uses sklearn wrapper)
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

cb_param_dist = {
    'iterations': randint(100, 800),
    'learning_rate': uniform(0.01, 0.2),
    'depth': randint(3, 8),
    'l2_leaf_reg': uniform(1, 10)
}

cb_model = CatBoostClassifier(verbose=0, random_seed=42)
rs_cb = RandomizedSearchCV(cb_model, param_distributions=cb_param_dist, n_iter=20, scoring='roc_auc', cv=cv, verbose=2, n_jobs=-1, random_state=42)
# For CatBoost, we need to pass DataFrame and categorical feature indices to fit; RandomizedSearchCV will call fit with arrays.
# To keep it simple, we'll fit on a converted dataset with ordinal encoding (already done above)
rs_cb.fit(X_train_ord, y_train)
print("Best CatBoost params:", rs_cb.best_params_)
print("Best CatBoost CV AUC:", rs_cb.best_score_)

best_cb = rs_cb.best_estimator_
print("Tuned CatBoost Test AUC:", roc_auc_score(y_test, best_cb.predict_proba(X_test_ord)[:,1]))

### Feature importance (CatBoost)

CatBoost exposes `get_feature_importance` which supports several importance types (PredictionValuesChange, LossFunctionChange, ShapValues, etc.).


In [None]:
# CatBoost feature importance (using PredictionValuesChange)
fi_cb = cb.get_feature_importance(type='PredictionValuesChange', data=cb.get_pool(X_train, label=y_train, cat_features=cat_features))
feature_names_all = X_train.columns.tolist()
indices = np.argsort(fi_cb)[-20:]

import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.barh(range(len(indices)), fi_cb[indices], align='center')
plt.yticks(range(len(indices)), [feature_names_all[i] for i in indices])
plt.title("Top 20 feature importances — CatBoost (PredictionValuesChange)")
plt.tight_layout()
plt.show()

## Summary & Practical tips

- All four gradient-boosting families are included with tuning and feature importance.
- Running all tuning steps will take time; adjust `n_iter` / `n_trials` for faster runs.
- Use early stopping where supported to reduce overfitting and speed up tuning.

---

Place `bank-additional-full.csv` in the same folder as this notebook and run all cells.
