## Gradient Boosting

- achieves low bias low variance models by sequentially training trees
- each tree in the sequence is fit to the residual from the fit of the previous tree
    - residual = true response - predicted response

- with each tree in the sequence, the residual values get smaller on average



# Gradient Boosting Training

start with:
- all the training data
- a number of trees (hyperparameter)
- a lambda value, the same learning rate idea as in Adaboost (hyperparameter)

train all trees sequentially:
- the first tree on the actual training data (X) and responses (y)
- the rest of the trees on X and the residuals from the previous tree
- no tree/observation weights, just different response values for each tree

the final gradient boosting fit is:
- the sum of the scaled fit of all trees
- the fit/boundary of each tree is scaled by $\lambda$ the learning rate:
    - usually between 0 and 1, can be higher
    - higher $\lambda$ (closer to 1), the fit/boundary of each tree is added with little attenuation -- less regularization
    - lower $\lambda$ (closer to 0), the fit/boundary of each tree is added with more attenuation -- more regularization

base model should be high bias low variance, decision tree as the base model should be heavily regularized, possible with max_depth (hyperparameter)


number of trees sohuld be picted carefully (hyperparameter)
- too high: overfitting, too low: underfitting

$\lambda$ scales the fit/boundary of each tree (hyperparameter)
- underfitting if too low, overfitting if too high

Gradient boosting also gives us the option to pick a subset without replacement of the training data for each tree in the sequence

for each tree:
- if an observation is seen before, its remaining residual is used as athe response
- if an observation has not been seen before, the actual response is used

purposes:
- faster runtime (sklearn gradient boosting model is very slow)
- further regularization of the model -- the trees will be kept from collaboratively fitting to noise, outliers, etc

in sklearn this is controlled by the subsample hyperparameter:
- a value between 0 and 1
- underfitting if too low, trees wouldn't see enough data to train on
- overfitting (possibly) if too high since trees will see the same noise, outliers, etc

## Gradient boosting -- Cost Function


the performance improves if each tree sees the gradient of the cost of the previous tree with respect to the predicted response

MSE = $\frac{1}{N} \sum_{i=1}^{N} \left( y^{(i)} - \hat{y}^{(i)} \right)^2$

$\frac{\partial \text{MSE}}{\partial \hat{y}^{(i)}} = -\frac{2}{N} \sum_{i=1}^{N} \left( y^{(i)} - \hat{y}^{(i)} \right)$


Can we use other cost functions?

- yes, and theoretically, there is no way to know which cost would work best
- the cost function is another hyperparameter
- most common types are:
    - MSE
    - MAE
    - Huber Cost
        - behaves like MSE for smaller residuals, smooth and differentiable around 0
        - behaves like MAE for larger residuals -- reduces the importance of outliers in the cost, good to avoid overfitting
- binomial deviance is the most robust option for classification, practically almost always preferred

Can we avoid overfitting if we regularize this cost function?
- Yes: this idea was the main motivation behind Extreme Gradient Boosting (XGBoost)
- Building on Gradient Boosting and XGBoost, two more extensions were proposed more recently and are widely used:
    - CatBoost ("cat" because it works well with categorical data)
    - LightGBM

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

In [2]:
trainf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_train.csv")
trainp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_train.csv")
train = pd.merge(trainf, trainp)

testf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_test.csv")
testp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_test.csv")
test = pd.merge(testf, testp)

predictors = ['mpg', 'engineSize', 'year', 'mileage']
target = 'price'
X_train = train[predictors]
y_train = train[target]
X_test = test[predictors]
y_test = test[target]


no scaling -- based on trees

## Model Inputs

In [5]:
model = GradientBoostingRegressor(
    random_state=12,
    n_estimators= 20, # number of trees, same logic as AdaBoost
    learning_rate = 0.1, # its functionality in the algo is different, but its effect is identical to Adaboost
    max_depth = 4, # the tree hyperparameter that keeps each tree small and underfitting
    subsample = 0.8, # the observation subset size that each tree sees, aiming to prevent correlation between trees by giving trees different data
    loss = "huber"
)

model.fit(X_train, y_train)

# Feature Importances


In [6]:
model.feature_importances_

array([0.04049564, 0.48502878, 0.34932893, 0.12514665])

# Tuning the Model

In [10]:
# most expensive model we cover

model = GradientBoostingRegressor(
    random_state=12,
    loss = "huber")

grid = {
    "n_estimators": [20, 50, 100], # handled the same as AdaBoost
    "max_depth": [4,6,8], # same as AdaBoost, don't use estimator__ because it is a direct model input
    "learning_rate": [0.01, 0.1, 1], # same as AdaBoost
    "subsample": [0.5, 0.75, 1.0] #don't try anything below .5, always include 1 as an option
}
