## Adaptive Boosting contd

- initialize all the observation weights as 1/(# of training observations)
- at the beginning all of the observations are equally important
- Error rate of jth tree is the sum of the weights of th observations the model missed divided by the sum of all the observation weights
- this is similar to 1-accuracy only with weighted observations

- using the error rate of the tree, calculate the tree weight
- the weight of the jth tree:
$a_j = \lambda \cdot \log\left(\frac{1 - r_j}{r_j}\right)$
 where $r_j$ is the error rate of the jth tree
- better tree performance means lower $r_j$ and higher $a_j$

using the tree weight update the observation weights:
- If the 𝑗𝑡ℎ tree predicted the 𝑖𝑡ℎ observation correctly, the weight of the 𝑖𝑡ℎ observation remains the same
- if not, the weight of the ith observation is boosted by multiplying it by $e^{a_j}$

if a more successful tree misses an obsservation, that observation becomes more important (a_j is higher so e^a_j is higher so w_i x e^a_j is higher)


the next tree is trained with 
- the same training data
- the updated observation weights

an observation with a higher weight is considered as higher cost (MSE, Gini) for the training algorithm of the tree
this makes the next tree pay more attention to the observations that are missed by the previous tree(s)
- it will try to fit more closely or set the boundary to classify them correctly
- the more an observation is missed by consecutive trees, the higher its weight gets and the more attention the next tree will pay to it

So the Adaboost algorithm:
- trains each tree with the entire training data with the observation weights
- assigns a weight to each tree
- update the observation weights
- move on to training the next tree
- repeat until all trees are trained

## Adaptive Boosting -- Prediction

after training, for each test observation:
- each tree returns a prediction
- for regression, the weighted average of all the predictions are taken
    - tree weights are used for prediction
    - the prediction of each tree is multiplied by its normalized weight
- for classification, the weighted majority vote (mode) is taken

## Adaptive Boosting -- hyperparameters

- base model should be high bias low variance
- decision tree as the base model should be heavily regularized (to keep it high bias)
- this is possible with one of the tree hyperparameters, such as max_depth (first hyperparameter)
- sine the goal is just to keep the tree small, one hyperparameter is enough

- the number of trees in the Adaboost model is another hyperparameter (2)
- $\lambda$ included in the tree weights calculation is another hyperparameter (3)
    - this value scales the tree weights and controls how abruptly the observation weights will change
    - determines how different the next tree will train i.e. how (un)correlated the predictions will be

# Weak Points

- prone to overfitting
    - boosting reduces the high bias of the base model but overdoing this can easily lead to overfitting
- sensitive to outliers
    - the observation weight of an outlier will keep growing as trees in the boosting model keep missing it, eventually the model will pay too much attention to the outlier and overfit


In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import root_mean_squared_error


In [2]:
trainf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_train.csv")
trainp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_train.csv")
train = pd.merge(trainf, trainp)

testf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_test.csv")
testp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_test.csv")
test = pd.merge(testf, testp)

predictors = ['mpg', 'engineSize', 'year', 'mileage']
target = 'price'
X_train = train[predictors]
y_train = train[target]
X_test = test[predictors]
y_test = test[target]

# Model Inputs

In [5]:
base_model = DecisionTreeRegressor(
    random_state = 12,
    max_depth = 4 # one hyperparameter is enough
)

model = AdaBoostRegressor(
    random_state = 12,
    estimator = base_model, # like bagging, the base moel can technically be anything
    n_estimators = 20, # number of trees that are trained sequentially
    learning_rate = 0.1 # the lambda value that determines the scale of the tree weights and the correlation between tree predictions
)

model.fit(X_train, y_train)


# Feature Importances

In [6]:
# we can directly use .feature_importances_ with a trained model object
# no need to access each individual tree
model.feature_importances_

array([0.08507495, 0.58457068, 0.20795569, 0.12239868])

# Tuning the Model

- we have three inputs to tune, one inside the base model the other two inside the boosting model
- the random state and the base model are fixed

In [7]:
base_model = DecisionTreeRegressor(
    random_state=12
)

model = AdaBoostRegressor(
    random_state=12,
    estimator=base_model)

grid = {
    "estimator__max_depth": [4, 6, 8, 10], # use estimator__ to access the parameters of the base model
    "n_estimators": range(20, 150, 10),
    "learning_rate": [0.01, 0.1, 1, 10]
}

max_depth:
- too high: the base model will overfit, meaning the boosting model overfits, because boosting reduces bias not variance
- too low: risk base model not capturing anything because it is too shallow, so the boosting model still ends up underfitting

n_estimators:
- too low: not enough trees to capture different parts of the training data, underfitting
- too high: too many trees, will capture the noise of the data after a certain number of trees because there is nothing left to capture



learning_rate:
- too high: each tree will focus too much on the observations that the previous tree missed, leads to overfitting, prone to outliers and noise
- too low: each tree will train and predict similarly to the trees before it because the tree weights and the observation weights will not be updated sufficiently