<a href="https://colab.research.google.com/github/shaifali1102/Supervised-Learning/blob/main/Boosting_Practise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting

- ensemble learning technique that combines weak learners(less accurate models) to create a strong learner(single, highly accurate model)
- builds models sequentially
- each model is trained using the residual/errors of the previous model
- the first model is the mean model which is the simpest model
- base learners have low variance and high bias
    - eg shallow DTs or decision stump
- output of these learners are combined in additive manner
- we do weighted addition of prediction from models
    - there is a possibility that a model has a large residual or is not contributing much to the final prediction
    - we use weights to control the model contribution in final prediction
  $$
  F_m(x) = h_0(x) + \sum_{i=1}^{m} \gamma_ih_i(x)
  $$
      - here m is a hyperparameter
      - $h_0(x)$ is mean model prediction
      - $\gamma$ is the weight assigned to the $i^{th}$ model
      - $h_i(x)$ is the $i^{th}$ model prediction
- Training and testing
  - Training
    - fit the base learners(DTs)
    - find the value of weights($\gamma_m$)
    - training is slow as it is sequential
  - Testing
    - hyperparam M is already found
    - for a query point ($x_q$)
        - pass it through the base DTs and get predictions
        - multiple the predictions with $\gamma_m$ to get final prediction
- **Gradient Boosted DT(GBDT)**
    - it minimizes a loss function using gradient descent        
        - log loss for classification
        - MSE for regression
  $$
  L(y^i, \hat y^i) = (y^i - \hat y^i)^2
  $$

  $$
  \frac{\delta L}{\delta \hat y^i} = \frac{(y^i - \hat y^i)^2}{\delta \hat y^i}
  $$

  $$
  \frac{\delta L}{\delta \hat y^i} = -2(y^i - \hat y^i)
  $$

  $$
  \frac{-\delta L}{\delta \hat y^i} = 2(y^i - \hat y^i)
  $$
    - negative gradient is taken as we want the gradient to move in the direction to minimize the loss.
    - we find that the residual is proportional to the -ve gradient of the loss function wrt model prediction
    - we know that $\hat y$ is the prediction at end of stage k predicted by model k
  $$
  \hat y = F_k(x^i)
  $$
    - therefore , we can replace $\hat y$ in our loss function:
  $$
  \frac{-\delta L}{\delta F_k(x^i)} = 2(y^i - \hat y^i)
  $$
    - these -ve gradients are called **pseudo residuals**
    - how do we use these pseudo residuals?
        - when building model $M_j$, we need {$x^i, err^i$} to train $m_j$
          - where $err^i = y^i - F_{j-1}(x^i)$
          - or $err^i = residual$
        - so instead of calculating residual we can calculate pseudo residual
          - $err^i \approx pseudo~residual$
          - Therefore,
  $$
  err^i = \frac{-\delta L}{\delta F_{j-1}(x^i)}
  $$
      - we build model at each step to minimize residual by using residual of previous model to make predictions
          - if we use pseudo residuals instead of residuals, we indirectly minimize the loss functions at each step

- Bias-Variance Tradeoff
    - M = no of base learners
      - as M increases, model will overfit
          - as models increase, the training error will tend to 0
      - as M decreases, model will underfit
          - prediction will be close to mean model
    - Depth
      - as depth increases, model will overfit
          - variance increases
    - GBDT will tend to overfit, if not trained properly

- Regularization to prevent overfitting
    1. Regularization by Shrinkage
          - we add a hyperparam, learning rate $v$
  $$
  F_m(x) = h_0(x) + v\sum_{m=1}^{M} \gamma_mh_m(x)
  $$
          - $v$ = learning rate
          - range of $v = 0 \le v \le 1$
    2. Stochastic Gradient Boosting
        - we add randomization to reduce variance
        - GBDT -> pseudo residual + additive combining
        - Stochastic GBDT -> GBDT + Randomization(row and column sampling)
        - skelearn hyperparams
            - row sampling - subsample
            - column sampling - max_features

- Outlier impact
    - outliers will gave high residual as each model is fit on residual of previous model
    - this causes GBDT to focus on reducing these residuals for outlier points
    - huber loss instead of MSE
        - MSE: square increases wrt small increase in error values
        - called hybrid loss function
            - behaves like MSE for small errors
            - MAE for large errors(mitigating MSE's sensitivity to outliers)
        - RMSE can also be used instead
            - as it takes root of whole value, the loss value won't explode for large error values


    
  




<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/062/272/original/z.png?1705486735' width=700></center>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import pickle

!gdown 171Yoe_GSapyrmOnD9oBzHWNOD_OnQs0F
!gdown 1hnIlTPW3AMeB69EbeaXCRIrpMVT1Vwmc
!gdown 1nZtB_RtxMg_MgoRczb8UWQX-AEK_l3qE
!gdown 1zLDUErwKdmF-RacOyHEuI_z_46LssQtP


with open('X_train.pickle', 'rb') as handle:
    X_train = pickle.load(handle)

with open('X_test.pickle', 'rb') as handle:
    X_test = pickle.load(handle)

with open('Y_train.pickle', 'rb') as handle:
    y_train = pickle.load(handle)

with open('Y_test.pickle', 'rb') as handle:
    y_test = pickle.load(handle)

Downloading...
From: https://drive.google.com/uc?id=171Yoe_GSapyrmOnD9oBzHWNOD_OnQs0F
To: /content/Y_test.pickle
100% 31.7k/31.7k [00:00<00:00, 71.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1hnIlTPW3AMeB69EbeaXCRIrpMVT1Vwmc
To: /content/X_test.pickle
100% 253k/253k [00:00<00:00, 77.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1nZtB_RtxMg_MgoRczb8UWQX-AEK_l3qE
To: /content/Y_train.pickle
100% 126k/126k [00:00<00:00, 29.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zLDUErwKdmF-RacOyHEuI_z_46LssQtP
To: /content/X_train.pickle
100% 1.01M/1.01M [00:00<00:00, 122MB/s]


## GBDT Implementation

In [None]:
pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,2213.879876,973.921683,2345.631834,439.476501,3104.622571,2053.534864,2769.461169,1490.383933
1,944.692689,197.246966,2114.598591,1277.642577,2925.766845,2934.17141,3320.243815,2655.539956
2,614.548915,190.64068,1552.583252,268.346133,3425.336511,3342.227722,3482.490542,3277.049734
3,461.600974,148.411723,614.729822,80.594145,40.985764,185.214018,44.823199,238.795383
4,226.981118,322.398827,1371.250212,239.889627,3771.013521,3805.124879,3589.184883,3463.405656


In [None]:
# GBDT
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=150, loss='log_loss', max_depth=2)

In [None]:
gbc.fit(X_train, y_train)


In [None]:
print(gbc.score(X_train, y_train))
print(gbc.score(X_test, y_test))

0.9553526128868595
0.9150393101699213


### Hyperparams tuning

In [None]:
params = {
    'n_estimators': [50,100,150,200],
    'max_depth': [3, 4, 5, 7],
    'learning_rate': [0.1,0.2,0.3]
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

gbc = GradientBoostingClassifier()

rs = RandomizedSearchCV(gbc, params, scoring='accuracy', cv=3, n_jobs = -1, verbose=1)

rs.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [None]:
res = rs.cv_results_

for i in range(len(res["params"])):
  print(f"Parameters:{res['params'][i]} Mean_score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")


Parameters:{'n_estimators': 200, 'max_depth': 7, 'learning_rate': 0.3} Mean_score: 0.9583333333333334 Rank: 1
Parameters:{'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.3} Mean_score: 0.9248477929984779 Rank: 10
Parameters:{'n_estimators': 200, 'max_depth': 7, 'learning_rate': 0.1} Mean_score: 0.9578259766615931 Rank: 2
Parameters:{'n_estimators': 150, 'max_depth': 3, 'learning_rate': 0.3} Mean_score: 0.9361364789446981 Rank: 5
Parameters:{'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.1} Mean_score: 0.9336631151699645 Rank: 7
Parameters:{'n_estimators': 50, 'max_depth': 4, 'learning_rate': 0.3} Mean_score: 0.9323313039066464 Rank: 9
Parameters:{'n_estimators': 50, 'max_depth': 4, 'learning_rate': 0.2} Mean_score: 0.9340436326737697 Rank: 6
Parameters:{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.2} Mean_score: 0.9323947234906139 Rank: 8
Parameters:{'n_estimators': 50, 'max_depth': 5, 'learning_rate': 0.2} Mean_score: 0.94019533231862 Rank: 4
Parameters:{'n_

In [None]:
print(rs.best_estimator_)

GradientBoostingClassifier(learning_rate=0.3, max_depth=7, n_estimators=200)


In [None]:
gbc = rs.best_estimator_

gbc.fit(X_train, y_train)

print(gbc.score(X_train, y_train))
print(gbc.score(X_test, y_test))

1.0
0.9619579000760842
