# Boosting
1. Overall
    * Confederates (weak) learners into a strong[-er] learner
    * Sequentially adds learners
    * Learns from mistakes: focuses on (and imporves) errors
    * Three popular variants : AdaBoost, GradientBoosting, XGBoost
    * Disadv : Sensitive to outliers and is slow
2. Diff (RF vs Boosting):
    * No communication between the trees in the forest | Trees are added sequentially and learns from mistakes.
3. **AdaBoost:**
    * Randomly generate a bag of data(Bootstrapping)
    * Fit a learner to each bag (creates a stump(depth=1 tree))
    * create another bag of data focussing on the error that we made.
        * Adds an additional weight on the errors made on the prev incorrect datapoints.
        * At later point when the datapoint if correctly classified, then the weight decreases.
    * Fitting Process, <img src="images/adaboost1.png" />
    * Finally the decision boundary would look like, <img src="images/adaboost2.png" width="25%" height="25%" />
    ## Change this...
    * Different stumps have different weightage, to calculate a tree's weight according to its train performance $weight=\frac{\frac{1}{2}log(1-totError)}{totError}$
    * A simple weight update : `weight = weight * exp(+/-1*learning_rate)`.
    * Algo(AdaBoost fit) :
         - def fit(X, Y, num_trees):
             - self.tree_list = create_list(num_trees)  # decision_trees
             - errors = []
             - for t in self.tree_list:
                 - Bagging, higher probability to selecting from errors
                 - sample_x, sample_y = sample(X, Y, errors)
                 - t.fit(sample_x, sample_y, self.max_depth)
    * AdaBoost predict:
        * <img src="images/adaboostpredict.png" align="center" width="50%" height="50%" />
        
4. **Gradient Boosting:**
    * Existing learners "frozen"
    * New learner added to existing sequence of learners
    * Outside-in Approach:
        * e.g. How likely do you think it is the couple will be divorced in 10 years?
            * How many american marriages end in divorce?
            * How many in Rhode Island?
            * How many catholic marriages end in divorce? etc....
    * Algo:
        * Generate an initial hypothesis
        * Repeat:
            * Calculate residuals (errors) for each training instance
            * Fit CART to residuals
            * Generate new hypothesis = initial hypothesis * CART residual * learning rate
            * CART is usually a tree of depth in range (3 - 5)
    * Steps:
        * <img src="images/gboost1.png" align="center" width="50%" height="50%" />
        * <img src="images/gboost2.png" align="center" width="50%" height="50%" />
        * Updating the weights to get the height based on mean from the leaf nodes, <img src="images/gboost3.png" align="center" width="50%" height="50%" />

5. **Extreme Gradient Boosting**
    * Works well on large dataset
    * Creates unique type of tree on data
    * Orders instances according to feature-value pairs
    * For each adjacent pair of instances (according to feature-values):
        * Calculate mean(feature-value)
        * Calculate similarity: whether split on feature-value results in good pairing
    * Uses “similarity” score to determine quality of feature-value pairs
    * Uses “gain” score to determine whether split will be performed or pruned
    * Contains regularisation parameter (λ), which subtracts weight from any observation
    * Allows depth=6 trees

## Implementation

In [24]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

from sklearn.metrics import mean_squared_error

## Understanding the data

In [2]:
df = pd.read_csv('dataset/forestfires.csv')
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [8]:
month_map = {mon:i for i,mon in enumerate(df.month.unique())}
day_map = {day:i for i,day in enumerate(df.day.unique())}
df['month'] = df['month'].map(month_map)
df['day'] = df['day'].map(day_map)
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,0,0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,1,1,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,1,2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,0,0,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,0,3,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [13]:
x_train,x_test,y_train,y_test = train_test_split(df.drop(['area'],axis=1),df['area'],test_size=0.2, random_state=5)

In [16]:
reg = AdaBoostRegressor()
reg.fit(x_train,y_train)

AdaBoostRegressor()

In [18]:
pred = reg.predict(x_test)

In [20]:
np.sqrt(mean_squared_error(y_test,pred))

107.4579902462583

### Hyperparameter tuning

In [23]:
params = {'n_estimators':[30, 40, 50, 60, 70],
          'learning_rate':[1.,0.1,0.01,0.001]}
grid = GridSearchCV(AdaBoostRegressor(),params, cv=5)
grid.fit(x_train,y_train)

np.sqrt(mean_squared_error(y_test,grid.predict(x_test)))

107.15418339246146

In [25]:
grid.best_estimator_

AdaBoostRegressor(learning_rate=0.001)