# Introduction
In this notebook, i will be explaining in details the nuts and bolts of some of the popular boosting algorithms used today, as well as the features that differentiates them from one another. Towards the end, i will also bring you through the hyperparameter tuning process in detail. I hope that after reading the notebook, you can have a much better understanding of how each boosting algorithm works and gain insights to why they perform much better than the rest :)

# Table of Content
1. [Data Preprocesssing](#sec1)
2. [Boosting Algorithms](#sec2)  
    * [2.1 Gradient Boost Regressor](#sec2.1)
    * [2.2 Extreme Gradient Boost](#sec2.2)
    * [2.3 LightGBM](#sec2.3)
    * [2.4 CatBoost](#sec2.4)
3. [HyperParameter Tuning](#sec3)  
    * [Boosting Hyperparameters](#sec3.1)
    * [Grid Search](#sec3.2)
    * [Randomized Search](#sec3.3)

<a id='sec1'></a>
## [1. Data Preprocessing](#sec1)
In this section, i have summarized all the steps for the data preprocessing and feature engineering process. Since the main focus of this notebook is on understanding Boosting Algorithms, we will skip this part. The in-depth visualization and step-by-step approach can be found in another of my notebook [here](https://www.kaggle.com/angqx95/data-science-workflow-top-2-with-tuning)

In [None]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('../input/home-data-for-ml-course/train.csv', index_col=0)
test = pd.read_csv('../input/home-data-for-ml-course/test.csv', index_col=0)

X = pd.concat([train.drop("SalePrice", axis=1),test], axis=0)
y = train[['SalePrice']]

In [None]:
def df_preprocessing(df):
    
    ## Dropping Features that are not useful for model prediction
    df.drop(['GarageYrBlt','TotRmsAbvGrd','1stFlrSF','GarageCars'], axis=1, inplace=True)    #drop highly correlated feature
    df.drop(['PoolQC','MiscFeature','Alley'], axis=1, inplace=True)         #drop top 3 columns with most number of missing values
    df.drop(['MoSold','YrSold'], axis=1, inplace=True)          #remove columns with no relationship with SalePrice
    
    df_col = df.columns     #remove columns with >96% same values
    overfit_col = []
    for i in df_col:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(X) * 100 > 96:
            overfit_col.append(i)

    overfit_col = list(overfit_col)
    df = df.drop(overfit_col, axis=1)

    
    ## Removing outliers
    global train
    train = train.drop(train[train['LotFrontage'] > 200].index)
    train = train.drop(train[train['LotArea'] > 100000].index)
    train = train.drop(train[train['BsmtFinSF1'] > 4000].index)
    train = train.drop(train[train['TotalBsmtSF'] > 5000].index)
    train = train.drop(train[train['GrLivArea'] > 4000].index)
    
    
    ## Impute missing values
    ordd = ['GarageType','GarageFinish','BsmtFinType2','BsmtExposure','BsmtFinType1', 
       'GarageCond','GarageQual','BsmtCond','BsmtQual','FireplaceQu','Fence',"KitchenQual",
       "HeatingQC",'ExterQual','ExterCond']
    df[ordd] = df[ordd].fillna("NA")         #Ordinal columns replace missing values with NA
    
    cat = ["MasVnrType", "MSZoning", "Exterior1st", "Exterior2nd", "SaleType", "Electrical", "Functional"]
    df[cat] = df.groupby("Neighborhood")[cat].transform(lambda x: x.fillna(x.mode()[0]))      #Nominal columns replace missing value with most frequent occurrence aka mode
    
    df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.mean()))      #Replace with mean after grouping by Neighborhood
    df['GarageArea'] = df.groupby('Neighborhood')['GarageArea'].transform(lambda x: x.fillna(x.mean())) 
    df['MSZoning'] = df.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))

    cont = ["BsmtHalfBath", "BsmtFullBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea"]
    df[cont] = X[cont] = X[cont].fillna(X[cont].mean())       #Replace missing values with respective mean values for continuous features
    
    
    ## Mapping Ordinal Features
    ordinal_map = {'Ex': 5,'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA':0}
    fintype_map = {'GLQ': 6,'ALQ': 5,'BLQ': 4,'Rec': 3,'LwQ': 2,'Unf': 1, 'NA': 0}
    expose_map = {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'NA': 0}
    fence_map = {'GdPrv': 4,'MnPrv': 3,'GdWo': 2, 'MnWw': 1,'NA': 0}
    
    ord_col = ['ExterQual','ExterCond','BsmtQual', 'BsmtCond','HeatingQC','KitchenQual','GarageQual','GarageCond', 'FireplaceQu']
    for col in ord_col:
        df[col] = df[col].map(ordinal_map)

    fin_col = ['BsmtFinType1','BsmtFinType2']
    for col in fin_col:
        df[col] = df[col].map(fintype_map)

    df['BsmtExposure'] = df['BsmtExposure'].map(expose_map)
    df['Fence'] = df['Fence'].map(fence_map)
    
    ## Change data type
    df['MSSubClass'] = df['MSSubClass'].apply(str)
    
    return df


## Feature Engineering
def feat_engineer(df):
    
    ## Add new features based on merging relevant existing features
    X['TotalLot'] = X['LotFrontage'] + X['LotArea']
    X['TotalBsmtFin'] = X['BsmtFinSF1'] + X['BsmtFinSF2']
    X['TotalSF'] = X['TotalBsmtSF'] + X['2ndFlrSF']
    X['TotalBath'] = X['FullBath'] + X['HalfBath']
    X['TotalPorch'] = X['OpenPorchSF'] + X['EnclosedPorch'] + X['ScreenPorch']
    
    
    ## Generate Binary columns indicating 0/1 the presence of such features
    colum = ['MasVnrArea','TotalBsmtFin','TotalBsmtSF','2ndFlrSF','WoodDeckSF','TotalPorch']
    for col in colum:
        col_name = col+'_bin'
        df[col_name] = df[col].apply(lambda x: 1 if x > 0 else 0)
    
    
    ## Convert categorical to numerical through One-hot encoding
    df = pd.get_dummies(df)
    return df


X = df_preprocessing(X)
X = feat_engineer(X)

In [None]:
## Scaling
cols = X.select_dtypes(np.number).columns
X[cols] = RobustScaler().fit_transform(X[cols])

## Log transformation of SalePrice
y["SalePrice"] = np.log(y['SalePrice'])

## Return train and test index
x = X.loc[train.index]
y = y.loc[train.index]
test = X.loc[test.index]

#Split train into train/validation set for training
X_train, X_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=2020)

print(X_train.shape)

<a id='sec2'></a>
## [2. Boosting Algorithms](.sec2)
In this section i will like to explain what is **Boosting**, and the various popular boosting algorithms that people uses. We will dive deep into the following algorithms:
1. Gradient Boost Regressor
2. Extreme Gradient Boost
3. LightGBM
4. CatBoost

#### Ensemble Methods
Before defining what is Boosting, we would need to understand idea of ensembling. Ensembling is a technique that creates multiple models and then combining them to make a prediction. This reduce the reliance on a single model for our prediction (which may not be accurate) and leverage on many models to give our final result.
The idea of Ensemble method is that with the use of multiple models and combining them to give our prediction, we can effectively:
1. Improve accuracy
2. Reduce Variance  

#### Boosting
Boosting is one of the ensembling techniques that we will be focusing on in this notebook. The idea of Boosting is to begin with a class of weak learners, and improving these weak learners into strong learners. It always starts off with a base learner; usually a tree-based model like Decision Tree. From there, the algorithm will train the Decision tree and identify mistakes that it has made. It then generates another decision tree, which place more emphasis in trying to correct these mistakes by giving more weights to the misclassified instance, while having less emphasis on the correct predictions. This process will go on until the algorithm no longer improves. Thus, we can see that Boosting builds models in a sequential manner where the current generation improves on the mistakes that the previous generation of models have made, learning and correcting these mistakes, ultimately improving from weak learner to a strong learner which can give accurate prediction.

![](https://miro.medium.com/max/1400/0*yDz8euzLbQvucBwx.png)

![](https://imgur.com/E7faDVa)

<a id='sec2.1'></a>
### [2.1 Gradient Boost Regressor](#sec2.1)
Gradient Boost is a kind of tree-based boosting model that uses gradient descent as its optimization algorithm. Gradient Boost Regressor is a variant that deals with regression problems; where the target variable is a continuous variable, which is *SalePrice*, in our case.

Lets take a quick look at the algorithm for Gradient Boost <br>
<img src="https://i.imgur.com/OBmgEba.png" width='500' height='40'> <br>
We will go through the algorithm step by step to understand the entire algorithm in an intuitive manner.

#### Step 1
$$ F_{0}(x) = \arg\min_{\gamma} \sum^{n}_{i=1}L(y_i, \gamma) $$ <br>
We will first build our base learner, $ F_{0}(x) $ using this formula. What this formula essentially says is that we need to find the value of $ \gamma $ such that $ \gamma $ minimize the loss function $ \sum^{n}_{i=1}L(y_i, \gamma) $ across all input space $ n $. And we assign this value $ \gamma\ $ to our model $ F_{0}(x) $, where our model will predict every
thing as $ \gamma $. In regression problem, our loss function $ L(y_i, \gamma) $ is usually the mean square error, $ MSE = (actual - predicted)^2 $. In scenarios where MSE is our loss function, the value of $ \gamma $ always turns out to the average value of all the target instances.  
So we will assign our base model $ F_{0}(x) $ with the average value of our target column. That sums up the step 1 of the gradient boosting algorithm.

#### Step 2 (a)

$$ r_{im} = -\big[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\big] $$

This is the section we calculate the loss function with respect to the predicted values from the base learner. This differential formula will end up as pseudo-residuals. Residuals is the value between the actual truth and predicted values; $ r_{im} = (observed - predicted) $ This is differential formula effectively calculates the gradient, hence giving its name; **Gradient Boost Regressor**.




#### Step 2 (b)

After we have gotten the residual, $ r_{im} $ from the base learner, we will generate a tree based learner; decision tree, which will fit to the residual values calculated earlier. Hence, we can see that the new learners are generated based on the previous generation.

#### Step 2 (c)
Now that we have fitted the new tree-based learner using the residuals calculated from the previous learner, we can now proceed on to generate the output of the new tree model

$$ y_{jm} = \arg\min_{\gamma} \sum_{x_{i}\in R_{jm}}L(y_i, F_{m-1}(x_{i}) + \gamma) $$ <br>

Following this we calculate the output $ y_{jm} $ that will minimize the loss function $ \sum_{x_{i}\in R_{jm}}L(y_i, F_{m-1}(x_{i}) + \gamma) $


#### Step 2 (d)
We will then update the new model $ F_{m}(x) $ using the current generated tree model with the previous baseline classifier
$$
 F_{m}(x) = F_{m-1}(x) + v\sum_{j=1}^{J_{m}} \gamma_{jm}I(x \in R_{jm})
 $$
where $ v $ is the learning rate

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=500, criterion='friedman_mse',
                                min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, 
                                min_impurity_split=None, random_state=2020, max_features='sqrt', 
                                alpha=0.9, tol=0.0001)   #default from documentation

In [None]:
#Training
gbr.fit(X_train, y_train.values.ravel())

#Inference
y_pred = gbr.predict(X_val)

#Evaluate
np.sqrt(mean_squared_error(y_val,y_pred))

Find out more at:
* https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4
* https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
* https://explained.ai/gradient-boosting/L2-loss.html


<a id='sec2.2'></a>
### [2.2 Extreme Gradient Boost (XGBoost)](#sec2.2)

XGBoost is an improved version of the Gradient Boost architecture with some additional improvements.

<img src='https://miro.medium.com/max/1400/1*FLshv-wVDfu-i54OqvZdHg.png' width=600, height=500>

#### System Optimization
1. **Parallelization**: XGBoost uses parallelized implementation for the sequential tree building. 
2. **Tree Pruning**: Uses 'max_depth' parameter for tree splitting, and start pruning trees backward. This approach improves the performance significantly
3. **Hardware Optimization**: XGBoost makes efficient use of hardware resources. This is done by allocating internal buffers to each thread to store gradients statistics; cache awareness. 'Out-of-core' computing also optimize available disk space while handling big dataframe that do not into memory. 

#### Algorithmic Enhancement
1. **Regularization**: It uses both L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting.
2. **Sparsity Awareness**: XGBoost admits sparse features for inputs by automatically 'learning' best missing value depending on training loss
3. **Weighted Quantile Sketch**: Employs the distributed WQS algorithm to effectively find the optimal split points.
4. **Cross-Validation**: built-in cross-validation method at each iteration

#### Histogram-Based Splitting Methods
Finding the best split in the decision tree model is a key challenge. A naive traversal of every feature of every data point will be computationally expensive, with complexity of $O(n_{data}n_{features})$. Building on the observation that small changes in split does not make much difference in the decision tree performance, XGBoost and LightGBM both employed a Histogram-based method that takes advantage of this fact and group features into discrete bins and performs splitting on these bins. Hence, reducing complexity of splitting to $O(n_{data}n_{bins})$
<img src='https://i2.wp.com/mlexplained.com/wp-content/uploads/2018/01/binned_split_gbdt.png?resize=300%2C260&ssl=1'>

Such methods of splitting also presents the user with some decisions to be made. 
1. Number of bins creates a trade-off between accuracy and speed, more bins will result in higher accuracy, but slower.
2. How to divide the features into discrete bins is a non-trivial problem. Gradient statistics is the most widely used approach in dividing the bins. E.g 'approx, hist' under parameter *tree_method*

Find out more at:
* https://xgboost.readthedocs.io/en/latest/python/python_api.html
* https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

In [None]:
import xgboost as xgboost

xgb = xgboost.XGBRegressor(n_estimators=500,
                           max_depth=3,
                           learning_rate=0.1,
                           min_child_weight=2,
                           grow_policy='lossguide',
                           random_state=2020)

In [None]:
#Training
xgb.fit(X_train, y_train.values.ravel())

#Inference
y_pred = xgb.predict(X_val)

#Evaluate
np.sqrt(mean_squared_error(y_val,y_pred))

<a id='sec2.3'></a>
### [2.3 LightGBM](#sec2.3)
LightBGM is another gradient boosting framework that uses tree-based learning algorithm, created by Microsoft.
Features of LightGBM includes:
1. High speed
2. Able to handle large dataset
3. Takes less memory to run

LightGBM grows tree vertically as opposed to other tree algorithm, which grows tree horizontally. This means that LightGBM grows tree leaf-wise, choosing the leaf with max delta loss to grow. Growing on the same leaf, leaf-wise algorithm will reduce more loss than level-wise algorithm. Leaf-wise algorithm will also tend to excel in larger dataset.
<img src='https://miro.medium.com/max/700/1*AZsSoXb8lc5N6mnhqX5JCg.png'>
<img src='https://miro.medium.com/max/700/1*whSa8rY4sgFQj1rEcWr8Ag.png'>

Level-wise training can be equate as a form of regularized training, since leaf wise-training can constrcut any tree the level-wise can, but not vice versa. Hence, leaf-wise training is more prone to overfitting albeit more flexible. Thus, having a larger dataset will be beneficial.

#### Gradient-based One-Side Sampling (GOSS)
Not all datapoints contributes equally to training; data points with small gradients tend to be more well-trained (close to local minima). Hence, it is more efficient to concentrate on data points with larger gradients. A naive approach will be to ignore the data points with small gradients when computing the best split; which may lead to biased sampling resulting in alteration of data distribution. LightGBM solve this problem by random sampling data with small gradient and increase the weights of these samples when computing their contribution to change in loss (a form of importance smapling)


#### Exclusive Feature Bundling (EFB)
In order to speed up tree learning, LightGBM uses EFB which bundles data features together. In high dimensional data, many features are mutually exclusive (they never take zero values simultaneously). LightGBM identifies these features and bundles them into a single feature, which reduces training complexity. There is 2 parts to this algorithm:
1. Identify features that could be bundled together
2. Merging features within the same bundle

In [None]:
import lightgbm as lightgbm

lgb = lightgbm.LGBMRegressor(boosting_type='gbdt',
                             max_depth=3,
                             learning_rate=0.1,
                             n_estimators=500,
                             min_child_samples=2,
                             reg_alpha=0.0,
                             reg_lambda=0.1,
                             random_state=2020
                             )

In [None]:
#Training
lgb.fit(X_train, y_train.values.ravel())

#Inference
y_pred = lgb.predict(X_val)

#Evaluate
np.sqrt(mean_squared_error(y_val,y_pred))

Find out more on LightGBM:
* https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
* https://docs.microsoft.com/en-gb/archive/blogs/machinelearning/lessons-learned-benchmarking-fast-machine-learning-algorithms
* https://mlexplained.com/2018/01/05/lightgbm-and-xgboost-explained/
* https://towardsdatascience.com/what-makes-lightgbm-lightning-fast-a27cf0d9785e

<a id='sec2.4'></a>
### [2.4 CatBoost](#sec2.4)
CatBoost is a tree-based boosting algorithm developed by Yandex. Compared to the previous few tree based boosting algorithms, CatBoost implements **Symmetric Trees**, which helps in decreasing prediction time.
<img src='https://miro.medium.com/max/1400/1*AjrRnwvBuu-zK8CvEfM29w.png' width=600 height=500>


#### Ordered Boosting
The training procedure of CatBoost differs from other boosting algorithms. Instead of calculating residuals on each data points using model that has been trained on the same dataset, Catboost calculates residuals on data points that the model has never seen before; trainind different models that is used to calculate residuals for different data points. Since training many different models is pretty computationally expensive, CatBoost trains only $log_{i}$ models, where $i$ refers to the number of data points

#### Random Permutation
CatBoost further avoid overfitting the models by using random permutations on the dataset before applying ordered boosting on those random permutations. This feature can be tuned through the parameter *bagging_temperature*. 

#### Handling Categorical Features
CatBoost algorithm handles categorical data using the concept of ordered boosting, applying it to **response coding**. In a typical response coding, each categorical feature is represented using the mean to the target values of all the data points with the same categorical feature. This leads to the problem of **Target Leakage**. On the other hand, CatBoost only consider the previous data points to that time when calculating the mean to the target valus of the data points with same categorical features.

#### Categorical Feature Combinations
CatBoost also combines mutiple cateogircal features atuomatically wherever it makes sense, by building a base tree with root node consisting of onlu a single feature. It then radomly select the other best feature as child nodes and represent it along with the root node feature. Deeper down the tree, the number of categorical feature combinations will increase.
<img src='https://miro.medium.com/max/1400/0*0SmuD3DHT013X2vd' width=400, height=300>

In [None]:
from catboost import CatBoostRegressor

cb = CatBoostRegressor(max_depth=3,
                       learning_rate=0.1,
                       n_estimators=500,
                       loss_function='RMSE',
                       boosting_type='Ordered',
                       min_child_samples=2,
                       l2_leaf_reg=0.1,
                       random_state=2020,
                       logging_level='Silent')

In [None]:
#Training
cb.fit(X_train, y_train.values.ravel())

#Inference
y_pred = cb.predict(X_val)

#Evaluate
np.sqrt(mean_squared_error(y_val,y_pred))

Find out more about Catboost here:
* https://medium.com/@hanishsidhu/whats-so-special-about-catboost-335d64d754ae#:~:text=CatBoost%20is%20based%20on%20gradient,needed%20for%20most%20business%20problems.
* https://towardsdatascience.com/introduction-to-gradient-boosting-on-decision-trees-with-catboost-d511a9ccbd14

<a id='sec3'></a>
## [3. Hyperparameter Tuning](#sec3)
Before we start on the technicalities of Hyperparameter Tuning, lets first define what do we mean by *Hyperparameter*. Hyperparameter is defined as the parameters whose values are set before the learning process. Hyperparameter tuning refers to choosing a set of optimal hyperparameters for a learning algorithm. Machine Learning model don't learn hyperparameters, they learn parameters, which are the weights in the model

<a id='sec3.1'></a>
### [3.1 Hyperparameters for Gradient Boost](#sec3.1)

Parameters of ensemble methods can be broken down into 3 categories:
1. Tree-Specific Parameters: These affect each individual tree in the model.
2. Boosting Parameters: These affect the boosting operation in the model.
3. Miscellaneous Parameters: Other parameters for overall functioning.

#### Tree Specific Parameters
![tree_param](https://www.analyticsvidhya.com/wp-content/uploads/2016/02/tree-infographic.png)

1. **min_samples_split**
 *   Defines the minimum number of sample which are required in a node before it is allowed to be considered for splitting
 *   High values prevent model from learning too specific relationship for a particular sample; reducing overfitting 
2. **min_sample_leaf**
 *   Minimum number of samples required for a leaf node
 *   Similar to min_sample_split; prevent overfitting
3. **max_depth**
 *   Maximum depth of the tree
 *   Higher depth; over-fitting
4. **max_leaf_node**
 *   Maximum number of leaf nodes in a tree
 *   Similar to max_depth
 *   Max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to train at the expense of a slightly higher training error.
5. **max_features**
 *   Number of features to be considered in search of a best split
 *   Rule of thumb: square root of the total number of features

#### Boosting Parameters

1. **learning_rate**
 *   Determines the impact of each tree on the final prediction
2. **n_estimators**
 *   Number of trees to be modeled
 *   May overfit with higher number of trees
3. **subsample**
 *   Fraction of samples to be selected for each tree
 *   Values slightly less than 1; more robust, reducing variance

#### Tuning Strategies
When we talk about hyperparameter tuning, they can be broken down into the following main strategies.
1. Grid Search
2. Randomized Search

<a id='sec3.2'></a>
### [3.2 Grid Search](#sec3.2)
Grid Search is an approach where we prepare a set of candidate values for the hyperparameters, and train the model with every combination of the hyparameters values exhaustively. Lets see how we implement Grid Search on python

In [None]:
from sklearn.model_selection import GridSearchCV

learning_rate = [0.0001, 0.001, 0.01, 0.1]
n_estimators = [50, 100, 250, 500]
max_depth = [3,5,10]

param_grid = dict(learning_rate = learning_rate, #Dictionary with parameters names (str) as keys and lists of parameter settings to try as values
             n_estimators = n_estimators,
             max_depth = max_depth)

grid = GridSearchCV(estimator=gbr,
                    param_grid=param_grid,
                    scoring="neg_root_mean_squared_error",
                    verbose=1,
                    n_jobs=-1)

grid_gbr = grid.fit(X_train,y_train)
print('Best Score: ', grid_gbr.best_score_)
print('Best Params: ', grid_gbr.best_params_)

#### Pros
- Able to cover all possible prospective sets of parameters  

#### Cons
- Time consuming and computationally expensive as it search every combination of values
- It may neglect on important parameters and spend time on optimizing redundant parameters instead

<a id='sec3.3'></a>
### [3.3 Randomized Search](#sec3.3)
Randomized search differs from grid search in that it search the sets of candidates hyperparameters values randomly instead of carrying out an exhaustive search like grid search. In sklearn, we will set the *n_iter* parameters that indicates the number of combinations we randomly try. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {'learning_rate' : [0.0001, 0.001, 0.01, 0.1],
                       'n_estimators' : [50, 100, 250, 500, 700, 900],
                       'max_depth' : [i for i in range(10)],
                       'min_samples_split': [2, 5, 10, 20, 40],
                       'max_features' : ["auto", "sqrt", "log2"]}

rdm_grid = RandomizedSearchCV(estimator=gbr,
                             param_distributions = param_distributions,
                             n_iter = 100,
                             n_jobs = -1,
                             scoring="neg_root_mean_squared_error")

rdm_gbr = rdm_grid.fit(X_train, y_train)
print('Best Score: ', rdm_gbr.best_score_)
print('Best Params: ', rdm_gbr.best_params_)

#### Pros
- Decreased processing time since we can control the number of parameter search
- May be more effective than Grid Search

#### Cons
- Depending on the number of searches and how large the parameter space is, some parameters might not be explored enough

Suprisingly, based on [Bergstra et al., 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) research, randomized search actually performs better than grid search. From the figure, we can see that as a result of a structured search space, grid search only generates 3 distinct places. On the other hand, because of randomness, randomized search was able to generate much more distinct places.

<img src='https://miro.medium.com/max/1400/0*gaxqCRZa22tunZBJ.png' width=500 height=400>



Thats all folks! In this notebook, we have covered:
1. Boosting Algorithms
2. Hyperparameter tuning  



I hope you have learnt something out of this notebook which might help in your future data science compeitions/projects. 
Do give a **Upvote** if it has helped you :) Thank you!