# Boosting
## This notebook outlines the main concepts behind Boosting technique used in Ensembling of Machine Learning models

Consider a scenario where you are to look at a picture of either a dog or cat and classify it.

Usual approach you will take is as follows:
- Devise some rules (features) for it to be a dog or a cat

Example:
- Image has pointy ears: Cat
- Image has cat shaped eyes: Cat
- Image has bigger limbs: Dog
- Image has a wide mouth structure: Dog
- and so on

**Weak learner** - If you are to use only one of the above rule in coming to a decision, then it is flawed with more inaccurate predictions

**Combine** these weak learners (which are good in identifying their own rules' existence) by a **weighted average** or **majority voting**

**Boosting** - ensemble learning technique where it combines multiple weak learners to strong learner by sequentially correcting previous misclassified samples in subsequent iterations

### Boosting Theory

The basic principle behind the working of the boosting algorithm is to generate **multiple weak learners** and **combine** their predictions to form one strong rule. These weak rules are generated by applying base Machine Learning algorithms on different distributions of the data set. These algorithms generate weak rules for each iteration. After multiple iterations, the weak learners are combined to form a strong learner that will predict a more accurate outcome.

### Algorithm

- 1. A subset is created from the original dataset.
- 2. Initially, all data points are given equal weights.
- 3. A base model is created on this subset.
- 4. Use this model to make predictions on the whole dataset.
- 5. Errors are calculated using the actual values and predicted values.
- 6. The observations which are incorrectly predicted, are given higher weights.
- 7. Another model is created and predictions are made on the dataset. (This model tries to correct the errors from the previous model)
- 8. Similarly, multiple models are created, each correcting the errors of the previous model.
- 9. The final model (strong learner) is the weighted mean of all the models (weak learners).

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually **boosts** the performance of the ensemble, hence the name **Boosting**.


![Boosting](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/Boosting.png)

### Types of Boosting
- AdaBoost
- XGBoost
- Light GBM
- CatBoost

### Load the dataset

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_prediction.csv")
data        
        

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,1,0,0,0,0,2,1,5849.0,0.0,146.412162,360.0,1.0
1,1,1,1,0,0,0,0,4583.0,1508.0,128.000000,360.0,1.0
2,1,1,0,0,1,2,1,3000.0,0.0,66.000000,360.0,1.0
3,1,1,0,1,0,2,1,2583.0,2358.0,120.000000,360.0,1.0
4,1,0,0,0,0,2,1,6000.0,0.0,141.000000,360.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,0,0,0,1,2900.0,0.0,71.000000,360.0,1.0
610,1,1,3,0,0,0,1,4106.0,0.0,40.000000,180.0,1.0
611,1,1,1,0,0,2,1,8072.0,240.0,253.000000,360.0,1.0
612,1,1,2,0,0,2,1,7583.0,0.0,187.000000,360.0,1.0


### Split into X and y

In [3]:
X = data.drop('Loan_Status', axis=1)
X

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,1,0,0,0,0,2,5849.0,0.0,146.412162,360.0,1.0
1,1,1,1,0,0,0,4583.0,1508.0,128.000000,360.0,1.0
2,1,1,0,0,1,2,3000.0,0.0,66.000000,360.0,1.0
3,1,1,0,1,0,2,2583.0,2358.0,120.000000,360.0,1.0
4,1,0,0,0,0,2,6000.0,0.0,141.000000,360.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,0,0,0,2900.0,0.0,71.000000,360.0,1.0
610,1,1,3,0,0,0,4106.0,0.0,40.000000,180.0,1.0
611,1,1,1,0,0,2,8072.0,240.0,253.000000,360.0,1.0
612,1,1,2,0,0,2,7583.0,0.0,187.000000,360.0,1.0


In [4]:
y = data['Loan_Status']
y

0      1
1      0
2      1
3      1
4      1
      ..
609    1
610    1
611    1
612    1
613    0
Name: Loan_Status, Length: 614, dtype: int64

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=1
)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((491, 11), (491,), (123, 11), (123,))

## AdaBoost

### Algorithm

- Initially, all observations in the dataset are given equal weights.
- A model is built on a subset of data.
- Using this model, predictions are made on the whole dataset.
- Errors are calculated by comparing the predictions and actual values.
- While creating the next model, higher weights are given to the data points which were predicted incorrectly.
- Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation.
- This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

### Import AdaBoostClassifier

In [10]:
from sklearn.ensemble import AdaBoostClassifier

### Build the model

In [11]:
model = AdaBoostClassifier(random_state=1)

### Fit the model

In [12]:
model.fit(X_train, y_train)

AdaBoostClassifier(random_state=1)

### Predict on the test data

In [13]:
model.score(X_test,y_test)

0.8130081300813008

### XGBoost

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost **10 times faster** than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as **regularized boosting** technique.

#### Features

- **Regularization**:

Standard GBM implementation has no regularisation like XGBoost.
Thus XGBoost also helps to reduce overfitting.

- **Parallel Processing**:

XGBoost implements parallel processing and is faster than GBM .
XGBoost also supports implementation on Hadoop.

- **High Flexibility**:

XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model.

- **Handling Missing Values**:

XGBoost has an in-built routine to handle missing values.

- **Tree Pruning**:

XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.

- **Built-in Cross-Validation**:

XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

### Import XGBoost Classifier

In [16]:
from xgboost import XGBClassifier

### Build the model

In [17]:
model = XGBClassifier(random_state=1,learning_rate=0.01)

### Fit the model

In [18]:
model.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Predict on the test data

In [19]:
model.score(X_test,y_test)

0.7804878048780488

### Light GBM

Light GBM beats all the other algorithms when the dataset is **extremely large**. Compared to the other algorithms, Light GBM takes **lesser time** to run on a huge dataset.

LightGBM is a gradient boosting framework that uses tree-based algorithms and follows **leaf-wise** approach while other algorithms work in a level-wise approach pattern.

![LightGBM](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LightGBM_leafs.png)

## Homework
### Import LightGBM

In [27]:
import lightgbm as lgb

### Create dataset

In [28]:
train_data=lgb.Dataset(X_train,label=y_train)

### Create parameters

In [29]:
params = {'learning_rate':0.001}

### Train the model

In [30]:
model= lgb.train(params, train_data, 100)

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 376
[LightGBM] [Info] Number of data points in the train set: 491, number of used features: 11
[LightGBM] [Info] Start training from score 0.688391


### Predict on test data

In [31]:
predictions = model.predict(X_test)

In [32]:
predictions

array([0.68604029, 0.71807129, 0.67477524, 0.68604029, 0.70880289,
       0.63607412, 0.70853787, 0.67792374, 0.70606748, 0.62549547,
       0.6845242 , 0.70853787, 0.67792374, 0.69337514, 0.69133065,
       0.71807129, 0.68144025, 0.71813688, 0.68144025, 0.70883309,
       0.69206793, 0.70377748, 0.68640465, 0.70377748, 0.70880289,
       0.67871972, 0.69860557, 0.63607412, 0.71821164, 0.68620138,
       0.69538232, 0.67871972, 0.67777841, 0.62549547, 0.70463581,
       0.71394297, 0.7072053 , 0.69876666, 0.68144025, 0.67792374,
       0.70463581, 0.68144025, 0.68144025, 0.67777841, 0.68144025,
       0.67871972, 0.70293662, 0.62549547, 0.70377748, 0.69895806,
       0.68640465, 0.67871972, 0.7072053 , 0.70853787, 0.70880289,
       0.63607412, 0.69895806, 0.62549547, 0.70363323, 0.68604029,
       0.70906255, 0.7072053 , 0.71813688, 0.63607412, 0.62549547,
       0.68923875, 0.70984148, 0.70463581, 0.69538232, 0.70853787,
       0.70377748, 0.70377748, 0.68144025, 0.63607412, 0.69337

In [34]:
for i in range(0,len(predictions)):
    if predictions[i]>=0.5: 
        predictions[i]=1
    else:
        predictions[i]=0
        
predictions

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1.])

### Check accuracy

In [35]:
y_test.value_counts()

1    84
0    39
Name: Loan_Status, dtype: int64

In [37]:
predictions_df = pd.DataFrame(predictions)
predictions_df.value_counts()

1.0    123
dtype: int64

## Bonus: Gradient Boosting

### Import the GradientBoostingClassifier

In [38]:
from sklearn.ensemble import GradientBoostingClassifier

### Build the model

In [39]:
model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)

### Train the model

In [40]:
model.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=0.01, random_state=1)

### Predict on the test data

In [41]:
model.score(X_test,y_test)

0.7967479674796748

## Bonus: CatBoost

Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels (i.e. they are highly cardinal), performing one-hot-encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset.

CatBoost can **automatically deal with categorical variables** and does not require extensive data preprocessing like other machine learning algorithms.

### Homework: 
### Import the CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier

### Build the model

In [None]:
model=CatBoostClassifier()

### Take care of the categorical features

In [None]:
categorical_features_indices = np.where(df.dtypes != np.float)[0]

### Train the model

In [None]:
model.fit(x_train,y_train,cat_features=([ 0,  1, 2, 3, 4, 10]),eval_set=(x_test, y_test))

### Predict on the test data

In [None]:
model.score(x_test,y_test)