![Gradient Boosting Machines](https://i.ibb.co/C7nDs5b/Gradient-Boosting-Machines.png)

- **ML Part 1** - Logistic Regression
- **ML Part 2** - K-Nearest Neighbors (KNN)
- **ML Part 3** - Support Vector Machine (SVM)
- **ML Part 4** - Artificial Neural Network (NN)
- **ML Part 5** - Classification and Regression Tree (CART)
- **ML Part 6** - Random Forests
- **ML Part 7 - Gradient Boosting Machines (GBM)**
- **ML Part 8** - XGBoost
- **ML Part 9** - LightGBM
- **ML Part 10** - CatBoost

## What is GBM?

Let’s start by understanding Boosting! Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set. The gradient boosting algorithm (gbm) can be most easily explained by first introducing the AdaBoost Algorithm.The AdaBoost Algorithm begins by training a decision tree in which each observation is assigned an equal weight. After evaluating the first tree, we increase the weights of those observations that are difficult to classify and lower the weights for those that are easy to classify. The second tree is therefore grown on this weighted data. Here, the idea is to improve upon the predictions of the first tree. Our new model is therefore Tree 1 + Tree 2. We then compute the classification error from this new 2-tree ensemble model and grow a third tree to predict the revised residuals. We repeat this process for a specified number of iterations. Subsequent trees help us to classify observations that are not well classified by the previous trees. Predictions of the final ensemble model is therefore the weighted sum of the predictions made by the previous tree models.

Gradient Boosting trains many models in a gradual, additive and sequential manner. The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identify the shortcomings of weak learners (eg. decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y=ax+b+e , e needs a special mention as it is the error term). The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimise. For example, if we are trying to predict the sales prices by using a regression, then the loss function would be based off the error between true and predicted house prices. Similarly, if our goal is to classify credit defaults, then the loss function would be a measure of how good our predictive model is at classifying bad loans. One of the biggest motivations of using gradient boosting is that it allows one to optimise a user specified cost function, instead of a loss function that usually offers less control and does not essentially correspond with real world applications.


## How does GBM work?

This algorithm is based on the work of Leo Breiman.

Suppose we have a regression problem where we try to predict the target values as above.
Gradient Boosting creates an "F" function that generates predictions in the first iteration. Calculates the difference between the estimates and the target value and creates the "h" function for these differences. In the second iteration, it combines the "F" and "h" functions and again calculates the difference between predictions and targets. In this way, it tries to increase the success of the "F" function by constantly adding on it, and therefore to reduce the difference between predictions and targets to zero.

In the image below, you can see the model's prediction in the 1st iteration as a red line on the left graph. I have also shown the difference between the predictions and the target value for each x value in the graph on the right.

![](https://miro.medium.com/max/1125/1*6M_WOL_-ZM9VHNbMTJ1Kqg.png)

The success of the model will increase as the iterations progress. You can see the same graph in the 10th iteration result below.

![](https://miro.medium.com/max/1125/1*MNKP3INB89Sft1oxlRltGQ.png)

In the 25th and 50th iteration results, you can now see that the model's prediction and target difference approaches zero. In fact, the 50th iteration results show that the model is starting to overfit a bit. To prevent overfit, it would be useful to compare the results with a separate validation set and find the appropriate number of iterations for you.

![](https://miro.medium.com/max/1125/1*fxlV-MohyZm-R5KKcSxgcw.png)

![](https://miro.medium.com/max/1125/1*vhCCMYp7G9xhDQaP2Mv4PA.png)


## Advantages
- Often provides predictive accuracy that cannot be beat.
- Lots of flexibility - can optimize on different loss functions and provides several hyperparameter tuning options that make the function fit very flexible.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Handles missing data - imputation not required.

## Disdvantages
- GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Must use cross-validation to neutralize.
- Computationally expensive - GBMs often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, LIME, etc.).


## Hyperparameters
- Number of trees: The total number of trees to fit. GBMs often require many trees; however, unlike random forests GBMs can overfit so the goal is to find the optimal number of trees that minimize the loss function of interest with cross validation.
- Depth of trees: The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d = 1  works well, in which case each tree is a stump consisting of a single split. More commonly, d is greater than 1 but it is unlikely d > 10 will be required.
- Learning rate: Controls how quickly the algorithm proceeds down the gradient descent. Smaller values reduce the chance of overfitting but also increases the time to find the optimal fit. This is also called shrinkage.
- Subsampling: Controls whether or not you use a fraction of the available training observations. Using less than 100% of the training observations means you are implementing stochastic gradient descent. This can help to minimize overfitting and keep from getting stuck in a local minimum or plateau of the loss function gradient.



## Tuning a GBM Model and Early Stopping
Hyperparameter tuning is especially significant for gbm modelling since they are prone to overfitting. The special process of tuning the number of iterations for an algorithm such as gbm and random forest is called “Early Stopping”. Early Stopping performs model optimisation by monitoring the model’s performance on a separate test data set and stopping the training procedure once the performance on the test data stops improving beyond a certain number of iterations.

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit. In the context of gbm, early stopping can be based either on an out of bag sample set (“OOB”) or cross- validation (“cv”). Like mentioned above, the ideal time to stop training the model is when the validation error has decreased and started to stabilise before it starts increasing due to overfitting.



## Comparison with Other Models

### GBM vs Random Forest
Every algorithm consists of two steps:
- Producing a distribution of simple ML models on subsets of the original data.
- Combining the distribution into one "aggregated" model.

Now, Random Forest uses Bagging(Bootstrapped Aggregating) for sampling.
- It aims to decrease variance not bias.
- It is low bias- high variance model.
- It doesn’t overfit.
- It uses parallel ensembling.
- In final prediction, It uses simple majority vote for classification.

while GBT uses Boosting method for sampling.
- It aims to decrease bias not variance.
- It is high bias-low variance algorithm.
- It overfits.
- It uses sequential ensembling.
- In final prediction, It uses weighted majority vote for classification.

### GBM vs Neural Net
Gradient Boosting (LGB, XGB, Catboost):
- work well on categorical features
- easy to tune
- work well on small datasets like we have here
- hard to include information from images
- run on CPU hence 6h time

Neural Net:
- hard to include categorical features
- work well with images and text
- need GPU hence only 2h time
- hard to combine different features (categorical, numerical, images)
- not good on small datasets

In [None]:
# Import the necessary packages
import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.ensemble import GradientBoostingClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import and read dataset
input_ = "../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv"
df = pd.read_csv(input_)

df.head(10)

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
df.describe()

In [None]:
x = df.drop(columns='DEATH_EVENT')
y = df['DEATH_EVENT']

model = GradientBoostingClassifier()
model.fit(x,y)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(12).plot(kind='barh')
plt.show()

When we examine the graph above, we can predict that time, serum_creatinine, ejection_fraction  values will increase accuracy in education.

In [None]:
# Delete outlier
df = df[df['ejection_fraction']<70]

In [None]:
#inp_data = df.drop(df[['DEATH_EVENT']], axis=1)
inp_data = df.iloc[:,[4,7,11]]
out_data = df[['DEATH_EVENT']]

X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, random_state=0, shuffle=True)

## Applying Transformer
sc= StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
## X_train, X_test, y_train, y_test Shape

print("X_train Shape : ", X_train.shape)
print("X_test Shape  : ", X_test.shape)
print("y_train Shape : ", y_train.shape)
print("y_test Shape  : ", y_test.shape)

In [None]:
## I coded this method for convenience and to avoid writing the same code over and over again

def result(clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print('Accuracy Score    : {:.4f}'.format(accuracy_score(y_test, y_pred)))
    print('GBM f1-score      : {:.4f}'.format(f1_score( y_test , y_pred)))
    print('GBM precision     : {:.4f}'.format(precision_score(y_test, y_pred)))
    print('GBM recall        : {:.4f}'.format(recall_score(y_test, y_pred)))
    print("GBM roc auc score : {:.4f}".format(roc_auc_score(y_test,y_pred)))
    print("\n",classification_report(y_pred, y_test))
    
    plt.figure(figsize=(6,6))
    cf_matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap((cf_matrix / np.sum(cf_matrix)*100), annot = True, fmt=".2f", cmap="Blues")
    plt.title("GBM Confusion Matrix (Rate)")
    plt.show()
    
    cm = confusion_matrix(y_test,y_pred)
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, cmap="Blues",
                xticklabels=["FALSE","TRUE"],
                yticklabels=["FALSE","TRUE"],
                cbar=False)
    plt.title("GBM Confusion Matrix (Number)")
    plt.show()
    
def sample_result(
    loss='deviance',
    learning_rate=0.1,
    n_estimators=100,
    subsample=1.0,
    criterion='friedman_mse',
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,
    max_features=None
):    
    
    scores = [] 
    for i in range(0,500): # 500 samples
        n_estimators, max_features, max_depth, min_samples_split
        X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2)
        clf = GradientBoostingClassifier(
            loss=loss,
            learning_rate=learning_rate,
            n_estimators=n_estimators,
            subsample=subsample,
            criterion=criterion,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            max_depth=max_depth,
            max_features=max_features
        ) 
        sc=StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.fit_transform(X_test)
        clf.fit(X_train, y_train)
        scores.append(accuracy_score(clf.predict(X_test), y_test)) 
    
    plt.hist(scores)
    plt.show()
    print("Best Score: {}\nMean Score: {}".format(np.max(scores), np.mean(scores)))

---

## Simple Metod
I applied GBM directly without changing anything and the result is as follows:

In [None]:
clf = GradientBoostingClassifier(random_state=0)
result(clf)
sample_result()

## Advanced Method

In [None]:
param_grid = {
    "learning_rate": [0.1, 0.5, 0.01],
    "min_samples_split": [1,3,5],
    "max_depth": [2,4,6],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse",  "mae"],
    "subsample":[0.1, 0.5, 1],
    "n_estimators":[500,1000]
}

clf = GradientBoostingClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=4, verbose=3, cv=5)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = GradientBoostingClassifier(
    learning_rate=0.01,
    min_samples_split=5,
    max_depth=2,
    max_features='sqrt',
    criterion='friedman_mse',
    subsample=0.5,
    n_estimators=500,
    random_state=0
)

result(clf)
sample_result(
    learning_rate=0.01,
    min_samples_split=5,
    max_depth=2,
    max_features='sqrt',
    criterion='friedman_mse',
    subsample=0.5,
    n_estimators=500
)

In [None]:
Importance = pd.DataFrame({'Importance':clf.feature_importances_*100},index=df.iloc[:,[4,7,11]].columns)
Importance.sort_values(by='Importance',axis=0,ascending=True).plot(kind='barh',color='lightblue')
plt.xlabel('Importance for variable');

## Reporting
I evaluated the results I found with Confusion Matrix, the results are as follows:

**Correctly predicted -> %93.33 (278 of 297 predict are correct)**
- True Negative -> %68.33 (41 people) -> Those who were predicted not to die and who did not die
- True Positive -> %25.00 (15 people) -> Those who were predicted to die and who did die

**Wrong predicted-> %6.66 (19 of 297 predict are wrong)**
- False Positive -> %03.33 (2 people) -> Those who were predicted to die but who did not die
- False Negative -> %03.33 (2 people) -> Those who were predicted to not die but who did die