![XGBoost](https://i.ibb.co/DDM7r46/xgboost.png)

- **ML Part 1** - Logistic Regression
- **ML Part 2** - K-Nearest Neighbors (KNN)
- **ML Part 3** - Support Vector Machine (SVM)
- **ML Part 4** - Artificial Neural Network (NN)
- **ML Part 5** - Classification and Regression Tree (CART)
- **ML Part 6** - Random Forests
- **ML Part 7** - Gradient Boosting Machines (GBM)
- **ML Part 8 - XGBoost**
- **ML Part 9** - LightGBM
- **ML Part 10** - CatBoost

XGBoost is a decison-tree based and gradient-boosting ML system. If your data is non-structured data such as picture / text / sound, deep learning with artificial neural networks will be the right choice.
However, if you don't have a lot of data , I recommend starting with decision-tree based algorithms. Decision-tree based algorithms have evolved a lot over time. You can see this evolution more easily in the flow below.

![](https://miro.medium.com/proxy/1*QJZ6W-Pck_W7RlIDwUIN9Q.jpeg)

XGBoost was first presented as an article at the SIGKDD 2016 conference by two researchers at Washington University, Tianqi Chen and Carlos Guestrin, and made a tremendous impression in the ML world. After its presentation, it became the star of not only academic competitions, but also of Kaggle competitions and started to find application in industry. Thanks to this, it has an open-source repo and a strong community contributed by many data scientists today.

“Random forests are made up of multiple decision trees working together (ensemble). Combining and working models with no correlation between each other performs better than any model that works alone, ensemble learning is based on this. The lack of correlation between each other helps each tree avoid its own mistakes from others. "
As a one-to-one analogy, we cannot make important decisions in real life alone. In random forests, multiple weak decision trees (we call them weak learner) come together to create a stronger tree. You start with a very weak tree, you make a mistake in that tree, and you make your way to a stronger tree by not making the mistake you dug from that tree on the next tree. The reason why these trees are weak is that the data they are trained on consists of random subsets of our data set, which we provide by bagging.

## Why Does XGBoost Work Well?
In fact, both XGBoost and Gradient-Boosting Machines (GBMs) use community-based poor learners, poor learners are supported by the gradient-descent method.

![](https://miro.medium.com/proxy/1*FLshv-wVDfu-i54OqvZdHg.png)

## System Optimizations
- **Parallel Working:** XGBoost enables the creation of decision trees much faster by parallelization while creating. The underlying reason for doing this is that while creating base-learners, it is able to switch between internal and external cycles. Normally, the external cycloids compute the internal cyclic attributes while creating the leaves of the decision-tree. However, parallelization is limited because the outer cycles cannot be completed before the inner cycles are finished, that is, the leaves of the tree will not be formed without calculating the features. XGBoost speeds up runtime by varying the computing power allocated to the internal and external cycles and greatly reduces the parallelization overhead.
- **Tree-Pruning:** It stops separating according to negative-loss criterion while separating tree branches in GBMs. XGBoost, on the contrary, determines the depth of the tree with the max_depth parameter from the very beginning, and if the tree is too advanced downward it pruns backwards. Because XGBoost prioritizes depth, it significantly increases complexity and thus computational performance.
- **Hardware Optimization:** XGBoost was designed from the outset to make better use of hardware resources. For example, each thread keeps an internal buffer and gradient statistics in this buffer, keeping in mind the fullness of the buffer. Apart from that, it can fit larger data into memory by optimizing the disk space thanks to improvements such as "out-of-core" computing.


## Algorithmic Improvements
- **Regularization:** Overfitting can be prevented using both LASSO and Ridge regularization.
- **Sparse Compatibility:** In real life, unfortunately, data sets contain many missing values. XGBoost is able to learn the most accurate way with poor-learners by looking at the loss of education. Or sometimes, the data set has missing values in a certain order (sensor / communication errors, etc.). In these cases, XGBoost can collect the situation.
- **Weighted Quarter Drawing:** One of the biggest advantages of XGBoost is that it uses the observation points in the data set by weighing them in order to distinguish them from the most accurate point while separating into trees.
- **Cross-validation:** XGBoost comes with a cross-validation (cv) application in itself, so scikit-learn etc. from outside. You don't need to make a cv using it. Also, you do not need to specify how many iterations will be made in each run.

In [None]:
# Import the necessary packages
import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import r2_score, roc_auc_score, roc_curve, classification_report
from xgboost.sklearn import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import and read dataset
input_ = "../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv"
df = pd.read_csv(input_)

df.head(10)

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
df.describe()

In [None]:
x = df.drop(columns='DEATH_EVENT')
y = df['DEATH_EVENT']

model = XGBClassifier()
model.fit(x,y)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(12).plot(kind='barh')
plt.show()

In [None]:
for i in range(0,len(df.columns)):
    print("{} = {}".format(i,df.columns[i]))

In [None]:
# Delete outlier
df = df[df['ejection_fraction']<70]

In [None]:
inp_data = df.drop(df[['DEATH_EVENT']], axis=1)
#inp_data = df.iloc[:,[11,7,4,0,1,8]]
out_data = df[['DEATH_EVENT']]

X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, random_state=0, shuffle=True)

## Applying Transformer
sc= StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
## X_train, X_test, y_train, y_test Shape

print("X_train Shape : ", X_train.shape)
print("X_test Shape  : ", X_test.shape)
print("y_train Shape : ", y_train.shape)
print("y_test Shape  : ", y_test.shape)

In [None]:
# I coded this method for convenience and to avoid writing the same code over and over again

def result(clf):
    clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=[(X_test, y_test)], verbose=False)
    y_pred = clf.predict(X_test)
    
    print('Accuracy Score    : {:.4f}'.format(accuracy_score(y_test, y_pred)))
    print('XGBoost f1-score      : {:.4f}'.format(f1_score( y_test , y_pred)))
    print('XGBoost precision     : {:.4f}'.format(precision_score(y_test, y_pred)))
    print('XGBoost recall        : {:.4f}'.format(recall_score(y_test, y_pred)))
    print("XGBoost roc auc score : {:.4f}".format(roc_auc_score(y_test,y_pred)))
    print("\n",classification_report(y_pred, y_test))
    
    plt.figure(figsize=(6,6))
    cf_matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap((cf_matrix / np.sum(cf_matrix)*100), annot = True, fmt=".2f", cmap="Blues")
    plt.title("XGBoost Confusion Matrix (Rate)")
    plt.show()
    
    cm = confusion_matrix(y_test,y_pred)
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, cmap="Blues",
                xticklabels=["FALSE","TRUE"],
                yticklabels=["FALSE","TRUE"],
                cbar=False)
    plt.title("XGBoost Confusion Matrix (Number)")
    plt.show()
    
    
def report(**params):
    scores = [] 
    for i in range(0,250): # 250 samples
        X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, shuffle=True)
        sc = StandardScaler()
        clf = XGBClassifier(**params)
        X_train = sc.fit_transform(X_train)
        X_test = sc.fit_transform(X_test)
        clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=[(X_test, y_test)], verbose=False)
        scores.append(accuracy_score(clf.predict(X_test), y_test)) 
        
    Importance = pd.DataFrame({'Importance':clf.feature_importances_*100},index=df.columns[:12])
    Importance.sort_values(by='Importance',axis=0,ascending=True).plot(kind='barh',color='lightblue')
    plt.xlabel('Importance for variable');
    plt.hist(scores)
    plt.show()
    print("Best Score: {}\nMean Score: {}".format(np.max(scores), np.mean(scores)))

---
## Simple Metod
I applied XGBoost directly without changing anything and the result is as follows:

### clf = XGBClassifier(random_state=0)
result(clf)

In [None]:
report()

## Advanced Metod
### Parameters

1. **eta [default=0.3]**
  - Analogous to learning rate in GBM
  - Makes the model more robust by shrinking the weights on each step
  - Typical final values to be used: 0.01-0.2


2. **min_child_weight [default=1]**
  - Defines the minimum sum of weights of all observations required in a child.
  - This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
  - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
  - Too high values can lead to under-fitting hence, it should be tuned using CV.


3. **max_depth [default=6]**
  - The maximum depth of a tree, same as GBM.
  - Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
  - Should be tuned using CV.
  - Typical values: 3-10


4. **max_leaf_nodes**
  - The maximum number of terminal nodes or leaves in a tree.
  - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
  - If this is defined, GBM will ignore max_depth.


5. **gamma [default=0]**
  - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
  - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.


6. **max_delta_step [default=0]**
  - In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
  - Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
  - This is generally not used but you can explore further if you wish.


7. **subsample [default=1]**
  - Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
  - Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
  - Typical values: 0.5-1


8. **colsample_bytree [default=1]**
  - Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
  - Typical values: 0.5-1


9. **colsample_bylevel [default=1]**
  - Denotes the subsample ratio of columns for each split, in each level.
  - I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.


10. **lambda [default=1]**
  - L2 regularization term on weights (analogous to Ridge regression)
  - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.


11. **alpha [default=0]**
  - L1 regularization term on weight (analogous to Lasso regression)
  - Can be used in case of very high dimensionality so that the algorithm runs faster when implemented


12. **scale_pos_weight [default=1]**
  - A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.


---

### Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]
    - This defines the loss function to be minimized. Mostly used values are:
        - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
        - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
            - you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
        - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
2. eval_metric [ default according to objective ]
    - The metric to be used for validation data.
    - The default values are rmse for regression and error for classification.
    - Typical values are:
        - rmse – root mean square error
        - mae – mean absolute error
        - logloss – negative log-likelihood
        - error – Binary classification error rate (0.5 threshold)
        - merror – Multiclass classification error rate
        - mlogloss – Multiclass logloss
        - auc: Area under the curve
3. seed [default=0]
    - The random number seed.
    - Can be used for generating reproducible results and also for parameter tuning.

If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:
   1. eta –> learning_rate
   2. lambda –> reg_lambda
   3. alpha –> reg_alpha

You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier. However, it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.

---

#### Step 1: Tune max_depth and min_child_weight

In [None]:
param_grid = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}

clf = XGBClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 5,
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 5,
)

---

#### Step 2: Tune gamma

In [None]:
param_grid = {
    'gamma': [i/10.0 for i in range(0,8)]
}

clf = XGBClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
)

---

#### Step 3: Tune subsample and colsample_bytree

In [None]:
param_grid = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree': [i/10.0 for i in range(7,15)]
}

clf = XGBClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
    seed=0
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
)

---

#### Step 4: Tuning Regularization Parameters

In [None]:
param_grid = {
 'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05]
}

clf = XGBClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
    reg_alpha= 0.01,
    seed=0
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
    reg_alpha= 0.01
)

---

#### Step 5: Reducing Learning Rate

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
    reg_alpha= 0.01,
    learning_rate=0.1,
    objective= 'binary:logistic',
    seed= 0
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 5,
    gamma = 0.5,
    colsample_bytree= 0.9,
    subsample= 0.8,
    reg_alpha= 0.01,
    learning_rate=0.1,
    objective= 'binary:logistic',
)


---

In [None]:
param_grid = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2),
    'gamma': [i/10.0 for i in range(0,8)],
    'subsample':[i/10.0 for i in range(6,10)],
    'colsample_bytree': [i/10.0 for i in range(7,15)],
    'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05],
    'objective':['binary:logistic']
}

clf = XGBClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = XGBClassifier(
    max_depth= 5,
    min_child_weight= 3,
    gamma = 0.2,
    colsample_bytree= 0.7,
    subsample= 0.6,
    reg_alpha= 0.01,
    learning_rate=0.1,
    objective= 'binary:logistic',
    seed= 0
)

result(clf)

In [None]:
report(
    max_depth= 5,
    min_child_weight= 3,
    gamma = 0.2,
    colsample_bytree= 0.7,
    subsample= 0.6,
    reg_alpha= 0.01,
    learning_rate=0.01,
    objective= 'binary:logistic',
)