<html>
    <p style='background:Orange; color:white; font-size:30px; padding:7px;text-align:center;border-width: 5px;border-color: coral;border-style: solid'><b>Catboost-Tabular Playground Series</b></p>
</html>

<img src="https://avatars.mds.yandex.net/get-bunker/56833/dba868860690e7fe8b68223bb3b749ed8a36fbce/orig">

<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">Contents</span></h1><br>

* [1.Introduction](#1)
* [2.About Catboost](#2)
    * [2.1.Tree Structure Type](#2.1)
    * [2.2.Categorical Feature Support](#2.2)
    * [2.3.Differentiating from classical boosting](#2.3)
* [3.Model Building](#3)
* [4.Shap](#4)

<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">Introduction</span></h1><br>


<b>About the Data </b>

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

<b>Files</b>

train.csv - the training data, one product (id) per row, with the associated features (feature_*) and class label (target)
test.csv - the test data; you must predict the probability the id belongs to each class
sample_submission.csv - a sample submission file in the correct format

<hr style="margin-width:10"></hr>

<a id="2"></a>
<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">About Catboost</span></h1><br>

CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library.It is a powerfull library build by Yandex community. This note book deals with indepth understanding of how to implement the catboost algorithm on the data and improving the accuracy of the model.

Following notebook will help you to understand the model trained using Catboost algorithm following which we will also look into model predition as well as some useful features like feature importances and SHAP interpretation.

<a id="2.1"></a>
<b>1. Tree Structure Type </b> 

Catboost supports the tree to grow as full symmetric binary tree ; i.e. for each level of the tree the split definition will be the same. Using FSBT , the results doesnt change a lot with the parameters. Stability is maintained. Because of which there is not much requirement of parameter tuning and with default value giving good results. Below is the example of full symmetric binary tree.

<img src="https://miro.medium.com/max/2008/1*AjrRnwvBuu-zK8CvEfM29w.png">

<a id="2.2"></a>
<b>2. Categorical Features Support </b>

One hot encoding : Catboost supports categorical features with converting them to OHE under the hood without manual interventions.
Average label value(CAtegorical feature with label value) : Taking the average of 1's across the feature combination and adding a new column with those value; but this leads to data leak and hence the target leakage.
Using permuation of data : permuting the records and calculating the average feature value with target label before the object (not including that object)
Creating the feature combination but in a greedy manner (taking only those with high impactibility) to avoid creating combination with feature having many categories.

<a id="2.3"></a>
<b>3. Differentiating from Classical Boosting </b>

Differentiating from Classical Boosting
Classical boosting techniques uses the weighted sum of the gradients of the objects in the leaf as an estimate which is prone to overfitting. As the estimate is biased because the tree is making estimate on the same object on which the tree is build.

Ordered Boosting uses the classical permutation of the objects before the leaf on which the tree is build. Making the estimate on the object before the one.


<b>Source<b>
https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">Import required Libraries</span></h1><br>


In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import catboost
from catboost import CatBoostClassifier
from catboost import Pool, cv
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

### Import Data

In [None]:
train=pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test=pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')

### Import Shape

In [None]:
display(train.shape)
display(test.shape)

### Check the records

In [None]:
train.head()

### Label Encoding

In [None]:
lencoder = LabelEncoder()
Y = pd.DataFrame(lencoder.fit_transform(train['target']), columns=['target'])

In [None]:
X=train.copy()
X.drop('target',axis=1,inplace=True)

In [None]:
display(X.shape)
display(Y.shape)

In [None]:
Y['target'].unique()## 4 classes

### Split Data into Train test

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.20, random_state=42)

In [None]:
Y.target.unique()

### Scale the data

In [None]:
scaler = StandardScaler()
scaler.fit(X_train.iloc[:,1:])
train_sc=scaler.transform(X_train.iloc[:,1:])

In [None]:
test_sc=scaler.transform(X_test.iloc[:,1:])

<a id="3"></a>


<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">Model Building</span></h1><br>

### Create the Pool object

In [None]:
pool_train=Pool(train_sc,Y_train)
pool_val=Pool(test_sc,Y_test)

### Create a function to get the optimal number of trees

In [None]:
### Define a cv function to fit on data and find the optimal number of iteration keeping other parameters fixed
### Function takes input = xgb object with default params , train data ,train y data 
def modelfit(params,poolX,useTrainCV=True,cv_folds=5,early_stopping_rounds=10):
    if useTrainCV:
        cvresult = cv(params=params, pool=poolX,nfold=cv_folds,early_stopping_rounds=early_stopping_rounds,plot=True,verbose=50)
    return cvresult ## return dataframe for the iteration till the optimal iteration is reached

In [None]:
## Prepara a cv class params
params={
    'loss_function':'MultiClass',
    'iterations':1500,
    'verbose':50
}

In [None]:
### Object return the optimal number of trees to grow
n_est=modelfit(params,pool_train)

In [None]:
n_est.shape[0]### number of optimal iteration

In [None]:
### Fit the model with iteration=885
cboost1=CatBoostClassifier(iterations=885,loss_function='MultiClass',random_seed=123,verbose=50)
cboost1.fit(train_sc,Y_train)

### Train Results

In [None]:
#Predict training set:
train_predictions = cboost1.predict(train_sc)
#Print model report:
print("\nModel Report Train")
print("Accuracy :{}".format(metrics.accuracy_score(Y_train, train_predictions)))
print("precision: {}".format(metrics.precision_score(Y_train, train_predictions,average=None)))
print("recall: {}".format(metrics.recall_score(Y_train, train_predictions,average=None)))
print("f1score: {}".format(metrics.f1_score(Y_train, train_predictions,average=None)))

### Test results

In [None]:
#Predict test set:
test_predictions = cboost1.predict(test_sc)
#Print model report:
print("\nModel Report Test")
print("Accuracy :{}".format(metrics.accuracy_score(Y_test, test_predictions)))
print("precision: {}".format(metrics.precision_score(Y_test, test_predictions,average=None)))
print("recall: {}".format(metrics.recall_score(Y_test, test_predictions,average=None)))
print("f1score: {}".format(metrics.f1_score(Y_test, test_predictions,average=None)))

In [None]:
cboost1.get_all_params()

### Round 2 
Train accuracy =0.59 better than Catboost default. Fix iterations=885 and tune other parameters like max_depth.

In [None]:
### Use grid search by keepin n_estimators from above = 885 and tune max_depth 
### This param are mostly for controlling the complexity of the model
## Define the grid

param_test1 = {
    'depth':np.arange(6,11,1)
}

gsearch1 = GridSearchCV(estimator = CatBoostClassifier(iterations=885,loss_function='MultiClass',random_seed=123,depth=6), 
                                    param_grid = param_test1, scoring='accuracy',
                                    n_jobs=4, verbose=50,
                                    cv=5)
gsearch1.fit(train_sc,Y_train)
gsearch1.best_params_, gsearch1.best_score_

### Insight Round 2
Not improved Accuracy. We will keep depth =6 only. Tune learning rate.


### Round 3


In [None]:
### Fix depth=6 and tune learning rate
param_test2 = {
    'learning_rate':[x/10.0 for x in np.arange(1,10,1)]
}
gsearch2 = GridSearchCV(estimator = CatBoostClassifier(iterations=885,loss_function='MultiClass',random_seed=123,depth=6), 
                                    param_grid = param_test2, scoring='accuracy',
                                    n_jobs=4, 
                                    cv=5)
gsearch2.fit(train_sc,Y_train)
gsearch2.best_params_, gsearch1.best_score_

### Insight Round 3
No imrovement from round 2. Freeze n_estimators as it is and tune l2_leaf_reg

### Round 3

In [None]:
### Keep iterations =885 and tune l2_leaf_Reg.
### Fix depth=6 and tune l2 reg.
param_test3 = {
    'l2_leaf_reg':[1,2,3,4,5]
}
gsearch3 = GridSearchCV(estimator = CatBoostClassifier(iterations=885,
                                                       loss_function='MultiClass',
                                                       random_seed=123,
                                                       depth=6), 
                                                       param_grid = param_test3, 
                                                       scoring='accuracy',
                                                       n_jobs=4, 
                                                       cv=5)
gsearch3.fit(train_sc,Y_train)
gsearch3.best_params_, gsearch1.best_score_

### Insight round 4
l2_leaf_reg=4 does not improve the Accuracy

### Round 5

In [None]:
## Fit the carboost with the above params and check the train results
## Prepara a cv class
params={
    'loss_function':'MultiClass',
    'iterations':885,
    'l2_leaf_reg':4,
    'depth':6,
}

In [None]:
### Object return the optimal number of trees to grow
n_est_1=modelfit(params,pool_train)

In [None]:
n_est_1.shape[0]###885

### Train the model with full data

In [None]:
scaler = StandardScaler()
scaler.fit(X.iloc[:,1:])
X_sc=scaler.transform(X.iloc[:,1:])

In [None]:
### Fit the model with iteration=
cboost2=CatBoostClassifier(iterations=837,loss_function='MultiClass',random_seed=123,l2_leaf_reg=4,depth=6,verbose=50)
cboost2.fit(X_sc,Y)

### Train results

In [None]:
#Predict training set:
train_predictions = cboost2.predict(X_sc)
#Print model report:
print("\nModel Report Train")
print("Accuracy :{}".format(metrics.accuracy_score(Y, train_predictions)))
print("precision: {}".format(metrics.precision_score(Y, train_predictions,average=None)))
print("recall: {}".format(metrics.recall_score(Y, train_predictions,average=None)))
print("f1score: {}".format(metrics.f1_score(Y, train_predictions,average=None)))

### Predict on Test

In [None]:
### scale the test set
scaler = StandardScaler()
scaler.fit(test.iloc[:,1:])
test_sc=scaler.transform(test.iloc[:,1:])

In [None]:
test_prediction=cboost2.predict(test_sc,prediction_type='Probability')

In [None]:
test_prob=pd.DataFrame(test_prediction)

In [None]:
submission=pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

In [None]:
submission=pd.concat([test['id'],test_prob],axis=1)
submission.columns=['id','Class_1','Class_2','Class_3','Class_4']

In [None]:
### File for submission
submission.to_csv('submission.csv',index=False)

### Feature importance

These feature importances are non negative. They are normalized and sum to 1, so you can look on these values like percentage of importance.

In [None]:
np.array(cboost2.get_feature_importance(prettified=True))

<hr style="width:5"></hr>

<a id="4"></a>
<h1><span class="label label-default" style="background-color:black;border-radius:100px 100px; font-weight: bold; font-family:papyrus; font-size:20px; color:#03e8fc; padding:10px">SHAP</span></h1><br>

Shap values are calculated for each object in a data set. There is a fix value which is assigned to each object and the features are weighted as per the importance explaining its significance. The sum of the shap values of the features equates to the predictive value(sum of predictions-non probabilistic)

Importance of Shap values:

Explains the global interpretability : the collective SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable. This is like the variable importance plot.
Explains the local interpretability : each observation gets its own set of SHAP values. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance algorithms only show the results across the entire population but not on each individual case.

In [None]:
shap_values = cboost2.get_feature_importance(
    pool_val, 
    'ShapValues'
)

In [None]:
### Get expected value for record 0 for 4 classes, last value is expected value
expected_value_cls1 = shap_values[0,0,-1]
expected_value_cls2 = shap_values[0,1,-1]
expected_value_cls3 = shap_values[0,2,-1]
expected_value_cls4 = shap_values[0,3,-1]

### Get shap values for 4 classes 
shap_values_cls1 = shap_values[0,0,:-1]
shap_values_cls2 = shap_values[0,1,:-1]
shap_values_cls3 = shap_values[0,2,:-1]
shap_values_cls4 = shap_values[0,3,:-1]

### Class 1 SHAP values

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value_cls1, shap_values_cls1, X_test.iloc[0,1:])

### Class 2 SHAP values

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value_cls2, shap_values_cls2, X_test.iloc[0,1:])

### Class 3 SHAP values

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value_cls3, shap_values_cls3, X_test.iloc[0,1:])

### Class 4 SHAP values

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value_cls4, shap_values_cls3, X_test.iloc[0,1:])

### Collective feature importance

The below plot helps to interpret the impact of feature on observation. The feature importance value is in terms of feature importance with highly impacting feature on the top and low impacting feature at the bottom.

In [None]:
for i in range(4):
    shap.summary_plot(shap_values[:,i,:], X_test)