<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
### <center> Author: Mikhail Tribunskiy, @MITribunskiy
    
## <center> Tutorial
### <center> "CatBoost overview"

# Introduction
CatBoost is an open-source gradient boosting on decision trees library developed by Yandex. 
CatBoost can be used for solving problems, such as: 
- classification (binary, multi-class)
- regression
- ranking

These tasks differ by their objective function, that we are trying to minimize during gradient descend. Moreover, Catboost have pre-build metrics to measure the accuracy of the model.

On official CatBoost [website](https://catboost.ai/#benchmark) you can find the comparison of Catboost with major benchmarks.

Catboost introduces the followign algorithmic advances:

**1. Categorical features support:**

For data with categorical features the accuracy of CatBoost would be better compared to other algorithms. You do not need to preprocess categorical features (like one-hot encoding), just specify some hyperparameters (will be shown below, we will also use **HP** for hyperparameters).

**2. Better overfitting handling:**

CatBoost uses the implementation of ordered boosting, an alternative to the classic boosting algorithm.  
For example, the gradient boosting is quickly overfitted on small datasets. In Catboost there is a special modification for such cases, so on small datasets where other algorithms had a problem of overfitting you won’t observe the same problem with Catboost.

**3. Fast and easy-to-use GPU-training:**

The versions of CatBoost available from pip install (*pip install catboost*) and conda install (*conda install catboost*) have GPU support out-of-the-box. You just need to specify that you want to train your model on GPU in the corresponding HP (will be shown below).

**4. Other useful features:**

Missing value support (***nan_mode*** HP), great visualization.

Ok, let's get started:

In [None]:
# to get no error executing this kernel, it is neccessary to update catboost to version 0.14.2 +
!pip install catboost==0.14.2

In [None]:
import catboost
print(catboost.__version__)

# Classification task

We import model for solving classification tasks:

In [None]:
from catboost import CatBoostClassifier

Let's use example datasets. 

In [None]:
from catboost import datasets

train_df, test_df = datasets.amazon() # nice datasets with categorical features only :D
train_df.shape, test_df.shape

In [None]:
train_df.head()

train_df has the same amount of columns as test_df, although it contains label(target) column. Does our test_df dataset contain target values, too?

In [None]:
test_df.head()

Unfortunately, no. However, you can check your model predictions by submitting them [here](https://www.kaggle.com/c/amazon-employee-access-challenge/overview). That's an old Kaggle competition but you can still make submissions to test yourself. We will also use the same metric to evaluate our model: [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

Although datasets contain numerical values, these features are actually codes for different properties of an employee: manager id, company role code and other. Thus, these datasets contain categorical features.

Let's separate features and label values:

In [None]:
y = train_df['ACTION']
X = train_df.drop(columns='ACTION') # or X = train_df.drop('ACTION', axis=1)

In [None]:
X_test = test_df.drop(columns='id')

To make further results reproducible we will use fixed random seed.

In [None]:
SEED = 1

To estimate the result of the training let's split train data on train and validation parts.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=SEED)

What time is it? Training time!

In [None]:
%%time

params = {'loss_function':'Logloss', # objective function
          'eval_metric':'AUC', # metric
          'verbose': 200, # output to stdout info about training process every 200 iterations
          'random_seed': SEED
         }
cbc_1 = CatBoostClassifier(**params)
cbc_1.fit(X_train, y_train, # data to train on (required parameters, unless we provide X as a pool object, will be shown below)
          eval_set=(X_valid, y_valid), # data to validate on
          use_best_model=True, # True if we don't want to save trees created after iteration with the best validation score
          plot=True # True for visualization of the training process (it is not shown in a published kernel - try executing this code)
         );

Only 0.829? That's not even top 50%. Though to be honest, our model could show better results if we allowed it to train for more iterations (by default, it's 1000). Still, how else can we improve our results? First of all, we should finally specify, what features are categorical. In the above model CatBoost treated categorical features as numerical ones. Thus, the categories were ranked.

Let's fix that problem by specifying a HP ***cat_features=[i1, i2, ... , in]*** (list of integers):

In [None]:
cat_features = list(range(X.shape[1]))
print(cat_features)

What we did here is that we created a list of feature (column) numbers which we want CatBoost to treat as the categorical ones. And in our dataset all features are categorical.

If not all features are categorical but we have their names, we can specify ***cat_features*** like this:

In [None]:
cat_features_names = X.columns # here we specify names of categorical features
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)

Or if we know how to distinguish a categorical feature from the numerical one:

In [None]:
condition = True # here we specify what condition should be satisfied only by the names of categorical features
cat_features_names = [col for col in X.columns if condition]
cat_features = [X.columns.get_loc(col) for col in cat_features_names]
print(cat_features)

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'verbose': 200,
          'random_seed': SEED
         }
cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train,
          eval_set=(X_valid, y_valid),
          use_best_model=True,
          plot=True
         );

Ah, much better. Moreover, we obtained our best result much faster (iteration 412), though overall training took much more time. We can handle this problem by specifying HP ***early_stopping_rounds=N***, meaning that if the metric result do not improve for N rounds, model should stop training:

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }
cbc_2 = CatBoostClassifier(**params)
cbc_2.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid), 
          use_best_model=True, 
          plot=True
         );

By default CatBoost uses CPU to make calculations. What will change if we make it conduct the calculations on GPU? To do so, we need to specify HP ***task_type='GPU'***. Let's train our model on GPU (without overfitting HP ***early_stopping_rounds***):

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'task_type': 'GPU',
          'verbose': 200,
          'random_seed': SEED
         }
cbc_3 = CatBoostClassifier(**params)
cbc_3.fit(X_train, y_train,
          eval_set=(X_valid, y_valid), 
          use_best_model=True,
          plot=True
         );

Well, the results didn't change much. Still, they are not the same, so it worth trying. Moreover, some HP can be set only if the model trains on GPU.  
These are: 
- tree growing policy (***grow_policy***)
- the minimum number of training samples in a leaf (***min_data_in_leaf***)
- the maximum number of leafs in the resulting tree (***max_leaves***)

etc.

These HP may significantly help in model tuning.

In some datasets training on GPU takes much less time. Training on GPU can be sped up further by specifying HP ***border_count=N***, where N defines the number of splits considered for each feature. CatBoost documentation suggests setting the value of this parameter to **32** if training is performed on GPU. In many cases, this does not affect the quality of the model but significantly speeds up the training. Let's check if that is also our case:

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'task_type': 'GPU',
          'border_count': 32,
          'verbose': 200,
          'random_seed': SEED
         }
cbc_4 = CatBoostClassifier(**params)
cbc_4.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid), 
          use_best_model=True, 
          plot=True
         );

Indeed, it doesn't affect much the model quality. However, the training process completed almost twice faster. Nice!

In some cases we may suspect that some features give us misleading information. To experiment with this idea we can either create numerous of slices of our data, or we can just specify in the model HP ***ignored_features=[i1, i2, ... , in]***, list of column numbers we want to ignore.

First, let's create columns with data which will puzzle our model:

In [None]:
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
np.random.seed(SEED)
noise_cols = [f'noise_{i}' for i in range(5)]
for col in noise_cols:
    X_train[col] = y_train * np.random.rand(X_train.shape[0])
    X_valid[col] = np.random.rand(X_valid.shape[0])

In [None]:
X_train.head()

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'verbose': 200,
          'random_seed': SEED
         }
cbc_5 = CatBoostClassifier(**params)
cbc_5.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid), 
          use_best_model=True, 
          plot=True
         );

Wow, didn't expect it to be that low. Let's specify columns which we want to ignore:

In [None]:
ignored_features = list(range(X_train.shape[1] - 5, X_train.shape[1]))
print(ignored_features)

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'cat_features': cat_features,
          'ignored_features': ignored_features,
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }
cbc_6 = CatBoostClassifier(**params)
cbc_6.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid), 
          use_best_model=True, 
          plot=True
         );

Problem fixed. Good. We obtained the same results as in cbc_2 model, where there were no misleading features. Now let's get rid of them:

In [None]:
X_train = X_train.drop(columns=noise_cols)
X_valid = X_valid.drop(columns=noise_cols)

In [None]:
X_train.head()

CatBoostClassifier.fit() method can also accept pool object as a train data:

In [None]:
from catboost import Pool

train_data = Pool(data=X_train,
                  label=y_train,
                  cat_features=cat_features
                 )

valid_data = Pool(data=X_valid,
                  label=y_valid,
                  cat_features=cat_features
                 )

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
#           'cat_features': cat_features, # we don't need to specify this parameter as 
#                                           pool object contains info about categorical features
          'early_stopping_rounds': 200,
          'verbose': 200,
          'random_seed': SEED
         }

cbc_7 = CatBoostClassifier(**params)
cbc_7.fit(train_data, # instead of X_train, y_train
          eval_set=valid_data, # instead of (X_valid, y_valid)
          use_best_model=True, 
          plot=True
         );

As we can see, we obtained the same results as for cbc_2. Then why should we bother creating Pool objects? Pool object has some nice methods. For example, some parts of our data may be outdated, inaccurate. With ***Pool.set_weight()*** we can specify weights to instances (rows) of our data. Or we can divide data on groups using ***Pool.set_group_id()*** and play with different weights for different groups using ***Pool.set_group_weight()***.  We may have a baseline calculated. Then we will be able to provide initial formula values for all input objects using ***Pool.set_baseline()***. Thus, training will start from these values for all input objects instead of starting from zero.

Finally, Pool object is also a nice way to contain bounded parts of data.

For more thorough cross-validation procedure we can use **cv** from catboost:

In [None]:
from catboost import cv

In [None]:
%%time

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'verbose': 200,
          'random_seed': SEED
         }

all_train_data = Pool(data=X,
                      label=y,
                      cat_features=cat_features
                     )

scores = cv(pool=all_train_data,
            params=params, 
            fold_count=4,
            seed=SEED, 
            shuffle=True,
            stratified=True, # if True the folds are made by preserving the percentage of samples for each class
            plot=True
           )

Now, what shall we do to to estimate feature importance?

First of all, let's check result of **.get_feature_importance()** method of the fitted model:

In [None]:
cbc_7.get_feature_importance(prettified=True)

In [None]:
import pandas as pd

feature_importance_df = pd.DataFrame(cbc_7.get_feature_importance(prettified=True), columns=['feature', 'importance'])
feature_importance_df

Or in a more illustrative way:

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6));
sns.barplot(x="importance", y="feature", data=feature_importance_df);
plt.title('CatBoost features importance:');

Let's go deeper:

In [None]:
import shap
explainer = shap.TreeExplainer(cbc_7) # insert your model
shap_values = explainer.shap_values(train_data) # insert your train Pool object

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[:100,:], X_train.iloc[:100,:])

That's an interactive plot. You can analyse your model by switching parameters for both abscissa and ordinate. Keep in mind that it's made from the slice of the data (first 100 instances).

Let us check summary plot:

In [None]:
shap.summary_plot(shap_values, X_train)

Look like it matters who is your manager (MGR_ID) :D

On the above diagram every employee (instance/row in our dataset) is represented by one dot in each row. The x position of the dot is the impact of that feature on the model’s prediction, and the color of the dot represents the value of that feature for that exact employee. Dots that do not fit on the row pile up to show density. Here we can see that 'ROLE_ROLLUP_1' and 'ROLE_CODE' features have low impact on the model prediction, and for most of employees their impact is almost zero.

Finally, let us make a prediction:

In [None]:
%%time

from sklearn.model_selection import StratifiedKFold

n_fold = 4 # amount of data folds
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=SEED)

params = {'loss_function':'Logloss',
          'eval_metric':'AUC',
          'verbose': 200,
          'random_seed': SEED
         }

test_data = Pool(data=X_test,
                 cat_features=cat_features)

scores = []
prediction = np.zeros(X_test.shape[0])
for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
    
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index] # train and validation data splits
    y_train, y_valid = y[train_index], y[valid_index]
    
    train_data = Pool(data=X_train, 
                      label=y_train,
                      cat_features=cat_features)
    valid_data = Pool(data=X_valid, 
                      label=y_valid,
                      cat_features=cat_features)
    
    model = CatBoostClassifier(**params)
    model.fit(train_data,
              eval_set=valid_data, 
              use_best_model=True
             )
    
    score = model.get_best_score()['validation_0']['AUC']
    scores.append(score)

    y_pred = model.predict_proba(test_data)[:, 1]
    prediction += y_pred

prediction /= n_fold
print('CV mean: {:.4f}, CV std: {:.4f}'.format(np.mean(scores), np.std(scores)))

In [None]:
import pandas as pd

sub = pd.read_csv('../input/amazon-employee-access-challenge/sampleSubmission.csv')
sub['Action'] = prediction
sub_name = 'catboost_submission.csv'
sub.to_csv(sub_name, index=False)

print(f'Saving submission file as: {sub_name}')

Let us submit it [here](https://www.kaggle.com/c/amazon-employee-access-challenge/submit). 0.90741 private score! That's silver medal! Too bad it closed 6 years ago :D

### Some comments on Regularization tasks

CatBoost also has tools for solving regularization problem:

In [None]:
from catboost import CatBoostRegressor

Generally speaking, it differs from CatBoostClassifier only by the objective function (*Root Mean Square Error* (**RMSE**) for regression tasks by default) and by final predictions :D Training is done by the same method **.fit()** with the same tuning HP (more parameters [here](https://catboost.ai/docs/concepts/python-reference_parameters-list.html)).

If you are interested in other objectives and metrics, I would advice checking [Objectives and metrics](https://catboost.ai/docs/concepts/loss-functions.html) section in CatBoost documentation.

# Summary

To make a long story short, CatBoost provides useful tools for easy work with highly categorized data. It shows solid results training on unprocessed categorical features. Calculation on GPU is controlled by only 1 parameter. Moreover, CatBoost has a really comprehensive documentation.


# Resources

1. [CatBoost documentation](https://catboost.ai/docs/)
2. [CatBoost tutorials repository](https://github.com/catboost/tutorials)
3. [Introduction to gradient boosting on decision trees with Catboost](https://towardsdatascience.com/introduction-to-gradient-boosting-on-decision-trees-with-catboost-d511a9ccbd14)
4. [Working with categorical data: Catboost](https://medium.com/whats-your-data/working-with-categorical-data-catboost-8b5e11267a37)
5. [Interpretable Machine Learning with XGBoost](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27)