<a id="section-top"></a>

* [Introduction](#section-zero)
* [Data Preparation](#section-one)
* [Model Comparison](#section-two)

    - [Input: Numpy Arrays](#section-two-one)
    - [Input: Catboost Pool](#section-two-two)
    - [Catboost Pool with quantize](#section-two-three)


* [Conclusion](#section-three)
* [More](#section-four)

<a id="section-zero"></a>
# 0. Introduction

In this notebook, I will compare Catboost's speed with different inputs;

**Numpy arrays**,

**Catboost Pool**,

**Catboost Pool w/ quantize**


<a id="section-one"></a>
# 1. Data Preparation

In [None]:
import pandas  as pd
import numpy as np
import time
import random
import os

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool

In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything(666)

In [None]:
train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")

In [None]:
train["missing"] = train.isnull().sum(axis = 1)
test["missing"] = test.isnull().sum(axis = 1)

In [None]:
target = "claim"
predictors = [x for x in train.columns if x not in ["id", target]]

kf = KFold(n_splits = 5, shuffle = True, random_state = 666)

In [None]:
train[predictors] = train[predictors].fillna(train.groupby("missing")[predictors].transform("mean"))
test[predictors] = test[predictors].fillna(train.groupby("missing")[predictors].transform("mean"))

In [None]:
scaler = StandardScaler()

train[predictors] = scaler.fit_transform(train[predictors])
test[predictors] = scaler.transform(test[predictors])

Simply, I just created new feature "missing", filled missing values and scaled the with standard scaler.

<a id="section-two"></a>
# 2. Model Comparison

I define simple catboost classifier with default parameters. I only set iterations to 2000 for better comparison.

For comparison, I will use out of folds predictions.

In [None]:
X = train[predictors]
y = train[target]
test = test[predictors]

In [None]:
model_cb = CatBoostClassifier(
    random_seed = 666,
    thread_count = -1,
    iterations = 2000,
    learning_rate = 0.1,
    eval_metric = "AUC",
    task_type = "GPU"
)

[take me to the top](#section-top)

<a id="section-two-one"></a>
# 2.1 Input: Numpy Arrays

2000 iterations with 0.1 learning rate.

Out of folds predictions for 5 folds.

**Input is numpy arrays, generally all we did.**

In [None]:
start = time.time()

oof_cb = np.zeros(len(X))

i = 0

while i < 5:
    
    for train_ix, test_ix in kf.split(X.values):
    
        train_X, train_y = X.iloc[train_ix], y.iloc[train_ix]
        test_X, test_y = X.iloc[test_ix], y.iloc[test_ix]

        model_cb.fit(
            train_X, train_y,
            eval_set = [(test_X, test_y)],
            early_stopping_rounds = 50,
            use_best_model = True,
            verbose = 0,
        )

        oof_cb[test_ix] = oof_cb[test_ix] + model_cb.predict_proba(test_X)[:, 1]
        
    i += 1
    
print("AUC score: \033[1m{}\033[0m".format(round(roc_auc_score(y, oof_cb), 5)))

elapsed_time = time.time() - start

print("\nAverage Elapsed time for \033[1mnumpy array\033[0m input: \t\t \033[1m{}\033[0m".format(elapsed_time / 5))

[take me to the top](#section-top)

<a id="section-two-two"></a>
# 2.2 Input: Catboost Pool

https://catboost.ai/docs/concepts/python-reference_pool.html

Create a Pool object simply

> **Pool(data, label, ...)**

*For example

Pool(X, label = y)

In [None]:
?Pool

In [None]:
start = time.time()

train_pool = Pool(X, label = y)

oof_cb = np.zeros(len(X))

i = 0

while i < 5:
    
    for train_ix, test_ix in kf.split(X.values):

        tr_pool = train_pool.slice(train_ix)
        val_pool = train_pool.slice(test_ix)

        model_cb.fit(
            tr_pool,
            eval_set = [(val_pool)],
            early_stopping_rounds = 50,
            use_best_model = True,
            verbose = 0,
        )

        oof_cb[test_ix] = oof_cb[test_ix] + model_cb.predict_proba(val_pool)[:, 1]
        
    i += 1
    
print("AUC score: \033[1m{}\033[0m".format(round(roc_auc_score(y, oof_cb), 5)))

elapsed_time = time.time() - start

print("\nAverage Elapsed time for \033[1mCatboost Pool\033[0m input: \t\t \033[1m{}\033[0m seconds".format(elapsed_time / 5))

[take me to the top](#section-top)

<a id="section-two-three"></a>
# 2.3 Catboost Pool with quantize

https://catboost.ai/docs/concepts/speed-up-training.html

> **By default, the train and test datasets are quantized each time that the boosting is run.**


In [None]:
start = time.time()

train_pool = Pool(X, label = y)
train_pool.quantize(task_type = "GPU")

oof_cb = np.zeros(len(X))

i = 0

while i < 5:
    
    for train_ix, test_ix in kf.split(X.values):

        tr_pool = train_pool.slice(train_ix)
        val_pool = train_pool.slice(test_ix)

        model_cb.fit(
            tr_pool,
            eval_set = [(val_pool)],
            early_stopping_rounds = 50,
            use_best_model = True,
            verbose = 0,
        )

        oof_cb[test_ix] = oof_cb[test_ix] + model_cb.predict_proba(val_pool)[:, 1]
    
    i += 1
    
print("AUC score: \033[1m{}\033[0m".format(round(roc_auc_score(y, oof_cb), 5)))

elapsed_time = time.time() - start

print("\nAverage Elapsed time for \033[1mCatboost Pool quantized\033[0m input: \t\t \033[1m{}\033[0m seconds".format(elapsed_time / 5))

[take me to the top](#section-top)

<a id="section-three"></a>

# 3. Conclusion

**Reusing quantized datasets outperforms other methods as mentioned** [here](https://catboost.ai/docs/concepts/speed-up-training.html#reuzing-quantized-datasets) 

**You should use Pool for Catboost models. It improves performance drastically.**


**Note**: Using catboost with GPU doesn't guarantee reproducibility. https://github.com/catboost/catboost/issues/546#issuecomment-440647874


[take me to the top](#section-top)

<a id="section-four"></a>

# 4. More

Catboost GPU performance - https://github.com/catboost/catboost/issues/505#issuecomment-431484934

Catboost GPU reproducibility - https://github.com/catboost/catboost/issues/546

Catboost Pool documentation - https://catboost.ai/docs/concepts/python-reference_pool.html

Catboost speeding up training suggestions - https://catboost.ai/docs/concepts/speed-up-training.html



**My notebooks similar to this one:**

https://www.kaggle.com/mustafacicek/subsample-for-boosting-models

https://www.kaggle.com/mustafacicek/xgboost-train-and-fit-comparison


[take me to the top](#section-top)