# CatBoost basics

For this homework will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

As a result of this tutorial you need to provide a tsv file with answers.
There are 17 questions in this tutorial. The resulting tsv file should consist of 17 lines, each line should contain the number of the question, an answer to it and a tab separater between them. Questions are numbered from 1 to 17.
See an example of the resulting file here.

## Reading the data

Let's first download the data and put it to folder `amazon`. Now we will read this data from file.

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/96/6c/6608210b29649267de52001b09e369777ee2a5cfe1c71fa75eba82a4f2dc/catboost-0.24-cp36-none-manylinux1_x86_64.whl (65.9MB)
[K     |████████████████████████████████| 65.9MB 45kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24


In [3]:
!wget https://raw.githubusercontent.com/hse-aml/competitive-data-science/master/Programming_assignment_week_4_Catboost/grader_v2.py

--2020-08-07 03:13:50--  https://raw.githubusercontent.com/hse-aml/competitive-data-science/master/Programming_assignment_week_4_Catboost/grader_v2.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3748 (3.7K) [text/plain]
Saving to: ‘grader_v2.py’


2020-08-07 03:13:50 (58.4 MB/s) - ‘grader_v2.py’ saved [3748/3748]



In [4]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
import catboost
from catboost import datasets
from catboost import *

from grader_v2 import Grader

In [5]:
train_df, test_df = catboost.datasets.amazon()
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [6]:
grader = Grader()

## Preparing your data

Label values extraction

In [7]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

Categorical features declaration

In [8]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Now it makes sense to ananyze the dataset.
First you need to calculate how many positive and negative objects are present in the train dataset.

**Question 1:**

How many negative objects are present in the train dataset X?

In [9]:
zero_count = y.shape[0] - y.sum()
grader.submit_tag('negative_samples', zero_count)

Current answer for task negative_samples is: 1897


**Question 2:**

How many positive objects are present in the train dataset X?

In [10]:
one_count = y.sum()
grader.submit_tag('positive_samples', one_count)

Current answer for task positive_samples is: 30872


In [11]:
print('Zero count = ' + str(zero_count) + ', One count = ' + str(one_count))

Zero count = 1897, One count = 30872


Now for every feature you need to calculate number of unique values of this feature.

**Question 3:**
    
How many unique values has feature RESOURCE?

In [12]:
unique_vals_for_RESOURCE = X['RESOURCE'].nunique()
grader.submit_tag('resource_unique_values', unique_vals_for_RESOURCE)

Current answer for task resource_unique_values is: 7518


Now we can create a Pool object. This type is used for datasets in CatBoost. You can also use numpy array or dataframe. Working with Pool class is the most efficient way in terms of memory and speed. We recommend to create Pool from file in case if you have your data on disk or from FeaturesData if you use numpy.

In [15]:
import sys
import os

In [16]:
print(sys.path)

['', '/env/python', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.6/dist-packages/IPython/extensions', '/root/.ipython']


In [20]:
!ls /usr/local/lib/python3.6/ | grep 

ls: cannot access '/env/python': No such file or directory


In [21]:
import numpy as np
from catboost import Pool

pool1 = Pool(data=X, label=y, cat_features=cat_features)
#pool2 = Pool(data='/usr/lib/python3.6/site-packages/catboost/cached_datasets/amazon/train.csv', delimiter=',', has_header=True)
pool2 = Pool(data=train_df, delimiter=',', has_header=True)
pool3 = Pool(data=X, cat_features=cat_features)

print('Dataset shape')
print('dataset 1:' + str(pool1.shape) + '\ndataset 2:' + str(pool2.shape)  + '\ndataset 3:' + str(pool3.shape))

print('\n')
print('Column names')
print('dataset 1: ')
print(pool1.get_feature_names()) 
print('\ndataset 2:')
print(pool2.get_feature_names())
print('\ndataset 3:')
print(pool3.get_feature_names())

Dataset shape
dataset 1:(32769, 9)
dataset 2:(32769, 10)
dataset 3:(32769, 9)


Column names
dataset 1: 
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 2:
['ACTION', 'RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 3:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


## Split your data into train and validation

When you will be training your model, you will have to detect overfitting and select best parameters. To do that you need to have a validation dataset.
Normally you would be using some random split, for example
`train_test_split` from `sklearn.model_selection`.
But for the purpose of this homework the train part will be the first 80% of the data and the evaluation part will be the last 20% of the data.

In [22]:
train_count = int(X.shape[0] * 0.8)

X_train = X.iloc[:train_count,:]
y_train = y[:train_count]
X_validation = X.iloc[train_count:, :]
y_validation = y[train_count:]

## Train your model

Now we will train our first model.

In [23]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5,
    random_seed=0,
    learning_rate=0.1
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent'
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

Model is fitted: True
Model params:
{'random_seed': 0, 'learning_rate': 0.1, 'iterations': 5}


## Stdout of the training

You can see in stdout values of the loss function on each iteration, or on each k-th iteration.
You can also see how much time passed since the start of the training and how much time is left.

In [24]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=15,
    verbose=3
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Learning rate set to 0.441257
0:	learn: 0.4218644	test: 0.4238882	best: 0.4238882 (0)	total: 10.1ms	remaining: 141ms
3:	learn: 0.2323838	test: 0.2357560	best: 0.2357560 (3)	total: 62.9ms	remaining: 173ms
6:	learn: 0.1868476	test: 0.1795406	best: 0.1795406 (6)	total: 118ms	remaining: 135ms
9:	learn: 0.1786571	test: 0.1689628	best: 0.1689628 (9)	total: 171ms	remaining: 85.5ms
12:	learn: 0.1731382	test: 0.1639559	best: 0.1639559 (12)	total: 226ms	remaining: 34.8ms
14:	learn: 0.1720164	test: 0.1623020	best: 0.1623020 (14)	total: 260ms	remaining: 0us

bestTest = 0.1623019855
bestIteration = 14



<catboost.core.CatBoostClassifier at 0x7fab4da36b38>

## Random seed

If you don't specify random_seed then random seed will be set to a new value each time.
After the training has finished you can look on the value of the random seed that was set.
If you train again with this random_seed, you will get the same results.

In [25]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Learning rate set to 0.5
0:	learn: 0.3963217	test: 0.3986149	best: 0.3986149 (0)	total: 10.5ms	remaining: 42ms
1:	learn: 0.2934864	test: 0.2972563	best: 0.2972563 (1)	total: 20.1ms	remaining: 30.1ms
2:	learn: 0.2495301	test: 0.2543897	best: 0.2543897 (2)	total: 30.3ms	remaining: 20.2ms
3:	learn: 0.2235704	test: 0.2267986	best: 0.2267986 (3)	total: 49.3ms	remaining: 12.3ms
4:	learn: 0.2003235	test: 0.1962958	best: 0.1962958 (4)	total: 68.2ms	remaining: 0us

bestTest = 0.1962957582
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7fab4da36eb8>

In [26]:
random_seed = model.random_seed_
print('Used random seed = ' + str(random_seed))
model = CatBoostClassifier(
    iterations=5,
    random_seed=random_seed
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Used random seed = 0
Learning rate set to 0.5
0:	learn: 0.3963217	test: 0.3986149	best: 0.3986149 (0)	total: 10.5ms	remaining: 42.2ms
1:	learn: 0.2934864	test: 0.2972563	best: 0.2972563 (1)	total: 20.1ms	remaining: 30.1ms
2:	learn: 0.2495301	test: 0.2543897	best: 0.2543897 (2)	total: 29.7ms	remaining: 19.8ms
3:	learn: 0.2235704	test: 0.2267986	best: 0.2267986 (3)	total: 48.5ms	remaining: 12.1ms
4:	learn: 0.2003235	test: 0.1962958	best: 0.1962958 (4)	total: 71.9ms	remaining: 0us

bestTest = 0.1962957582
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7fab4da49080>

Try training 10 models with parameters and calculate mean and the standart deviation of Logloss error on validation dataset.

**Question 4:**

What is the mean value of the Logloss metric on validation dataset (X_validation, y_validation) after 10 times training `CatBoostClassifier` with different random seeds in the following way:

`model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.1,
    random_seed={my_random_seed}
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)
`

In [30]:
scores = np.zeros(10)
for i in range(10):
    model = CatBoostClassifier(
        iterations=300,
        learning_rate=0.1,
        random_seed=i,
    )
    model.fit(
        X_train, y_train,
        cat_features=cat_features,
        eval_set=(X_validation, y_validation)
        
    )
    scores[i] = model.best_score_['validation']['Logloss']
print(scores)
print(scores.mean())

0:	learn: 0.5790122	test: 0.5797377	best: 0.5797377 (0)	total: 18.7ms	remaining: 5.59s
1:	learn: 0.4926882	test: 0.4940422	best: 0.4940422 (1)	total: 47.3ms	remaining: 7.05s
2:	learn: 0.4262945	test: 0.4277599	best: 0.4277599 (2)	total: 91.9ms	remaining: 9.1s
3:	learn: 0.3762555	test: 0.3777218	best: 0.3777218 (3)	total: 144ms	remaining: 10.7s
4:	learn: 0.3396747	test: 0.3417070	best: 0.3417070 (4)	total: 163ms	remaining: 9.63s
5:	learn: 0.3099710	test: 0.3121170	best: 0.3121170 (5)	total: 204ms	remaining: 9.97s
6:	learn: 0.2891905	test: 0.2917690	best: 0.2917690 (6)	total: 225ms	remaining: 9.43s
7:	learn: 0.2733016	test: 0.2762425	best: 0.2762425 (7)	total: 243ms	remaining: 8.87s
8:	learn: 0.2547005	test: 0.2563450	best: 0.2563450 (8)	total: 301ms	remaining: 9.73s
9:	learn: 0.2403536	test: 0.2406058	best: 0.2406058 (9)	total: 358ms	remaining: 10.4s
10:	learn: 0.2293277	test: 0.2286916	best: 0.2286916 (10)	total: 422ms	remaining: 11.1s
11:	learn: 0.2209746	test: 0.2198225	best: 0.21982

In [31]:
print(scores)
print(scores.mean())

[0.1383 0.1384 0.1376 0.1375 0.1374 0.1385 0.137  0.1382 0.1392 0.1384]
0.138053844266512


In [32]:
mean = scores.mean()
grader.submit_tag('logloss_mean', mean)

Current answer for task logloss_mean is: 0.138053844266512


**Question 5:**

What is the standard deviation of it?

In [33]:
stddev = scores.std()
grader.submit_tag('logloss_std', stddev)

Current answer for task logloss_std is: 0.0006068927556740362


## Metrics calculation and graph plotting

When experimenting with Jupyter notebook you can see graphs of different errors during training.
To do that you need to use `plot=True` parameter.

In [36]:
%matplotlib inline 


In [37]:
!pip install catboost
!pip install ipywidgets
!pip install shap
#!pip install sklearn
!jupyter nbextension enable --py widgetsnbextension

Collecting shap
[?25l  Downloading https://files.pythonhosted.org/packages/a8/77/b504e43e21a2ba543a1ac4696718beb500cfa708af2fb57cb54ce299045c/shap-0.35.0.tar.gz (273kB)
[K     |████████████████████████████████| 276kB 7.4MB/s 
Building wheels for collected packages: shap
  Building wheel for shap (setup.py) ... [?25l[?25hdone
  Created wheel for shap: filename=shap-0.35.0-cp36-cp36m-linux_x86_64.whl size=394115 sha256=eb7aa6b24f21be1cbe0bababa37a9c1fcdd36ecc87ab5bdb86a80bcc9a651ad5
  Stored in directory: /root/.cache/pip/wheels/e7/f7/0f/b57055080cf8894906b3bd3616d2fc2bfd0b12d5161bcb24ac
Successfully built shap
Installing collected packages: shap
Successfully installed shap-0.35.0
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [38]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=50,
    random_seed=63,
    learning_rate=0.1,
    custom_loss=['Accuracy']
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7fab4588dac8>

**Question 6:**

What is the value of the accuracy metric value on evaluation dataset after training with parameters `iterations=50`, `random_seed=63`, `learning_rate=0.1`?

In [40]:
accuracy = model.best_score_['validation']['Accuracy']
grader.submit_tag('accuracy_6', accuracy)

Current answer for task accuracy_6 is: 0.9440036618858713


## Model comparison

In [41]:
model1 = CatBoostClassifier(
    learning_rate=0.5,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.5',
    custom_loss = ['Accuracy']
)

model2 = CatBoostClassifier(
    learning_rate=0.05,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.05',
    custom_loss = ['Accuracy']
)
model1.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)
model2.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 21ms	remaining: 21s
100:	learn: 0.1221343	test: 0.1467235	best: 0.1444052 (48)	total: 6.6s	remaining: 58.7s
200:	learn: 0.0960692	test: 0.1537423	best: 0.1444052 (48)	total: 13.2s	remaining: 52.5s
300:	learn: 0.0775523	test: 0.1616180	best: 0.1444052 (48)	total: 20s	remaining: 46.4s
400:	learn: 0.0631211	test: 0.1658593	best: 0.1444052 (48)	total: 27s	remaining: 40.3s
500:	learn: 0.0516834	test: 0.1699314	best: 0.1444052 (48)	total: 34s	remaining: 33.8s
600:	learn: 0.0440555	test: 0.1756899	best: 0.1444052 (48)	total: 40.8s	remaining: 27.1s
700:	learn: 0.0382871	test: 0.1755893	best: 0.1444052 (48)	total: 47.3s	remaining: 20.2s
800:	learn: 0.0353594	test: 0.1801353	best: 0.1444052 (48)	total: 52.9s	remaining: 13.1s
900:	learn: 0.0318842	test: 0.1832694	best: 0.1444052 (48)	total: 58.9s	remaining: 6.47s
999:	learn: 0.0291325	test: 0.1854538	best: 0.1444052 (48)	total: 1m 4s	remaining: 0us

bestTest = 0.1444051519
bestIterati

<catboost.core.CatBoostClassifier at 0x7fab4386ea58>

In [45]:
!ls learning_rate_0.05

catboost_training.json	learn_error.tsv  test_error.tsv  tmp
learn			test		 time_left.tsv


In [42]:
from catboost import MetricVisualizer
MetricVisualizer(['learning_rate_0.05', 'learning_rate_0.5']).start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

**Question 7:**

Try training these models for 1000 iterations. Which model will give better best resulting Accuracy on validation dataset?
By best resulting accuracy we mean accuracy on best iteration, which might be not the last iteration.

In [46]:
best_model_name =  'learning_rate_0.05' # one of 'learning_rate_0.5', 'learning_rate_0.05'
grader.submit_tag('best_model_name', best_model_name)

Current answer for task best_model_name is: learning_rate_0.05


## Best iteration

If a validation dataset is present then after training, the model is shrinked to a number of trees when it got best evaluation metric value on validation dataset.
By default evaluation metric is the optimized metric. But you can set evaluation metric to some other metric.
In the example below evaluation metric is `Accuracy`.

In [47]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy'
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7fab4386ecf8>

In [48]:
print('Tree count: ' + str(model.tree_count_))

Tree count: 25


If you don't want the model to be shrinked, you can set `use_best_model=False`

In [49]:
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy',
    use_best_model=False
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7fab437202e8>

**Question 8:**
    
What will be the number of trees in the resulting model after training with validation dataset with parameters `iterations=100`, ` learning_rate=0.5`, `eval_metric='Accuracy'` and with parameter `use_best_model=False`

In [50]:
tree_count = model.tree_count_
grader.submit_tag('num_trees', tree_count)

Current answer for task num_trees is: 100


## Cross-validation

The next functionality you need to know about is cross-validation.
For unbalanced datasets stratified cross-validation can be useful.

In [51]:
from catboost import cv

params = {}
params['loss_function'] = 'Logloss'
params['iterations'] = 80
params['custom_loss'] = 'AUC'
params['random_seed'] = 63
params['learning_rate'] = 0.5

cv_data = cv(
    params = params,
    pool = Pool(X, label=y, cat_features=cat_features),
    fold_count=5,
    inverted=False,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    stratified=True,
    verbose=False
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Cross-validation returns specified metric values on every iteration (or every k-th iteration, if you specify so)

In [53]:
print(cv_data[0:4])

   iterations  test-Logloss-mean  ...  test-AUC-mean  test-AUC-std
0           0           0.302032  ...       0.539535      0.005601
1           1           0.231446  ...       0.598963      0.009583
2           2           0.192522  ...       0.787625      0.003311
3           3           0.179467  ...       0.809551      0.011891

[4 rows x 7 columns]


Let's look on mean value and standard deviation of Logloss for cv on best iteration.

In [52]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, not stratified: 0.1571±0.0028 on step 50


**Question 9:**

Try running stratified cross-validation with the same parameters. What will be mean of Logloss metric on test of the stratified cross-validation on the best iteration?

In [54]:
mean_on_best_iteration = np.min(cv_data['test-Logloss-mean'])
grader.submit_tag('mean_logloss_cv', mean_on_best_iteration)

Current answer for task mean_logloss_cv is: 0.15707007154415673


**Question 10:**

Try running stratified cross-validation with the same parameters. What will be the standard deviation of Logloss metric of the stratified cross-validation on the best iteration?

In [55]:
std_on_best_iteration = cv_data['test-Logloss-std'][best_iter]
grader.submit_tag('logloss_std_1', std_on_best_iteration)

Current answer for task logloss_std_1 is: 0.0028273089638071918


## Overfitting detector

A useful feature of the library is overfitting detector.
Let's try training the model with early stopping.

In [57]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    random_seed=63,
    learning_rate=0.5,
    od_type='Iter',
    od_wait=20,
    eval_metric = 'AUC'
)
model_with_early_stop.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	test: 0.5002435	best: 0.5002435 (0)	total: 80.6ms	remaining: 16s
1:	test: 0.6796596	best: 0.6796596 (1)	total: 164ms	remaining: 16.2s
2:	test: 0.8462429	best: 0.8462429 (2)	total: 219ms	remaining: 14.4s
3:	test: 0.8526637	best: 0.8526637 (3)	total: 302ms	remaining: 14.8s
4:	test: 0.8712343	best: 0.8712343 (4)	total: 396ms	remaining: 15.5s
5:	test: 0.8759795	best: 0.8759795 (5)	total: 457ms	remaining: 14.8s
6:	test: 0.8763159	best: 0.8763159 (6)	total: 524ms	remaining: 14.4s
7:	test: 0.8824011	best: 0.8824011 (7)	total: 594ms	remaining: 14.2s
8:	test: 0.8811621	best: 0.8824011 (7)	total: 652ms	remaining: 13.8s
9:	test: 0.8818324	best: 0.8824011 (7)	total: 719ms	remaining: 13.7s
10:	test: 0.8867967	best: 0.8867967 (10)	total: 794ms	remaining: 13.6s
11:	test: 0.8867720	best: 0.8867967 (10)	total: 836ms	remaining: 13.1s
12:	test: 0.8859881	best: 0.8867967 (10)	total: 909ms	remaining: 13.1s
13:	test: 0.8872667	best: 0.8872667 (13)	total: 978ms	remaining: 13s
14:	test: 0.8876331	best: 0.8

<catboost.core.CatBoostClassifier at 0x7fab4386ea20>

**Question 11:**

Now try training the model with the same parameters and with overfitting detector, but with `eval_metric='AUC'`
What will be the number of iterations after which the training will stop?
(Not the number of trees in the resulting model, but the number of iterations that the algorithm will perform befor training).

In [58]:
iterations_count = 43
grader.submit_tag('iterations_overfitting', iterations_count)

Current answer for task iterations_overfitting is: 43


## Snapshotting

If you train for long time, for example for several hours, you need to save snapshots.
Otherwise if your laptop or your server will reboot, you will loose all the progress.
To do that you need to specify `snapshot_file` parameter.
Try running the code below and interrupting the kernel after short time.
Then try running the same cell again.
The training will start from the iteration when the training was interrupted.
Note that all additional files are written by default into `catboost_info` directory. It can be changed using `train_dir` parameter. So the snapshot file will be there.

In [59]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

Learning rate set to 0.288002
0:	learn: 0.4998892	test: 0.5009131	best: 0.5009131 (0)	total: 21.1ms	remaining: 824ms
1:	learn: 0.3928966	test: 0.3947405	best: 0.3947405 (1)	total: 41.9ms	remaining: 796ms
2:	learn: 0.3273849	test: 0.3293347	best: 0.3293347 (2)	total: 59.5ms	remaining: 734ms
3:	learn: 0.2861754	test: 0.2886289	best: 0.2886289 (3)	total: 70.3ms	remaining: 633ms
4:	learn: 0.2600956	test: 0.2633799	best: 0.2633799 (4)	total: 78.6ms	remaining: 550ms
5:	learn: 0.2409059	test: 0.2435218	best: 0.2435218 (5)	total: 91.2ms	remaining: 517ms
6:	learn: 0.2289714	test: 0.2311302	best: 0.2311302 (6)	total: 101ms	remaining: 477ms
7:	learn: 0.2117902	test: 0.2093182	best: 0.2093182 (7)	total: 122ms	remaining: 487ms
8:	learn: 0.2016288	test: 0.1975972	best: 0.1975972 (8)	total: 139ms	remaining: 479ms
9:	learn: 0.1953748	test: 0.1900229	best: 0.1900229 (9)	total: 157ms	remaining: 471ms
10:	learn: 0.1905659	test: 0.1839534	best: 0.1839534 (10)	total: 174ms	remaining: 460ms
11:	learn: 0.186

<catboost.core.CatBoostClassifier at 0x7fab5b0b4518>

## Model predictions

There are multiple ways to do predictions.
The easiest one is to call predict or predict_proba.
You also can make predictions using C++ code. For that see [documentation](https://tech.yandex.com/catboost/doc/dg/concepts/c-plus-plus-api-docpage/).

In [60]:
print(model.predict_proba(data=X_validation))

[[0.0106 0.9894]
 [0.0212 0.9788]
 [0.0128 0.9872]
 ...
 [0.011  0.989 ]
 [0.3201 0.6799]
 [0.0403 0.9597]]


In [61]:
print(model.predict(data=X_validation))

[1 1 1 ... 1 1 1]


For binary classification resulting value is not necessary a value in `[0,1]`. It is some numeric value. To get the probability out of this value you need to calculate sigmoid of that value.

In [62]:
raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')
print(raw_pred)

[4.536  3.8344 4.3463 ... 4.4948 0.7533 3.171 ]


In [63]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
probabilities = [sigmoid(x) for x in raw_pred]
print(np.array(probabilities))

[0.9894 0.9788 0.9872 ... 0.989  0.6799 0.9597]


## Staged prediction

CatBoost also supports staged prediction - when you want to have a prediction on each object on each iteration (or on each k-th iteration). This can be used if you want to calculate the values of some custom metric using the predictions.

In [64]:
predictions_gen = model.staged_predict_proba(data=X_validation, ntree_start=0, ntree_end=5, eval_period=1)
for iteration, predictions in enumerate(predictions_gen):
    print('Iteration ' + str(iteration) + ', predictions:')
    print(predictions)

Iteration 0, predictions:
[[0.3752 0.6248]
 [0.3752 0.6248]
 [0.3752 0.6248]
 ...
 [0.3752 0.6248]
 [0.3752 0.6248]
 [0.3752 0.6248]]
Iteration 1, predictions:
[[0.2887 0.7113]
 [0.2887 0.7113]
 [0.2887 0.7113]
 ...
 [0.2887 0.7113]
 [0.2887 0.7113]
 [0.2887 0.7113]]
Iteration 2, predictions:
[[0.2232 0.7768]
 [0.2345 0.7655]
 [0.2232 0.7768]
 ...
 [0.2232 0.7768]
 [0.2434 0.7566]
 [0.2345 0.7655]]
Iteration 3, predictions:
[[0.1763 0.8237]
 [0.1875 0.8125]
 [0.1763 0.8237]
 ...
 [0.1763 0.8237]
 [0.2028 0.7972]
 [0.1949 0.8051]]
Iteration 4, predictions:
[[0.144  0.856 ]
 [0.1535 0.8465]
 [0.144  0.856 ]
 ...
 [0.144  0.856 ]
 [0.1666 0.8334]
 [0.1598 0.8402]]


## Metric evaluation on a new dataset

You can also calculate metrics directly after training.

In [65]:
metrics = model.eval_metrics(data=pool1, metrics=['Logloss','AUC'], plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [66]:
print('AUC values:')
print(np.array(metrics['AUC']))

AUC values:
[0.5052 0.5074 0.639  0.6435 0.6435 0.6705 0.737  0.8662 0.9039 0.905
 0.9141 0.9216 0.923  0.9296 0.9336 0.9371 0.9412 0.9413 0.9421 0.9426
 0.9443 0.9456 0.9456 0.9456 0.9456 0.9455 0.9454 0.9452 0.9458 0.946
 0.9465 0.9469 0.9472 0.947  0.9468 0.9467 0.9467 0.9471 0.947  0.947 ]


**Question 12:**

Now train a model in the following way:

`
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)
`

What will be the AUC value on 550 iteration if evaluation metrics on the initial X dataset?

In [67]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

0:	learn: 0.6335962	test: 0.6340206	best: 0.6340206 (0)	total: 36.9ms	remaining: 36.8s
1:	learn: 0.5800784	test: 0.5804728	best: 0.5804728 (1)	total: 92ms	remaining: 45.9s
2:	learn: 0.5348017	test: 0.5355652	best: 0.5355652 (2)	total: 128ms	remaining: 42.5s
3:	learn: 0.4948441	test: 0.4962757	best: 0.4962757 (3)	total: 168ms	remaining: 41.7s
4:	learn: 0.4596710	test: 0.4614066	best: 0.4614066 (4)	total: 199ms	remaining: 39.5s
5:	learn: 0.4296351	test: 0.4315399	best: 0.4315399 (5)	total: 230ms	remaining: 38s
6:	learn: 0.4018046	test: 0.4036053	best: 0.4036053 (6)	total: 281ms	remaining: 39.8s
7:	learn: 0.3781803	test: 0.3800486	best: 0.3800486 (7)	total: 311ms	remaining: 38.6s
8:	learn: 0.3586405	test: 0.3606194	best: 0.3606194 (8)	total: 347ms	remaining: 38.2s
9:	learn: 0.3397761	test: 0.3415773	best: 0.3415773 (9)	total: 393ms	remaining: 38.9s
10:	learn: 0.3250980	test: 0.3270929	best: 0.3270929 (10)	total: 411ms	remaining: 37s
11:	learn: 0.3122087	test: 0.3143434	best: 0.3143434 (11

<catboost.core.CatBoostClassifier at 0x7fab45936668>

In [68]:
auc_value = model.eval_metrics(data=pool1, metrics=['AUC'])['AUC'][550]
grader.submit_tag('auc_550', auc_value)

Current answer for task auc_550 is: 0.9851195570316492


## Feature importances

Now we will learn how to understand which features are the most important ones. Let's first train the model that will not use feature combinations. To forbid feature combinations you need to use 'max_ctr_complexity=1'. This will speed up the training by a lot, but it will reduce the resulting quality. 

In [69]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=4,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

Learning rate set to 0.137885
0:	learn: 0.5382281	total: 76.4ms	remaining: 22.9s
50:	learn: 0.1494065	total: 3.44s	remaining: 16.8s
100:	learn: 0.1423648	total: 7.29s	remaining: 14.4s
150:	learn: 0.1368037	total: 11.1s	remaining: 10.9s
200:	learn: 0.1324257	total: 14.9s	remaining: 7.36s
250:	learn: 0.1283281	total: 19s	remaining: 3.7s
299:	learn: 0.1251284	total: 22.8s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fab45936a20>

Let's see which features are most important for the model without feature combinations.

In [70]:
importances = model.get_feature_importance(prettified=True)
print(importances)

         Feature Id  Importances
0          RESOURCE    20.963067
1            MGR_ID    17.835245
2     ROLE_DEPTNAME    15.684628
3  ROLE_FAMILY_DESC    11.586115
4     ROLE_ROLLUP_2    10.996416
5        ROLE_TITLE     6.214463
6     ROLE_ROLLUP_1     6.100384
7         ROLE_CODE     5.484116
8       ROLE_FAMILY     5.135566


** Question 13: **

Try training the model without the restriction of combinations, with other parameters set to the same values.
What will be top 3 most important features for this model?

In [71]:
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=None,
    random_seed=43   
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

Learning rate set to 0.137885
0:	learn: 0.5382281	total: 70.6ms	remaining: 21.1s
50:	learn: 0.1494065	total: 3.41s	remaining: 16.6s
100:	learn: 0.1423648	total: 7.26s	remaining: 14.3s
150:	learn: 0.1368037	total: 11.1s	remaining: 10.9s
200:	learn: 0.1324257	total: 14.9s	remaining: 7.32s
250:	learn: 0.1283281	total: 18.8s	remaining: 3.67s
299:	learn: 0.1251284	total: 22.6s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fab4da5e198>

In [72]:
importances = model.get_feature_importance(prettified=True)
print(importances)

         Feature Id  Importances
0          RESOURCE    20.963067
1            MGR_ID    17.835245
2     ROLE_DEPTNAME    15.684628
3  ROLE_FAMILY_DESC    11.586115
4     ROLE_ROLLUP_2    10.996416
5        ROLE_TITLE     6.214463
6     ROLE_ROLLUP_1     6.100384
7         ROLE_CODE     5.484116
8       ROLE_FAMILY     5.135566


In [101]:
# You should provide comma separated list of strings. Each string should be in single quotes. All list should be in square brackets.
top3 = [['RESOURCE', 'MGR_ID', 'ROLE_DEPTNAME']]
grader.submit_tag('feature_importance_top3', top3)

Current answer for task feature_importance_top3 is: [['RESOURCE', 'MGR_ID', 'ROLE_DEPTNAME']]


## Shap values

Let's train the model one more time.

In [74]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=1,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

Learning rate set to 0.137885
0:	learn: 0.5379892	total: 45.2ms	remaining: 13.5s
50:	learn: 0.1642268	total: 1.89s	remaining: 9.23s
100:	learn: 0.1579621	total: 3.81s	remaining: 7.51s
150:	learn: 0.1533024	total: 5.73s	remaining: 5.66s
200:	learn: 0.1486749	total: 7.66s	remaining: 3.77s
250:	learn: 0.1446334	total: 9.57s	remaining: 1.87s
299:	learn: 0.1411890	total: 11.5s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fab4386eb38>

The library provides a way to understand which features are important for a given object.
Let's take a look on the whole dataset X and analyze the influence of different features on the objects from this dataset.
We will now calculate importances for each object. After that we will visualize these importances.

In [75]:
pool1 = Pool(data=X, label=y, cat_features=cat_features)
shap_values = model.get_feature_importance(data=pool1, fstr_type='ShapValues', verbose=10000)
print(shap_values.shape)

Processing trees...
128/300 trees processed	passed time: 148ms	remaining time: 198ms
300/300 trees processed	passed time: 373ms	remaining time: 0us
Processing documents...
128/32769 documents processed	passed time: 3.37ms	remaining time: 859ms
10112/32769 documents processed	passed time: 235ms	remaining time: 526ms
20096/32769 documents processed	passed time: 474ms	remaining time: 299ms
30080/32769 documents processed	passed time: 730ms	remaining time: 65.2ms
(32769, 10)


Let's look on the prediction of the model for 0-th object. The raw prediction is not the probability, to calculate probability from raw prediction you need to calculate sigmoid(raw_prediction).

In [76]:
test_objects = [X.iloc[0:1]]

for obj in test_objects:
    print('Probability of class 1 = {:.4f}'.format(model.predict_proba(obj)[0][1]))
    print('Formula raw prediction = {:.4f}'.format(model.predict(obj, prediction_type='RawFormulaVal')[0]))
    print('\n')

Probability of class 1 = 0.9948
Formula raw prediction = 5.2546




Sum of all shap values are equal to the resulting raw formula predition.
We can see on the graph that will be output below that there is a base value, which is equal for all the objects.
And almost all the feature have positive influence on this object. The biggest step to the right is because of the feature called 'MGR_ID'.

In [79]:
import shap
shap.initjs()
shap.force_plot(shap_values[0,:], X.iloc[0,:])

Exception: ignored

** Question 14: **

What is the most important feature for 91-th object

In [80]:
most_important_feature = 'RESOURCE'
grader.submit_tag('most_important', most_important_feature)

Current answer for task most_important is: RESOURCE


** Question 15: **

Does it have positive or negative influence? Answer 1 if positive and -1 if negative.

In [81]:
influence_sign = -1
grader.submit_tag('shap_influence', influence_sign)

Current answer for task shap_influence is: -1


You can also view aggregated information about the influences on the whole dataset.

In [82]:
shap.summary_plot(shap_values, X)

AssertionError: ignored

From this graph you can see that values of MGR_ID and RESOURCE features have a large negative impact for many objects.
You can also see that RESOURCE has largest positive impact for many objects.

## Saving the model

You can save your model as a binary file. It is also possible to save the model as Python or C++ code.
If you save the model as a binary file you can then look on the parameters with which the model was trained, including learning_rate and random_seed that are set automatically if you don't specify them.

In [83]:
my_best_model = CatBoostClassifier(iterations=10)
my_best_model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=False
)
my_best_model.save_model('catboost_model.bin')

In [84]:
my_best_model.load_model('catboost_model.bin')
print(my_best_model.get_params())
print(my_best_model.random_seed_)
print(my_best_model.learning_rate_)

{'iterations': 10, 'loss_function': 'Logloss', 'verbose': 0}
0
0.5


## Hyperparameter tunning

You can tune the parameters to get better speed or better quality.
Here is the list of parameters that are important for speed and accuracy.

### Training speed

Here is the list of parameters that are important for speeding up the training.
Note that changing this parameters might decrease the quality.
1. iterations + learning rate
By default we train for 1000 iterations. You can decrease this number, but if you decrease the number of iterations you need to increase learning rate so that the process converges. We set learning rate by default dependent on number of iterations and on your dataset, so you might just use default learning rate. But if you want to tune it, you need to know - the more iterations you have, the less should be the learning rate.

2. boosting_type
By default we use Ordered boosting for smaller datasets where we want to fight overfitting. This is expensive in terms of computations. You can set boosting_type to Plain to disable this.

3. bootstrap_type
By default we sample weights from exponential distribution. It is faster to use sampling from Bernoulli distribution. To enable that use bootstrap_type='Bernoulli' + subsample={some value < 1}

4. one_hot_max_size
By default we use one-hot encoding only for categorical features with little amount of different values. For all other categorical features we calculate statistics. This is expensive, and one-hot encoding is cheep. So you can speed up the training by setting one_hot_max_size to some bigger value

5. rsm
This parameter is very important, because it speeds up the training and does not affect the quality. So you should definitely use it, but only in case if you have hundreds of features.
If you have little amount of features it's better not to use this parameter.
If you have many features then the rule is the following: you decrease rsm, for example, you set rsm=0.1. With this rsm value the training needs more iterations to converge. Usually you need about 20% more iterations. But each iteration will be 10x faster. So the resulting training time will be faster even though you will have more trees in the resulting model.

6. leaf_estimation_iterations
This parameter is responsible for calculating leaf values after you have already selected tree structure.
If you have little amount of features, for example 8 or 10 features, then this place starts to be the bottle-neck.
Default value for this parameter depends on the training objective, you can try setting it to 1 or 5, and if you have little amount of features, this might speed up the training.

7. max_ctr_complexity
By default catboost generates categorical feature combinations in a greedy way.
This is time consuming, you can disable that by setting max_ctr_complexity=1 or by allowing only combinations of 2 features by setting max_ctr_complexity=2.
This will speed up the training only if you have categorical features.

8. If you are training the model on GPU, you can try decreasing border_count. This is the number of splits considered for each feature. By default it's set to 128, but you can try setting it to 32. In many cases it will not degrade the quality of the model and will speed up the training by a lot. 

In [85]:
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32)

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7fab2b16a550>

** Question 16: **

Try tunning the speed of the algorithm. What is the maximum speedup you could get by changing these parameters without decreasing of AUC on best iteration on eval dataset compared to AUC on best iteration after training with default parameters and random seed = 0?
The answer shoud be a number, for example 2.7 means you got 2.7 times speedup.

In [86]:
speedup = 2.7
grader.submit_tag('speedup', speedup)

Current answer for task speedup is: 2.7


### Accuracy

The parameters listed below are important to get the best quality of the model. Try changing this parameters to improve the quality of the resulting model

In [89]:
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6,
    verbose=10
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.6568385	test: 0.6571059	best: 0.6571059 (0)	total: 72.7ms	remaining: 1m 12s
10:	learn: 0.4155487	test: 0.4164934	best: 0.4164934 (10)	total: 530ms	remaining: 47.6s
20:	learn: 0.3083466	test: 0.3103643	best: 0.3103643 (20)	total: 853ms	remaining: 39.8s
30:	learn: 0.2504686	test: 0.2508172	best: 0.2508172 (30)	total: 1.27s	remaining: 39.8s
40:	learn: 0.2147695	test: 0.2118084	best: 0.2118084 (40)	total: 1.77s	remaining: 41.4s
50:	learn: 0.1948465	test: 0.1885134	best: 0.1885134 (50)	total: 2.32s	remaining: 43.2s
60:	learn: 0.1831136	test: 0.1731985	best: 0.1731985 (60)	total: 2.85s	remaining: 43.9s
70:	learn: 0.1760215	test: 0.1643348	best: 0.1643348 (70)	total: 3.39s	remaining: 44.3s
80:	learn: 0.1711760	test: 0.1584812	best: 0.1584812 (80)	total: 3.97s	remaining: 45.1s
90:	learn: 0.1670447	test: 0.1533402	best: 0.1533402 (90)	total: 4.59s	remaining: 45.9s
100:	learn: 0.1636345	test: 0.1491871	best: 0.1491871 (100)	total: 5.24s	remaining: 46.6s
110:	learn: 0.1614307	test: 0.

<catboost.core.CatBoostClassifier at 0x7fab2b16ad68>

** Question 17: **

Try tunning these parameters to make AUC on eval dataset as large as possible. What is the maximum AUC value you have reached?

In [90]:
final_auc = 0.91
grader.submit_tag('final_auc', final_auc)

Current answer for task final_auc is: 0.91


In [102]:
STUDENT_EMAIL = "shakir.ahmed@student.sust.edu"
STUDENT_TOKEN = "4XDEKWd8q74TJ6FG"
grader.status()

You want to submit these numbers:
Task negative_samples: 1897
Task positive_samples: 30872
Task resource_unique_values: 7518
Task logloss_mean: 0.138053844266512
Task logloss_std: 0.0006068927556740362
Task accuracy_6: 0.9440036618858713
Task best_model_name: learning_rate_0.05
Task num_trees: 100
Task mean_logloss_cv: 0.15707007154415673
Task logloss_std_1: 0.0028273089638071918
Task iterations_overfitting: 43
Task auc_550: 0.9851195570316492
Task feature_importance_top3: [['RESOURCE', 'MGR_ID', 'ROLE_DEPTNAME']]
Task most_important: RESOURCE
Task shap_influence: -1
Task speedup: 2.7
Task final_auc: 0.91


In [103]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
