# CatBoost basics

For this homework will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

As a result of this tutorial you need to provide a tsv file with answers.
There are 17 questions in this tutorial. The resulting tsv file should consist of 17 lines, each line should contain the number of the question, an answer to it and a tab separater between them. Questions are numbered from 1 to 17.
See an example of the resulting file here.

## Reading the data

Let's first download the data and put it to folder `amazon`. Now we will read this data from file.

In [1]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
import catboost
from catboost import datasets
from catboost import *

from grader_v2 import Grader

In [2]:
train_df, test_df = catboost.datasets.amazon()
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [3]:
grader = Grader()

## Preparing your data

Label values extraction

In [26]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

Categorical features declaration

In [27]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Now it makes sense to ananyze the dataset.
First you need to calculate how many positive and negative objects are present in the train dataset.

**Question 1:**

How many negative objects are present in the train dataset X?

In [28]:
zero_count = 1897
grader.submit_tag('negative_samples', zero_count)

Current answer for task negative_samples is: 1897


**Question 2:**

How many positive objects are present in the train dataset X?

In [29]:
one_count = 30872
grader.submit_tag('positive_samples', one_count)

Current answer for task positive_samples is: 30872


In [30]:
print('Zero count = ' + str(zero_count) + ', One count = ' + str(one_count))

Zero count = 1897, One count = 30872


Now for every feature you need to calculate number of unique values of this feature.

**Question 3:**
    
How many unique values has feature RESOURCE?

In [31]:
unique_vals_for_RESOURCE = 7518
grader.submit_tag('resource_unique_values', unique_vals_for_RESOURCE)

Current answer for task resource_unique_values is: 7518


Now we can create a Pool object. This type is used for datasets in CatBoost. You can also use numpy array or dataframe. Working with Pool class is the most efficient way in terms of memory and speed. We recommend to create Pool from file in case if you have your data on disk or from FeaturesData if you use numpy.

In [32]:
import numpy as np
from catboost import Pool

pool1 = Pool(data=X, label=y, cat_features=cat_features)
pool2 = Pool(data='/opt/conda/lib/python3.6/site-packages/catboost/cached_datasets/amazon/train.csv', delimiter=',', has_header=True)
pool3 = Pool(data=X, cat_features=cat_features)

print('Dataset shape')
print('dataset 1:' + str(pool1.shape) + '\ndataset 2:' + str(pool2.shape)  + '\ndataset 3:' + str(pool3.shape))

print('\n')
print('Column names')
print('dataset 1: ')
print(pool1.get_feature_names()) 
print('\ndataset 2:')
print(pool2.get_feature_names())
print('\ndataset 3:')
print(pool3.get_feature_names())

Dataset shape
dataset 1:(32769, 9)
dataset 2:(32769, 9)
dataset 3:(32769, 9)


Column names
dataset 1: 
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 2:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 3:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


## Split your data into train and validation

When you will be training your model, you will have to detect overfitting and select best parameters. To do that you need to have a validation dataset.
Normally you would be using some random split, for example
`train_test_split` from `sklearn.model_selection`.
But for the purpose of this homework the train part will be the first 80% of the data and the evaluation part will be the last 20% of the data.

In [33]:
train_count = int(X.shape[0] * 0.8)

X_train = X.iloc[:train_count,:]
y_train = y[:train_count]
X_validation = X.iloc[train_count:, :]
y_validation = y[train_count:]

## Train your model

Now we will train our first model.

In [53]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5,
    random_seed=0,
    learning_rate=0.1
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent'
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

Model is fitted: True
Model params:
{'random_seed': 0, 'loss_function': 'Logloss', 'learning_rate': 0.1, 'iterations': 5}


## Stdout of the training

You can see in stdout values of the loss function on each iteration, or on each k-th iteration.
You can also see how much time passed since the start of the training and how much time is left.

In [54]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=15,
    verbose=3
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 71.6ms	remaining: 1s
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 1.63s	remaining: 4.49s
6:	learn: 0.1706950	test: 0.1549520	best: 0.1549520 (6)	total: 3.06s	remaining: 3.5s
9:	learn: 0.1672391	test: 0.1495040	best: 0.1495040 (9)	total: 4.34s	remaining: 2.17s
12:	learn: 0.1645499	test: 0.1487789	best: 0.1487789 (12)	total: 5.93s	remaining: 912ms
14:	learn: 0.1630092	test: 0.1469375	best: 0.1469375 (14)	total: 7.06s	remaining: 0us

bestTest = 0.1469374586
bestIteration = 14



<catboost.core.CatBoostClassifier at 0x7f6dd6ca0cf8>

## Random seed

If you don't specify random_seed then random seed will be set to a new value each time.
After the training has finished you can look on the value of the random seed that was set.
If you train again with this random_seed, you will get the same results.

In [55]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 80.7ms	remaining: 323ms
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 680ms	remaining: 1.02s
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 1.06s	remaining: 707ms
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 1.68s	remaining: 421ms
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 2.29s	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f6dd6ca0a20>

In [56]:
random_seed = model.random_seed_
print('Used random seed = ' + str(random_seed))
model = CatBoostClassifier(
    iterations=5,
    random_seed=random_seed
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Used random seed = 0
0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 42.3ms	remaining: 169ms
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 642ms	remaining: 963ms
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 1.04s	remaining: 692ms
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 1.62s	remaining: 406ms
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 2.23s	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f6dd6cdf4a8>

Try training 10 models with parameters and calculate mean and the standart deviation of Logloss error on validation dataset.

**Question 4:**

What is the mean value of the Logloss metric on validation dataset (X_validation, y_validation) after 10 times training `CatBoostClassifier` with different random seeds in the following way:

`model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.1,
    random_seed={my_random_seed}
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)
`

In [57]:
seed_list = range(10)
validation_loss = []
for seed in seed_list:
    model = CatBoostClassifier(
        iterations=300,
        learning_rate=0.1,
        random_seed=seed
        )
    model.fit(
        X_train, y_train,
        cat_features=cat_features,
        eval_set=(X_validation, y_validation),
        verbose = 50
        )
    validation_loss.append(model.best_score_['validation_0']['Logloss'])

0:	learn: 0.5790122	test: 0.5797377	best: 0.5797377 (0)	total: 90.1ms	remaining: 26.9s
50:	learn: 0.1606281	test: 0.1432391	best: 0.1432391 (50)	total: 25.1s	remaining: 2m 2s
100:	learn: 0.1555506	test: 0.1394226	best: 0.1394226 (100)	total: 48.6s	remaining: 1m 35s
150:	learn: 0.1515049	test: 0.1384309	best: 0.1384019 (148)	total: 1m 17s	remaining: 1m 16s
200:	learn: 0.1486424	test: 0.1381162	best: 0.1381162 (200)	total: 1m 46s	remaining: 52.7s
250:	learn: 0.1467162	test: 0.1379267	best: 0.1379267 (250)	total: 2m 16s	remaining: 26.6s
299:	learn: 0.1452004	test: 0.1379528	best: 0.1377092 (262)	total: 2m 44s	remaining: 0us

bestTest = 0.1377092211
bestIteration = 262

Shrink model to first 263 iterations.
0:	learn: 0.5785828	test: 0.5796621	best: 0.5796621 (0)	total: 375ms	remaining: 1m 52s
50:	learn: 0.1617489	test: 0.1457523	best: 0.1457523 (50)	total: 25.1s	remaining: 2m 2s
100:	learn: 0.1542598	test: 0.1402208	best: 0.1401744 (97)	total: 51.4s	remaining: 1m 41s
150:	learn: 0.1509185	

In [61]:
validation_loss

[0.1377092210691794,
 0.13908961147685442,
 0.13890494877003,
 0.13718808123056636,
 0.13969110529935072,
 0.13802919472444256,
 0.1371345869369057,
 0.13801195200538616,
 0.13811303812914325,
 0.13793673162140377]

In [9]:
validation_loss_hard_coded = np.array([0.1377092210691794,
 0.13908961147685442,
 0.13890494877003,
 0.13718808123056636,
 0.13969110529935072,
 0.13802919472444256,
 0.1371345869369057,
 0.13801195200538616,
 0.13811303812914325,
 0.13793673162140377])

In [10]:
mean = validation_loss_hard_coded.mean()
grader.submit_tag('logloss_mean', mean)

Current answer for task logloss_mean is: 0.138180847126


**Question 5:**

What is the standard deviation of it?

In [11]:
stddev = validation_loss_hard_coded.std()
grader.submit_tag('logloss_std', stddev)

Current answer for task logloss_std is: 0.000777781825619


## Metrics calculation and graph plotting

When experimenting with Jupyter notebook you can see graphs of different errors during training.
To do that you need to use `plot=True` parameter.

In [25]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=50,
    random_seed=63,
    learning_rate=0.1,
    custom_loss=['Accuracy']
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7f6dd6921668>

**Question 6:**

What is the value of the accuracy metric value on evaluation dataset after training with parameters `iterations=50`, `random_seed=63`, `learning_rate=0.1`?

In [12]:
accuracy = 0.9539213
grader.submit_tag('accuracy_6', accuracy)

Current answer for task accuracy_6 is: 0.9539213


## Model comparison

In [27]:
model1 = CatBoostClassifier(
    learning_rate=0.5,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.5',
    custom_loss = ['Accuracy']
)

model2 = CatBoostClassifier(
    learning_rate=0.05,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.05',
    custom_loss = ['Accuracy']
)
model1.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)
model2.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 138ms	remaining: 2m 17s


KeyboardInterrupt: 

In [None]:
from catboost import MetricVisualizer
MetricVisualizer(['learning_rate_0.05', 'learning_rate_0.5']).start()

**Question 7:**

Try training these models for 1000 iterations. Which model will give better best resulting Accuracy on validation dataset?
By best resulting accuracy we mean accuracy on best iteration, which might be not the last iteration.

In [13]:
best_model_name = 'learning_rate_0.05' # one of 'learning_rate_0.5', 'learning_rate_0.05'
grader.submit_tag('best_model_name', best_model_name)

Current answer for task best_model_name is: learning_rate_0.05


## Best iteration

If a validation dataset is present then after training, the model is shrinked to a number of trees when it got best evaluation metric value on validation dataset.
By default evaluation metric is the optimized metric. But you can set evaluation metric to some other metric.
In the example below evaluation metric is `Accuracy`.

In [29]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy'
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7f6dd6cd3390>

In [30]:
print('Tree count: ' + str(model.tree_count_))

Tree count: 72


If you don't want the model to be shrinked, you can set `use_best_model=False`

In [65]:
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy',
    use_best_model=False
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7f6dd6c95a90>

**Question 8:**
    
What will be the number of trees in the resulting model after training with validation dataset with parameters `iterations=100`, ` learning_rate=0.5`, `eval_metric='Accuracy'` and with parameter `use_best_model=False`

In [14]:
tree_count = 100
grader.submit_tag('num_trees', tree_count)

Current answer for task num_trees is: 100


## Cross-validation

The next functionality you need to know about is cross-validation.
For unbalanced datasets stratified cross-validation can be useful.

In [32]:
from catboost import cv

params = {}
params['loss_function'] = 'Logloss'
params['iterations'] = 80
params['custom_loss'] = 'AUC'
params['random_seed'] = 63
params['learning_rate'] = 0.5

cv_data = cv(
    params = params,
    pool = Pool(X, label=y, cat_features=cat_features),
    fold_count=5,
    inverted=False,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    stratified=True,
    verbose=False
)

A Jupyter Widget

Cross-validation returns specified metric values on every iteration (or every k-th iteration, if you specify so)

In [33]:
print(cv_data[0:4])

   test-AUC-mean  test-AUC-std  test-Logloss-mean  test-Logloss-std  \
0       0.500000      0.000000           0.302197          0.000080   
1       0.625621      0.122336           0.222651          0.014472   
2       0.799508      0.012871           0.179930          0.004739   
3       0.824558      0.013151           0.165090          0.003799   

   train-AUC-mean  train-AUC-std  train-Logloss-mean  train-Logloss-std  
0        0.499984       0.000017            0.302203           0.000050  
1        0.614679       0.109875            0.225825           0.010991  
2        0.758325       0.022924            0.190024           0.004146  
3        0.781285       0.017559            0.178807           0.003176  


Let's look on mean value and standard deviation of Logloss for cv on best iteration.

In [34]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, not stratified: 0.1409±0.0056 on step 65


**Question 9:**

Try running stratified cross-validation with the same parameters. What will be mean of Logloss metric on test of the stratified cross-validation on the best iteration?

In [15]:
mean_on_best_iteration = 0.1409
grader.submit_tag('mean_logloss_cv', mean_on_best_iteration)

Current answer for task mean_logloss_cv is: 0.1409


**Question 10:**

Try running stratified cross-validation with the same parameters. What will be the standard deviation of Logloss metric of the stratified cross-validation on the best iteration?

In [16]:
std_on_best_iteration = 0.0056
grader.submit_tag('logloss_std_1', std_on_best_iteration)

Current answer for task logloss_std_1 is: 0.0056


## Overfitting detector

A useful feature of the library is overfitting detector.
Let's try training the model with early stopping.

In [37]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    random_seed=63,
    learning_rate=0.5,
    od_type='Iter',
    od_wait=20,
    eval_metric = 'AUC'
)
model_with_early_stop.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7f6dd6cdfa58>

**Question 11:**

Now try training the model with the same parameters and with overfitting detector, but with `eval_metric='AUC'`
What will be the number of iterations after which the training will stop?
(Not the number of trees in the resulting model, but the number of iterations that the algorithm will perform befor training).

In [19]:
iterations_count = 64
grader.submit_tag('iterations_overfitting', iterations_count)

Current answer for task iterations_overfitting is: 64


## Snapshotting

If you train for long time, for example for several hours, you need to save snapshots.
Otherwise if your laptop or your server will reboot, you will loose all the progress.
To do that you need to specify `snapshot_file` parameter.
Try running the code below and interrupting the kernel after short time.
Then try running the same cell again.
The training will start from the iteration when the training was interrupted.
Note that all additional files are written by default into `catboost_info` directory. It can be changed using `train_dir` parameter. So the snapshot file will be there.

In [80]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

0:	learn: 0.3284341	test: 0.3318014	best: 0.3318014 (0)	total: 293ms	remaining: 11.4s
1:	learn: 0.2435617	test: 0.2470202	best: 0.2470202 (1)	total: 610ms	remaining: 11.6s
2:	learn: 0.1973609	test: 0.1922760	best: 0.1922760 (2)	total: 1.28s	remaining: 15.8s
3:	learn: 0.1838698	test: 0.1745423	best: 0.1745423 (3)	total: 1.82s	remaining: 16.3s
4:	learn: 0.1769813	test: 0.1634033	best: 0.1634033 (4)	total: 2.29s	remaining: 16.1s
5:	learn: 0.1729134	test: 0.1575425	best: 0.1575425 (5)	total: 2.88s	remaining: 16.4s
6:	learn: 0.1719994	test: 0.1558761	best: 0.1558761 (6)	total: 3.2s	remaining: 15.1s
7:	learn: 0.1703007	test: 0.1537828	best: 0.1537828 (7)	total: 3.6s	remaining: 14.4s
8:	learn: 0.1682803	test: 0.1513517	best: 0.1513517 (8)	total: 4.11s	remaining: 14.2s
9:	learn: 0.1666971	test: 0.1503261	best: 0.1503261 (9)	total: 4.68s	remaining: 14s
10:	learn: 0.1660670	test: 0.1500227	best: 0.1500227 (10)	total: 5.09s	remaining: 13.4s
11:	learn: 0.1654023	test: 0.1500106	best: 0.1500106 (11

<catboost.core.CatBoostClassifier at 0x7f6dd6c77240>

## Model predictions

There are multiple ways to do predictions.
The easiest one is to call predict or predict_proba.
You also can make predictions using C++ code. For that see [documentation](https://tech.yandex.com/catboost/doc/dg/concepts/c-plus-plus-api-docpage/).

In [81]:
print(model.predict_proba(data=X_validation))

[[ 0.0159  0.9841]
 [ 0.0157  0.9843]
 [ 0.0059  0.9941]
 ..., 
 [ 0.0071  0.9929]
 [ 0.3818  0.6182]
 [ 0.0263  0.9737]]


In [82]:
print(model.predict(data=X_validation))

[ 1.  1.  1. ...,  1.  1.  1.]


For binary classification resulting value is not necessary a value in `[0,1]`. It is some numeric value. To get the probability out of this value you need to calculate sigmoid of that value.

In [83]:
raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')
print(raw_pred)

[ 4.1255  4.1387  5.122  ...,  4.9439  0.4819  3.6114]


In [84]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
probabilities = [sigmoid(x) for x in raw_pred]
print(np.array(probabilities))

[ 0.9841  0.9843  0.9941 ...,  0.9929  0.6182  0.9737]


## Staged prediction

CatBoost also supports staged prediction - when you want to have a prediction on each object on each iteration (or on each k-th iteration). This can be used if you want to calculate the values of some custom metric using the predictions.

In [85]:
predictions_gen = model.staged_predict_proba(data=X_validation, ntree_start=0, ntree_end=5, eval_period=1)
for iteration, predictions in enumerate(predictions_gen):
    print('Iteration ' + str(iteration) + ', predictions:')
    print(predictions)

Iteration 0, predictions:
[[ 0.228  0.772]
 [ 0.228  0.772]
 [ 0.228  0.772]
 ..., 
 [ 0.228  0.772]
 [ 0.228  0.772]
 [ 0.228  0.772]]
Iteration 1, predictions:
[[ 0.1121  0.8879]
 [ 0.1273  0.8727]
 [ 0.1121  0.8879]
 ..., 
 [ 0.1121  0.8879]
 [ 0.221   0.779 ]
 [ 0.1526  0.8474]]
Iteration 2, predictions:
[[ 0.0599  0.9401]
 [ 0.0686  0.9314]
 [ 0.0599  0.9401]
 ..., 
 [ 0.0599  0.9401]
 [ 0.3378  0.6622]
 [ 0.0833  0.9167]]
Iteration 3, predictions:
[[ 0.0424  0.9576]
 [ 0.0486  0.9514]
 [ 0.0424  0.9576]
 ..., 
 [ 0.0424  0.9576]
 [ 0.3799  0.6201]
 [ 0.0594  0.9406]]
Iteration 4, predictions:
[[ 0.0267  0.9733]
 [ 0.0435  0.9565]
 [ 0.0267  0.9733]
 ..., 
 [ 0.0379  0.9621]
 [ 0.3549  0.6451]
 [ 0.0531  0.9469]]


  


## Metric evaluation on a new dataset

You can also calculate metrics directly after training.

In [89]:
metrics = model.eval_metrics(data=pool1, metrics=['Logloss','AUC'], plot=True)

In [90]:
print('AUC values:')
print(np.array(metrics['AUC']))

AUC values:
[ 0.4999  0.6183  0.6184  0.6297  0.6447  0.6465  0.6411  0.6746  0.7442
  0.7388  0.7501  0.75    0.7794  0.7794  0.7758  0.7889  0.8162  0.834
  0.8387  0.8657  0.8657  0.8657  0.8875  0.9022  0.9068  0.9109  0.9223
  0.9256  0.9266  0.9283  0.9296  0.93    0.934   0.9356  0.9365  0.9366
  0.9379  0.9404  0.9405  0.9406  0.9416  0.9428  0.9436  0.944   0.9467
  0.9474  0.9498  0.9511  0.9519  0.9528  0.9539  0.9543  0.9546  0.9552
  0.9553  0.9554  0.9561  0.9562  0.9564  0.9565  0.9567  0.9567  0.957
  0.9594  0.9612  0.9632  0.9645  0.9645  0.965   0.965   0.9659  0.9659
  0.9668  0.9666  0.9675  0.9692  0.9702  0.9709  0.9711  0.9714  0.9722
  0.973   0.973   0.9733  0.9734  0.9737  0.9737  0.9739  0.9739  0.9739
  0.9738  0.9739  0.9742  0.9742  0.9742  0.9742  0.9742  0.9741  0.9741
  0.9744  0.9746  0.9746  0.9745  0.9745  0.9746  0.9749  0.9749  0.975
  0.975   0.975   0.975   0.975   0.975   0.9751  0.9753  0.9755  0.9756
  0.9756  0.9758  0.9759  0.9759  0.9761  

A Jupyter Widget

**Question 12:**

Now train a model in the following way:

`
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)
`

What will be the AUC value on 550 iteration if evaluation metrics on the initial X dataset?

In [88]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

0:	learn: 0.6335962	test: 0.6340206	best: 0.6340206 (0)	total: 267ms	remaining: 4m 26s
1:	learn: 0.5800784	test: 0.5804728	best: 0.5804728 (1)	total: 653ms	remaining: 5m 26s
2:	learn: 0.5348017	test: 0.5355652	best: 0.5355652 (2)	total: 991ms	remaining: 5m 29s
3:	learn: 0.4948441	test: 0.4962757	best: 0.4962757 (3)	total: 1.48s	remaining: 6m 8s
4:	learn: 0.4596700	test: 0.4614066	best: 0.4614066 (4)	total: 1.87s	remaining: 6m 12s
5:	learn: 0.4291247	test: 0.4309207	best: 0.4309207 (5)	total: 2.45s	remaining: 6m 45s
6:	learn: 0.4027804	test: 0.4048004	best: 0.4048004 (6)	total: 2.86s	remaining: 6m 46s
7:	learn: 0.3792210	test: 0.3813320	best: 0.3813320 (7)	total: 3.26s	remaining: 6m 43s
8:	learn: 0.3570632	test: 0.3594469	best: 0.3594469 (8)	total: 3.75s	remaining: 6m 52s
9:	learn: 0.3389886	test: 0.3409673	best: 0.3409673 (9)	total: 4.17s	remaining: 6m 53s
10:	learn: 0.3225536	test: 0.3249813	best: 0.3249813 (10)	total: 4.59s	remaining: 6m 52s
11:	learn: 0.3096544	test: 0.3120757	best:

93:	learn: 0.1617124	test: 0.1455788	best: 0.1455788 (93)	total: 47.5s	remaining: 7m 38s
94:	learn: 0.1615787	test: 0.1454088	best: 0.1454088 (94)	total: 48.1s	remaining: 7m 38s
95:	learn: 0.1615018	test: 0.1453321	best: 0.1453321 (95)	total: 48.7s	remaining: 7m 38s
96:	learn: 0.1614838	test: 0.1453240	best: 0.1453240 (96)	total: 49.2s	remaining: 7m 38s
97:	learn: 0.1613493	test: 0.1451730	best: 0.1451730 (97)	total: 49.8s	remaining: 7m 38s
98:	learn: 0.1612984	test: 0.1451546	best: 0.1451546 (98)	total: 50.3s	remaining: 7m 38s
99:	learn: 0.1611222	test: 0.1449381	best: 0.1449381 (99)	total: 50.9s	remaining: 7m 38s
100:	learn: 0.1609553	test: 0.1447964	best: 0.1447964 (100)	total: 51.4s	remaining: 7m 37s
101:	learn: 0.1608943	test: 0.1447115	best: 0.1447115 (101)	total: 52s	remaining: 7m 38s
102:	learn: 0.1608000	test: 0.1446959	best: 0.1446959 (102)	total: 52.6s	remaining: 7m 38s
103:	learn: 0.1607811	test: 0.1446851	best: 0.1446851 (103)	total: 52.8s	remaining: 7m 35s
104:	learn: 0.1

183:	learn: 0.1562893	test: 0.1413826	best: 0.1413529 (182)	total: 1m 32s	remaining: 6m 49s
184:	learn: 0.1562284	test: 0.1412802	best: 0.1412802 (184)	total: 1m 32s	remaining: 6m 49s
185:	learn: 0.1562283	test: 0.1412801	best: 0.1412801 (185)	total: 1m 33s	remaining: 6m 47s
186:	learn: 0.1561934	test: 0.1412981	best: 0.1412801 (185)	total: 1m 33s	remaining: 6m 47s
187:	learn: 0.1561865	test: 0.1412822	best: 0.1412801 (185)	total: 1m 34s	remaining: 6m 47s
188:	learn: 0.1561855	test: 0.1412821	best: 0.1412801 (185)	total: 1m 34s	remaining: 6m 46s
189:	learn: 0.1561630	test: 0.1412833	best: 0.1412801 (185)	total: 1m 35s	remaining: 6m 45s
190:	learn: 0.1560906	test: 0.1412598	best: 0.1412598 (190)	total: 1m 35s	remaining: 6m 45s
191:	learn: 0.1560763	test: 0.1412665	best: 0.1412598 (190)	total: 1m 36s	remaining: 6m 45s
192:	learn: 0.1560762	test: 0.1412669	best: 0.1412598 (190)	total: 1m 36s	remaining: 6m 44s
193:	learn: 0.1560696	test: 0.1412523	best: 0.1412523 (193)	total: 1m 37s	remain

273:	learn: 0.1519447	test: 0.1393321	best: 0.1393321 (273)	total: 2m 22s	remaining: 6m 17s
274:	learn: 0.1519223	test: 0.1393304	best: 0.1393304 (274)	total: 2m 23s	remaining: 6m 17s
275:	learn: 0.1518849	test: 0.1393317	best: 0.1393304 (274)	total: 2m 23s	remaining: 6m 16s
276:	learn: 0.1518522	test: 0.1393071	best: 0.1393071 (276)	total: 2m 24s	remaining: 6m 16s
277:	learn: 0.1518385	test: 0.1393221	best: 0.1393071 (276)	total: 2m 24s	remaining: 6m 15s
278:	learn: 0.1517941	test: 0.1393123	best: 0.1393071 (276)	total: 2m 25s	remaining: 6m 15s
279:	learn: 0.1517207	test: 0.1392926	best: 0.1392926 (279)	total: 2m 25s	remaining: 6m 14s
280:	learn: 0.1516382	test: 0.1392190	best: 0.1392190 (280)	total: 2m 26s	remaining: 6m 14s
281:	learn: 0.1516030	test: 0.1392211	best: 0.1392190 (280)	total: 2m 26s	remaining: 6m 13s
282:	learn: 0.1515593	test: 0.1392389	best: 0.1392190 (280)	total: 2m 27s	remaining: 6m 13s
283:	learn: 0.1514680	test: 0.1392362	best: 0.1392190 (280)	total: 2m 27s	remain

363:	learn: 0.1492291	test: 0.1390084	best: 0.1389079 (354)	total: 3m 15s	remaining: 5m 41s
364:	learn: 0.1491846	test: 0.1389854	best: 0.1389079 (354)	total: 3m 15s	remaining: 5m 40s
365:	learn: 0.1491685	test: 0.1389916	best: 0.1389079 (354)	total: 3m 16s	remaining: 5m 40s
366:	learn: 0.1491665	test: 0.1389915	best: 0.1389079 (354)	total: 3m 16s	remaining: 5m 39s
367:	learn: 0.1490961	test: 0.1389736	best: 0.1389079 (354)	total: 3m 17s	remaining: 5m 39s
368:	learn: 0.1490872	test: 0.1389692	best: 0.1389079 (354)	total: 3m 18s	remaining: 5m 38s
369:	learn: 0.1490723	test: 0.1389852	best: 0.1389079 (354)	total: 3m 18s	remaining: 5m 38s
370:	learn: 0.1490125	test: 0.1389267	best: 0.1389079 (354)	total: 3m 19s	remaining: 5m 37s
371:	learn: 0.1489802	test: 0.1389401	best: 0.1389079 (354)	total: 3m 19s	remaining: 5m 37s
372:	learn: 0.1489752	test: 0.1389511	best: 0.1389079 (354)	total: 3m 20s	remaining: 5m 36s
373:	learn: 0.1489643	test: 0.1389569	best: 0.1389079 (354)	total: 3m 20s	remain

453:	learn: 0.1470704	test: 0.1390522	best: 0.1389002 (376)	total: 4m 6s	remaining: 4m 56s
454:	learn: 0.1470668	test: 0.1390508	best: 0.1389002 (376)	total: 4m 6s	remaining: 4m 55s
455:	learn: 0.1470663	test: 0.1390518	best: 0.1389002 (376)	total: 4m 7s	remaining: 4m 55s
456:	learn: 0.1470639	test: 0.1390501	best: 0.1389002 (376)	total: 4m 7s	remaining: 4m 54s
457:	learn: 0.1470313	test: 0.1390698	best: 0.1389002 (376)	total: 4m 8s	remaining: 4m 54s
458:	learn: 0.1470305	test: 0.1390709	best: 0.1389002 (376)	total: 4m 9s	remaining: 4m 53s
459:	learn: 0.1470232	test: 0.1390678	best: 0.1389002 (376)	total: 4m 9s	remaining: 4m 53s
460:	learn: 0.1470230	test: 0.1390685	best: 0.1389002 (376)	total: 4m 10s	remaining: 4m 52s
461:	learn: 0.1470147	test: 0.1390619	best: 0.1389002 (376)	total: 4m 10s	remaining: 4m 52s
462:	learn: 0.1470115	test: 0.1390619	best: 0.1389002 (376)	total: 4m 11s	remaining: 4m 51s
463:	learn: 0.1470088	test: 0.1390580	best: 0.1389002 (376)	total: 4m 11s	remaining: 4m

543:	learn: 0.1454142	test: 0.1390041	best: 0.1389002 (376)	total: 4m 59s	remaining: 4m 10s
544:	learn: 0.1454129	test: 0.1390053	best: 0.1389002 (376)	total: 4m 59s	remaining: 4m 10s
545:	learn: 0.1453689	test: 0.1389595	best: 0.1389002 (376)	total: 5m	remaining: 4m 9s
546:	learn: 0.1453661	test: 0.1389541	best: 0.1389002 (376)	total: 5m 1s	remaining: 4m 9s
547:	learn: 0.1453630	test: 0.1389597	best: 0.1389002 (376)	total: 5m 1s	remaining: 4m 8s
548:	learn: 0.1453624	test: 0.1389650	best: 0.1389002 (376)	total: 5m 2s	remaining: 4m 8s
549:	learn: 0.1453272	test: 0.1389728	best: 0.1389002 (376)	total: 5m 2s	remaining: 4m 7s
550:	learn: 0.1453272	test: 0.1389729	best: 0.1389002 (376)	total: 5m 3s	remaining: 4m 7s
551:	learn: 0.1453207	test: 0.1389662	best: 0.1389002 (376)	total: 5m 3s	remaining: 4m 6s
552:	learn: 0.1453139	test: 0.1389661	best: 0.1389002 (376)	total: 5m 4s	remaining: 4m 5s
553:	learn: 0.1452368	test: 0.1389722	best: 0.1389002 (376)	total: 5m 5s	remaining: 4m 5s
554:	lear

633:	learn: 0.1443304	test: 0.1387945	best: 0.1387695 (626)	total: 5m 50s	remaining: 3m 22s
634:	learn: 0.1443067	test: 0.1388296	best: 0.1387695 (626)	total: 5m 50s	remaining: 3m 21s
635:	learn: 0.1442955	test: 0.1388184	best: 0.1387695 (626)	total: 5m 51s	remaining: 3m 21s
636:	learn: 0.1442948	test: 0.1388215	best: 0.1387695 (626)	total: 5m 51s	remaining: 3m 20s
637:	learn: 0.1442833	test: 0.1388051	best: 0.1387695 (626)	total: 5m 52s	remaining: 3m 20s
638:	learn: 0.1442739	test: 0.1388149	best: 0.1387695 (626)	total: 5m 53s	remaining: 3m 19s
639:	learn: 0.1442633	test: 0.1388225	best: 0.1387695 (626)	total: 5m 53s	remaining: 3m 19s
640:	learn: 0.1442493	test: 0.1388261	best: 0.1387695 (626)	total: 5m 54s	remaining: 3m 18s
641:	learn: 0.1442056	test: 0.1388201	best: 0.1387695 (626)	total: 5m 54s	remaining: 3m 17s
642:	learn: 0.1441944	test: 0.1388252	best: 0.1387695 (626)	total: 5m 55s	remaining: 3m 17s
643:	learn: 0.1441916	test: 0.1388299	best: 0.1387695 (626)	total: 5m 56s	remain

723:	learn: 0.1430643	test: 0.1387495	best: 0.1387474 (717)	total: 6m 42s	remaining: 2m 33s
724:	learn: 0.1430604	test: 0.1387584	best: 0.1387474 (717)	total: 6m 43s	remaining: 2m 32s
725:	learn: 0.1430512	test: 0.1387335	best: 0.1387335 (725)	total: 6m 44s	remaining: 2m 32s
726:	learn: 0.1430380	test: 0.1387457	best: 0.1387335 (725)	total: 6m 44s	remaining: 2m 32s
727:	learn: 0.1430084	test: 0.1387023	best: 0.1387023 (727)	total: 6m 45s	remaining: 2m 31s
728:	learn: 0.1429966	test: 0.1387251	best: 0.1387023 (727)	total: 6m 46s	remaining: 2m 30s
729:	learn: 0.1429914	test: 0.1387287	best: 0.1387023 (727)	total: 6m 46s	remaining: 2m 30s
730:	learn: 0.1429490	test: 0.1387472	best: 0.1387023 (727)	total: 6m 47s	remaining: 2m 29s
731:	learn: 0.1429448	test: 0.1387476	best: 0.1387023 (727)	total: 6m 47s	remaining: 2m 29s
732:	learn: 0.1428836	test: 0.1387465	best: 0.1387023 (727)	total: 6m 48s	remaining: 2m 28s
733:	learn: 0.1428661	test: 0.1387486	best: 0.1387023 (727)	total: 6m 49s	remain

813:	learn: 0.1419410	test: 0.1387480	best: 0.1386987 (763)	total: 7m 34s	remaining: 1m 43s
814:	learn: 0.1419403	test: 0.1387489	best: 0.1386987 (763)	total: 7m 35s	remaining: 1m 43s
815:	learn: 0.1419293	test: 0.1387507	best: 0.1386987 (763)	total: 7m 36s	remaining: 1m 42s
816:	learn: 0.1419259	test: 0.1387571	best: 0.1386987 (763)	total: 7m 36s	remaining: 1m 42s
817:	learn: 0.1419245	test: 0.1387596	best: 0.1386987 (763)	total: 7m 37s	remaining: 1m 41s
818:	learn: 0.1419241	test: 0.1387615	best: 0.1386987 (763)	total: 7m 37s	remaining: 1m 41s
819:	learn: 0.1419186	test: 0.1387665	best: 0.1386987 (763)	total: 7m 38s	remaining: 1m 40s
820:	learn: 0.1419184	test: 0.1387666	best: 0.1386987 (763)	total: 7m 38s	remaining: 1m 40s
821:	learn: 0.1419162	test: 0.1387637	best: 0.1386987 (763)	total: 7m 39s	remaining: 1m 39s
822:	learn: 0.1419152	test: 0.1387665	best: 0.1386987 (763)	total: 7m 39s	remaining: 1m 38s
823:	learn: 0.1419129	test: 0.1387723	best: 0.1386987 (763)	total: 7m 40s	remain

903:	learn: 0.1411996	test: 0.1388787	best: 0.1386987 (763)	total: 8m 25s	remaining: 53.7s
904:	learn: 0.1411987	test: 0.1388666	best: 0.1386987 (763)	total: 8m 26s	remaining: 53.1s
905:	learn: 0.1411740	test: 0.1388914	best: 0.1386987 (763)	total: 8m 26s	remaining: 52.6s
906:	learn: 0.1411623	test: 0.1388933	best: 0.1386987 (763)	total: 8m 27s	remaining: 52s
907:	learn: 0.1411621	test: 0.1388941	best: 0.1386987 (763)	total: 8m 27s	remaining: 51.5s
908:	learn: 0.1411608	test: 0.1388969	best: 0.1386987 (763)	total: 8m 28s	remaining: 50.9s
909:	learn: 0.1411606	test: 0.1388939	best: 0.1386987 (763)	total: 8m 29s	remaining: 50.3s
910:	learn: 0.1411212	test: 0.1389058	best: 0.1386987 (763)	total: 8m 29s	remaining: 49.8s
911:	learn: 0.1411047	test: 0.1389358	best: 0.1386987 (763)	total: 8m 30s	remaining: 49.2s
912:	learn: 0.1410950	test: 0.1389286	best: 0.1386987 (763)	total: 8m 30s	remaining: 48.7s
913:	learn: 0.1410928	test: 0.1389204	best: 0.1386987 (763)	total: 8m 31s	remaining: 48.1s
9

994:	learn: 0.1401220	test: 0.1388882	best: 0.1386987 (763)	total: 9m 18s	remaining: 2.81s
995:	learn: 0.1401206	test: 0.1388876	best: 0.1386987 (763)	total: 9m 19s	remaining: 2.25s
996:	learn: 0.1401031	test: 0.1388953	best: 0.1386987 (763)	total: 9m 19s	remaining: 1.68s
997:	learn: 0.1400862	test: 0.1389271	best: 0.1386987 (763)	total: 9m 20s	remaining: 1.12s
998:	learn: 0.1400827	test: 0.1389350	best: 0.1386987 (763)	total: 9m 20s	remaining: 561ms
999:	learn: 0.1400826	test: 0.1389358	best: 0.1386987 (763)	total: 9m 21s	remaining: 0us

bestTest = 0.1386986824
bestIteration = 763

Shrink model to first 764 iterations.


<catboost.core.CatBoostClassifier at 0x7f6dd6cdf9e8>

In [20]:
auc_value = 0.9849756977745989
grader.submit_tag('auc_550', auc_value)

Current answer for task auc_550 is: 0.9849756977745989


## Feature importances

Now we will learn how to understand which features are the most important ones. Let's first train the model that will not use feature combinations. To forbid feature combinations you need to use 'max_ctr_complexity=1'. This will speed up the training by a lot, but it will reduce the resulting quality. 

In [97]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=4,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

0:	learn: 0.5454508	total: 552ms	remaining: 2m 45s
50:	learn: 0.1562605	total: 24.9s	remaining: 2m 1s
100:	learn: 0.1493646	total: 51s	remaining: 1m 40s
150:	learn: 0.1451569	total: 1m 20s	remaining: 1m 18s
200:	learn: 0.1432905	total: 1m 48s	remaining: 53.4s
250:	learn: 0.1417401	total: 2m 17s	remaining: 26.9s
299:	learn: 0.1402619	total: 2m 45s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f6dd6c85160>

Let's see which features are most important for the model without feature combinations.

In [98]:
importances = model.get_feature_importance(prettified=True)
print(importances)

[('RESOURCE', 24.73509920590257), ('MGR_ID', 17.449161787258667), ('ROLE_DEPTNAME', 15.316223709876839), ('ROLE_ROLLUP_2', 11.490154799409593), ('ROLE_TITLE', 10.71183545081703), ('ROLE_FAMILY_DESC', 8.946143168072846), ('ROLE_FAMILY', 4.379723768290924), ('ROLE_CODE', 3.772023536810539), ('ROLE_ROLLUP_1', 3.199634573560977)]


** Question 13: **

Try training the model without the restriction of combinations, with other parameters set to the same values.
What will be top 3 most important features for this model?

In [21]:
top3 = 'RESOURCE,MGR_ID,ROLE_DEPTNAME' # You should provide comma separated list of strings. Each string should be in single quotes. All list should be in square brackets.
grader.submit_tag('feature_importance_top3', top3)

Current answer for task feature_importance_top3 is: RESOURCE,MGR_ID,ROLE_DEPTNAME


## Shap values

Let's train the model one more time.

In [99]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=1,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

0:	learn: 0.5443376	total: 253ms	remaining: 1m 15s
50:	learn: 0.1711369	total: 16.4s	remaining: 1m 20s
100:	learn: 0.1671705	total: 33.5s	remaining: 1m 6s
150:	learn: 0.1649220	total: 53.3s	remaining: 52.6s
200:	learn: 0.1632912	total: 1m 12s	remaining: 35.8s
250:	learn: 0.1622900	total: 1m 32s	remaining: 18.1s
299:	learn: 0.1613767	total: 1m 51s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f6dd6c859b0>

The library provides a way to understand which features are important for a given object.
Let's take a look on the whole dataset X and analyze the influence of different features on the objects from this dataset.
We will now calculate importances for each object. After that we will visualize these importances.

In [100]:
pool1 = Pool(data=X, label=y, cat_features=cat_features)
shap_values = model.get_feature_importance(data=pool1, fstr_type='ShapValues', verbose=10000)
print(shap_values.shape)

Processing trees...
128/300 trees processed	passed time: 96.7ms	remaining time: 130ms sec
300/300 trees processed	passed time: 398ms	remaining time: 0us sec
Processing documents...
128/32769 documents processed	passed time: 93.6ms	remaining time: 23.9s sec
10112/32769 documents processed	passed time: 896ms	remaining time: 2.01s sec
20096/32769 documents processed	passed time: 1.79s	remaining time: 1.13s sec
30080/32769 documents processed	passed time: 2.69s	remaining time: 241ms sec
(32769, 10)


Let's look on the prediction of the model for 0-th object. The raw prediction is not the probability, to calculate probability from raw prediction you need to calculate sigmoid(raw_prediction).

In [101]:
test_objects = [X.iloc[0:1]]

for obj in test_objects:
    print('Probability of class 1 = {:.4f}'.format(model.predict_proba(obj)[0][1]))
    print('Formula raw prediction = {:.4f}'.format(model.predict(obj, prediction_type='RawFormulaVal')[0]))
    print('\n')

Probability of class 1 = 0.9899
Formula raw prediction = 4.5822




Sum of all shap values are equal to the resulting raw formula predition.
We can see on the graph that will be output below that there is a base value, which is equal for all the objects.
And almost all the feature have positive influence on this object. The biggest step to the right is because of the feature called 'MGR_ID'.

In [135]:
import shap
shap.initjs()
shap.force_plot(shap_values[91,-1], shap_values[91,:-1], X.iloc[91,:])

** Question 14: **

What is the most important feature for 91-th object

In [22]:
most_important_feature = 'RESOURCE'
grader.submit_tag('most_important', most_important_feature)

Current answer for task most_important is: RESOURCE


** Question 15: **

Does it have positive or negative influence? Answer 1 if positive and -1 if negative.

In [23]:
influence_sign = -1
grader.submit_tag('shap_influence', influence_sign)

Current answer for task shap_influence is: -1


You can also view aggregated information about the influences on the whole dataset.

In [138]:
shap.summary_plot(shap_values, X)

IndexError: index 9 is out of bounds for axis 1 with size 9

From this graph you can see that values of MGR_ID and RESOURCE features have a large negative impact for many objects.
You can also see that RESOURCE has largest positive impact for many objects.

## Saving the model

You can save your model as a binary file. It is also possible to save the model as Python or C++ code.
If you save the model as a binary file you can then look on the parameters with which the model was trained, including learning_rate and random_seed that are set automatically if you don't specify them.

In [34]:
my_best_model = CatBoostClassifier(iterations=10)
my_best_model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=False
)
my_best_model.save_model('catboost_model.bin')

In [35]:
my_best_model.load_model('catboost_model.bin')
print(my_best_model.get_params())
print(my_best_model.random_seed_)
print(my_best_model.learning_rate_)

{'loss_function': 'Logloss', 'iterations': 10, 'logging_level': 'Silent', 'verbose': 0}
0
0.5


## Hyperparameter tunning

You can tune the parameters to get better speed or better quality.
Here is the list of parameters that are important for speed and accuracy.

### Training speed

Here is the list of parameters that are important for speeding up the training.
Note that changing this parameters might decrease the quality.
1. iterations + learning rate
By default we train for 1000 iterations. You can decrease this number, but if you decrease the number of iterations you need to increase learning rate so that the process converges. We set learning rate by default dependent on number of iterations and on your dataset, so you might just use default learning rate. But if you want to tune it, you need to know - the more iterations you have, the less should be the learning rate.

2. boosting_type
By default we use Ordered boosting for smaller datasets where we want to fight overfitting. This is expensive in terms of computations. You can set boosting_type to Plain to disable this.

3. bootstrap_type
By default we sample weights from exponential distribution. It is faster to use sampling from Bernoulli distribution. To enable that use bootstrap_type='Bernoulli' + subsample={some value < 1}

4. one_hot_max_size
By default we use one-hot encoding only for categorical features with little amount of different values. For all other categorical features we calculate statistics. This is expensive, and one-hot encoding is cheep. So you can speed up the training by setting one_hot_max_size to some bigger value

5. rsm
This parameter is very important, because it speeds up the training and does not affect the quality. So you should definitely use it, but only in case if you have hundreds of features.
If you have little amount of features it's better not to use this parameter.
If you have many features then the rule is the following: you decrease rsm, for example, you set rsm=0.1. With this rsm value the training needs more iterations to converge. Usually you need about 20% more iterations. But each iteration will be 10x faster. So the resulting training time will be faster even though you will have more trees in the resulting model.

6. leaf_estimation_iterations
This parameter is responsible for calculating leaf values after you have already selected tree structure.
If you have little amount of features, for example 8 or 10 features, then this place starts to be the bottle-neck.
Default value for this parameter depends on the training objective, you can try setting it to 1 or 5, and if you have little amount of features, this might speed up the training.

7. max_ctr_complexity
By default catboost generates categorical feature combinations in a greedy way.
This is time consuming, you can disable that by setting max_ctr_complexity=1 or by allowing only combinations of 2 features by setting max_ctr_complexity=2.
This will speed up the training only if you have categorical features.

8. If you are training the model on GPU, you can try decreasing border_count. This is the number of splits considered for each feature. By default it's set to 128, but you can try setting it to 32. In many cases it will not degrade the quality of the model and will speed up the training by a lot. 

In [45]:
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32,
    eval_metric='AUC')

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    plot=True,
    eval_set=(X_validation, y_validation)
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7fa83207f4a8>

** Question 16: **

Try tunning the speed of the algorithm. What is the maximum speedup you could get by changing these parameters without decreasing of AUC on best iteration on eval dataset compared to AUC on best iteration after training with default parameters and random seed = 0?
The answer shoud be a number, for example 2.7 means you got 2.7 times speedup.

In [47]:
speedup = 18
grader.submit_tag('speedup', speedup)

Current answer for task speedup is: 18


### Accuracy

The parameters listed below are important to get the best quality of the model. Try changing this parameters to improve the quality of the resulting model

In [50]:
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
    plot=True
)

A Jupyter Widget

KeyboardInterrupt: 

In [57]:
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=200,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6,
    eval_metric='AUC'
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
    plot=True
)

A Jupyter Widget

<catboost.core.CatBoostClassifier at 0x7fa831fa2128>

** Question 17: **

Try tunning these parameters to make AUC on eval dataset as large as possible. What is the maximum AUC value you have reached?

In [58]:
final_auc = 0.9007342
grader.submit_tag('final_auc', final_auc)

Current answer for task final_auc is: 0.9007342


In [59]:
STUDENT_EMAIL = 'chrispun0518@gmail.com' # EMAIL HERE
STUDENT_TOKEN = 'dHlwQYhEiHdvOkPU'# TOKEN HERE
grader.status()

You want to submit these numbers:
Task negative_samples: 1897
Task positive_samples: 30872
Task resource_unique_values: 7518
Task logloss_mean: 0.138180847126
Task logloss_std: 0.000777781825619
Task accuracy_6: 0.9539213
Task best_model_name: learning_rate_0.05
Task num_trees: 100
Task mean_logloss_cv: 0.1409
Task logloss_std_1: 0.0056
Task iterations_overfitting: 64
Task auc_550: 0.9849756977745989
Task feature_importance_top3: RESOURCE,MGR_ID,ROLE_DEPTNAME
Task most_important: RESOURCE
Task shap_influence: -1
Task speedup: 18
Task final_auc: 0.9007342


In [60]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
