## Advanced options for multioutput handling

### Imports

In [34]:
import os
# set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.makedirs('data', exist_ok=True)

import pandas as pd
from pandas import Series, DataFrame

from sklearn.model_selection import train_test_split

from py_boost import GradientBoosting 

# strategies to deal with multiple outputs
from py_boost.multioutput.sketching import *
from py_boost.multioutput.target_splitter import *

### Download data from openml

In [2]:
!wget https://www.openml.org/data/get_csv/19335692/file1c556677f875.csv -O data/helena.csv

--2021-11-10 21:58:10--  https://www.openml.org/data/get_csv/19335692/file1c556677f875.csv
Resolving www.openml.org (www.openml.org)... 131.155.11.11
Connecting to www.openml.org (www.openml.org)|131.155.11.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘helena.csv’

helena.csv              [        <=>         ]  14.56M  9.87MB/s    in 1.5s    

2021-11-10 21:58:12 (9.87 MB/s) - ‘helena.csv’ saved [15271704]



In [3]:
data = pd.read_csv('data/helena.csv')

data

Unnamed: 0,class,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27
0,41,0.005521,0.080556,0.110417,0.490822,0.586406,0.066414,0.092206,0.116352,0.379310,...,-0.342986,78.6894,17.237800,21.504200,14.43730,17.378000,9.61674,-0.609370,1.044830,1.481790
1,92,0.049398,0.147917,0.541667,0.542865,0.515608,0.105128,0.475550,0.049555,0.383460,...,2.639370,59.7879,5.393410,3.819610,11.49240,3.929470,5.91423,1.409210,4.749540,1.103820
2,24,0.548663,1.000000,1.000000,0.397029,0.627398,1.023440,1.004220,0.027381,0.451337,...,0.137427,58.1429,-3.365980,-0.037489,10.63470,2.660180,3.93377,-0.898220,2.137790,1.054470
3,29,0.023073,0.206250,0.238889,0.622998,0.764067,0.202599,0.177892,0.071232,0.531712,...,0.477009,55.4798,-1.051090,-4.755360,13.36710,2.852060,9.65162,0.224397,-0.220216,-0.273287
4,91,0.224427,0.433333,0.902083,0.814199,0.576879,0.344413,0.822975,0.026121,0.425875,...,0.521306,76.8475,-19.371700,32.270700,9.41442,4.343450,8.67710,-1.587580,1.117870,-0.545338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65191,88,0.007292,0.152083,0.061111,0.114431,0.406104,0.143170,0.053086,0.129365,0.215442,...,1.265300,53.2951,-1.416430,2.173900,13.66950,1.588520,2.02855,0.619052,0.622377,-0.363035
65192,77,0.411279,1.000000,0.430556,0.503805,0.207163,1.003740,0.412067,0.017673,0.044771,...,-2.842380,91.1178,-0.009207,-2.224830,1.30504,0.898489,1.80362,-2.726140,-0.184329,-0.476441
65193,71,0.999352,1.000000,1.000000,0.501360,0.501075,0.999384,0.999414,0.009636,0.000648,...,0.213472,84.4141,2.042450,13.849800,7.24428,1.443890,4.00495,-0.749115,1.025130,0.096257
65194,24,0.206175,0.383333,0.944444,0.749915,0.550936,0.292477,0.830921,0.033542,0.430515,...,0.879472,61.4110,17.354200,5.566660,16.22600,10.049400,6.04195,0.400956,0.375599,0.644575


In [26]:
data['class'].value_counts()

78    4005
55    3063
40    2992
39    2623
38    2216
      ... 
56     121
75     121
32     119
34     116
10     111
Name: class, Length: 100, dtype: int64

#### Looks like we have 100 classes here ...

In [4]:
X = data.drop('class', axis=1).values.astype('float32')
y = data['class'].values.astype('int32')

X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

### Train multiclass model

Simple usecase for training multiclass is the same as for regression. By default py_boost builds multioutout trees to handle multioutput problems, single tree outputs a vector of length 100 for 100 class task


In [17]:
%%time
model = GradientBoosting('crossentropy',
                         ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1,
                         subsample=1, colsample=1, min_data_in_leaf=10,
                         max_bin=256, max_depth=6)

model.fit(X, y, 
          eval_sets = [{'X': X_test, 'y': y_test}])

[22:09:40] Stdout logging level is INFO.
[22:09:40] GDBT train starts. Max iter 10000, early stopping rounds 300
[22:09:40] Iter 0; Sample 0, Crossentropy = 4.304827220079593; 
[22:09:44] Iter 100; Sample 0, Crossentropy = 2.8611263531023345; 
[22:09:49] Iter 200; Sample 0, Crossentropy = 2.728235614034673; 
[22:09:53] Iter 300; Sample 0, Crossentropy = 2.6778626299315684; 
[22:09:58] Iter 400; Sample 0, Crossentropy = 2.6502818506254036; 
[22:10:02] Iter 500; Sample 0, Crossentropy = 2.63424278326861; 
[22:10:07] Iter 600; Sample 0, Crossentropy = 2.623488614688844; 
[22:10:12] Iter 700; Sample 0, Crossentropy = 2.615064449096285; 
[22:10:16] Iter 800; Sample 0, Crossentropy = 2.6071642743631136; 
[22:10:21] Iter 900; Sample 0, Crossentropy = 2.600947677453085; 
[22:10:25] Iter 1000; Sample 0, Crossentropy = 2.596409021535033; 
[22:10:30] Iter 1100; Sample 0, Crossentropy = 2.5919130667358017; 
[22:10:35] Iter 1200; Sample 0, Crossentropy = 2.5883597160700273; 
[22:10:39] Iter 1300; S

<py_boost.gpu.boosting.GradientBoosting at 0x7efd4b393610>

In [6]:
%%time
pred = model.predict(X_test)

pred.shape

CPU times: user 1.16 s, sys: 664 ms, total: 1.83 s
Wall time: 1.83 s


(13040, 100)

### Sketching strategies to speedup training

Training multioutput models is more time consuming than single output. That's why we implement few strategies to simplify tree structure search via gradinet matrix sketching:

* ***RandomSamplingSketch*** (recommended)
* ***TopOutputsSketch***
* ***SVDSketch*** (needs RAPIDS (cuml) to be installed)

Let's check, how it works

In [25]:
%%time

sketch = RandomSamplingSketch(10)

model = GradientBoosting('crossentropy',
                         ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1,
                         subsample=1, colsample=1, min_data_in_leaf=10,
                         max_bin=256, max_depth=6,
                         multioutput_sketch=sketch
                        )

model.fit(X, y, eval_sets = [{'X': X_test, 'y': y_test}])

[22:20:12] Stdout logging level is INFO.
[22:20:12] GDBT train starts. Max iter 10000, early stopping rounds 300
[22:20:12] Iter 0; Sample 0, Crossentropy = 4.366729785003158; 
[22:20:14] Iter 100; Sample 0, Crossentropy = 2.900768619314183; 
[22:20:15] Iter 200; Sample 0, Crossentropy = 2.760633129321527; 
[22:20:17] Iter 300; Sample 0, Crossentropy = 2.700729209640207; 
[22:20:18] Iter 400; Sample 0, Crossentropy = 2.666901549547681; 
[22:20:20] Iter 500; Sample 0, Crossentropy = 2.6433972865472493; 
[22:20:21] Iter 600; Sample 0, Crossentropy = 2.6286164429689336; 
[22:20:23] Iter 700; Sample 0, Crossentropy = 2.6165641306630825; 
[22:20:24] Iter 800; Sample 0, Crossentropy = 2.6068343569429606; 
[22:20:26] Iter 900; Sample 0, Crossentropy = 2.598758513549033; 
[22:20:27] Iter 1000; Sample 0, Crossentropy = 2.5918135165170693; 
[22:20:29] Iter 1100; Sample 0, Crossentropy = 2.587314895990852; 
[22:20:30] Iter 1200; Sample 0, Crossentropy = 2.5825381641023717; 
[22:20:32] Iter 1300; 

<py_boost.gpu.boosting.GradientBoosting at 0x7efd18392880>

In [20]:
pred = model.predict(X_test)
pred.shape

(13040, 100)

#### Nice speed up and some times even better accuracy !

#### Its even faster than catboost with close to the same setup :)

In [24]:
%%time
from catboost import CatBoostClassifier

model = CatBoostClassifier(
                       grow_policy='Depthwise', 
                       bootstrap_type='Bernoulli',
                       subsample=1.,
                       border_count=256, 
                       iterations=10000, 
                       od_wait=300,
                       max_depth=6, 
                       devices='0:0', 
                       learning_rate=0.03, 
                       l2_leaf_reg=1, 
                       min_data_in_leaf=10, 
                       score_function='L2',
                       model_shrink_mode='Constant',
                       **{'task_type': 'GPU', 'verbose': 100, }
                    )

model.fit(X, y, eval_set = (X_test, y_test))    

0:	learn: 4.2940514	test: 4.3045907	best: 4.3045907 (0)	total: 44.4ms	remaining: 7m 24s
100:	learn: 2.7026061	test: 2.8639819	best: 2.8639819 (100)	total: 3.76s	remaining: 6m 8s
200:	learn: 2.4526127	test: 2.7287550	best: 2.7287550 (200)	total: 7.26s	remaining: 5m 54s
300:	learn: 2.3213709	test: 2.6789284	best: 2.6789284 (300)	total: 10.6s	remaining: 5m 42s
400:	learn: 2.2232475	test: 2.6516266	best: 2.6516266 (400)	total: 13.9s	remaining: 5m 33s
500:	learn: 2.1449561	test: 2.6351179	best: 2.6351179 (500)	total: 17.1s	remaining: 5m 25s
600:	learn: 2.0773091	test: 2.6237368	best: 2.6237368 (600)	total: 20.3s	remaining: 5m 17s
700:	learn: 2.0146066	test: 2.6149354	best: 2.6149321 (699)	total: 23.5s	remaining: 5m 11s
800:	learn: 1.9531132	test: 2.6072940	best: 2.6072841 (799)	total: 26.6s	remaining: 5m 5s
900:	learn: 1.8969742	test: 2.6016194	best: 2.6016194 (900)	total: 29.8s	remaining: 5m 1s
1000:	learn: 1.8410815	test: 2.5963370	best: 2.5963370 (1000)	total: 33s	remaining: 4m 56s
1100:

<catboost.core.CatBoostClassifier at 0x7efd1837f970>

### Target splitting strategies

As it was mentioned above, py_boost builds a single tree for all outputs. Commonly its recommended strategy, but there are some options. Here are possible strategies:

* ***'Single'*** stands for default strategy
* ***'OneVsAll'*** (not recommended) stands for building separate trees for the each output
* ***RandomGroupsSplitter*** (experimental) randomly groups the outputs at the each iteration and build a tree per group. For example, you can build 3 trees predicting 100 classes
* ***Custom*** build your own strategy to split outputs to the groups

#### Now, let's try to check RandomGroupsSplitter together with the sketching 

In [32]:
%%time

splitter = RandomGroupsSplitter(3)
sketch = RandomSamplingSketch(10)

model = GradientBoosting('crossentropy',
                         ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1,
                         subsample=1, colsample=1, min_data_in_leaf=10,
                         max_bin=256, max_depth=6,
                         multioutput_sketch=sketch,
                         target_splitter=splitter
                        )

model.fit(X, y, eval_sets = [{'X': X_test, 'y': y_test}])

[22:46:05] Stdout logging level is INFO.
[22:46:05] GDBT train starts. Max iter 10000, early stopping rounds 300
[22:46:06] Iter 0; Sample 0, Crossentropy = 4.326980387631079; 
[22:46:09] Iter 100; Sample 0, Crossentropy = 2.8532873930758953; 
[22:46:13] Iter 200; Sample 0, Crossentropy = 2.7270546642112574; 
[22:46:17] Iter 300; Sample 0, Crossentropy = 2.67468653221888; 
[22:46:21] Iter 400; Sample 0, Crossentropy = 2.646447996475138; 
[22:46:25] Iter 500; Sample 0, Crossentropy = 2.6275378505210005; 
[22:46:29] Iter 600; Sample 0, Crossentropy = 2.614286675159497; 
[22:46:33] Iter 700; Sample 0, Crossentropy = 2.6037103730812103; 
[22:46:38] Iter 800; Sample 0, Crossentropy = 2.596234914516479; 
[22:46:42] Iter 900; Sample 0, Crossentropy = 2.5894875879047965; 
[22:46:46] Iter 1000; Sample 0, Crossentropy = 2.584543545871069; 
[22:46:50] Iter 1100; Sample 0, Crossentropy = 2.5801520899776977; 
[22:46:54] Iter 1200; Sample 0, Crossentropy = 2.57597971092585; 
[22:46:58] Iter 1300; Sa

<py_boost.gpu.boosting.GradientBoosting at 0x7efb850701c0>

#### Didn't improve the score, but at least we could try :)