## Advanced options for multioutput handling

### Imports

In [1]:
import os
# set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.makedirs('../data', exist_ok=True)

import pandas as pd
from pandas import Series, DataFrame

from sklearn.model_selection import train_test_split

from py_boost import GradientBoosting, SketchBoost

# strategies to deal with multiple outputs
from py_boost.multioutput.sketching import *
from py_boost.multioutput.target_splitter import *

### Downloading data from OpenML

In [2]:
!wget https://www.openml.org/data/get_csv/19335692/file1c556677f875.csv -O ../data/helena.csv

--2023-03-20 14:27:27--  https://www.openml.org/data/get_csv/19335692/file1c556677f875.csv
Resolving www.openml.org (www.openml.org)... 131.155.11.11
Connecting to www.openml.org (www.openml.org)|131.155.11.11|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://api.openml.org/data/get_csv/19335692/file1c556677f875.csv [following]
--2023-03-20 14:27:27--  https://api.openml.org/data/get_csv/19335692/file1c556677f875.csv
Resolving api.openml.org (api.openml.org)... 131.155.11.11
Connecting to api.openml.org (api.openml.org)|131.155.11.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘../data/helena.csv’

../data/helena.csv      [      <=>           ]  14.56M  11.5MB/s    in 1.3s    

2023-03-20 14:27:28 (11.5 MB/s) - ‘../data/helena.csv’ saved [15271704]



In [3]:
data = pd.read_csv('../data/helena.csv')

data

Unnamed: 0,class,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27
0,41,0.005521,0.080556,0.110417,0.490822,0.586406,0.066414,0.092206,0.116352,0.379310,...,-0.342986,78.6894,17.237800,21.504200,14.43730,17.378000,9.61674,-0.609370,1.044830,1.481790
1,92,0.049398,0.147917,0.541667,0.542865,0.515608,0.105128,0.475550,0.049555,0.383460,...,2.639370,59.7879,5.393410,3.819610,11.49240,3.929470,5.91423,1.409210,4.749540,1.103820
2,24,0.548663,1.000000,1.000000,0.397029,0.627398,1.023440,1.004220,0.027381,0.451337,...,0.137427,58.1429,-3.365980,-0.037489,10.63470,2.660180,3.93377,-0.898220,2.137790,1.054470
3,29,0.023073,0.206250,0.238889,0.622998,0.764067,0.202599,0.177892,0.071232,0.531712,...,0.477009,55.4798,-1.051090,-4.755360,13.36710,2.852060,9.65162,0.224397,-0.220216,-0.273287
4,91,0.224427,0.433333,0.902083,0.814199,0.576879,0.344413,0.822975,0.026121,0.425875,...,0.521306,76.8475,-19.371700,32.270700,9.41442,4.343450,8.67710,-1.587580,1.117870,-0.545338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65191,88,0.007292,0.152083,0.061111,0.114431,0.406104,0.143170,0.053086,0.129365,0.215442,...,1.265300,53.2951,-1.416430,2.173900,13.66950,1.588520,2.02855,0.619052,0.622377,-0.363035
65192,77,0.411279,1.000000,0.430556,0.503805,0.207163,1.003740,0.412067,0.017673,0.044771,...,-2.842380,91.1178,-0.009207,-2.224830,1.30504,0.898489,1.80362,-2.726140,-0.184329,-0.476441
65193,71,0.999352,1.000000,1.000000,0.501360,0.501075,0.999384,0.999414,0.009636,0.000648,...,0.213472,84.4141,2.042450,13.849800,7.24428,1.443890,4.00495,-0.749115,1.025130,0.096257
65194,24,0.206175,0.383333,0.944444,0.749915,0.550936,0.292477,0.830921,0.033542,0.430515,...,0.879472,61.4110,17.354200,5.566660,16.22600,10.049400,6.04195,0.400956,0.375599,0.644575


In [4]:
data['class'].value_counts()

78    4005
55    3063
40    2992
39    2623
38    2216
      ... 
75     121
56     121
32     119
34     116
10     111
Name: class, Length: 100, dtype: int64

Looks like we have 100 classes here ...

In [5]:
X = data.drop('class', axis=1).values.astype('float32')
y = data['class'].values.astype('int32')

X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

### Traininig a multiclass model

A simple use case for training a multiclass problem is the same as for regression. By default py_boost builds multioutout trees to handle multioutput problems (single tree outputs a vector of length 100 for 100 class task).


In [6]:
%%time
model = GradientBoosting(
    'crossentropy',
     ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1, gd_steps=1,
     subsample=1, colsample=1, min_data_in_leaf=10, use_hess=True,
     max_bin=256, max_depth=6, debug=True
)

model.fit(X, y, 
          eval_sets = [{'X': X_test, 'y': y_test}])

[14:27:34] Stdout logging level is INFO.
[14:27:34] GDBT train starts. Max iter 10000, early stopping rounds 300
[14:27:35] Iter 0; Sample 0, Crossentropy = 4.285501684682705; 
[14:27:42] Iter 100; Sample 0, Crossentropy = 2.7763018474667405; 
[14:27:49] Iter 200; Sample 0, Crossentropy = 2.6534368930852343; 
[14:27:57] Iter 300; Sample 0, Crossentropy = 2.6120720412816; 
[14:28:05] Iter 400; Sample 0, Crossentropy = 2.591461281805933; 
[14:28:14] Iter 500; Sample 0, Crossentropy = 2.5793298777967135; 
[14:28:22] Iter 600; Sample 0, Crossentropy = 2.5728810045047528; 
[14:28:31] Iter 700; Sample 0, Crossentropy = 2.5679381462019157; 
[14:28:40] Iter 800; Sample 0, Crossentropy = 2.56521209757055; 
[14:28:49] Iter 900; Sample 0, Crossentropy = 2.563091753755386; 
[14:28:57] Iter 1000; Sample 0, Crossentropy = 2.5616368979225217; 
[14:29:06] Iter 1100; Sample 0, Crossentropy = 2.5606824117112787; 
[14:29:15] Iter 1200; Sample 0, Crossentropy = 2.5603412056454866; 
[14:29:23] Iter 1300; S

<py_boost.gpu.boosting.GradientBoosting at 0x7f438d0259d0>

In [7]:
%%time
pred = model.predict(X_test)
pred.shape

CPU times: user 1.13 s, sys: 67.9 ms, total: 1.2 s
Wall time: 1.2 s


(13040, 100)

### Sketching strategies to speedup training

Computational costs of training multioutput models drastically increase when number of output grows. We implemented a few strategies to simplify tree structure search via gradinet matrix sketching:

* ***RandomSamplingSketch*** (recommended for use_hess=True)
* ***RandomProjectionSketch*** (recommended for use_hess=False)
* ***TopOutputsSketch***
* ***SVDSketch*** (needs RAPIDS (cuml) to be installed)

Let us check, how it works.

In [8]:
%%time

sketch = RandomProjectionSketch(1)
# sketch = RandomSamplingSketch(10)
# sketch = TopOutputsSketch(10)
# sketch = SVDSketch(n_components=1)

model = GradientBoosting(
    'crossentropy',
    ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1, gd_steps=1, 
    subsample=1, colsample=1, min_data_in_leaf=10, use_hess=False, 
    max_bin=256, max_depth=6,
    multioutput_sketch=sketch, debug=True
)

model.fit(X, y, eval_sets = [{'X': X_test, 'y': y_test}])

[14:29:52] Stdout logging level is INFO.
[14:29:52] GDBT train starts. Max iter 10000, early stopping rounds 300
[14:29:53] Iter 0; Sample 0, Crossentropy = 4.37780858910313; 
[14:29:54] Iter 100; Sample 0, Crossentropy = 2.8830431798283946; 
[14:29:55] Iter 200; Sample 0, Crossentropy = 2.7344265194452593; 
[14:29:56] Iter 300; Sample 0, Crossentropy = 2.6735993342291535; 
[14:29:58] Iter 400; Sample 0, Crossentropy = 2.639656949046086; 
[14:29:59] Iter 500; Sample 0, Crossentropy = 2.616071220871295; 
[14:30:00] Iter 600; Sample 0, Crossentropy = 2.599227755788996; 
[14:30:01] Iter 700; Sample 0, Crossentropy = 2.58695941504371; 
[14:30:03] Iter 800; Sample 0, Crossentropy = 2.577093448514365; 
[14:30:04] Iter 900; Sample 0, Crossentropy = 2.5705359899350455; 
[14:30:05] Iter 1000; Sample 0, Crossentropy = 2.5652103670154838; 
[14:30:06] Iter 1100; Sample 0, Crossentropy = 2.5596293871441924; 
[14:30:07] Iter 1200; Sample 0, Crossentropy = 2.5561329785264295; 
[14:30:09] Iter 1300; S

<py_boost.gpu.boosting.GradientBoosting at 0x7f435ec0d940>

In [9]:
%%time
pred = model.predict(X_test)
pred.shape

CPU times: user 1.76 s, sys: 76.1 ms, total: 1.84 s
Wall time: 1.84 s


(13040, 100)

### SketchBoost

Alternatively you can use SketchBoost class with built in sketching strateges. Just define number of outputs to keep

In [10]:
%%time
model = SketchBoost(
    'crossentropy',
     ntrees=10000, lr=0.03, verbose=100, es=300, lambda_l2=1, gd_steps=1, 
     subsample=1, colsample=1, min_data_in_leaf=10, 
     max_bin=256, max_depth=6,
)

model.fit(X, y, eval_sets = [{'X': X_test, 'y': y_test}])

[14:30:25] Stdout logging level is INFO.
[14:30:25] GDBT train starts. Max iter 10000, early stopping rounds 300
[14:30:25] Iter 0; Sample 0, Crossentropy = 4.373896924422614; 
[14:30:27] Iter 100; Sample 0, Crossentropy = 2.907183459675343; 
[14:30:28] Iter 200; Sample 0, Crossentropy = 2.751375678378877; 
[14:30:29] Iter 300; Sample 0, Crossentropy = 2.689612458132744; 
[14:30:30] Iter 400; Sample 0, Crossentropy = 2.6544527823729394; 
[14:30:32] Iter 500; Sample 0, Crossentropy = 2.6289818069997537; 
[14:30:33] Iter 600; Sample 0, Crossentropy = 2.6110374028609233; 
[14:30:34] Iter 700; Sample 0, Crossentropy = 2.59649351120474; 
[14:30:35] Iter 800; Sample 0, Crossentropy = 2.58602262416369; 
[14:30:37] Iter 900; Sample 0, Crossentropy = 2.577047973870354; 
[14:30:38] Iter 1000; Sample 0, Crossentropy = 2.5710195922726817; 
[14:30:39] Iter 1100; Sample 0, Crossentropy = 2.565049627730205; 
[14:30:40] Iter 1200; Sample 0, Crossentropy = 2.5607273376650648; 
[14:30:42] Iter 1300; Sam

<py_boost.gpu.sketch_boost.SketchBoost at 0x7f438ce5b820>

In [11]:
%%time
pred = model.predict(X_test)
pred.shape

CPU times: user 1.06 s, sys: 36.1 ms, total: 1.1 s
Wall time: 1.1 s


(13040, 100)

#### We see a nice speed up and sometimes even a better accuracy!

#### These modifications allow us to train a model faster than CatBoost does (with a similar setup).

In [12]:
%%time
from catboost import CatBoostClassifier

model = CatBoostClassifier(
                       grow_policy='Depthwise', 
                       bootstrap_type='Bernoulli',
                       subsample=1.,
                       border_count=256, 
                       iterations=10000, 
                       od_wait=300,
                       max_depth=6, 
                       devices='0:0', 
                       learning_rate=0.03, 
                       l2_leaf_reg=1, 
                       min_data_in_leaf=10, 
                       score_function='L2',
                       model_shrink_mode='Constant',
                       **{'task_type': 'GPU', 'verbose': 100, }
                    )

model.fit(X, y, eval_set = (X_test, y_test))    

0:	learn: 4.2940505	test: 4.3045904	best: 4.3045904 (0)	total: 42.5ms	remaining: 7m 4s
100:	learn: 2.7026067	test: 2.8639825	best: 2.8639825 (100)	total: 3.59s	remaining: 5m 51s
200:	learn: 2.4526129	test: 2.7287550	best: 2.7287550 (200)	total: 6.96s	remaining: 5m 39s
300:	learn: 2.3213712	test: 2.6789284	best: 2.6789284 (300)	total: 10.2s	remaining: 5m 27s
400:	learn: 2.2232472	test: 2.6516266	best: 2.6516266 (400)	total: 13.3s	remaining: 5m 17s
500:	learn: 2.1449561	test: 2.6351179	best: 2.6351179 (500)	total: 16.3s	remaining: 5m 9s
600:	learn: 2.0773086	test: 2.6237371	best: 2.6237371 (600)	total: 19.3s	remaining: 5m 2s
700:	learn: 2.0146061	test: 2.6149348	best: 2.6149321 (699)	total: 22.3s	remaining: 4m 55s
800:	learn: 1.9531141	test: 2.6072940	best: 2.6072838 (799)	total: 25.3s	remaining: 4m 50s
900:	learn: 1.8969740	test: 2.6016194	best: 2.6016194 (900)	total: 28.3s	remaining: 4m 46s
1000:	learn: 1.8410810	test: 2.5963373	best: 2.5963373 (1000)	total: 31.3s	remaining: 4m 41s
110

<catboost.core.CatBoostClassifier at 0x7f43613fc9d0>