## The simpliest usage example of py_boost

### Installation (if needed)

In [1]:
# !pip install py-boost

### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting 

### Generate dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 1.96 s, sys: 1.45 s, total: 3.4 s
Wall time: 817 ms


### Train GBM

The only required argument here is the loss funnction. Type of solved task is defined by loss function and input target shape. It could be passed as Loss instance or via string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification


Training is simply done by calling .fit metod. Possible arguments:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'*** Validation set is passed as the list of dict with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one

#### The example below illustrates how to train simple regression task:

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[20:25:59] Stdout logging level is INFO.
[20:25:59] GDBT train starts. Max iter 100, early stopping rounds 100
[20:25:59] Iter 0; Sample 0, rmse = 173.6750225293403; 
[20:25:59] Iter 10; Sample 0, rmse = 133.1954926671811; 
[20:26:00] Iter 20; Sample 0, rmse = 107.86632421368985; 
[20:26:00] Iter 30; Sample 0, rmse = 90.082420080482; 
[20:26:00] Iter 40; Sample 0, rmse = 76.44561629937125; 
[20:26:00] Iter 50; Sample 0, rmse = 65.61091639545339; 
[20:26:00] Iter 60; Sample 0, rmse = 56.80208140886104; 
[20:26:00] Iter 70; Sample 0, rmse = 49.577701821537644; 
[20:26:00] Iter 80; Sample 0, rmse = 43.604032056197624; 
[20:26:00] Iter 90; Sample 0, rmse = 38.69803479079091; 
[20:26:00] Iter 99; Sample 0, rmse = 34.992295325052034; 
CPU times: user 3.98 s, sys: 708 ms, total: 4.69 s
Wall time: 3.41 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f245c30a790>

### Train for multiregression

Each of built-in loss functions has its own default metric, so metric definition is optional. If you need to specify eval metric, you could pass Metric instance or use string alias:

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task


Also you can specify other common GBDT hyperparameters as shown below

#### The example below demonstrates the example of training model for multioutput regression task, no extra definition needed to switch the task, just pass multidimensional target:

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])з

[20:26:01] Stdout logging level is INFO.
[20:26:01] GDBT train starts. Max iter 1000, early stopping rounds 200
[20:26:01] Iter 0; Sample 0, R2_score = 0.0083971580680002; 
[20:26:03] Iter 100; Sample 0, R2_score = 0.5170809873680927; 
[20:26:04] Iter 200; Sample 0, R2_score = 0.7244484108548253; 
[20:26:06] Iter 300; Sample 0, R2_score = 0.8327945671515238; 
[20:26:07] Iter 400; Sample 0, R2_score = 0.8949818250346425; 
[20:26:09] Iter 500; Sample 0, R2_score = 0.9320841598288354; 
[20:26:11] Iter 600; Sample 0, R2_score = 0.9546882993697207; 
[20:26:12] Iter 700; Sample 0, R2_score = 0.9687345404426747; 
[20:26:14] Iter 800; Sample 0, R2_score = 0.9776046975136594; 
[20:26:16] Iter 900; Sample 0, R2_score = 0.983284793877143; 
[20:26:17] Iter 999; Sample 0, R2_score = 0.9869747857432605; 
CPU times: user 16.8 s, sys: 1.66 s, total: 18.5 s
Wall time: 17.2 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f227a4c9ee0>

## Inference

#### Prediction could be done via calling .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 795 ms, sys: 491 ms, total: 1.29 s
Wall time: 1.28 s


(50000, 10)

In [7]:
preds

array([[-245.00179 , -154.63591 , -283.4694  , ..., -149.77898 ,
        -225.69731 , -242.11638 ],
       [-109.20388 , -106.41216 ,  -47.461292, ..., -120.21821 ,
        -112.02216 ,  -11.135624],
       [ -41.005196,  -66.21285 ,  138.89832 , ...,   11.006489,
         -31.018751, -215.91324 ],
       ...,
       [ -74.87248 ,  138.37398 ,   81.35473 , ...,  228.11902 ,
          39.559467,   20.036104],
       [ -13.227322,  128.68654 ,  232.82118 , ...,  136.53038 ,
         160.83208 ,  199.1988  ],
       [ -14.531125,   39.43056 ,  171.69994 , ...,   99.01615 ,
          26.81745 ,    7.780988]], dtype=float32)

#### Prediction for certan iterations could be done via calling .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 246 ms, sys: 220 ms, total: 466 ms
Wall time: 465 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations could be done via calling .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 11.3 ms, sys: 7.57 ms, total: 18.8 ms
Wall time: 17.9 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[10, 36, 19],
       [35, 48, 18],
       [31, 56, 21],
       ...,
       [59, 56, 26],
       [30, 57, 20],
       [53, 56, 20]], dtype=uint32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  41.,   51.,   37.,   43.,   54.,   51., 5674.,   52.,   62.,
         55.,   40.,   48.,   45.,   33.,   57., 5945., 5510.,   40.,
         37., 5460.,   29.,   35.,   52.,   62.,   34.,   41.,   48.,
         49.,   30.,   37.,   48.,   46.,   56.,   43.,   37.,   43.,
       6009.,   43.,   39.,   40.,   64.,   52.,   44.,   49.,   62.,
         38.,   39.,   46.,   41.,   50.,   52.,   58., 5848.,   47.,
         38.,   60.,   55.,   36.,   35.,   50.,   34.,   48.,   41.,
         36.,   38.,   41.,   38.,   43.,   44.,   65.,   45.,   29.,
         39.,   64.,   44.,   59.,   65.,   37.,   44.,   40.,   51.,
         38.,   51.,   49.,   59.,   38., 5605., 3504.,   55., 5843.,
         53., 6135.,   41.,   50.,   53.,   39.,   47.,   42.,   56.,
         65.], dtype=float32)

#### Trained model could be saved as pickle for inference

In [12]:
joblib.dump(model, 'temp_model.pkl')

new_model = joblib.load('temp_model.pkl')
new_model.predict(X_test)

array([[-245.00179 , -154.63591 , -283.4694  , ..., -149.77898 ,
        -225.69731 , -242.11638 ],
       [-109.20388 , -106.41216 ,  -47.461292, ..., -120.21821 ,
        -112.02216 ,  -11.135624],
       [ -41.005196,  -66.21285 ,  138.89832 , ...,   11.006489,
         -31.018751, -215.91324 ],
       ...,
       [ -74.87248 ,  138.37398 ,   81.35473 , ...,  228.11902 ,
          39.559467,   20.036104],
       [ -13.227322,  128.68654 ,  232.82118 , ...,  136.53038 ,
         160.83208 ,  199.1988  ],
       [ -14.531125,   39.43056 ,  171.69994 , ...,   99.01615 ,
          26.81745 ,    7.780988]], dtype=float32)