## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
# !pip install cupy-cuda110 py-boost

### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.makedirs('../data', exist_ok=True)

import joblib
from sklearn.datasets import make_regression
import numpy as np

# simple case - just one class is used
from py_boost import GradientBoosting
from py_boost.cv import CrossValidation

### Generation of dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.32 s, sys: 1.6 s, total: 3.92 s
Wall time: 813 ms


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[10:05:38] Stdout logging level is INFO.
[10:05:38] GDBT train starts. Max iter 100, early stopping rounds 100
[10:05:39] Iter 0; Sample 0, rmse = 173.68515689890785; 
[10:05:39] Iter 10; Sample 0, rmse = 133.2329134694332; 
[10:05:39] Iter 20; Sample 0, rmse = 107.90957666785495; 
[10:05:39] Iter 30; Sample 0, rmse = 90.08412889111113; 
[10:05:39] Iter 40; Sample 0, rmse = 76.43090124568918; 
[10:05:39] Iter 50; Sample 0, rmse = 65.558451192956; 
[10:05:40] Iter 60; Sample 0, rmse = 56.768458988756876; 
[10:05:40] Iter 70; Sample 0, rmse = 49.56556837526753; 
[10:05:40] Iter 80; Sample 0, rmse = 43.589192766231506; 
[10:05:40] Iter 90; Sample 0, rmse = 38.672162024198634; 
[10:05:40] Iter 99; Sample 0, rmse = 34.99783732571031; 
CPU times: user 8.95 s, sys: 1.34 s, total: 10.3 s
Wall time: 7.95 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f0374b0e670>

### Traininig a GBDT model in a multiregression case

Each of built-in loss functions has its own default metric, so metric definition is optional. 
If you need to specify the evaluation metric, you can pass a Metric instance or use a string alias.

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non-default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task

It is possible to specify other common GBDT hyperparameters as shown below.

#### The following example demonstrates how to train a model for a multioutput regression task (no extra definition needed to switch the task to multioutput one, you just need to pass a multidimensional target).

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[10:05:40] Stdout logging level is INFO.
[10:05:40] GDBT train starts. Max iter 1000, early stopping rounds 200
[10:05:42] Iter 0; Sample 0, R2_score = 0.008394434444722109; 
[10:05:44] Iter 100; Sample 0, R2_score = 0.5168089065694275; 
[10:05:46] Iter 200; Sample 0, R2_score = 0.7243047110486456; 
[10:05:48] Iter 300; Sample 0, R2_score = 0.8328610027909523; 
[10:05:51] Iter 400; Sample 0, R2_score = 0.8950558515671254; 
[10:05:53] Iter 500; Sample 0, R2_score = 0.9320911339106711; 
[10:05:55] Iter 600; Sample 0, R2_score = 0.9547420897045367; 
[10:05:57] Iter 700; Sample 0, R2_score = 0.9687934126955178; 
[10:05:59] Iter 800; Sample 0, R2_score = 0.9776687079713445; 
[10:06:02] Iter 900; Sample 0, R2_score = 0.9833495654282334; 
[10:06:04] Iter 999; Sample 0, R2_score = 0.9870432315097194; 
CPU times: user 23.4 s, sys: 2.24 s, total: 25.6 s
Wall time: 23.7 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f035e2a39a0>

## Inference

#### Prediction can be done via calling the .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 1.37 s, sys: 650 ms, total: 2.02 s
Wall time: 2.06 s


(50000, 10)

In [7]:
preds

array([[-227.73753  , -139.2886   , -271.09726  , ..., -132.06668  ,
        -209.25096  , -226.7468   ],
       [-107.63527  , -105.17763  ,  -51.269665 , ..., -120.603195 ,
        -108.78121  ,  -16.261744 ],
       [ -37.47324  ,  -59.83011  ,  144.37334  , ...,   16.221537 ,
         -27.90023  , -213.23357  ],
       ...,
       [ -79.836494 ,  138.32225  ,   85.83367  , ...,  230.75499  ,
          37.589367 ,   16.244379 ],
       [  -5.5796094,  138.75368  ,  246.4468   , ...,  150.21175  ,
         174.16621  ,  205.89897  ],
       [  -6.7977576,   40.970036 ,  167.97235  , ...,   96.029816 ,
          27.76857  ,   11.001275 ]], dtype=float32)

#### Prediction for certan iterations can be done via calling the .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 373 ms, sys: 306 ms, total: 679 ms
Wall time: 690 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations can be done via calling the .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 15.5 ms, sys: 3.89 ms, total: 19.3 ms
Wall time: 18.2 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[14, 17,  5],
       [50, 44, 14],
       [32, 45, 54],
       ...,
       [54, 50, 15],
       [30, 45,  5],
       [60, 44, 13]], dtype=int32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  34.,   38.,   39.,   45.,   59.,   46., 5633.,   45.,   33.,
         65.,   43.,   51.,   36.,   43.,   37., 5962., 5585.,   54.,
         39., 5452.,   40.,   48.,   38.,   50.,   48.,   36.,   39.,
         45.,   55.,   44.,   55.,   52.,   64.,   55.,   45.,   43.,
       5880.,   44.,   48.,   42.,   44.,   41.,   57.,   56.,   44.,
         41.,   33.,   42.,   46.,   44.,   47.,   48., 5824.,   35.,
         45.,   52.,   44.,   41.,   56.,   45.,   62.,   42.,   46.,
         40.,   47.,   47.,   35.,   50.,   49.,   54.,   45.,   34.,
         32.,   56.,   54.,   51.,   61.,   42.,   58.,   53.,   57.,
         37.,   45.,   70.,   63.,   36., 5635., 3571.,   47., 5825.,
         45., 6166.,   42.,   45.,   57.,   47.,   45.,   56.,   30.,
         56.], dtype=float32)

#### The trained model can be saved as pickle for inference

In [12]:
joblib.dump(model, '../data/temp_model.pkl')

new_model = joblib.load('../data/temp_model.pkl')
new_model.predict(X_test)

array([[-227.73753  , -139.2886   , -271.09726  , ..., -132.06668  ,
        -209.25096  , -226.7468   ],
       [-107.63527  , -105.17763  ,  -51.269665 , ..., -120.603195 ,
        -108.78121  ,  -16.261744 ],
       [ -37.47324  ,  -59.83011  ,  144.37334  , ...,   16.221537 ,
         -27.90023  , -213.23357  ],
       ...,
       [ -79.836494 ,  138.32225  ,   85.83367  , ...,  230.75499  ,
          37.589367 ,   16.244379 ],
       [  -5.5796094,  138.75368  ,  246.4468   , ...,  150.21175  ,
         174.16621  ,  205.89897  ],
       [  -6.7977576,   40.970036 ,  167.97235  , ...,   96.029816 ,
          27.76857  ,   11.001275 ]], dtype=float32)

### Cross Validation

Also py_boost supports built in cross validation wrapper that produce out-of-fold prediction

In [13]:
%%time
model = GradientBoosting('mse')
cv = CrossValidation(model)

oof_pred = cv.fit_predict(X, y, cv=5)

pred = cv.predict(X_test)
((pred - y_test) ** 2).mean() ** .5

[10:06:10] Stdout logging level is INFO.
[10:06:10] GDBT train starts. Max iter 100, early stopping rounds 100
[10:06:10] Iter 0; Sample 0, rmse = 175.42609649872603; 
[10:06:10] Iter 10; Sample 0, rmse = 144.38437583641723; 
[10:06:10] Iter 20; Sample 0, rmse = 122.64889288531302; 
[10:06:10] Iter 30; Sample 0, rmse = 106.2448795667879; 
[10:06:11] Iter 40; Sample 0, rmse = 93.29088329381031; 
[10:06:11] Iter 50; Sample 0, rmse = 82.61964778368379; 
[10:06:11] Iter 60; Sample 0, rmse = 73.54036275397404; 
[10:06:11] Iter 70; Sample 0, rmse = 65.77099620580852; 
[10:06:12] Iter 80; Sample 0, rmse = 59.169712596698275; 
[10:06:12] Iter 90; Sample 0, rmse = 53.48868788517551; 
[10:06:12] Iter 99; Sample 0, rmse = 49.021571915744666; 
[10:06:12] Stdout logging level is INFO.
[10:06:12] GDBT train starts. Max iter 100, early stopping rounds 100
[10:06:12] Iter 0; Sample 0, rmse = 176.46325625123137; 
[10:06:12] Iter 10; Sample 0, rmse = 145.3524019640615; 
[10:06:13] Iter 20; Sample 0, rms

47.28329104835014