## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
!pip install cupy-cuda110 py-boost

Defaulting to user installation because normal site-packages is not writeable


### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.makedirs('data', exist_ok=True)

import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting 

### Generation of dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.13 s, sys: 1.5 s, total: 3.63 s
Wall time: 821 ms


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[09:26:30] Stdout logging level is INFO.
[09:26:30] GDBT train starts. Max iter 100, early stopping rounds 100
[09:26:30] Iter 0; Sample 0, rmse = 173.67502218505908; 
[09:26:30] Iter 10; Sample 0, rmse = 133.1954896923392; 
[09:26:30] Iter 20; Sample 0, rmse = 107.86651387609845; 
[09:26:30] Iter 30; Sample 0, rmse = 90.08264442437226; 
[09:26:30] Iter 40; Sample 0, rmse = 76.44595416433026; 
[09:26:31] Iter 50; Sample 0, rmse = 65.61115373070031; 
[09:26:31] Iter 60; Sample 0, rmse = 56.802347598291114; 
[09:26:31] Iter 70; Sample 0, rmse = 49.57790908462357; 
[09:26:31] Iter 80; Sample 0, rmse = 43.604359308972406; 
[09:26:31] Iter 90; Sample 0, rmse = 38.69830902688293; 
[09:26:31] Iter 99; Sample 0, rmse = 34.9925890975328; 
CPU times: user 4.23 s, sys: 664 ms, total: 4.89 s
Wall time: 3.59 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f76447a4190>

### Traininig a GBDT model in a multiregression case

Each of built-in loss functions has its own default metric, so metric definition is optional. 
If you need to specify the evaluation metric, you can pass a Metric instance or use a string alias.

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non-default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task

It is possible to specify other common GBDT hyperparameters as shown below.

#### The following example demonstrates how to train a model for a multioutput regression task (no extra definition needed to switch the task to multioutput one, you just need to pass a multidimensional target).

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[09:26:32] Stdout logging level is INFO.
[09:26:32] GDBT train starts. Max iter 1000, early stopping rounds 200
[09:26:32] Iter 0; Sample 0, R2_score = 0.008384934011665946; 
[09:26:33] Iter 100; Sample 0, R2_score = 0.5168156713553872; 
[09:26:35] Iter 200; Sample 0, R2_score = 0.7242761967419419; 
[09:26:37] Iter 300; Sample 0, R2_score = 0.8326476773549876; 
[09:26:39] Iter 400; Sample 0, R2_score = 0.8949085297478583; 
[09:26:40] Iter 500; Sample 0, R2_score = 0.9319644219954142; 
[09:26:42] Iter 600; Sample 0, R2_score = 0.9546549940950049; 
[09:26:44] Iter 700; Sample 0, R2_score = 0.9687287866556533; 
[09:26:45] Iter 800; Sample 0, R2_score = 0.9776128355209428; 
[09:26:47] Iter 900; Sample 0, R2_score = 0.983291931460586; 
[09:26:49] Iter 999; Sample 0, R2_score = 0.9869762282705536; 
CPU times: user 17.7 s, sys: 1.63 s, total: 19.3 s
Wall time: 18 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f745a2ddbb0>

## Inference

#### Prediction can be done via calling the .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 902 ms, sys: 444 ms, total: 1.35 s
Wall time: 1.34 s


(50000, 10)

In [7]:
preds

array([[-239.43956  , -148.27985  , -281.3192   , ..., -142.55777  ,
        -214.58322  , -235.33083  ],
       [-112.88808  , -116.1815   ,  -64.24356  , ..., -132.47252  ,
        -121.00293  ,  -22.776104 ],
       [ -32.328655 ,  -52.749233 ,  144.76985  , ...,   21.727789 ,
         -20.009361 , -205.19566  ],
       ...,
       [ -72.1823   ,  138.6279   ,   87.44407  , ...,  229.96695  ,
          41.225166 ,   20.167723 ],
       [  -6.853402 ,  141.3769   ,  246.19102  , ...,  153.36969  ,
         179.23767  ,  208.78542  ],
       [ -21.059866 ,   31.169312 ,  168.37772  , ...,   90.69682  ,
          19.636633 ,    2.2348657]], dtype=float32)

#### Prediction for certan iterations can be done via calling the .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 241 ms, sys: 248 ms, total: 489 ms
Wall time: 489 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations can be done via calling the .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 19.3 ms, sys: 3.26 ms, total: 22.6 ms
Wall time: 21.7 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[11, 46,  9],
       [54, 46, 28],
       [32, 46, 55],
       ...,
       [54, 53, 18],
       [27, 46, 20],
       [60, 46, 27]], dtype=uint32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  41.,   48.,   39.,   56.,   54.,   50., 5662.,   52.,   53.,
         56.,   40.,   52.,   40.,   52.,   30., 5941., 5405.,   39.,
         34., 5544.,   43.,   46.,   41.,   65.,   37.,   51.,   45.,
         51.,   42.,   52.,   46.,   46.,   57.,   58.,   42.,   55.,
       5997.,   49.,   27.,   32.,   57.,   44.,   43.,   37.,   43.,
         61.,   36.,   49.,   53.,   58.,   42.,   51., 5919.,   29.,
         52.,   60.,   44.,   40.,   51.,   34.,   56.,   45.,   40.,
         49.,   54.,   32.,   40.,   52.,   38.,   45.,   58.,   42.,
         45.,   50.,   53.,   53.,   70.,   34.,   33.,   58.,   74.,
         41.,   43.,   45.,   71.,   30., 5637., 3552.,   51., 5754.,
         46., 6103.,   34.,   41.,   48.,   40.,   40.,   44.,   43.,
         56.], dtype=float32)

#### The trained model can be saved as pickle for inference

In [12]:
joblib.dump(model, 'data/temp_model.pkl')

new_model = joblib.load('data/temp_model.pkl')
new_model.predict(X_test)

array([[-239.43956  , -148.27985  , -281.3192   , ..., -142.55777  ,
        -214.58322  , -235.33083  ],
       [-112.88808  , -116.1815   ,  -64.24356  , ..., -132.47252  ,
        -121.00293  ,  -22.776104 ],
       [ -32.328655 ,  -52.749233 ,  144.76985  , ...,   21.727789 ,
         -20.009361 , -205.19566  ],
       ...,
       [ -72.1823   ,  138.6279   ,   87.44407  , ...,  229.96695  ,
          41.225166 ,   20.167723 ],
       [  -6.853402 ,  141.3769   ,  246.19102  , ...,  153.36969  ,
         179.23767  ,  208.78542  ],
       [ -21.059866 ,   31.169312 ,  168.37772  , ...,   90.69682  ,
          19.636633 ,    2.2348657]], dtype=float32)