## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
# !pip install cupy-cuda110 py-boost

### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.makedirs('data', exist_ok=True)

import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting 

### Generation of dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.16 s, sys: 1.39 s, total: 3.55 s
Wall time: 818 ms


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[13:36:56] Stdout logging level is INFO.
[13:36:56] GDBT train starts. Max iter 100, early stopping rounds 100
[13:36:56] Iter 0; Sample 0, rmse = 173.67502259174105; 
[13:36:56] Iter 10; Sample 0, rmse = 133.19549012617102; 
[13:36:56] Iter 20; Sample 0, rmse = 107.86651480713061; 
[13:36:56] Iter 30; Sample 0, rmse = 90.08260826323315; 
[13:36:56] Iter 40; Sample 0, rmse = 76.44592871706898; 
[13:36:57] Iter 50; Sample 0, rmse = 65.61081640486391; 
[13:36:57] Iter 60; Sample 0, rmse = 56.80203684260939; 
[13:36:57] Iter 70; Sample 0, rmse = 49.57758870899029; 
[13:36:57] Iter 80; Sample 0, rmse = 43.60423015153368; 
[13:36:57] Iter 90; Sample 0, rmse = 38.698271034115564; 
[13:36:57] Iter 99; Sample 0, rmse = 34.99248027394042; 
CPU times: user 4.21 s, sys: 769 ms, total: 4.98 s
Wall time: 4.2 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f993f1453d0>

### Traininig a GBDT model in a multiregression case

Each of built-in loss functions has its own default metric, so metric definition is optional. 
If you need to specify the evaluation metric, you can pass a Metric instance or use a string alias.

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non-default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task

It is possible to specify other common GBDT hyperparameters as shown below.

#### The following example demonstrates how to train a model for a multioutput regression task (no extra definition needed to switch the task to multioutput one, you just need to pass a multidimensional target).

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[13:36:57] Stdout logging level is INFO.
[13:36:57] GDBT train starts. Max iter 1000, early stopping rounds 200
[13:36:58] Iter 0; Sample 0, R2_score = 0.008384933719017152; 
[13:37:00] Iter 100; Sample 0, R2_score = 0.5168160162984419; 
[13:37:02] Iter 200; Sample 0, R2_score = 0.7242818472323869; 
[13:37:04] Iter 300; Sample 0, R2_score = 0.8326812183358807; 
[13:37:06] Iter 400; Sample 0, R2_score = 0.8949632615853698; 
[13:37:09] Iter 500; Sample 0, R2_score = 0.9320433721142856; 
[13:37:11] Iter 600; Sample 0, R2_score = 0.9546620597013271; 
[13:37:13] Iter 700; Sample 0, R2_score = 0.9687442621799958; 
[13:37:15] Iter 800; Sample 0, R2_score = 0.9776257612467603; 
[13:37:18] Iter 900; Sample 0, R2_score = 0.9833175063506848; 
[13:37:20] Iter 999; Sample 0, R2_score = 0.987007280794451; 
CPU times: user 21.3 s, sys: 2.61 s, total: 23.9 s
Wall time: 22.7 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f9b2ca75610>

## Inference

#### Prediction can be done via calling the .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 969 ms, sys: 470 ms, total: 1.44 s
Wall time: 1.64 s


(50000, 10)

In [7]:
preds

array([[-240.39325  , -146.38701  , -278.4931   , ..., -140.4276   ,
        -214.81644  , -235.54738  ],
       [-112.46493  , -113.979645 ,  -60.654854 , ..., -129.82375  ,
        -119.83982  ,  -21.774942 ],
       [ -40.70634  ,  -63.056435 ,  141.13675  , ...,   11.234179 ,
         -33.281082 , -209.6839   ],
       ...,
       [ -75.78832  ,  133.4913   ,   79.55582  , ...,  223.86246  ,
          35.72327  ,   16.227137 ],
       [  -2.2073328,  144.83105  ,  251.2446   , ...,  155.44762  ,
         180.64246  ,  212.98679  ],
       [ -22.812143 ,   31.183327 ,  166.83586  , ...,   91.17909  ,
          19.619398 ,    2.1586423]], dtype=float32)

#### Prediction for certan iterations can be done via calling the .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 292 ms, sys: 242 ms, total: 534 ms
Wall time: 610 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations can be done via calling the .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 19 ms, sys: 495 µs, total: 19.5 ms
Wall time: 48.8 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[11, 21, 10],
       [54, 46, 24],
       [32, 46, 55],
       ...,
       [54, 53, 10],
       [27, 46, 20],
       [60, 46, 24]], dtype=uint32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  44.,   46.,   48.,   35.,   48.,   54., 5641.,   47.,   29.,
         53.,   45.,   42.,   47.,   44.,   28., 5975., 5428.,   38.,
         33., 5406.,   45.,   40.,   41.,   70.,   34.,   46.,   51.,
         42.,   49.,   54.,   45.,   49.,   46.,   58.,   29.,   42.,
       5990.,   42.,   47.,   40.,   45.,   43.,   59.,   47.,   45.,
         60.,   31.,   41.,   61.,   50.,   40.,   45., 5989.,   37.,
         55.,   63.,   40.,   31.,   44.,   43.,   37.,   47.,   46.,
         50.,   45.,   37.,   45.,   48.,   39.,   65.,   38.,   38.,
         38.,   38.,   25.,   56.,   51.,   41.,   39.,   53.,   65.,
         49.,   43.,   45.,   55.,   35., 5687., 3626.,   46., 5826.,
         54., 6092.,   38.,   43.,   50.,   38.,   36.,   49.,   48.,
         52.], dtype=float32)

#### The trained model can be saved as pickle for inference

In [12]:
joblib.dump(model, 'data/temp_model.pkl')

new_model = joblib.load('data/temp_model.pkl')
new_model.predict(X_test)

array([[-240.39325  , -146.38701  , -278.4931   , ..., -140.4276   ,
        -214.81644  , -235.54738  ],
       [-112.46493  , -113.979645 ,  -60.654854 , ..., -129.82375  ,
        -119.83982  ,  -21.774942 ],
       [ -40.70634  ,  -63.056435 ,  141.13675  , ...,   11.234179 ,
         -33.281082 , -209.6839   ],
       ...,
       [ -75.78832  ,  133.4913   ,   79.55582  , ...,  223.86246  ,
          35.72327  ,   16.227137 ],
       [  -2.2073328,  144.83105  ,  251.2446   , ...,  155.44762  ,
         180.64246  ,  212.98679  ],
       [ -22.812143 ,   31.183327 ,  166.83586  , ...,   91.17909  ,
          19.619398 ,    2.1586423]], dtype=float32)