## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
# !pip install cupy-cuda110 py-boost

### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.makedirs('../data', exist_ok=True)

import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting 

### Generation of dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2 s, sys: 1.66 s, total: 3.67 s
Wall time: 1.02 s


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [5]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[22:56:31] Stdout logging level is INFO.
[22:56:31] GDBT train starts. Max iter 100, early stopping rounds 100
[22:56:31] Iter 0; Sample 0, rmse = 173.67502273618487; 
[22:56:31] Iter 10; Sample 0, rmse = 133.19549011443806; 
[22:56:31] Iter 20; Sample 0, rmse = 107.8665815634426; 
[22:56:31] Iter 30; Sample 0, rmse = 90.0826883436304; 
[22:56:31] Iter 40; Sample 0, rmse = 76.4457603762527; 
[22:56:32] Iter 50; Sample 0, rmse = 65.61088819266847; 
[22:56:32] Iter 60; Sample 0, rmse = 56.80212578800188; 
[22:56:32] Iter 70; Sample 0, rmse = 49.57775349232627; 
[22:56:32] Iter 80; Sample 0, rmse = 43.60446814035017; 
[22:56:32] Iter 90; Sample 0, rmse = 38.698434828148464; 
[22:56:32] Iter 99; Sample 0, rmse = 34.99265881928977; 
CPU times: user 2.64 s, sys: 25 ms, total: 2.67 s
Wall time: 1.36 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f167c29d910>

### Traininig a GBDT model in a multiregression case

Each of built-in loss functions has its own default metric, so metric definition is optional. 
If you need to specify the evaluation metric, you can pass a Metric instance or use a string alias.

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non-default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task

It is possible to specify other common GBDT hyperparameters as shown below.

#### The following example demonstrates how to train a model for a multioutput regression task (no extra definition needed to switch the task to multioutput one, you just need to pass a multidimensional target).

In [6]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[22:56:35] Stdout logging level is INFO.
[22:56:35] GDBT train starts. Max iter 1000, early stopping rounds 200
[22:56:35] Iter 0; Sample 0, R2_score = 0.00838493382121076; 
[22:56:37] Iter 100; Sample 0, R2_score = 0.5168178350036745; 
[22:56:39] Iter 200; Sample 0, R2_score = 0.7242543017807709; 
[22:56:42] Iter 300; Sample 0, R2_score = 0.8327236690102027; 
[22:56:44] Iter 400; Sample 0, R2_score = 0.8949888496073136; 
[22:56:46] Iter 500; Sample 0, R2_score = 0.932081267698251; 
[22:56:48] Iter 600; Sample 0, R2_score = 0.9547251100659679; 
[22:56:50] Iter 700; Sample 0, R2_score = 0.968778829828431; 
[22:56:52] Iter 800; Sample 0, R2_score = 0.9776492835117481; 
[22:56:54] Iter 900; Sample 0, R2_score = 0.9833336498217328; 
[22:56:57] Iter 999; Sample 0, R2_score = 0.9870174981643824; 
CPU times: user 20.4 s, sys: 2.93 s, total: 23.3 s
Wall time: 22.1 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f14597e3af0>

## Inference

#### Prediction can be done via calling the .predict method

In [8]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 699 ms, sys: 538 ms, total: 1.24 s
Wall time: 1.26 s


(50000, 10)

In [9]:
preds

array([[-242.54619  , -150.13025  , -284.54285  , ..., -147.102    ,
        -219.06776  , -237.3776   ],
       [-121.74027  , -119.706116 ,  -63.835587 , ..., -133.57527  ,
        -122.1771   ,  -25.195803 ],
       [ -32.62834  ,  -53.10981  ,  147.76602  , ...,   22.266607 ,
         -18.788473 , -206.22401  ],
       ...,
       [ -85.29818  ,  130.21333  ,   85.10849  , ...,  222.686    ,
          30.748068 ,   10.428176 ],
       [  -5.4758997,  142.11945  ,  245.6441   , ...,  152.84828  ,
         177.40526  ,  208.26883  ],
       [ -20.150242 ,   34.07326  ,  165.9649   , ...,   93.3193   ,
          20.594887 ,    3.8033602]], dtype=float32)

#### Prediction for certan iterations can be done via calling the .predict_staged method

In [10]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 366 ms, sys: 257 ms, total: 623 ms
Wall time: 639 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations can be done via calling the .predict_leaves method

In [11]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 17.4 ms, sys: 0 ns, total: 17.4 ms
Wall time: 16.4 ms


(3, 50000, 1)

In [12]:
preds.T[0]

array([[11, 41, 15],
       [54, 41, 26],
       [32, 41, 17],
       ...,
       [54, 47, 22],
       [27, 41, 21],
       [60, 41, 26]], dtype=int32)

#### Feature importances

In [13]:
model.get_feature_importance()

array([  55.,   40.,   32.,   29.,   70.,   53., 5580.,   39.,   49.,
         62.,   29.,   58.,   44.,   44.,   43., 5988., 5433.,   42.,
         38., 5545.,   46.,   43.,   50.,   75.,   33.,   36.,   47.,
         50.,   46.,   45.,   46.,   51.,   52.,   60.,   37.,   44.,
       6047.,   45.,   47.,   41.,   55.,   39.,   62.,   37.,   39.,
         43.,   45.,   48.,   45.,   52.,   51.,   49., 5913.,   37.,
         51.,   56.,   39.,   25.,   32.,   48.,   57.,   49.,   46.,
         40.,   58.,   33.,   41.,   43.,   35.,   58.,   33.,   44.,
         43.,   38.,   31.,   52.,   55.,   40.,   39.,   50.,   58.,
         44.,   29.,   37.,   61.,   40., 5613., 3633.,   40., 5774.,
         41., 6149.,   36.,   57.,   47.,   48.,   40.,   64.,   36.,
         55.], dtype=float32)

#### The trained model can be saved as pickle for inference

In [15]:
joblib.dump(model, '../data/temp_model.pkl')

new_model = joblib.load('../data/temp_model.pkl')
new_model.predict(X_test)

array([[-240.81075  , -148.83202  , -280.99615  , ..., -142.74734  ,
        -214.31436  , -236.81242  ],
       [-108.171936 , -109.24172  ,  -60.211185 , ..., -125.89796  ,
        -117.73771  ,  -18.45293  ],
       [ -35.97747  ,  -56.86327  ,  143.64543  , ...,   17.656034 ,
         -24.660435 , -208.64627  ],
       ...,
       [ -76.424126 ,  131.95796  ,   76.04112  , ...,  221.9448   ,
          33.135143 ,   12.766483 ],
       [  -4.4583893,  142.41664  ,  250.57855  , ...,  153.35132  ,
         178.4664   ,  211.33765  ],
       [ -21.81873  ,   32.88923  ,  168.96144  , ...,   93.069725 ,
          22.013023 ,    3.3425033]], dtype=float32)