## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [13]:
!pip install cupy-cuda110 py-boost

Defaulting to user installation because normal site-packages is not writeable
Collecting cupy-cuda110
  Downloading cupy_cuda110-9.6.0-cp36-cp36m-manylinux1_x86_64.whl (75.4 MB)
     |████████████████████████████████| 75.4 MB 30 kB/s              
[?25hCollecting py-boost
  Using cached Py_Boost-0.1.0-py3-none-any.whl (39 kB)
Installing collected packages: py-boost, cupy-cuda110
Successfully installed cupy-cuda110-9.6.0 py-boost-0.1.0


### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting 

### Generate dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.17 s, sys: 1.41 s, total: 3.58 s
Wall time: 830 ms


### Train GBM

The only required argument here is the loss funnction. Type of solved task is defined by loss function and input target shape. It could be passed as Loss instance or via string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification


Training is simply done by calling .fit metod. Possible arguments:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'*** Validation set is passed as the list of dict with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one

#### The example below illustrates how to train simple regression task:

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[09:23:52] Stdout logging level is INFO.
[09:23:52] GDBT train starts. Max iter 100, early stopping rounds 100
[09:24:01] Iter 0; Sample 0, rmse = 173.67502181700436; 
[09:24:01] Iter 10; Sample 0, rmse = 133.19548984492013; 
[09:24:01] Iter 20; Sample 0, rmse = 107.86651411759983; 
[09:24:01] Iter 30; Sample 0, rmse = 90.08263834155294; 
[09:24:02] Iter 40; Sample 0, rmse = 76.44572334886129; 
[09:24:02] Iter 50; Sample 0, rmse = 65.61088448487712; 
[09:24:02] Iter 60; Sample 0, rmse = 56.80205311670096; 
[09:24:02] Iter 70; Sample 0, rmse = 49.57775313009075; 
[09:24:02] Iter 80; Sample 0, rmse = 43.60435310801947; 
[09:24:02] Iter 90; Sample 0, rmse = 38.69833114722225; 
[09:24:02] Iter 99; Sample 0, rmse = 34.992585070918345; 
CPU times: user 13.4 s, sys: 1.01 s, total: 14.4 s
Wall time: 13.1 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f04cc436730>

### Train for multiregression

Each of built-in loss functions has its own default metric, so metric definition is optional. If you need to specify eval metric, you could pass Metric instance or use string alias:

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task


Also you can specify other common GBDT hyperparameters as shown below

#### The example below demonstrates the example of training model for multioutput regression task, no extra definition needed to switch the task, just pass multidimensional target:

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[09:24:03] Stdout logging level is INFO.
[09:24:03] GDBT train starts. Max iter 1000, early stopping rounds 200
[09:24:06] Iter 0; Sample 0, R2_score = 0.008384933670233796; 
[09:24:07] Iter 100; Sample 0, R2_score = 0.5168154388837194; 
[09:24:09] Iter 200; Sample 0, R2_score = 0.7242525134774166; 
[09:24:11] Iter 300; Sample 0, R2_score = 0.8327166794367539; 
[09:24:12] Iter 400; Sample 0, R2_score = 0.8949136799796336; 
[09:24:14] Iter 500; Sample 0, R2_score = 0.9319944236361948; 
[09:24:16] Iter 600; Sample 0, R2_score = 0.9546369126361514; 
[09:24:18] Iter 700; Sample 0, R2_score = 0.9686988212185522; 
[09:24:19] Iter 800; Sample 0, R2_score = 0.9775841290463113; 
[09:24:21] Iter 900; Sample 0, R2_score = 0.9832721775452911; 
[09:24:23] Iter 999; Sample 0, R2_score = 0.9869727592306777; 
CPU times: user 20.2 s, sys: 2.04 s, total: 22.2 s
Wall time: 20.7 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f032fe4fd60>

## Inference

#### Prediction could be done via calling .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 1.09 s, sys: 451 ms, total: 1.54 s
Wall time: 1.54 s


(50000, 10)

In [7]:
preds

array([[-241.8923  , -149.8982  , -282.2822  , ..., -146.42816 ,
        -219.73492 , -236.32663 ],
       [-113.9823  , -114.28678 ,  -60.7693  , ..., -128.62047 ,
        -116.691444,  -20.987906],
       [ -38.82068 ,  -60.793297,  144.03748 , ...,   14.923006,
         -28.45954 , -211.73474 ],
       ...,
       [ -80.47606 ,  131.23747 ,   83.49576 , ...,  223.02928 ,
          32.694347,   12.786787],
       [  -7.73693 ,  140.71503 ,  247.16324 , ...,  151.71132 ,
         176.81691 ,  208.22423 ],
       [ -15.201057,   37.500587,  171.29694 , ...,   96.09559 ,
          23.459866,    7.251984]], dtype=float32)

#### Prediction for certan iterations could be done via calling .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 251 ms, sys: 243 ms, total: 494 ms
Wall time: 494 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations could be done via calling .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 233 ms, sys: 8.24 ms, total: 241 ms
Wall time: 240 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[11, 22,  9],
       [54, 45, 24],
       [32, 45, 55],
       ...,
       [54, 52, 10],
       [27, 45, 21],
       [60, 45, 24]], dtype=uint32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  39.,   48.,   37.,   32.,   51.,   58., 5621.,   50.,   34.,
         63.,   35.,   45.,   45.,   55.,   38., 6062., 5546.,   43.,
         39., 5566.,   38.,   37.,   46.,   67.,   37.,   38.,   44.,
         51.,   43.,   41.,   57.,   53.,   53.,   59.,   37.,   53.,
       6064.,   48.,   39.,   46.,   54.,   51.,   64.,   35.,   60.,
         54.,   41.,   57.,   49.,   44.,   41.,   47., 5758.,   34.,
         51.,   70.,   35.,   30.,   41.,   36.,   61.,   36.,   41.,
         51.,   52.,   38.,   48.,   43.,   41.,   69.,   54.,   40.,
         45.,   40.,   46.,   60.,   40.,   39.,   43.,   50.,   62.,
         46.,   36.,   41.,   55.,   40., 5634., 3415.,   48., 5874.,
         51., 6055.,   33.,   50.,   47.,   44.,   31.,   54.,   33.,
         62.], dtype=float32)

#### Trained model could be saved as pickle for inference

In [12]:
joblib.dump(model, 'data/temp_model.pkl')

new_model = joblib.load('data/temp_model.pkl')
new_model.predict(X_test)

array([[-241.8923  , -149.8982  , -282.2822  , ..., -146.42816 ,
        -219.73492 , -236.32663 ],
       [-113.9823  , -114.28678 ,  -60.7693  , ..., -128.62047 ,
        -116.691444,  -20.987906],
       [ -38.82068 ,  -60.793297,  144.03748 , ...,   14.923006,
         -28.45954 , -211.73474 ],
       ...,
       [ -80.47606 ,  131.23747 ,   83.49576 , ...,  223.02928 ,
          32.694347,   12.786787],
       [  -7.73693 ,  140.71503 ,  247.16324 , ...,  151.71132 ,
         176.81691 ,  208.22423 ],
       [ -15.201057,   37.500587,  171.29694 , ...,   96.09559 ,
          23.459866,    7.251984]], dtype=float32)