## Example of cluster tree usage

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
# !pip install cupy-cuda110

In [2]:
# !pip uninstall py-boost
# !cd ..; bash build_package.sh
# !pip install ../dist/Py_Boost-0.1.8-py3-none-any.whl

### Imports

In [3]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.makedirs('../data', exist_ok=True)

import joblib
import numpy as np
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting

### Generation of dummy regression data

In [4]:
%%time
X, y = make_regression(150000, 100, n_targets=1, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.07 s, sys: 1.34 s, total: 3.41 s
Wall time: 782 ms


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [5]:
%%time
model = GradientBoosting('mse', ntrees=1000, verbose=100)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[19:52:21] Stdout logging level is INFO.
[19:52:21] GDBT train starts. Max iter 1000, early stopping rounds 100
[19:52:21] Iter 0; Sample 0, rmse = 175.0854022376131; 
[19:52:22] Iter 100; Sample 0, rmse = 34.70492128150738; 
[19:52:23] Iter 200; Sample 0, rmse = 18.875099554835995; 
[19:52:23] Iter 300; Sample 0, rmse = 15.978998689640951; 
[19:52:24] Iter 400; Sample 0, rmse = 15.373219194770174; 
[19:52:25] Iter 500; Sample 0, rmse = 15.218443982066903; 
[19:52:26] Iter 600; Sample 0, rmse = 15.164208733705237; 
[19:52:27] Iter 700; Sample 0, rmse = 15.094296333644555; 
[19:52:28] Iter 800; Sample 0, rmse = 15.02916963023134; 
[19:52:29] Iter 900; Sample 0, rmse = 14.965925240030174; 
[19:52:29] Iter 999; Sample 0, rmse = 14.916092422911172; 
CPU times: user 11.8 s, sys: 908 ms, total: 12.7 s
Wall time: 11.4 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f24280be130>

Here we will make prediction from different stages and will try to cluster paths

In [6]:
%%time
iters_to_pred = np.arange(45).cumsum()

test_staged_pred = model.predict_staged(X_test, iterations=iters_to_pred)
# shape - (iterations, objects, output_dim)
test_staged_pred.shape

CPU times: user 902 ms, sys: 358 ms, total: 1.26 s
Wall time: 1.28 s


(45, 50000, 1)

Calculate errors

In [7]:
error_paths = ((y_test[np.newaxis, :] - test_staged_pred[..., 0]) ** 2).T
error_paths[0]

array([7.76091614e+03, 7.91277821e+03, 6.88692060e+03, 6.62394368e+03,
       5.07055708e+03, 2.98405530e+03, 1.90369454e+03, 1.42643322e+03,
       1.02466615e+03, 5.42771424e+02, 2.73757809e+02, 1.03261732e+02,
       7.15174639e+00, 5.33720132e+00, 2.04350297e+01, 1.29327882e+01,
       9.64879681e+00, 1.40339937e+01, 9.99056914e+00, 1.22497631e+01,
       1.40067405e+01, 1.84562225e+01, 1.60254962e+01, 1.60130375e+01,
       1.84175665e+01, 1.70985937e+01, 1.78986452e+01, 1.90121710e+01,
       2.24468675e+01, 2.29831233e+01, 2.36873280e+01, 2.51048279e+01,
       2.08870983e+01, 1.90735637e+01, 1.98192184e+01, 1.88467989e+01,
       1.73784870e+01, 2.01083576e+01, 2.08514782e+01, 2.05896238e+01,
       2.08361521e+01, 2.09831657e+01, 1.89429070e+01, 1.98022395e+01,
       2.25275469e+01])

Create cluster tree

In [8]:
from py_boost.gpu.cluster_tree import ClusterCandidates

clustering = ClusterCandidates(depth_range=list(range(1, 7)), min_data_in_leaf=100)

In [9]:
%%time
clustering.fit(X_test, error_paths)

CPU times: user 1.59 s, sys: 36 ms, total: 1.63 s
Wall time: 281 ms


<py_boost.gpu.cluster_tree.ClusterCandidates at 0x7f21881e9a90>

In [10]:
cluster_candidates = clustering.predict(X_test)
# shape - (n_candidates, n_objects, 1)
cluster_candidates.shape

(50000, 6)

In [11]:
cluster_candidates

array([[0, 2, 3, 4, 6, 9],
       [0, 2, 3, 4, 6, 9],
       [0, 2, 3, 4, 6, 9],
       ...,
       [0, 2, 2, 3, 5, 8],
       [0, 2, 2, 3, 4, 6],
       [0, 2, 3, 4, 6, 9]], dtype=uint32)

In [12]:
cluster_candidates.max(axis=0)

array([ 1,  2,  3,  5,  7, 10], dtype=uint32)