# **High Performance Machine Learning**
### Demo of Hummingbird and Treelite


### Install external libraries
First we will install the hummingbird libraries together with ONNX runtime libraries.
Next we will setup the treelite libraries. We want to make sure envoirnment contains them. 


In [1]:
!pip install --user hummingbird_ml[extra,onnx]
!pip install --user treelite treelite_runtime

Collecting hummingbird_ml[extra,onnx]
  Downloading hummingbird_ml-0.0.6-py2.py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 354 kB/s eta 0:00:011
[?25hCollecting onnxconverter-common>=1.6.0
  Downloading onnxconverter_common-1.7.0-py2.py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 1.1 MB/s eta 0:00:011
Collecting onnxruntime>=1.0.0; extra == "onnx"
  Downloading onnxruntime-1.5.1-cp37-cp37m-manylinux2014_x86_64.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 3.4 MB/s eta 0:00:01
[?25hCollecting onnxmltools>=1.6.0; extra == "onnx"
  Downloading onnxmltools-1.7.0-py2.py3-none-any.whl (252 kB)
[K     |████████████████████████████████| 252 kB 9.6 MB/s eta 0:00:01
Collecting keras2onnx
  Downloading keras2onnx-1.7.0-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 3.5 MB/s  eta 0:00:01
[?25hCollecting skl2onnx
  Downloading skl2onnx-1.7.0-py2.py3-none-any.whl (191 kB)
[K     |█████████████████

In [2]:
# Default import to run the codes
import torch

import numpy as np

#Gradient Boosted Tree libraries (XGBoost, LightGBM)
import xgboost as xgb
import lightgbm as lgb

#Treelite Imports
import treelite
import treelite_runtime     # runtime module

#Onnx Runtime libraries
import onnxruntime as ort
from onnxmltools.convert import convert_lightgbm
from onnxconverter_common.data_types import FloatTensorType

#Hummingbird libraries
from hummingbird.ml import convert
from hummingbird.ml import constants

# To measure the run speeds
from timeit import Timer

We will start by generating some random dataset for binary classification, before moving on to real life datasets. 
This is to demonstrate the capabilties of various libaries that allows for comparison among various approaches. 
Do note that we are working through Numpy arrays as compared to Pandas Dataframes that you will encounter in real life. 

In [3]:
# Create some random data for binary classification.
num_classes = 2
X = np.array(np.random.rand(10000, 28), dtype=np.float32)
y = np.random.randint(num_classes, size=10000)

Create a function that will repeat the functions a few time to get an accurate measures

In [4]:
def speed(inst, number=5, repeat=2):
    timer = Timer(inst, globals=globals())
    raw = np.array(timer.repeat(repeat, number=number))
    ave = raw.sum() / len(raw) / number
    mi, ma = raw.min() / number, raw.max() / number
    print("Average %1.3g Min=%1.3g Max=%1.3g" % (ave, mi, ma))
    return ave

### Create and train a model starting with LightGBM


In [5]:
model = lgb.LGBMClassifier()
model.fit(X, y)

LGBMClassifier()

In [6]:
# Use ONNXMLTOOLS to convert the model to ONNXML.
# You can adjust the inputs to suit you need, where the model will do batch or real-time inference. 
# The same of the file can determine the shape, you will see this later on. 
initial_types = [("input", FloatTensorType([X.shape[0], X.shape[1]]))] # Define the inputs for the ONNX

#ONNX Model 
onnx_ml_model = convert_lightgbm(
    model, initial_types=initial_types, target_opset=9
)

The Onnx Model
![../input/lgbmonnx/lgbm_onnx.onnx.png](../input/lgbmonnx/lgbm_onnx.onnx.png[](http://))

In [7]:
onnx_ml_model

ir_version: 4
producer_name: "OnnxMLTools"
producer_version: "1.7.0"
domain: "onnxconverter-common"
model_version: 0
doc_string: ""
graph {
  node {
    input: "input"
    output: "label_tensor"
    output: "probability_tensor"
    name: "LgbmClassifier"
    op_type: "TreeEnsembleClassifier"
    attribute {
      name: "class_ids"
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      ints: 0
      int

In [8]:
with open("lgbm_onnx.onnx", "wb") as f:
    f.write(onnx_ml_model.SerializeToString())

In [9]:
# Use Hummingbird to convert the ONNXML model to ONNX.
onnx_model = convert(onnx_ml_model, "onnx", X)

Lets do a test to see how fast does the code take to run.
At the time this workbook I got:
> Average 1.95 Min=1.92 Max=1.97

In [10]:
speed("onnx_model.predict(X)")

Average 1.77 Min=1.68 Max=1.85


1.769425600799997

### Reload the serialized ONNX file
We had serialized the model and we can load it again. And run it in the inference session.
Lets create the interfese session

In [11]:
sess = ort.InferenceSession("lgbm_onnx.onnx")

print("input name='{}' and shape={}".format(
    sess.get_inputs()[0].name, sess.get_inputs()[0].shape))
print("output name='{}' and shape={}".format(
    sess.get_outputs()[0].name, sess.get_outputs()[0].shape))

input name='input' and shape=[10000, 28]
output name='label' and shape=[1]


In [12]:
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name

pred_onx = sess.run([label_name], {input_name: X.astype(np.float32)})[0]

### Speed Tests for the Inference session. 

In [13]:
sess.run([label_name], {input_name: X.astype(np.float32)})[0]

array([1, 0, 0, ..., 1, 1, 0], dtype=int64)

In [14]:
speed("model.predict(X)")

Average 0.0536 Min=0.0533 Max=0.0538


0.05355548790000171

### Lets move over to hummingbird library with the default GEMM implementation. 

#### GEneric Matrix Multiplication (GEMM) Algorithm
The evaluation of a tree is done as a series ofthree GEneric Matrix Multiplication (GEMM) operations interleaved by two element-wise logical operations. 

Given a tree, we create fivetensors which collectively capture the tree structure:A,B,C,D,andE.
1. **A** captures the relationship between input features and internal nodes.
2. **B** is set to the threshold value of each internal node.
3. For any leaf node and internal node pair,**C** captures whether the internal node is a parent of that internal node, and if so, whether it is in the left or right sub-tree.
4. **D** captures the count of the internal nodes in the path from a leaf node to the tree root, for which the internal node is the left child of its parent.   
5. Finally,**E** captures the mapping between leaf nodes and the class labels.

In [15]:
# Use Hummingbird to convert the ONNXML model to ONNX.
onnx_model = convert(onnx_ml_model, "pytorch", X, extra_config={"tree_implementation":"gemm"})

#Speed test with GPU
speed("onnx_model.to('cuda');onnx_model.predict(X)")

#Speed test with CPU only
speed("onnx_model.predict(X)")

Average 0.557 Min=0.00467 Max=1.11
Average 0.00458 Min=0.00457 Max=0.0046


0.00458139729999516

At the time of the run, GPU performance averaged at 0.62 seconds while CPU averaged. This is expected as there is a huge overhead to transfer data to and from the GPU. 

> Average 0.62 Min=0.00469 Max=1.23

> Average 0.00461 Min=0.00458 Max=0.00464

### Tree Traversal Algorithm
Lets try an another algorithm which improves on the logic of GEMM. In the GEMM strategy, there was a high-degree of computational redundancy by evaluating all internal nodes and leaf nodes when only a few of them actually need to be evaluated.  The Tree traversal the algorithm tries to reduce the computational redundancy by mimicking the typical tree traversal but implemented using tensor operations


In [16]:
# Use Hummingbird to convert the ONNXML model to ONNX.
onnx_model = convert(onnx_ml_model, "pytorch", X, extra_config={"tree_implementation":"tree_trav"})

#Speed test with GPU
speed("onnx_model.to('cuda');onnx_model.predict(X)")

#Speed test with CPU only
speed("onnx_model.predict(X)")

Average 0.0075 Min=0.00546 Max=0.00954
Average 0.00548 Min=0.00537 Max=0.00558


0.0054752662999987935

At the time of the run, GPU performance averaged at 0.007 seconds while CPU averaged 0.00542 seconds. The GPU run has been significantly improved as compared to GEMM algorithm

> Average 0.007 Min=0.00544 Max=0.00855

> Average 0.00542 Min=0.00536 Max=0.00548


### Perfect Tree Traversal Algorithm

Similar to the tree traversal algorithm,this  strategy  also  mimics  the  tree  traversal. However,  here  weassume the tree is a *perfect binary tree*. In a perfect binary tree, all internal nodes have exactly two children and all leaf nodes are at the same depth level.

In [17]:
# Use Hummingbird to convert the ONNXML model to ONNX.
onnx_model = convert(onnx_ml_model, "pytorch", X, extra_config={"tree_implementation":"perf_tree_trav"})
speed("onnx_model.to('cuda');onnx_model.predict(X)")
speed("onnx_model.predict(X)")

Average 0.00774 Min=0.00472 Max=0.0108
Average 0.00447 Min=0.00441 Max=0.00453


0.0044690769999988335

At the time of the run, GPU performance averaged at 0.00785 seconds while CPU averaged 0.00451 seconds about 16% reduction in inference time.


In [18]:
model.booster_.save_model('lgbm_Classifier.txt')

<lightgbm.basic.Booster at 0x7f57b11ddad0>

In [19]:
#lets load the saved model in Treelite

trl_model = treelite.Model.load('lgbm_Classifier.txt', model_format='lightgbm')

[07:16:13] /workspace/src/frontend/lightgbm.cc:544: model.num_tree = 100


In [20]:
toolchain = 'gcc'
trl_model.export_lib(toolchain=toolchain, libpath='./lgbm_numpy.so', verbose=True)

[07:16:13] /workspace/src/compiler/ast_native.cc:44: Using ASTNativeCompiler
[07:16:13] /workspace/src/compiler/ast/split.cc:24: Parallel compilation disabled; all member trees will be dumped to a single source file. This may increase compilation time and memory usage.
[07:16:14] /workspace/src/c_api/c_api.cc:286: Code generation finished. Writing code to files...
[07:16:14] /workspace/src/c_api/c_api.cc:291: Writing file main.c...
[07:16:14] /workspace/src/c_api/c_api.cc:291: Writing file recipe.json...
[07:16:14] /workspace/src/c_api/c_api.cc:291: Writing file header.h...

[07:16:14] /root/.local/lib/python3.7/site-packages/treelite/contrib/util.py:104: Compiling sources files in directory ./tmph4s2qbnc into object files (*.o)...
[07:16:25] /root/.local/lib/python3.7/site-packages/treelite/contrib/util.py:133: Generating dynamic shared library ./tmph4s2qbnc/predictor.so...
[07:16:25] /root/.local/lib/python3.7/site-packages/treelite/contrib/__init__.py:278: Generated shared library i

In [33]:
predictor = treelite_runtime.Predictor('./lgbm_numpy.so', verbose=True)
batch = treelite_runtime.Batch.from_npy2d(X)
out_pred = predictor.predict(batch)


[07:17:44] /root/.local/lib/python3.7/site-packages/treelite_runtime/predictor.py:309: Dynamic shared library /kaggle/working/lgbm_numpy.so has been successfully loaded into memory


In [36]:
speed('predictor.predict(batch)')

Average 0.000782 Min=0.000253 Max=0.00131


0.0007824436000021251

At the time of the run, CPU averaged 0.000472 seconds compare this with Hummingbird's best case for CPU that was 0.00451 seconds almost **10X reduction** because we were able to compile the model. 

The base lightGBM model was 0.05 seconds, which is **100X slower** than a compiled model. 

Note: The first batch run will be slower as compared to the subsequent ones. 


# **Lets try the Boston Dataset**

In [23]:
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
print(f'dimensions of X = {X.shape}')
print(f'dimensions of y = {y.shape}')

dimensions of X = (506, 13)
dimensions of y = (506,)


In [24]:
import xgboost
dtrain = xgboost.DMatrix(X, label=y)
params = {'max_depth':3, 'eta':1, 'objective':'reg:squarederror', 'eval_metric':'rmse'}
bst = xgboost.train(params, dtrain, 20, [(dtrain, 'train')])

[0]	train-rmse:3.89050
[1]	train-rmse:3.38204
[2]	train-rmse:3.10513
[3]	train-rmse:2.84322
[4]	train-rmse:2.60580
[5]	train-rmse:2.45425
[6]	train-rmse:2.29526
[7]	train-rmse:2.17920
[8]	train-rmse:2.09359
[9]	train-rmse:1.96872
[10]	train-rmse:1.93416
[11]	train-rmse:1.83528
[12]	train-rmse:1.78750
[13]	train-rmse:1.71018
[14]	train-rmse:1.64747
[15]	train-rmse:1.57359
[16]	train-rmse:1.49626
[17]	train-rmse:1.43896
[18]	train-rmse:1.37123
[19]	train-rmse:1.30187


In [25]:
model = treelite.Model.from_xgboost(bst)

In [26]:
toolchain = 'gcc'

In [27]:
model.export_lib(toolchain=toolchain, libpath='./boston.so', verbose=True)

[07:16:26] /workspace/src/compiler/ast_native.cc:44: Using ASTNativeCompiler
[07:16:26] /workspace/src/compiler/ast/split.cc:24: Parallel compilation disabled; all member trees will be dumped to a single source file. This may increase compilation time and memory usage.
[07:16:26] /workspace/src/c_api/c_api.cc:286: Code generation finished. Writing code to files...
[07:16:26] /workspace/src/c_api/c_api.cc:291: Writing file main.c...
[07:16:26] /workspace/src/c_api/c_api.cc:291: Writing file recipe.json...
[07:16:26] /workspace/src/c_api/c_api.cc:291: Writing file header.h...
[07:16:26] /root/.local/lib/python3.7/site-packages/treelite/contrib/util.py:104: Compiling sources files in directory ./tmpsbfbm3ts into object files (*.o)...
[07:16:26] /root/.local/lib/python3.7/site-packages/treelite/contrib/util.py:133: Generating dynamic shared library ./tmpsbfbm3ts/predictor.so...
[07:16:26] /root/.local/lib/python3.7/site-packages/treelite/contrib/__init__.py:278: Generated shared library in

In [28]:
import treelite_runtime     # runtime module
predictor = treelite_runtime.Predictor('./boston.so', verbose=True)

[07:16:26] /root/.local/lib/python3.7/site-packages/treelite_runtime/predictor.py:309: Dynamic shared library /kaggle/working/boston.so has been successfully loaded into memory


In [29]:
batch = treelite_runtime.Batch.from_npy2d(X, rbegin=1, rend=500)

In [30]:
out_pred = predictor.predict(batch)
print(out_pred)

[20.186111  34.48005   37.285862  33.69207   28.283451  21.867765
 23.253294  17.29624   20.040396  16.950441  19.283855  21.816017
 20.342304  17.260868  20.342304  21.587475  18.442554  20.078833
 16.137157  13.929457  17.316753  17.636827  13.304058  15.001148
 13.821014  15.20974   14.832266  16.640896  20.459078  12.8455925
 17.52793   14.709333  14.155782  13.9590435 22.456371  20.259691
 22.808983  23.44831   30.543415  35.029102  28.100641  22.866552
 23.63423   22.805105  20.411028  20.411028  17.836796  14.136219
 19.081121  18.90171   20.825583  24.688742  23.183697  18.225296
 36.0195    23.34298   29.869055  23.83893   20.78524   18.64497
 17.47499   22.292019  24.869267  31.435106  23.953838  17.844997
 21.295427  18.80207   20.94234   23.24848   20.191523  23.353277
 23.414724  23.820236  21.87087   21.023169  21.270945  20.442343
 21.627295  25.187714  24.482016  22.956995  21.826084  22.557856
 24.683212  19.681183  23.851255  22.215725  30.038284  22.432909
 22.771053

In [31]:
dmX = xgboost.DMatrix(X)
speed('bst.predict(dmX)')

Average 0.000133 Min=5.66e-05 Max=0.00021


0.00013311490000091907

In [32]:
speed('out_pred = predictor.predict(batch)')

Average 0.000255 Min=0.000135 Max=0.000374


0.00025455820000388487