Copyright 2021 NVIDIA Corporation. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: center;">

# Triton server with FIL backend

## Overview

This notebook demonstrates how to deploy a [LightGBM model](https://lightgbm.readthedocs.io/en/latest/) in Triton Inference Server with the Forest Inference Library (FIL) backend. The FIL backend allows forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML) to be deployed in a Triton inference server using the RAPIDS Forest Inference LIbrary for fast GPU-based inference. Using this backend, forest models can be deployed seamlessly alongside deep learning models for fast, unified inference pipelines.

### Contents
* [Requirements](#Requirements)
* [Train LightGBM model on SKlearn Breast Cancer Dataset](#TrainLightGBMBinaryClassificationmodel)
* [Export, load and deploy LightGBM model in Triton Inference Server](#Export)
* [Determine throughput and latency using Perf Analyzer](#PerfAnalyzer)
* [Find best configuration using Model Analyzer](#ModelAnalyzer)
* [Deploy model with best configuration](#Deploy)
* [Triton Client](#Client)
* [Conclusion](#Conclusion)

## Requirements <a class="anchor" id="Requirements"></a>

* Nvidia GPU (Pascal+ Recommended GPUs: T4, V100 or A100)
* [Latest NVIDIA driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html)
* [Docker](https://docs.docker.com/get-docker/)
* [The NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)

## Setup <a class="anchor" id="Setup"></a>

To begin, check that the NVIDIA driver has been installed correctly. The `nvidia-smi` command should run and output information about the GPUs on your system:"

In [None]:
!nvidia-smi

## Install LightGBM and Sklearn <a class="anchor" id="InstallLightGBM"></a>

We'd need to install LightGBM, Pandas and SKlearn using the following pip commands inside the container as follows:

In [None]:
# Install sklearn and lightGBM
!pip3 install -U scikit-learn lightgbm pandas

In [None]:
import lightgbm as lgb 
lgb.__version__

## Train LightGBM Binary Classification model <a class="anchor" id="TrainLightGBMBinaryClassificationmodel"></a>

If you have a pre-trained lightGBM model, save it as `model.txt` and skip this step. We'll train a LightGBM model on the example Breast Cancer Dataset provided by Scikit Learn in this section 

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

import os
import signal
import subprocess

**Load the Dataset:**

In [None]:
#loading the breast cancer dataset
data = load_breast_cancer()
X=pd.DataFrame(data.data,columns=data.feature_names)
#set the datatype of the dataframe to float32, Triton currently doesnt accept float64 as input
X=X.astype(np.float32)
Y=data.target

In [None]:
# Check for null values in the dataframe
X.isnull().sum().sort_values(ascending=False)

**Split the Data into Train Test:** 

In [None]:
#train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=20)

**Scale the Data:**

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Convert the data to LGB data format:**

In [None]:
d_train=lgb.Dataset(X_train, label=y_train)

**Set the LightGBM Params:**

In [None]:
#Specifying the parameter
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='binary' #Binary target feature
params['metric']='binary_logloss' #metric for binary classification
params['max_depth']=50

**Train the model:**

In [None]:
#train the model 
model=lgb.train(params,d_train,100) #train the model on 100 epocs

**Make predictions on the test set:**

In [None]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy: {:.2f}".format(accuracy * 100.0))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

## Export, load and deploy LightGBM model in Triton Inference Server <a class="anchor" id="Export"></a>

For deploying the trained LightGBM model in Triton Inference Server, follow the steps below:

**1. Create a model repository and save lightGBM model checkpoint:**

We'll need to create a model repository that looks as follows:

```
model_repository/
`-- lgb_example
    |-- 1
    |   `-- model.txt
    `-- config.pbtxt
```

In [None]:
# Create directory to save the model
![ ! -d "/model_repository" ] && mkdir -p /model_repository/lgb_example/1

# Save your lightGBM model as model.txt
# For more information on saving lightGBM model check https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.save.html
# Model can also be dumped to json format
model.save_model('/model_repository/lgb_example/1/model.txt')

**Note:**
The FIL backend's testing infrastructure includes a script for generating example models, putting them in the correct directory layout, and generating an associated config file. This can be helpful both for providing a template for your own models and for testing your Triton deployment. Please check this [link](https://github.com/triton-inference-server/fil_backend/blob/main/Example_Models.md) for the sample script.

**2. Create and save config.pbtxt**

To deploy the model in Triton Inference Server, we need to create and save a protobuf config file called config.pbtxt under `model_repository/lgb_example/` directory that contains information about the model and the deployment. Sample config file is available here: [link](https://github.com/triton-inference-server/fil_backend#configuration)

Essentially, the following parameters need to be updated as per your configuration

```
name: "lgb_example"                      # Name of the model directory (lgb_example in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 2 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "lightgbm" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]
```

Triton server looks for this configuration file before deploying LightGBM model for inference. It'll setup the server parameters as per the configuration passed within config.pbtxt. Store the above config at `/model_repository/fil/` directory as config.pbtxt as follows:

For more information on sample configs, please refer this [link](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md)

In [None]:
%%bash
# Writing config to file
cat > /model_repository/lgb_example/config.pbtxt <<EOL 
name: "lgb_example"                      # Name of the model directory (lgb_example in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 30 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 2 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "lightgbm" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]

EOL

The model repository should look like this:

```
model_repository/
`-- lgb_example
    |-- 1
    |   `-- model.txt
    `-- config.pbtxt
```

**3. Deploy the model in Triton Inference Server**

Finally, we can deploy the lightGBM model in Triton Inference Server using the following command:

In [None]:
# Run the Triton Inference Server in a Subprocess from Jupyter notebook

triton_process = subprocess.Popen(["tritonserver", "--model-repository=/model_repository"], stdout=subprocess.PIPE, preexec_fn=os.setsid) 

The above command should load the model and print the log `successfully loaded 'fil' version 1`. Triton server listens on the following endpoints:

```
Port 8000    -> HTTP Service
Port 8001    -> GRPC Service
Port 8002    -> Metrics
```

We can test the status of the server connection by running the curl command: `curl -v <IP of machine>:9000/v2/health/ready` which should return `HTTP/1.1 200 OK`

**NOTE:-** In our case the IP of machine on which Triton Server and this notebook are currently running is `localhost`

In [None]:
!curl -v localhost:8000/v2/health/ready

## Determine throughput and latency with Perf Analyzer <a class="anchor" id="PerfAnalyzer"></a>

Once the model is deployed for inference in Triton, we can measure its inference performance using `perf_analyzer`. The perf_analyzer application generates inference requests to the deployed model and measures the throughput and latency of those requests. For more information on `perf_analyzer` utility, please refer this [link](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md) 

In [None]:
# Install nvidia-pyindex
!pip3 install nvidia-pyindex

In [None]:
# Install Triton client
!pip3 install tritonclient[http]

In [None]:
# Run Perf analyzer to simulate incoming inference requests to Triton server
# Stabilize p99 latency with threshold of 5 msec and concurrency of incoming request from 10 to 15 with batch size 1
!perf_analyzer -m fil --percentile=99 --latency-threshold=5 --concurrency-range=10:15 --async -b 1

In [None]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(triton_process.pid), signal.SIGTERM)  # Send the signal to all the process groups

## Find best configuration using Model Analyzer <a class="anchor" id="ModelAnalyzer"></a>

[Triton Model Analyzer](https://github.com/triton-inference-server/model_analyzer) is a tool to profile and evaluate the best deployment configuration that maximizes inference performance of your model when deployed in Triton Inference Server. Using this tool, you can find the appropriate batch size, instances of your model etc. based on the constraints specified like maximum latency budget, minimum throughput and maximum GPU utilization limit. Model Analyzer installation steps are available here: [link](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md)

In [None]:
# Install library
!pip3 install triton-model-analyzer

Create a config file specifying the profiling constrains as follows:
* `perf_throughput` - Specify minimum desired throughput.


* `perf_latency` - Specify maximum tolerable latency or latency budget.


* `gpu_used_memory` - Specify maximum GPU memory used by model.

In [None]:
%%bash
# Writing constraints to file
cat > model_analyzer_constraints.yaml <<EOL 
model_repository: /model_repository/
triton_launch_mode: "local"
latency_budget: 5
run_config_search_max_concurrency: 64
run_config_search_max_instance_count: 3
run_config_search_max_preferred_batch_size: 8
profile_models:
  fil

EOL

In [None]:
# Run model_analyzer profiler on LightGBM model 
!model-analyzer profile -f model_analyzer_constraints.yaml --override-output-model-repository

The above command will perform a search across various config parameters on the `fil` LighGBM model. Complete execution of this cell might take a while (30-40 mins) as the model analyzer searches for the optimum configuration based on the given constraints. When finished, model analyzer stores all of the profiling measurements it has taken in a binary file in the checkpoint directory. Now we can generate and visualize results using `analyze` command as follows:

In [None]:
# Install wkhtmltopdf to generate pdf reports
!apt-get update && apt-get install -y wkhtmltopdf

In [None]:
![ ! -d "analysis_results" ] && mkdir analysis_results
!model-analyzer analyze --analysis-models fil -e analysis_results

The detailed summary report from model analyzer can be found under `/notebook/analysis_results/reports/summaries/fil/result_summary.pdf`. It'll look something like this:

![image.png](attachment:888cfd23-eb20-4187-9c3c-70febb9a9210.png)

![image.png](attachment:94b4ab62-052c-47d7-946f-03459ce7cd25.png)

## Deploy model with best configuration <a class="anchor" id="Deploy"></a>

Now we can deploy the model with configuration that gives best throughput and latency numbers as evaluated by the model analyzer. In our case, model configuration `fil_i10` gives the best configuration. We can now copy this config.pbtxt to model directory and re-deploy Triton Inference Server  

In [None]:
%%bash
# Change the best model configuration as per required constraints
export best_model='lgb_example_i0'
cp -r output_model_repository/$best_model/ /model_repository/
mkdir -p /model_repository/$best_model/1 && cp /model_repository/lgb_example/1/model.txt /model_repository/$best_model/1/

In [None]:
# Run Triton Inference Server in background from Jupyter notebook

triton_process = subprocess.Popen(["tritonserver", "--model-repository=/model_repository"], stdout=subprocess.PIPE, preexec_fn=os.setsid) 

In [None]:
# Run the perf_analyzer with same parameters
# Change the model as per your configuration 
!perf_analyzer -m lgb_example_i0 --percentile=99 --latency-threshold=5 --concurrency-range=10:15 --async -b 1

In [None]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(triton_process.pid), signal.SIGTERM)  # Send the signal to all the process groups

## Triton Client <a class="anchor" id="Client"></a>

After model profiling is done and the final model is selected as per required configuration and deployed in Triton, we can now test the inference by sending real inference request from Triton Client and checking the accuracy of responses. For more information on installation steps, please check [Triton Client Github](https://github.com/triton-inference-server/client)  

In [None]:
# Run Triton Inference Server in background from Jupyter notebook
triton_process = subprocess.Popen(["tritonserver", "--model-repository=/model_repository"], stdout=subprocess.PIPE, preexec_fn=os.setsid) 

In [None]:
# Check client library can be imported
import numpy
import tritonclient.http as triton_http

In [None]:
# Set up HTTP client.
http_client = triton_http.InferenceServerClient(
    url='localhost:8000',
    verbose=False,
    concurrency=1
)

# Set up Triton input and output objects for both HTTP and GRPC
triton_input_http = triton_http.InferInput(
    'input__0',
    (X_test.shape[0], X_test.shape[1]),
    'FP32'
)
triton_input_http.set_data_from_numpy(X_test, binary_data=True)
triton_output_http = triton_http.InferRequestedOutput(
    'output__0',
    binary_data=True
)

# Submit inference requests 
request_http = http_client.infer(
    'lgb_example',
    model_version='1',
    inputs=[triton_input_http],
    outputs=[triton_output_http]
)

# Get results as numpy arrays
result_http = request_http.as_numpy('output__0')
# Check that we got the same accuracy as previously
predictions = [round(value) for value in result_http]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {:.2f}".format(accuracy * 100.0))

The above test accuracy score of the model deployed in Triton using FIL backend matches with the one previously computed using LightGBM library's predict function.

In [None]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(triton_process.pid), signal.SIGTERM)  # Send the signal to all the process groups

# Conclusion <a class="anchor" id="Conclusion"></a>

Triton FIL backend can be used for deploying tree based models trained in frameworks like LightGBM, Scikit-Learn, and cuML for fast GPU-based inference. Essentially, tree based models can now be deployed with other deep learning based models in Triton Inference Server seamlessly. Moreover, Model Analyzer utility tool can be used to profile the models and get the best deployment configuration that satisfy the deployment constraints. The trained model can then be deployed using the best configuration in Triton and Triton Client can be used for sending inference requests. 