In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: center;">

# Triton server with FIL backend

## Overview

This notebook will run through the procedure to deploy a XGBoost model in Triton Inference Server with Forest Inference Library (FIL) backend. The FIL backend allows forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML) to be deployed in a Triton inference server using the RAPIDS Forest Inference LIbrary for fast GPU-based inference. Using this backend, forest models can be deployed seamlessly alongside deep learning models for fast, unified inference pipelines.

## Requirements

* Nvidia GPU (T4 or V100 or A100)
* [Latest NVIDIA driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html)
* [Docker](https://docs.docker.com/get-docker/)
* [The NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)

## Setup

Before running this notebook, please check whether NVIDIA driver, Docker and NVIDIA CUDA Toolkit. `nvidia-smi` command should run successfully as follows:

In [None]:
!nvidia-smi

## Install XGBoost and Sklearn

We'd need to install XGBoost and SKlearn using the following pip3 commands inside the container as follows:

In [None]:
# Install sklearn first
!pip3 install -U scikit-learn

# Then install XGBoost
!pip3 install xgboost

## XGBoost model

If you have a pre-trained xgboost model, save it as `xgboost.model` and skip this step. We'll train a XGBoost model on random data in this section 

In [None]:
# Import required libraries
import numpy
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import os
import signal
import subprocess

In [None]:
# Generate dummy data to perform binary classification
seed = 7
features = 9 # number of sample features
samples = 10000 # number of samples
X = numpy.random.rand(samples, features).astype('float32')
Y = numpy.random.randint(2, size=samples)

test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=7)

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (accuracy * 100.0))

## Export and load XGBoost model in Triton Inference Server

For deploying the trained XGBoost model in Triton Inference Server, follow the steps below:

**1. Create a model repository and save xgboost model checkpoint:**

We'll need to create a model repository that looks as follows:

```
model_repository/
`-- fil
    |-- 1
    |   `-- xgboost.model
    `-- config.pbtxt
```

In [None]:
# Create directory to save the model
![ ! -d "/data/model_repository" ] && mkdir -p /data/model_repository/fil/1

# Save your xgboost model as xgboost.model
model.save_model('/data/model_repository/fil/1/xgboost.model')

**2. Create and save config.pbtxt**

To deploy the model in Triton Inference Server, we need to create and save a protobuf config file called config.pbtxt under `model_repository/fil/` directory that contains information about the model and the deployment. Sample config file is available here: [link](https://github.com/triton-inference-server/fil_backend#configuration)

Essentially, the following parameters need to be updated as per your configuration

```
name: "fil"                              # Name of the model directory (fil in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 2 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "xgboost" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]

dynamic_batching {
  preferred_batch_size: [1024, 8192]
  max_queue_delay_microseconds: 100
}
```
Store the above config at `/data/model_repository/fil/` directory as config.pbtxt as follows:

For more information on sample configs, please refer this [link](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md)

In [None]:
%%bash
# Writing config to file
cat > /data/model_repository/fil/config.pbtxt <<EOL 
name: "fil"                              # Name of the model directory (fil in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 2 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "xgboost" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]

dynamic_batching {
  preferred_batch_size: [1024, 8192]
  max_queue_delay_microseconds: 100
}
EOL

Finally, the model repository should look like this:

```
model_repository/
`-- fil
    |-- 1
    |   `-- xgboost.model
    `-- config.pbtxt
```

**3. Deploy the model in Triton Inference Server**

Finally, we can deploy the xgboost model in Triton Inference Server using the following command:

In [None]:
# Run the Triton Inference Server in a Subprocess from Jupyter notebook

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen(["tritonserver --model-repository=/data/model_repository"], stdout=subprocess.PIPE, 
                       shell=True, preexec_fn=os.setsid) 

The above command should load the model and print the log `successfully loaded 'fil' version 1`. Triton server listens on the following endpoints:

```
Port 8000    -> HTTP Service
Port 8001    -> GRPC Service
Port 8002    -> Metrics
```

We can test the status of the server connection by running the curl command: `curl -v <IP of machine>:8000/v2/health/ready` which should return `HTTP/1.1 200 OK`

**NOTE:-** In our case the IP of machine on which Triton Server and this notebook are currently running is `localhost`

In [None]:
!curl -v localhost:8000/v2/health/ready

In [None]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(pro.pid), signal.SIGTERM)  # Send the signal to all the process groups

### Concurrent Model Execution

By default the Triton Inference Server loads a single instance of our XGBoost model. We can load multiple instances of the model on either the same GPU or different GPU (depending on the availability) inorder to improve the throughput of incoming requests. For more information on concurrent model execution in Triton, please refer [link](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups) and [link](https://github.com/triton-inference-server/server/blob/r21.06/docs/architecture.md#concurrent-model-execution)

To deploy multiple instances of our XGBoost model, we need to make a small modification in instance_group within config.pbtxt file. Change `instance_group [{ kind: KIND_GPU }]` to

```
  instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]
```
This will result in Triton server creating 2 instances of XGBoost model on the same GPU. Run the server after making the above modification:

In [None]:
%%bash
# Writing config to file
cat > /data/model_repository/fil/config.pbtxt <<EOL 
name: "fil"                              # Name of the model directory (fil in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 2 ]                          # Output 2 for binary classification model
  }
]
instance_group [                         # 2 instances of the model will be deployed in this case
    {
      count: 2
      kind: KIND_GPU
    }
  ]
parameters [
  {
    key: "model_type"
    value: { string_value: "xgboost" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]

dynamic_batching {
  preferred_batch_size: [1024, 8192]
  max_queue_delay_microseconds: 100
}
EOL

In [None]:
# Run the Triton Inference Server with 2 instances of our model in a Subprocess from Jupyter notebook

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen(["tritonserver --model-repository=/data/model_repository"], stdout=subprocess.PIPE, 
                       shell=True, preexec_fn=os.setsid) 

As seen in the above logs, two instances of XGBoost model are created using Triton FIL backend

```
I0716 17:55:38.954195 313 api.cu:117] TRITONBACKEND_ModelInstanceInitialize: fil_0_0 (GPU device 0)
I0716 17:55:43.991152 313 api.cu:117] TRITONBACKEND_ModelInstanceInitialize: fil_0_1 (GPU device 0)
I0716 17:55:44.000355 313 model_repository_manager.cc:1212] successfully loaded 'fil' version 1
```

In [None]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(pro.pid), signal.SIGTERM)  # Send the signal to all the process groups

### Dynamic Batching

Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. This results in increased throughput. To perform dynamic batching in Triton you'd need to modify the following property in config.pbtxt file:
```
dynamic_batching {
  preferred_batch_size: [1024, 8192]
  max_queue_delay_microseconds: 100
}
```
The *preferred_batch_size* property indicates the batch sizes that the dynamic batcher should attempt to create. The *max_queue_delay_microseconds* property setting delays sending the batch, when a batch of a preferred size cannot be created from the available requests as long as no request is delayed longer than the configured max_queue_delay_microseconds value. If a new request arrives during this delay and allows the dynamic batcher to form a batch of a preferred batch size, then that batch is sent immediately for inferencing. If the delay expires the dynamic batcher sends the batch as is, even though it is not a preferred size.

## Model Analyzer

[Triton Model Analyzer](https://github.com/triton-inference-server/model_analyzer) is a tool to profile and evaluate the best deployment configuration that maximizes inference performance of your model when deployed in Triton Inference Server. Using this tool, you can find the appropriate batch size, #TODO benefits of model analyzer. Model Analyzer installation steps are available here: [link](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md)



In [None]:
!pip3 install nvidia-pyindex

In [None]:
!pip3 install triton-model-analyzer

In [None]:
!export DCGM_VERSION=2.0.13
!wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/datacenter-gpu-manager_${DCGM_VERSION}_amd64.deb && \
 dpkg -i datacenter-gpu-manager_${DCGM_VERSION}_amd64.deb

In [None]:
!model-analyzer profile -m /data/model_repository/ --profile-models fil --override-output-model-repository

## Triton Client

**Note:-** Triton server needs to be running for executing this section

After model profiling is done and the final model is selected as per required configuration and deployed in Triton, we can now test the inference by sending inference request from Triton Client. For more information on installation steps, please check [Triton Client Github](https://github.com/triton-inference-server/client)  

In [None]:
# Run Triton Inference Server in background from Jupyter notebook

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen(["tritonserver --model-repository=/data/model_repository"], stdout=subprocess.PIPE, 
                       shell=True, preexec_fn=os.setsid) 

In [None]:
# Install dependencies to run Triton Client from Triton Inference Server container
!pip3 install nvidia-pyindex
!pip3 install tritonclient[http] # Install just http client for this notebook demo

In [None]:
# Check client library can be imported
import numpy
import tritonclient.http as triton_http

In [None]:
# Set up HTTP client.
http_client = triton_http.InferenceServerClient(
    url='localhost:8000',
    verbose=False,
    concurrency=1
)

# Set up Triton input and output objects for both HTTP and GRPC
triton_input_http = triton_http.InferInput(
    'input__0',
    (X_test.shape[0], X_test.shape[1]),
    'FP32'
)
triton_input_http.set_data_from_numpy(X_test, binary_data=True)
triton_output_http = triton_http.InferRequestedOutput(
    'output__0',
    binary_data=True
)

# Submit inference requests 
request_http = http_client.infer(
    'fil',
    model_version='1',
    inputs=[triton_input_http],
    outputs=[triton_output_http]
)

# Get results as numpy arrays
result_http = request_http.as_numpy('output__0')

# Check that we got the same accuracy as previously
predictions = [round(value) for value in result_http]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

The above test accuracy score of the model deployed in Triton using FIL backend matches with the one previously computed using XGBoost library's predict function.