# The FIL Backend for Triton: Deployment Details and FAQs

## Introduction

This example notebook focuses on the technical details of deploying tree-based models with the FIL Backend for Triton. It is organized as a series of FAQs followed by example code providing a practical illustration of the corresponding FAQ section.

The goal of this notebook is to offer information that goes beyond the basics and provide answers to practical questions that may arise when attempting a real-world deployment with the FIL backend. If you are a complete newcomer to the FIL backend and are looking for a short introduction to the basics of what the FIL backend is and how to use it, you are encouraged to check out [this introductory notebook](https://github.com/triton-inference-server/fil_backend/blob/main/notebooks/categorical-fraud-detection/Fraud_Detection_Example.ipynb).

While we do provide training code for example models, training models is *not* the subject of this notebook, and we will provide little detail on training. Instead, you are encouraged to use your own model(s) and data with this notebook to get a realistic picture of how your model will perform with Triton.

## Hardware Pre-Requisites
Most of this notebook is designed to run either on CPU or GPU. Sections that will only run on GPU will be marked in $\color{#76b900}{\text{green}}$. To guarantee that all cells will execute correctly if a GPU is not available, change `USE_GPU` in the following cell to `False`.

In [1]:
USE_GPU = True


## Software Pre-Requisites

TODO

### Bring Your Own Model
If you are bringing your own model(s) to use with this notebook, you will need only [Docker](https://docs.docker.com/engine/install/), Numpy, and Triton's Python client package. The following command can be used to install the Python dependencies in any Python environment:
```bash
pip install numpy tritonclient[all]
```
Note that the Triton client package also requires that `libb64` be available. This can be installed on Debian-based systems via
```bash
sudo apt install libb64-0d
```

### Train an Example Model
Training the example models in this notebook requires more extensive dependencies. You can install them all with the following conda environment file
```yaml
---
name: triton_example
channels:
  - conda-forge
  - nvidia
  - rapidsai
dependencies:
  - cudatoolkit=11.4
  - cudf=21.12
  - cuml=21.12
  - cupy
  - jupyter
  - kaggle
  - matplotlib
  - numpy
  - pandas
  - pip
  - python=3.8
  - scikit-learn
  - pip:
      - tritonclient[all]
      - xgboost>=1.5,<1.6
```

# FAQ 1: What can I deploy with the FIL backend?
The first thing you will need to begin using the FIL backend is a serialized model file. The FIL backend supports **tree-based** models serialized to formats from a variety of frameworks, including the following:

## XGBoost JSON and binary models
XGBoost uses two serialization formats, both of which are natively supported by the FIL backend. All XGBoost models except for multi-output regression models are supported.
<div class="alert alert-block alert-info">
<b>VERSION NOTE:</b> Categorical variable support was added to XGBoost 1.5 as an experimental feature. The FIL backend has supported categorical variables since version 21.11.
</div>
<div class="alert alert-block alert-info">
<b>VERSION NOTE:</b> The XGBoost JSON format changed in XGBoost 1.6. The first version of the FIL backend to support these JSON changes will be 22.07.
</div>

## LightGBM text models
LightGBM's text serialization format is natively supported for all LightGBM model types except for multi-output regression models.

<div class="alert alert-block alert-info">
<b>VERSION NOTE:</b> Models trained on categorical variables have been supported since version 21.11 of the backend
</div>

## Scikit-Learn/cuML tree models and other Treelite-supported models

The FIL backend supports the following model types from Scikit-Learn/cuML:
- GradientBoostingClassifier
- GradientBoostingRegressor
- IsolationForest
- RandomForestClassifier
- RandomForestRegressor
- ExtraTreesClassifier
- ExtraTreesRegressor

Since Scikit-Learn and cuML do not have native serialization formats for these models (instead relying on e.g. Pickle), we use Treelite's checkpoint format to support these models. This also means that *any* framework that can export to Treelite's checkpoint format will be supported by the FIL backend. As part of this notebook, we will provide an example of how to save a Scikit-Learn or cuML model to a Treelite checkpoint.

<div class="alert alert-block alert-info">
<b>VERSION NOTE:</b> Treelite's checkpoint format provides no forward/backward compatibility guarantees. It is therefore <b>strongly recommended</b> that you save Scikit-Learn and cuML models to Pickle so that they can be reconverted as needed. The table below shows the version of Treelite which <b>must</b> be used with each version of the FIL backend.
    
<table>
    <thead>
        <tr><th>FIL Backend Version</th><th>Treelite</th></tr>
    </thead>
    <tbody>
        <tr><td>21.08</td><td>1.3.0</td></tr>
        <tr><td>21.09-21.10</td><td>2.0.0</td></tr>
        <tr><td>21.11-22.02</td><td>2.1.0</td></tr>
        <tr><td>22.03-22.06</td><td>2.3.0</td></tr>
    </tbody>
</table>
    
</div>



### FAQ 1.1 Can I deploy non-tree Scikit-Learn models like LinearRegression?
No. The FIL backend only supports tree models and will continue to support only tree models in the future. Support for other model types may eventually be added to Triton via another backend.

### FAQ 1.2 Can I deploy Scikit-Learn/cuML Pipelines with the FIL backend?
No. If you wish to create pipelines of different models in Triton, check out Triton's [Python backend](https://github.com/triton-inference-server/python_backend#python-backend), which allows users to connect models supported by other backends with arbitrary Python logic.

### FAQ 1.3 Can I deploy Scikit-Learn/cuML models serialized with Pickle?
Pickle-serialized models can be converted to Treelite's checkpoint format using a script provided with the FIL Backend. This script is [documented here](https://github.com/triton-inference-server/fil_backend/blob/main/SKLearn_and_cuML.md#converting-to-treelite-checkpoints), and an example of its use will be included with this notebook. **Pickle models MUST be converted to Treelite checkpoints. They CANNOT be used directly by the FIL backend.**

### FAQ 1.4 Can I deploy Scikit-Learn/cuML models serialized with Joblib?
JobLib-serialized models can be loaded in Python and serialized to Treelite checkpoints. At the moment, the conversion scripts for Pickle-serialized models do **not** work with Joblib, but support for Joblib will be added with a later version. **Joblib models MUST be converted to Treelite checkpoints. They CANNOT be used directly by the FIL backend.**

# Example 1: Model Serialization

In the following example code snippets, we will demonstrate how model serialization works for each of the supported model types. In the cell below, indicate the type of model you would like to use.

If you are bringing your own model, please also provide the path to the serialized model. Otherwise, a model will be trained on random data in your selected format.

In addition to information on where and how the model is stored, we'll use the following cell to gather a bit of metadata on the model which we'll need later on including the number of features the model expects and the number of classes it outputs. If you are using a regression model, use `1` for the number of classes.

In [2]:
# Allowed values for MODEL_FORMAT are xgboost_json, xgboost_bin, lightgbm, skl_pkl, cuml_pkl, skl_joblib,
# and treelite
MODEL_FORMAT = 'xgboost_json'

# If a path is provided to a model in the specified format, that model will be used for the following examples.
# Otherwise, if MODEL_PATH is left as None, a model will be trained and stored to a default location.
MODEL_PATH = None

# Set this value to the number of features (columns) in your dataset
NUM_FEATURES = 32

# Set this value to the number of possible output classes or 1 for regression models
NUM_CLASSES = 2

## Model Training/Loading

In this section, if a model path has been provided, we will load the model so that we can compare its output to what we get from Triton later in the notebook. If a model path has **not** been provided, a model of the indicated type will be trained and serialized to a default location. We will not provide detail or commentary on training, since this is not the focus of this notebook. Consult documentation or examples for your chosen framework if you would like to learn more about the training process.

In [3]:
RANDOM_SEED=0

In [4]:
import pandas as pd
from sklearn.datasets import make_classification
# Create random dataset. Even if we do not use this dataset for training, we will use it for testing later.
# If you would like to use a real dataset, load it here into X and y Pandas dataframes
X, y = make_classification(
    n_samples=5000,
    n_features=NUM_FEATURES,
    n_informative=max(NUM_FEATURES // 3, 1),
    n_classes=NUM_CLASSES,
    random_state=RANDOM_SEED
)
X = pd.DataFrame(X)
y = pd.DataFrame(y)

In [5]:
# Set model parameters for any models we need to train
NUM_TREES = 500
MAX_DEPTH = 10

In [6]:
model = None
# XGBoost
def train_xgboost(X, y, n_trees=NUM_TREES, max_depth=MAX_DEPTH):
    import xgboost as xgb
    
    if USE_GPU:
        tree_method = 'gpu_hist'
        predictor = 'gpu_predictor'
    else:
        tree_method = 'hist'
        predictor = 'cpu_predictor'
    
    model = xgb.XGBClassifier(
        eval_metric='error',
        objective='binary:logistic',
        tree_method=tree_method,
        max_depth=max_depth,
        n_estimators=n_trees,
        use_label_encoder=False,
        predictor=predictor
    )
    
    return model.fit(X, y)

def train_lightgbm(X, y, n_trees=NUM_TREES, max_depth=MAX_DEPTH):
    import lightgbm as lgb
    
    lgb_data = lgb.Dataset(X, y)
    
    if classes <= 2:
        classes = 1
        objective = 'binary'
        metric = 'binary_logloss'
    else:
        objective = 'multiclass'
        metric = 'multi_logloss'
    training_params = {
        'metric': metric,
        'objective': objective,
        'num_class': NUM_CLASSES,
        'max_depth': max_depth,
        'verbose': -1
    }
    return lgb.train(training_params, lgb_data, n_trees)

def train_cuml(X, y, n_trees=NUM_TREES, max_depth=MAX_DEPTH):
    from cuml.ensemble import RandomForestClassifier
    model = RandomForestClassifier(
        max_depth=max_depth, n_estimators=n_trees, random_state=RANDOM_SEED
    )
    return model.fit(X, y)

def train_skl(X, y, n_trees=NUM_TREES, max_depth=MAX_DEPTH):
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(
        max_depth=max_depth, n_estimators=n_trees, random_state=RANDOM_SEED
    )
    return model.fit(X, y)
    
    
if MODEL_FORMAT in ('xgboost_json', 'xgboost_bin'):
    if MODEL_PATH is not None:
        # Load model just as a reference for later
        import xgboost as xgb
        model = xgb.Booster()
        model.load_model(MODEL_PATH)
        print('Congratulations! Your model is already in a natively-supported format')
    else:
        model = train_xgboost(X, y)
elif MODEL_FORMAT == 'lightgbm':
    if MODEL_PATH is not None:
        # Load model just as a reference for later
        import lightgbm as lgb
        model = lgb.Booster(model_file=MODEL_PATH)
        print('Congratulations! Your model is already in a natively-supported format')
    else:
        model = train_lightgbm(X, y)
elif MODEL_FORMAT in ('cuml_pkl', 'skl_pkl', 'cuml_joblib', 'skl_joblib'):
    if MODEL_PATH is not None:
        if MODEL_FORMAT in ('cuml_pkl', 'skl_pkl'):
            # Load model just as a reference for later
            import pickle
            model = pickle.load(MODEL_PATH)
            print(
                "While pickle files are not natively supported, we will use a script to"
                " convert your model to a Treelite checkpoint later in this notebook."
            )
        else:
            print("Loading model from joblib file in order to convert it to Treelite checkpoint...")
            import joblib
            model = joblib.load(MODEL_PATH)
    elif MODEL_FORMAT.startswith('cuml'):
        model = train_cuml(X, y)
    elif MODEL_FORMAT.startswith('skl'):
        model = train_skl(X, y)

## The Model Repository
Triton expects models to be stored in a specific directory structure. We will go ahead and create this directory structure now and serialize our models directly into the final directory or copy the serialized model there if the trained model was provided.

Each model requires a configuration file stored in `$MODEL_REPO/$MODEL_NAME/config.pbtxt`, and a model file stored in `$MODEL_REPO/$MODEL_NAME/$MODEL_VERSION/$MODEL_FILENAME`. Note that Triton supports storing multiple versions of a model directories with different `$MODEL_VERSION` numbers starting from `1`.

In [7]:
import os
import shutil

MODEL_NAME = 'example_model'
MODEL_VERSION = 1
MODEL_REPO = os.path.abspath('data/model_repository')
MODEL_DIR = os.path.join(MODEL_REPO, MODEL_NAME)
VERSIONED_DIR = os.path.join(MODEL_DIR, str(MODEL_VERSION))

os.makedirs(VERSIONED_DIR, exist_ok=True)

# We will use the following variables to record information from the serialization
# process that we will require later
model_path = None
model_format = None

## Example 1.1: Serializing an XGBoost model

In [8]:
if MODEL_FORMAT == 'xgboost_json':
    # This is the default filename expected for XGBoost JSON models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_basename = 'xgboost.json'
    model_path = os.path.join(VERSIONED_DIR, model_basename)
    
    model_format = 'xgboost_json'
elif MODEL_FORMAT == 'xgboost_bin':
    # This is the default filename expected for XGBoost binary models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_basename = 'xgboost.model'
    model_path = os.path.join(VERSIONED_DIR, model_basename)
    
    # This is the format name Triton uses to indicate XGBoost binary models
    model_format = 'xgboost'

if MODEL_FORMAT.startswith('xgboost'):
    if MODEL_PATH is not None:  # Just need to copy existing file...
        shutil.copy(MODEL_PATH, model_path)
    else:
        model.save_model(model_path)  # XGB derives format from extension

## Example 1.2 Serializing a LightGBM model

In [9]:
if MODEL_FORMAT == 'lightgbm':
    # This is the default filename expected for LightGBM text models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_basename = 'model.txt'
    model_path = os.path.join(VERSIONED_DIR, model_basename)
    
    model_format = MODEL_FORMAT
    
    if MODEL_PATH is not None:  # Just need to copy existing file...
        shutil.copy(MODEL_PATH, model_path)
    else:
        model.save_model(model_path)    

## Example 1.3 Serializing an in-memory Scikit-Learn model
<a id='example_1.3'></a>The following will show how to serialize a SKL model from Python directly to a Treelite checkpoint format. This could be a model that you have just trained or a model that you have e.g. loaded from Joblib. Again it is strongly recommended that you **save trained models in Pickle/Joblib as well as Treelite** since Treelite provides no compatibility guarantees between versions.

In [10]:
if model is not None and MODEL_FORMAT.startswith('skl'):
    import pickle
    archival_path = os.path.join(VERSIONED_DIR, 'model.pkl')
    pickle.dump(model, archival_path)  # Create archival pickled version
    
    # This is the default filename expected for Treelite checkpoint models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_basename = 'checkpoint.tl'
    model_path = os.path.join(VERSIONED_DIR, model_basename)
    
    model_format = 'treelite_checkpoint'
    
    import treelite
    tl_model = treelite.sklearn.import_model(model)
    tl_model.serialize(model_path)

## Example 1.4 Serializing an in-memory cuML model
<a id='example_1.4'></a>The following will show how to serialize a cuML model from Python directly to a Treelite checkpoint format. This could be a model that you have just trained or a model that you have e.g. loaded from Joblib. Again it is strongly recommended that you **save trained models in Pickle/Joblib as well as Treelite** since Treelite provides no compatibility guarantees between versions.

In [11]:
if model is not None and MODEL_FORMAT.startswith('cuml'):
    import pickle
    archival_path = os.path.join(VERSIONED_DIR, 'model.pkl')
    pickle.dump(model, archival_path)  # Create archival pickled version
    
    # This is the default filename expected for Treelite checkpoint models. It is recommended
    # that you stick with the default to avoid additional configuration.
    model_basename = 'checkpoint.tl'
    model_path = os.path.join(VERSIONED_DIR, model_basename)
    
    model_format = 'treelite_checkpoint'
    
    model.convert_to_treelite_model().to_treelite_checkpoint(model_path)

## Example 1.5 Converting a pickled Scikit-Learn model
For convenience, the FIL backend provides a script which can be used to convert a pickle file containing a Scikit-Learn model directly to a Treelite checkpoint file. If you do not have access to that script or prefer to work directly from Python, you can always load the pickled model into memory and then serialize it as in [Example 1.3](#example_1.3).

In [12]:
if MODEL_PATH is not None and MODEL_FORMAT == 'skl_pkl':
    archival_path = os.path.join(VERSIONED_DIR, 'model.pkl')
    shutil.copy(MODEL_PATH, archival_path)
    
    !../../scripts/convert_sklearn {archival_path}

## Example 1.6 Converting a pickled cuML model
For convenience, the FIL backend provides a script which can be used to convert a pickle file containing a cuML model directly to a Treelite checkpoint file. If you do not have access to that script or prefer to work directly from Python, you can always load the pickled model into memory and then serialize it as in [Example 1.4](#example_1.4).

In [13]:
if MODEL_PATH is not None and MODEL_FORMAT == 'cuml_pkl':
    archival_path = os.path.join(VERSIONED_DIR, 'model.pkl')
    shutil.copy(MODEL_PATH, archival_path)
    
    !python ../../scripts/convert_cuml.py {archival_path}

# FAQ 2: How do I execute models on CPU only? On GPU?

In addition to a serialized model file, you must provide a `config.pbtxt` configuration file for each model you wish to serve with the FIL backend for Triton. Within that file, it is possible to specify whether a model will run on CPU or GPU and how many instances of the model you wish to serve. For example, adding the following entry to the configuration file will create one instance of the model for each available GPU and run those instances each on their own dedicated GPU:

```pbtxt
  instance_group [{ kind: KIND_GPU }]
```

If you wish to instead run exactly three instances on CPU, the following entry can be used:
```pbtxt
  instance_group [
    {
      count: 3
      kind: KIND_CPU
    }
  ]
```

In the following example, we will create a configuration file that can be used to serve your model on either CPU or GPU depending on the value of the `USE_GPU` flag set earlier in this notebook.

# Example 2: Generating a configuration file

Based on the information provided about your model in previous cells, we can now construct a `config.pbtxt` that can be used to run that model on Triton. We will generate the configuration text and save it to the appropriate location.

For full information on configuration options, check out the FIL backend [documentation](https://github.com/triton-inference-server/fil_backend#configuration). For a detailed example of configuration file construction, you can also check out the [introductory notebook](https://nbviewer.org/github/triton-inference-server/fil_backend/blob/main/notebooks/categorical-fraud-detection/Fraud_Detection_Example.ipynb#The-Configuration-File).

In [14]:
# Maximum size in bytes for input and output arrays
MAX_MEMORY_BYTES = 60_000_000
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

# Select deployment hardware (GPU or CPU)
if USE_GPU:
    instance_kind = 'KIND_GPU'
else:
    instance_kind = 'KIND_CPU'

config_text = f"""backend: "fil",
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {NUM_FEATURES} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "xgboost_json" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "false" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "true" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "AUTO" }}
  }}
]

dynamic_batching {{}}"""

config_path = os.path.join(MODEL_DIR, 'config.pbtxt')
with open(config_path, 'w') as file_:
    file_.write(config_text)

# FAQ 3: How can I quickly test configuration options?
Sometimes it is useful to be able to quickly iterate on the options available in the `config.pbtxt` file for your model. While it is not recommended for production deployments, Triton offers a "polling" mode which will automatically reload models when their configurations change. To use this option, launch the server with the `--model-control-mode=poll` flag. After changing the configuration, wait a few seconds for the model to reload, and then Triton will be ready to handle requests with the new configuration.

# Example 3: Launching the Triton server with polling mode
In the following cell, we will launch the server with the model repository we set up previously in this notebook. We will use pulling mode in order to allow us to tweak the configuration file and observe the impact of our changes.

In [15]:
TRITON_IMAGE = 'nvcr.io/nvidia/tritonserver:22.05-py3'
!docker run --gpus all -d -p 8000:8000 -p 8001:8001 -p 8002:8002 -v {MODEL_REPO}:/models --name tritonserver {TRITON_IMAGE} tritonserver --model-repository=/models --model-control-mode=poll

Unable to find image 'nvcr.io/nvidia/tritonserver:22.05-py3' locally
22.05-py3: Pulling from nvidia/tritonserver

[1B17ec1767: Already exists 
[1B80b25883: Pulling fs layer 
[1B31827455: Pulling fs layer 
[1B550cd86a: Pulling fs layer 
[1B52cc2849: Pulling fs layer 
[1Bc00269e8: Pulling fs layer 
[1Bb6f81809: Pulling fs layer 
[1Befd7921a: Pulling fs layer 
[1B258a7fcf: Pulling fs layer 
[1B6e4d269a: Pulling fs layer 
[1Be063e89c: Pulling fs layer 
[1B7417808e: Pulling fs layer 
[1B0c8f4d73: Pulling fs layer 
[9Bc00269e8: Waiting fs layer 
[11B2cc2849: Waiting fs layer 
[9Befd7921a: Waiting fs layer 
[9B258a7fcf: Waiting fs layer 
[9B6e4d269a: Waiting fs layer 
[9Be063e89c: Waiting fs layer 
[1Bb7033e58: Pulling fs layer 
[1Bf49998bc: Pulling fs layer 
[10Bc8f4d73: Waiting fs layer 
[6B48fca1de: Waiting fs layer 
[1B2019a846: Pull complete 956kB/1.956kBB[22A[2K[23A[2K[22A[2K[22A[2K[22A[2K[22A[2K[22A[2K[22A[2K[22A[2K[23A[2K[19A[2K[23A[2K[

In [17]:
import time
time.sleep(10)  # Wait for server to come up
!docker logs tritonserver


== Triton Inference Server ==

NVIDIA Release 22.05 (build 38317651)
Triton Server Version 2.22.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

  Using driver version 460.32.03 which has support for CUDA 11.2.  This container
  was built with CUDA 11.7 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/

In later sections, we'll take advantage of polling mode to make tweaks to our configuration and observe their impact.

# FAQ 4: My models are exhausting Triton's memory. What can I do?
Tree-based models tend to have fairly modest memory needs, but when using several models together or when trying to process very large batches, you'll sometimes run into memory constraints.

For models deployed on GPU, Triton allocates a device memory pool when it is launched, and the FIL backend only allocates device memory from this pool, so one option is to simply increase the size of the pool until you reach hardware limits on available memory.

For models deployed on CPU or for deployments which have exceeded the hardware limits on available device memory, you may wish to instead reduce the memory consumption of models by tweaking configuration options.
## FAQ 4.1 How can I decrease the memory consumed by a model?
## FAQ 4.2 How do I increase Triton's memory pool?

In [18]:
!docker rm -f tritonserver

tritonserver
