<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_merlin_02-deploying-multi-stage-recsys-with-merlin-systems/nvidia_logo.png" style="width: 90px; float: right;">

# Deploying Online Multi-Stage RecSys with Triton Inference Server

At this point, we expect that you have already executed the first notebook `01-Building-Online-Multi-Stage-Recsys-Components.ipynb` and exported all the required files and models. 

We are going to generate recommended items for a given user query (user_id) by following the steps described in the figure below.

![img](./img/OnlineMultiStageRecsys.png)

We will serve the multi-stage recommender on [Triton Inference Server](https://github.com/triton-inference-server/server)(TIS) easily and efficiently. Below, we will go through these steps and demonstrate their usage in serving a multi-stage system on Triton.


This notebook is created using the latest stable [merlin-tensorflow](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow/tags) container. 

### Import required libraries and functions

In [2]:
%pip install protobuf==3.20

Note: you may need to restart the kernel to use updated packages.


At this step, we assume you already installed feast and faiss-gpu (or -cpu) libraries when running the first notebook `01-Building-Recommender-Systems-with-Merlin.ipynb`. 

In case you need to install them for running this example on GPU, execute the following script in a cell.
```
%pip install "feast<0.20" faiss-gpu
```
or the following script in a cell for CPU.
```
%pip install tensorflow-cpu "feast<0.20" faiss-cpu
```


In [3]:
import os
import numpy as np
import pandas as pd
import feast
import faiss

from nvtabular import ColumnSchema, Schema

from merlin.systems.dag.ensemble import Ensemble
from merlin.systems.dag.ops.session_filter import FilterCandidates
from merlin.systems.dag.ops.softmax_sampling import SoftmaxSampling
from merlin.systems.dag.ops.tensorflow import PredictTensorflow
from merlin.systems.dag.ops.unroll_features import UnrollFeatures
from merlin.systems.triton.utils import send_triton_request

import tensorflow as tf
import shutil

In [4]:
# Define output path for data
BASE_DIR = os.environ['PWD']
FEATURE_STORE_DIR = os.path.join(BASE_DIR, "feature_repo/")
MODEL_DIR = os.path.join(BASE_DIR, "models/")

DATA_DIR = os.path.join(BASE_DIR, "data")
DLRM_DIR = os.path.join(DATA_DIR, "dlrm")
QUERY_TOWER_DIR = os.path.join(DATA_DIR, "query_tower")
OUTPUT_DATA_DIR = os.path.join(DATA_DIR, "processed")
OUTPUT_RETRIEVAL_DATA_DIR = os.path.join(OUTPUT_DATA_DIR, "retrieval")


In [6]:
feature_store = feast.FeatureStore(FEATURE_STORE_DIR)

In [10]:
view = feature_store.get_feature_view("item_features")

In [11]:
view.entities[0]


'item_id'

In [12]:
view.features

[item_category-Int32, item_shop-Int32, item_brand-Int32, item_id_raw-Int32]

## Define Triton Ensemble

Define paths for ranking model, retrieval model

In [26]:
retrieval_model_path = os.path.join(MODEL_DIR, "candidate-retrieval/1/model.savedmodel/")

if not os.path.isdir(retrieval_model_path):
    shutil.copytree(QUERY_TOWER_DIR, retrieval_model_path)

    
ranking_model_path = os.path.join(MODEL_DIR, "ranking/1/model.savedmodel/")

if not os.path.isdir(ranking_model_path):
    shutil.copytree(DLRM_DIR, ranking_model_path)

In [27]:
from merlin.systems.dag.ops.feast import QueryFeast 

user_features = ["user_id_raw"] >> QueryFeast.from_feature_view(
    store=feature_store,
    view="user_features",
    column="user_id_raw",
    include_id=False,
)

KeyError: <PrimitiveFeastType.INT32: 3>

Retrieve top-K candidate items using `retrieval model` that are relevant for a given user. We use `PredictTensorflow()` operator that takes a tensorflow model and packages it correctly for TIS to run with the tensorflow backend.

In [11]:
# prevent TF to claim all GPU memory
from merlin.models.loader.tf_utils import configure_tensorflow

configure_tensorflow()

2023-01-17 20:04:42.149365: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-17 20:04:42.151577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-17 20:04:42.153074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


<function tensorflow.python.dlpack.dlpack.from_dlpack(dlcapsule)>

In [13]:
topk_retrieval = int(
    os.environ.get("topk_retrieval", "100")
)
retrieval = (
    PredictTensorflow(QUERY_TOWER_DIR)
    >> QueryFaiss(faiss_index_path, topk=topk_retrieval)
)

2022-09-14 15:28:46.303447: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-14 15:28:47.443330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16249 MB memory:  -> device: 0, name: Quadro GV100, pci bus id: 0000:2d:00.0, compute capability: 7.0


Fetch item features for the candidate items that are retrieved from the retrieval step above from the feature store.

In [14]:
item_features = retrieval["candidate_ids"] >> QueryFeast.from_feature_view(
    store=feature_store,
    view="item_features",
    column="candidate_ids",
    output_prefix="item",
    include_id=True,
)

Merge the user features and items features to create the all set of combined features that were used in model training using `UnrollFeatures` operator which takes a target column and joins the "unroll" columns to the target. This helps when broadcasting a series of user features to a set of items.

In [15]:
user_features_to_unroll = [
    "user_id",
    "user_shops",
    "user_profile",
    "user_group",
    "user_gender",
    "user_age",
    "user_consumption_2",
    "user_is_occupied",
    "user_geography",
    "user_intentions",
    "user_brands",
    "user_categories",
]

combined_features = item_features >> UnrollFeatures(
    "item_id", user_features[user_features_to_unroll]
)

Rank the combined features using the trained ranking model, which is a DLRM model for this example. We feed the path of the ranking model to `PredictTensorflow()` operator.

In [16]:
ranking = combined_features >> PredictTensorflow(ranking_model_path)

For the ordering we use `SoftmaxSampling()` operator. This operator sorts all inputs in descending order given the input ids and prediction introducing some randomization into the ordering by sampling items from the softmax of the predicted relevance scores, and finally returns top-k ordered items.

In [17]:
top_k=10
ordering = combined_features["item_id_raw"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

### Export Graph as Ensemble
The last step is to create the ensemble artifacts that TIS can consume. To make these artifacts import the Ensemble class. This class  represents an entire ensemble consisting of multiple models that run sequentially in TIS initiated by an inference request. It is responsible with interpreting the graph and exporting the correct files for TIS.

When we create an Ensemble object we feed the graph and a schema representing the starting input of the graph.  After we create the ensemble object, we export the graph, supplying an export path for the `ensemble.export()` function. This returns an ensemble config which represents the entire inference pipeline and a list of node-specific configs.

Create the folder to export the models and config files.

In [18]:
if not os.path.isdir(MODEL):
    os.makedirs(os.path.join(BASE_DIR, 'poc_ensemble'))

Create a request schema that we are going to use when sending a request to Triton Inference Server (TIS).

In [19]:
request_schema = Schema(
    [
        ColumnSchema("user_id_raw", dtype=np.int32),
    ]
)

In [20]:
# define the path where all the models and config files exported to
export_path = os.path.join(BASE_DIR, 'poc_ensemble')

ensemble = Ensemble(ordering, request_schema)
ens_config, node_configs = ensemble.export(export_path)

# return the output column name
outputs = ensemble.graph.output_schema.column_names
print(outputs)

['ordered_ids']


Let's check our export_path structure

In [21]:
sd.seedir(export_path, style='lines', itemlimit=10, depthlimit=5, exclude_folders=['.ipynb_checkpoints', '__pycache__'], sort=True)

poc_ensemble/
├─0_queryfeast/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─1_predicttensorflow/
│ ├─1/
│ │ └─model.savedmodel/
│ │   ├─assets/
│ │   ├─keras_metadata.pb
│ │   ├─saved_model.pb
│ │   └─variables/
│ │     ├─variables.data-00000-of-00001
│ │     └─variables.index
│ └─config.pbtxt
├─2_queryfaiss/
│ ├─1/
│ │ ├─index.faiss/
│ │ │ └─index.faiss
│ │ └─model.py
│ └─config.pbtxt
├─3_queryfeast/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─4_unrollfeatures/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─5_predicttensorflow/
│ ├─1/
│ │ └─model.savedmodel/
│ │   ├─assets/
│ │   ├─keras_metadata.pb
│ │   ├─saved_model.pb
│ │   └─variables/
│ │     ├─variables.data-00000-of-00001
│ │     └─variables.index
│ └─config.pbtxt
├─6_softmaxsampling/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
└─ensemble_model/
  ├─1/
  └─config.pbtxt


### Starting Triton Server

It is time to deploy all the models as an ensemble model to Triton Inference Serve [TIS](https://github.com/triton-inference-server). After we export the ensemble, we are ready to start the TIS. You can start triton server by using the following command on your terminal:

```
tritonserver --model-repository=/ensemble_export_path/ --backend-config=tensorflow,version=2
```

For the `--model-repository` argument, specify the same path as the `export_path` that you specified previously in the `ensemble.export` method. This command will launch the server and load all the models to the server. Once all the models are loaded successfully, you should see `READY` status printed out in the terminal for each loaded model.

### Retrieving Recommendations from Triton

Once our models are successfully loaded to the TIS, we can now easily send a request to TIS and get a response for our query with `send_triton_request` utility function. 

Let's send a request to TIS for a given `user_id_raw` value.

In [22]:
# read in data for request
from merlin.core.dispatch import make_df

# create a request to be sent to TIS
request = make_df({"user_id_raw": [7]})
request["user_id_raw"] = request["user_id_raw"].astype(np.int32)
print(request)

   user_id_raw
0            7


Let's return raw item ids from TIS as top-k recommended items per given request.

In [25]:
response = send_triton_request(request_schema, request, outputs)
response

{'ordered_ids': array([[117],
        [415],
        [228],
        [985],
        [ 76],
        [410],
        [193],
        [120],
        [ 87],
        [139]], dtype=int32)}

That's it! You finished deploying a multi-stage Recommender Systems on Triton Inference Server using Merlin framework.