<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_merlin_02-deploying-multi-stage-recsys-with-merlin-systems/nvidia_logo.png" style="width: 90px; float: right;">

# Deploying Online Multi-Stage RecSys with Triton Inference Server

At this point, we expect that you have already executed the first notebook `01-Building-Online-Multi-Stage-Recsys-Components.ipynb` and exported all the required files and models. Note that even if you didn't run the first notebook, you can still obtain the datasets and pre-trained models by running

```bash
cd Redis-Recsys
aws s3 cp s3://redisventures/merlin/merlin-recsys-data.zip ./data
```

We are going to generate recommended items for a given user query (user_id) by following the steps described in the figure below.

![img](https://raw.githubusercontent.com/RedisVentures/Redis-Recsys/master/assets/OnlineMultiStageRecsys.png)

We will serve the multi-stage recommender on [Triton Inference Server](https://github.com/triton-inference-server/server)(TIS) easily and efficiently.

### Import required libraries and functions

In [10]:
import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import feast
import shutil

from nvtabular import ColumnSchema, Schema
from merlin.systems.dag.ensemble import Ensemble
from merlin.systems.dag.ops.session_filter import FilterCandidates
from merlin.systems.dag.ops.softmax_sampling import SoftmaxSampling
from merlin.systems.dag.ops.tensorflow import PredictTensorflow
from merlin.systems.dag.ops.unroll_features import UnrollFeatures
from merlin.systems.triton.utils import send_triton_request

In [2]:
# Define output path for data
BASE_DIR = "/workdir"
FEATURE_STORE_DIR = os.path.join(BASE_DIR, "feature_repo/")
TRITON_MODEL_REPO = os.path.join(BASE_DIR, "models/")

DATA_DIR = "/model-data/"
DLRM_DIR = os.path.join(DATA_DIR, "dlrm")
QUERY_TOWER_DIR = os.path.join(DATA_DIR, "query_tower")
OUTPUT_DATA_DIR = os.path.join(DATA_DIR, "processed")
OUTPUT_RETRIEVAL_DATA_DIR = os.path.join(OUTPUT_DATA_DIR, "retrieval")

## Define Triton Ensemble
In order to run our Recsys in Triton, we need to assemble the pieces that will run together as an ensemble.

### Setup Triton Model Repo

Define paths for ranking and retrieval model in the Triton Model Repo. We need to move/copy our trained models from the `DATA_DIR` to the `TRITON_MODEL_REPO` so it can be consumed by Triton on startup.

In [3]:
retrieval_model_path = os.path.join(TRITON_MODEL_REPO, "1-user-embeddings/1/model.savedmodel/")
ranking_model_path = os.path.join(TRITON_MODEL_REPO, "5-ranking/1/model.savedmodel/")

# Copy over pretrined Query Tower Model to our Triton Model Repository
if not os.path.isdir(retrieval_model_path):
    shutil.copytree(QUERY_TOWER_DIR, retrieval_model_path)

# Copy over pretrined DLRMfor ranking to our Triton Model Repository
if not os.path.isdir(ranking_model_path):
    shutil.copytree(DLRM_DIR, ranking_model_path)

### Explore Triton Model Repo
Below we will take a look at the multi-stage ensemble that functions as a DAG of operations within Triton.

In [4]:
import seedir as sd

sd.seedir(
    TRITON_MODEL_REPO,
    style='lines',
    itemlimit=10,
    depthlimit=5,
    exclude_folders=['.ipynb_checkpoints', '__pycache__'],
    sort=True
)

models/
├─0-query-user-features/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─1-user-embeddings/
│ ├─1/
│ │ └─model.savedmodel/
│ │   ├─keras_metadata.pb
│ │   ├─saved_model.pb
│ │   └─variables/
│ │     ├─variables.data-00000-of-00001
│ │     └─variables.index
│ └─config.pbtxt
├─2-redis-vss-candidates/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─3-query-item-features/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─4-unroll-features/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
├─5-ranking/
│ ├─1/
│ │ └─model.savedmodel/
│ │   ├─.merlin/
│ │   │ └─input_schema.json
│ │   ├─keras_metadata.pb
│ │   ├─saved_model.pb
│ │   └─variables/
│ │     ├─variables.data-00000-of-00001
│ │     └─variables.index
│ └─config.pbtxt
├─6-softmax-sampling/
│ ├─1/
│ │ └─model.py
│ └─config.pbtxt
└─ensemble-model/
  ├─1/
  │ └─.gitkeep
  └─config.pbtxt


The subfolders (**starting with 0-6**) in the model repo above represent distinct stages in the RecSys ensemble.

- `0-query-user-features/` - fetch user features from Redis.
- `1-user-embeddings/` - generate user embeddings from the Query Tower (Tensorflow) model.
- `2-redis-vss-candidates/` - perform VSS to find KNN items using RediSearch.
- `3-query-item-features/` - fetch item features from Redis.
- `4-unroll-features/` - combine and unroll user and item features.
- `5-ranking/` - rank the top User/Item pairs with the DLRM (Tensorflow) model.
- `6-softmax-sampling/` - sort all inputs in descending order, introduce some randomization via softmax sampling, and return top-k ordered items.
- `ensemble-model/`

The `ensemble-model` contains the orchestration of all of the individual steps. To learn more about general Triton model repo structure [check this out](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md).

## Starting Triton Server

Now we can deploy all the models as an ensemble model to Triton Inference Serve [TIS](https://github.com/triton-inference-server). After we export the ensemble, we are ready to start the TIS. You can start triton server by using the following command in a Jupyter terminal:

```bash
tritonserver --model-repository=/workdir/models --backend-config=tensorflow,version=2
```

*For the `--model-repository` argument, specify the path to the Triton Model Repo stored in the var `TRITON_MODEL_REPO`.* This command will launch the server and load all the models to the server. Once all the models are loaded successfully, you should see `READY` status printed out in the terminal for each loaded model.

### Retrieving Recommendations from Triton

Once our models are successfully loaded to the TIS, we can now easily send a request to TIS and get a response for our query with our simple Python `client`.

Let's send a request to TIS for a given `user_id_raw` value. If you make multiple requests in a row for same user, you should see slightly different results based on the randomness introduced via softmax sampling!

In [7]:
!python client.py --user 23

Finding recommendations for User 23
Recommended Product Ids in 0.7311577796936035 seconds
[[ 39]
 [260]
 [107]
 [ 91]
 [461]
 [ 47]
 [ 97]
 [105]
 [153]
 [ 58]
 [487]
 [296]
 [217]
 [204]
 [113]
 [180]]


In [9]:
!perf_analyzer -m ensemble-model -u localhost:8000 --input-data=sample.json --shape=user_id_raw:1,1 -t 2

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 2
  Client: 
    Request count: 769
    Throughput: 42.7175 infer/sec
    Avg latency: 46744 usec (standard deviation 20083 usec)
    p50 latency: 36466 usec
    p90 latency: 85368 usec
    p95 latency: 86602 usec
    p99 latency: 88286 usec
    Avg HTTP time: 46738 usec (send/recv 51 usec + response wait 46687 usec)
  Server: 
    Inference count: 1541
    Execution count: 1541
    Successful request count: 1541
    Avg request latency: 46971 usec (overhead 16 usec + queue 13060 usec + compute 33895 usec)

  Composing models: 
  0-query-user-features, version: 
      Inference count: 1543
      Execution count: 1543
      Successful request count: 1543
      Avg request latency: 602

## Conclusion

That's it! You finished deploying an online multi-stage Recommender Systems on Triton Inference Server with Redis!