# 1 Overview

(TO BE UPDATED)

In this notebook, we want to provide a tutorial on how to use standard DLRM model that trained on HugeCTR_DLRM_Training.
notebook and deploy the saved model to Triton Inference Server. We could collect the inference benchmark by Triton performance analyzer  tool

1. [Overview](#1)
2. [Generate the DLRM Deployment Configuration](#2)
3. [Load Models on Triton Server](#3)
4. [Prepare Inference Input Data](#4) 
5. [Inference Benchmarm by Triton Performance Tool](#5) 

## 2. Generate the DLRM Deployment Configuration

In [1]:
# define some data folder to store the model related files
# Standard Libraries
import os
from time import time
import re
import shutil
import glob
import warnings

DATA_DIR  = "/model-data/criteo/"
model_folder  = os.path.join(DATA_DIR, "model")
dlrm_model_repo= os.path.join(model_folder, "dlrm")
dlrm_version =os.path.join(dlrm_model_repo, "1")


### Generate Triton configuration for DLRM Deployment 

In [24]:
%%writefile $dlrm_model_repo/config.pbtxt
name: "dlrm"
backend: "hugectr"
max_batch_size:64,
input [
   {
    name: "DES"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "CATCOLUMN"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "ROWINDEX"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus:[0]
  }
]

parameters [
  {
    key: "config"
    value: { string_value: "/model-data/criteo/model/dlrm/1/dlrm.json" }
  },
  {
    key: "gpucache"
    value: { string_value: "true" }
  },
  {
    key: "hit_rate_threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "gpucacheper"
    value: { string_value: "0.1" }
  },
  {
    key: "label_dim"
    value: { string_value: "1" }
  },
  {
    key: "slots"
    value: { string_value: "26" }
  },
  {
    key: "cat_feature_num"
    value: { string_value: "26" }
  },
  {
    key: "des_feature_num"
    value: { string_value: "13" }
  },
  {
    key: "max_nnz"
    value: { string_value: "2" }
  },
  {
    key: "embedding_vector_size"
    value: { string_value: "128" }
  },
  {
    key: "embeddingkey_long_type"
    value: { string_value: "true" }
  }
]

Overwriting /model-data/criteo/model/dlrm/config.pbtxt


### Generate HPS configuration for DLRM deployment

In [19]:
%%writefile $model_folder/ps.json
{
    "volatile_db": {
      "type": "redis_cluster",
      "address": "172.20.0.31:6373,172.20.0.32:6374,172.20.0.33:6375",
      "user_name":  "default",
      "password": "",
      "num_partitions": 8,
      "allocation_rate": 268435456,
      "max_get_batch_size": 10000,
      "max_set_batch_size": 10000,
      "overflow_margin": 10000000,
      "overflow_policy": "evict_oldest",
      "overflow_resolution_target": 0.99,
      "initial_cache_rate": 1.0,
      "cache_missed_embeddings": false,
      "update_filters": ["^hps_.+$"]
    },
    "supportlonglong": true,
    "models":[
        {
            "model":"dlrm",
            "sparse_files":["/model-data/criteo/model/dlrm/1/dlrm0_sparse_20000.model"],
            "dense_file":"/model-data/criteo/model/dlrm/1/dlrm_dense_20000.model",
            "network_file":"/model-data/criteo/model/dlrm/1/dlrm.json",
            "num_of_worker_buffer_in_pool": 2,
            "num_of_refresher_buffer_in_pool":1,
            "deployed_device_list":[0],
            "max_batch_size":64,
            "default_value_for_each_table":[0.0,0.0],
            "hit_rate_threshold":0.5,
            "gpucacheper":0.1,
            "gpucache": true,
            "cache_refresh_percentage_per_iteration":0.0,
            "maxnum_des_feature_per_sample": 13,
            "maxnum_catfeature_query_per_table_per_sample":[26],
            "embedding_vecsize_per_table":[128],
            "slot_num":26
        }
    ]  
}

Overwriting /model-data/criteo/model/ps.json


In [5]:
!ls -l $dlrm_version
!ls -l $dlrm_model_repo

total 9268
-rw-r--r-- 1 1000 1000    3704 Nov 21 21:08 dlrm.json
drwxr-xr-x 2 1000 1000    4096 Nov 21 21:08 dlrm0_sparse_20000.model
-rw-r--r-- 1 1000 1000 9479684 Nov 21 21:08 dlrm_dense_20000.model
total 8
drwxr-xr-x 3 1000 1000 4096 Nov 21 21:08 1
-rw-r--r-- 1 1000 1000 1230 Nov 22 00:45 config.pbtxt


## 3. Start Tritonserver

Follow these steps to start triton.

First, in another terminal from the one running this jupyterlab instance, run the following command to enter the `triton` container
```bash
docker exec -it large-scale-recsys_merlin_1 bash
```

Once inside the container, start the tritonserver with the following command
```bash
tritonserver --model-repository=/model-data/criteo/model/ --load-model=dlrm --model-control-mode=explicit --backend-directory=/usr/local/hugectr/backends --backend-config=hugectr,ps=/model-data/criteo/model/ps.json  --http-port=8000 --grpc-port=8001 --metrics-port=8002
```

At this point, you should see triton connecting to Redis and loading in the DLRM model embeddings.


## 4. Run Inference with Sample Data 

### Check Triton Deployment

In [22]:
!curl -v localhost:8000/v2/health/ready

*   Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact


### Batchsize=1

In [20]:
!perf_analyzer -m dlrm -u localhost:8000 --input-data /model-data/criteo/1.json --shape CATCOLUMN:26 --shape DES:13 --shape ROWINDEX:27

 Successfully read data for 1 stream/streams with 2 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 27776
    Throughput: 1542.54 infer/sec
    Avg latency: 643 usec (standard deviation 227 usec)
    p50 latency: 619 usec
    p90 latency: 689 usec
    p95 latency: 725 usec
    p99 latency: 895 usec
    Avg HTTP time: 640 usec (send/recv 29 usec + response wait 611 usec)
  Server: 
    Inference count: 27777
    Execution count: 27777
    Successful request count: 27777
    Avg request latency: 524 usec (overhead 1 usec + queue 35 usec + compute input 0 usec + compute infer 488 usec + compute output 0 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1542.54 infer/sec, latency 643 usec


### Batchsize=64

In [17]:
!perf_analyzer -m dlrm -u localhost:8000 --input-data /model-data/criteo/64.json --shape CATCOLUMN:1664 --shape DES:832 --shape ROWINDEX:1665

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 41139
    Throughput: 2285.01 infer/sec
    Avg latency: 433 usec (standard deviation 137 usec)
    p50 latency: 415 usec
    p90 latency: 497 usec
    p95 latency: 550 usec
    p99 latency: 635 usec
    Avg HTTP time: 430 usec (send/recv 28 usec + response wait 402 usec)
  Server: 
    Inference count: 126930
    Execution count: 126930
    Successful request count: 126930
    Avg request latency: 293 usec (overhead 1 usec + queue 42 usec + compute input 0 usec + compute infer 250 usec + compute output 0 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 2285.01 infer/sec, latency 433 usec


## 5. Run Inference with the Tritonserver Python Client

In [23]:
import sys
import numpy as np
import pandas as pd

from tritonclient.utils import *
import tritonclient.http  as httpclient

model_name = 'dlrm'
CATEGORICAL_COLUMNS=["C" + str(x) for x in range(1, 27)]
CONTINUOUS_COLUMNS=["I" + str(x) for x in range(1, 14)]
LABEL_COLUMNS = ['label']
emb_size_array = [4976199, 25419, 14705, 7112, 19283, 4, 6391, 1282, 60, 3289052, 282487, 138210, 11, 2203, 8901, 67, 4, 948, 15, 5577159, 1385790, 4348882, 178673, 10023, 88, 34]
shift = np.insert(np.cumsum(emb_size_array), 0, 0)[:-1]
test_df=pd.read_csv("/model-data/criteo/infer_test.csv",sep=',')



with httpclient.InferenceServerClient("localhost:8000") as client:
    dense_features = np.array([list(test_df.head(10)[CONTINUOUS_COLUMNS].values.flatten())],dtype='float32')
    embedding_columns = np.array([list((test_df.head(10)[CATEGORICAL_COLUMNS]+shift).values.flatten())],dtype='int64')
    row_ptrs = np.array([list(range(0,261))],dtype='int32')
    
    inputs = [
        httpclient.InferInput("DES", dense_features.shape,
                              np_to_triton_dtype(dense_features.dtype)),
        httpclient.InferInput("CATCOLUMN", embedding_columns.shape,
                              np_to_triton_dtype(embedding_columns.dtype)),
        httpclient.InferInput("ROWINDEX", row_ptrs.shape,
                              np_to_triton_dtype(row_ptrs.dtype)),

    ]

    inputs[0].set_data_from_numpy(dense_features)
    inputs[1].set_data_from_numpy(embedding_columns)
    inputs[2].set_data_from_numpy(row_ptrs)
    outputs = [
        httpclient.InferRequestedOutput("OUTPUT0")
    ]

    response = client.infer(model_name,
                            inputs,
                            request_id=str(1),
                            outputs=outputs)

    result = response.get_response()
    print(result)
    print("Prediction Result:")
    print(response.as_numpy("OUTPUT0"))

{'id': '1', 'model_name': 'dlrm', 'model_version': '1', 'parameters': {'NumSample': 10, 'DeviceID': 0}, 'outputs': [{'name': 'OUTPUT0', 'datatype': 'FP32', 'shape': [10], 'parameters': {'binary_data_size': 40}}]}
Prediction Result:
[0.01985384 0.02970626 0.02543451 0.02905972 0.08103204 0.02941077
 0.02769326 0.0242354  0.02630902 0.02453931]
