# Overview
The **03_model_inference_hps_trt_ensemble.ipynb** will cover following tasks
  * Configure three backends in Triton format
  * Deploy to inference with Triton ensemble mode
  * Validate deployed ensemble model with dummy dataset

In [1]:
import os
import shutil
import numpy as np
import tritonhttpclient
import tritonclient.http as httpclient
from tritonclient.utils import *



## Configure 3 backends in Triton format
The 3 backends are:
* "hps_embedding" backend, HPS Triton backend for embedding lookup serving
* "trt_naive_dnn_dense" backend, TensorRT Triton backend for dense model serving
* "hps_trt_ensemble" backend, integrates the above two backends and serves as one ensemble service

In [2]:
args = dict()
args["slot_num"] = 3

### Prepare Triton Inference Server directories

In [3]:
BASE_DIR = "/model_repo"

!mkdir -p $BASE_DIR/hps_embedding/1
!mkdir -p $BASE_DIR/trt_naive_dnn_dense/1
!mkdir -p $BASE_DIR/hps_trt_ensemble/1

In [4]:
# check created repository 
!tree /model_repo

[01;34m/model_repo[00m
├── [01;34mhps_embedding[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mnaive_dnn_sparse.model[00m
│   │       ├── emb_vector
│   │       └── key
│   ├── config.pbtxt
│   └── hps_embedding.json
├── [01;34mhps_trt_ensemble[00m
│   └── [01;34m1[00m
└── [01;34mtrt_naive_dnn_dense[00m
    └── [01;34m1[00m

7 directories, 4 files


### Configure "hps_embedding" HPS backend
For more references of HPS backend building, please check [Hierarchical Parameter Server Demo](../../samples/Hierarchical_Parameter_Server_Deployment.ipynb).

In [5]:
%%writefile $BASE_DIR/hps_embedding/config.pbtxt
name: "hps_embedding"
backend: "hps"
max_batch_size:0
input [
  {
    name: "KEYS"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  },
  {
    name: "NUMKEYS"
    data_type: TYPE_INT32
    dims: [ -1, -1]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
version_policy: {
        specific:{versions: 1}
},
instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus:[0]
  }
]

Overwriting /model_repo/hps_embedding/config.pbtxt


Generate the HPS configuration for deploying embedding tables

In [6]:
%%writefile $BASE_DIR/hps_embedding/hps_embedding.json
{
    "supportlonglong": true,
    "models": [{
        "model": "hps_embedding",
        "sparse_files": ["/model_repo/hps_embedding/1/naive_dnn_sparse.model"],
        "num_of_worker_buffer_in_pool": 3,
        "embedding_table_names":["sparse_embedding1"],
        "embedding_vecsize_per_table": [16],
        "maxnum_catfeature_query_per_table_per_sample": [3],
        "default_value_for_each_table": [1.0],
        "deployed_device_list": [0],
        "max_batch_size": 65536,
        "cache_refresh_percentage_per_iteration": 0.2,
        "hit_rate_threshold": 1.0,
        "gpucacheper": 1.0,
        "gpucache": true
        }
    ]
}

Overwriting /model_repo/hps_embedding/hps_embedding.json


In [7]:
!cp -r ./naive_dnn_sparse.model /model_repo/hps_embedding/1/

### Configure "trt_naive_dnn_dense" TensorRT backend 
**Note**

In [8]:
%%writefile $BASE_DIR/trt_naive_dnn_dense/config.pbtxt
platform: "tensorrt_plan"
default_model_filename: "naive_dnn_dense.trt"
backend: "tensorrt"
max_batch_size: 0

input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [49152]
    reshape: { shape: [1024, 48] }
  }
]
output [
  {
      name: "fc_3"
      data_type: TYPE_FP32
      dims: [-1,1]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus:[0]

  }
]

Writing /model_repo/trt_naive_dnn_dense/config.pbtxt


In [9]:
!cp -r ./naive_dnn_dense.trt /model_repo/trt_naive_dnn_dense/1/

### Configure "hps_trt_ensemble" Triton backend

In [10]:
%%writefile $BASE_DIR/hps_trt_ensemble/config.pbtxt
name: "hps_trt_ensemble"
platform: "ensemble"
max_batch_size: 0
input [
  {
    name: "EMB_KEY"
    data_type: TYPE_INT64
    dims: [-1,-1]
  },
  {
    name: "EMB_N_KEY"
    data_type: TYPE_INT32
    dims: [-1,-1]
  }
]
output [
  {
    name: "DENSE_OUTPUT"
    data_type: TYPE_FP32
    dims: [-1, 1]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "hps_embedding"
      model_version: -1
      input_map {
        key: "KEYS"
        value: "EMB_KEY"
      }
      input_map {
        key: "NUMKEYS"
        value: "EMB_N_KEY"
      }
      output_map {
        key: "OUTPUT0"
        value: "LOOKUP_VECTORS"
      }
    },
    {
      model_name: "trt_naive_dnn_dense"
      model_version: -1
      input_map {
        key: "input_1"
        value: "LOOKUP_VECTORS"
      }
      output_map {
        key: "fc_3"
        value: "DENSE_OUTPUT"
      }
    }
  ]
}

Writing /model_repo/hps_trt_ensemble/config.pbtxt


### Check the generated directory and configurations

In [11]:
!tree /model_repo

[01;34m/model_repo[00m
├── [01;34mhps_embedding[00m
│   ├── [01;34m1[00m
│   │   └── [01;34mnaive_dnn_sparse.model[00m
│   │       ├── emb_vector
│   │       └── key
│   ├── config.pbtxt
│   └── hps_embedding.json
├── [01;34mhps_trt_ensemble[00m
│   ├── [01;34m1[00m
│   └── config.pbtxt
└── [01;34mtrt_naive_dnn_dense[00m
    ├── [01;34m1[00m
    │   └── naive_dnn_dense.trt
    └── config.pbtxt

7 directories, 7 files


## Start Triton Inference Server, load 3 backends

Now, we assume you have checked your **tritonserver** version and confirmed that can run tritonserver command inside your docker container.

For this tutorial, the command to start Triton will be
> **tritonserver --model-repository=/model_repo/ --backend-config=hps,ps=/model_repo/hps_embedding/hps_embedding.json --load-model=hps_trt_ensemble --model-control-mode=explicit**

If you successfully started tritonserver, you should see a log similar to following

```bash
+----------+--------------------------------+--------------------------------+
| Backend  | Path                           | Config                         |
+----------+--------------------------------+--------------------------------+
| tensorrt | /opt/tritonserver/backends/ten | {"cmdline":{"auto-complete-con |
|          | sorrt/libtriton_tensorrt.so    | fig":"true","min-compute-capab |
|          |                                | ility":"6.000000","backend-dir |
|          |                                | ectory":"/opt/tritonserver/bac |
|          |                                | kends","default-max-batch-size |
|          |                                | ":"4"}}                        |
|          |                                |                                |
| hps      | /opt/tritonserver/backends/hps | {"cmdline":{"auto-complete-con |
|          | /libtriton_hps.so              | fig":"true","backend-directory |
|          |                                | ":"/opt/tritonserver/backends" |
|          |                                | ,"min-compute-capability":"6.0 |
|          |                                | 00000","ps":"/model_repo/hps_e |
|          |                                | mbedding/hps_embedding.json"," |
|          |                                | default-max-batch-size":"4"}}  |
|          |                                |                                |
+----------+--------------------------------+--------------------------------+

+---------------------+---------+--------+
| Model               | Version | Status |
+---------------------+---------+--------+
| hps_embedding       | 1       | READY  |
| hps_trt_ensemble    | 1       | READY  |
| trt_naive_dnn_dense | 1       | READY  |
+---------------------+---------+--------+
```

## Validate deployed ensemble model with dummy dataset
### Step.1 Check Tritonserver health
**Note**: if you are using default Tritonserver settings, the default port will be `8000` 

In [12]:
!curl -v localhost:8000/v2/health/ready

*   Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact


In [13]:
try:
    triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))
    
triton_client.is_server_live()

client created.
GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

### Step.2 Check loaded backends

In [14]:
triton_client.get_model_repository_index()

POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '175'}>
bytearray(b'[{"name":"hps_embedding","version":"1","state":"READY"},{"name":"hps_trt_ensemble","version":"1","state":"READY"},{"name":"trt_naive_dnn_dense","version":"1","state":"READY"}]')


[{'name': 'hps_embedding', 'version': '1', 'state': 'READY'},
 {'name': 'hps_trt_ensemble', 'version': '1', 'state': 'READY'},
 {'name': 'trt_naive_dnn_dense', 'version': '1', 'state': 'READY'}]

### Step.3 Prepare mock request

**Note**: The TensorRT engine for dense network is built with the fixed batch size 1024, thus we can only send requests of this batch size.

In [15]:
# generate mock requests based on model training settings
batch_size = 1024
key_tensor  = np.random.randint(1,10,(1, batch_size * args["slot_num"])).astype(np.int64)
nkey_tensor = np.full((1, 1), batch_size * 3).astype(np.int32)
print("Input key tensor is \n{}, \nnumber of key tensor is \n{}".format(key_tensor, nkey_tensor))

inputs = [
    httpclient.InferInput("EMB_KEY", 
                          key_tensor.shape,
                          np_to_triton_dtype(np.int64)),
    httpclient.InferInput("EMB_N_KEY", 
                          nkey_tensor.shape,
                          np_to_triton_dtype(np.int32)),
]
inputs[0].set_data_from_numpy(key_tensor)
inputs[1].set_data_from_numpy(nkey_tensor)

outputs = [
    httpclient.InferRequestedOutput("DENSE_OUTPUT")
]

Input key tensor is 
[[1 2 2 ... 3 9 9]], 
number of key tensor is 
[[3072]]


### Step.4 Send request to Triton server

In [16]:
model_name = "hps_trt_ensemble"

with httpclient.InferenceServerClient("localhost:8000") as client:
    response = client.infer(model_name,
                            inputs,
                            outputs=outputs)
    result = response.get_response()
    
    print("Prediction result is {}".format(response.as_numpy("DENSE_OUTPUT")))
    print("Response details:\n{}".format(result))

Prediction result is [[2932.007 ]
 [2992.7078]
 [3417.2224]
 ...
 [2994.281 ]
 [3024.6218]
 [3153.9033]]
Response details:
{'model_name': 'hps_trt_ensemble', 'model_version': '1', 'parameters': {'sequence_id': 0, 'sequence_start': False, 'sequence_end': False}, 'outputs': [{'name': 'DENSE_OUTPUT', 'datatype': 'FP32', 'shape': [1024, 1], 'parameters': {'binary_data_size': 4096}}]}
