## Using a Triton Inference Server

### Deploying SSEPT Model with Triton Inference Server

We use NVIDIA's Triton Inference Server to serve our SSEPT model in a scalable and optimized way. This includes:

- Deploying via Triton's Python backend
- Enabling dynamic batching
- Running on GPU (or CPU)
- Integrating with Flask UI
- Running benchmarks from a Jupyter container

---

#### Triton Model Structure

Directory layout:
```
serve-system-chi/models/
└── recommender_model/
├── config.pbtxt
├── 1/
│ ├── SSE_PT10kemb.onnx or .pth
│ └── model.py
```


#### config.pbtxt (example)

```protobuf
name: "recommender_model"
backend: "python"
max_batch_size: 16
input [
  { name: "USER_ID" data_type: TYPE_INT64 dims: [1] },
  { name: "SEQ" data_type: TYPE_INT64 dims: [50] }
]
output [
  { name: "TOP_K" data_type: TYPE_INT64 dims: [5] }
]
instance_group [
  { count: 1 kind: KIND_GPU gpus: [ 0 ] }
]


### model.py (Python backend interface)

implement a TritonPythonModel class:

def initialize(self, args):
    # Load ONNX or PyTorch model
    self.model = load_model(...)
    self.device = select_device_from(args)

def execute(self, requests):
    # For each request in batch:
    # - Extract user_id and sequence
    # - Run inference
    # - Return top_k as output


### Start Triton + Flask + Jupyter

docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up -d

This launches:
 triton_server: GPU inference backend
 flask_ui: sends JSON requests to Triton
 jupyter: used to run benchmarks

Verify Triton logs:
    docker logs triton_server -f
    
Expected output:

  | recommender_model | 1 | READY |
  Started GRPCInferenceService at 0.0.0.0:8001
  Started HTTPService at 0.0.0.0:8000


### Access UI and Jupyter

Flask Inference UI:
  http://<FLOATING_IP>/
Upload a sequence JSON or image input (depending on model).

Jupyter Notebook:
  docker logs jupyter

Open:
  http://<YOUR_FLOATING_IP>:8888/lab?token=...
Open notebook: work/triton.ipynb


### Benchmark Inference via Triton Client


In [None]:
from tritonclient.http import InferenceServerClient, InferInput
import numpy as np

client = InferenceServerClient("localhost:8000")
input_user = InferInput("USER_ID", [1, 1], "INT64")
input_seq  = InferInput("SEQ", [1, 50], "INT64")

input_user.set_data_from_numpy(np.array([[42]], dtype=np.int64))
input_seq.set_data_from_numpy(np.random.randint(0, 100, size=(1, 50), dtype=np.int64))

outputs = [InferRequestedOutput("TOP_K")]

res = client.infer("recommender_model", [input_user, input_seq], outputs=outputs)
print(res.as_numpy("TOP_K"))


### Serving a PyTorch model

The Triton client comes with a performance analyzer, which we can use to send requests to the server and get some statistics back. Let’s try it:

In [None]:
perf_analyzer -u triton_server:8000 -m recommender_model --input-data input.json -b 1 --concurrency-range 8
perf_analyzer -u triton_server:8000 -m recommender_model --input-data input.json -b 1 --concurrency-range 16

Make a note of the line showing the total average request latency, and the breakdown including:

-   `queue`, the queuing delay
-   and `compute infer`, the inference delay

### Enable Dynamic Batching
Edit model config:
dynamic_batching {
  preferred_batch_size: [4, 6, 8, 10]
  max_queue_delay_microseconds: 100
}

Then rebuild and restart:

docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d


### Multi-Instance GPU Scaling
Edit model config:
instance_group [
  { count: 2 kind: KIND_GPU gpus: [ 0 ] },
  { count: 2 kind: KIND_GPU gpus: [ 1 ] }
]

scale up to:
instance_group [
  { count: 4 kind: KIND_GPU gpus: [ 0 ] },
  { count: 4 kind: KIND_GPU gpus: [ 1 ] }
]


In [None]:
perf_analyzer -u triton_server:8000 -m recommender_model --input-data input.json -b 1 --concurrency-range 8

### Migrate to ONNX Backend

In [None]:
perf_analyzer -u triton_server:8000 -m recommender_model_onnx -b 1 --shape USER_ID:1, SEQ:50 --concurrency-range 16

### Flask + ONNX Integration
context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton_onnx
environment:
  - MODEL_NAME=recommender_model_onnx


docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build flask
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up flask --force-recreate -d

Test Flask UI:
http://<FLOATING_IP>