## Using a Triton Inference Server

[Triton Inference Server](https://developer.nvidia.com/triton-inference-server) is an open-source project by NVIDIA for high-performance ML model deployment. In this section, we will practice deploying models using Triton; after you have finished, you should be able to:

-   serve a model using Triton Inference Server with Python backend
-   use dynamic batching to improve performance
-   scale your model to run on multiple GPUs, and/or with multiple instances on the same GPU
-   benchmark the Triton service, and recognize indications of potential problems
-   and use optimized backends

### Anatomy of a Triton model with Python backend

To start, run

``` bash
# runs on node-serve-system
mkdir ~/serve-system-chi/models/
cp -r ~/serve-system-chi/models_staging/food_classifier ~/serve-system-chi/models/
```

to copy [our first configuration](https://github.com/teaching-on-testbeds/serve-system-chi/tree/main/models_staging/food_classifier) into the directory from which Triton will load models.

Our initial implementation serves our food image classifier using PyTorch. Here’s how it works.

In the [Dockerfile](https://github.com/teaching-on-testbeds/serve-system-chi/blob/main/docker/Dockerfile.triton), the Triton server is started with the command

``` bash
tritonserver --model-repository=/models
```

where the `/models` directry is organized as follows:

    models/
    └── food_classifier
        ├── 1
        │   ├── food11.pth
        │   └── model.py
        └── config.pbtxt

It includes:

-   a top-level directory whose name is the “model name”
-   a configuration file `config.pbtxt` inside that directory. We’ll look at that shortly.
-   and a subdirectory for each model version. We have model version 1, so we have a subdirectory 1. Inside this directory is a `model.py`, which describes how the model will run.

Let’s [look at the configuration file first](https://github.com/teaching-on-testbeds/serve-system-chi/blob/main/models_staging/food_classifier/config.pbtxt). Here are the contents of `config.pbtxt`:

    name: "food_classifier"
    backend: "python"
    max_batch_size: 16
    input [
      {
        name: "INPUT_IMAGE"
        data_type: TYPE_STRING
        dims: [1]
      }
    ]
    output [
      {
        name: "FOOD_LABEL"
        data_type: TYPE_STRING
        dims: [1]
      },
      {
        name: "PROBABILITY"
        data_type: TYPE_FP32
        dims: [1]
      }
    ]
      instance_group [
        {
          count: 1
          kind: KIND_GPU
          gpus: [ 0 ]
        }
    ]

We have defined:

-   a `name`, which must match the directory name
-   a `backend` - we are using the basic [Python backend](https://github.com/triton-inference-server/python_backend). This is a highly flexible backend which allows us to define how our model will run by providing Python code in a `model.py` file.
-   a `max_batch_size` - we have set it to 16, but generally you would set this according to the GPU memory available
-   the `name`, `data_type`, and `dims` (dimensions) of each `input` to the model
-   the `name`, `data_type`, and `dims` (dimensions) of each `output` from the model
-   an `instance_group` with the `count` (number of copies of the model that we want to serve) and details of the device we want to serve it on (we will serve it on GPU 0). Note that to run the model on CPU instead, we could have used

<!-- -->

      instance_group [
        {
          count: 1
          kind: KIND_CPU
        }
      ]

Next, let’s [look at `model.py`](https://github.com/teaching-on-testbeds/serve-system-chi/blob/main/models_staging/food_classifier/1/model.py). For a Triton model with Python backend, the `model.py` must define a class named `TritonPythonModel` with at least an `initialize` and `execute` method. Ours has:

-   An `initialize` method to load the model, move it to the device specified in the `args` passed from the Triton server, and put it in inference mode. This will run as soon as Triton starts and loads models from the directory passed to it:

``` python
def initialize(self, args):
        model_dir = os.path.dirname(__file__)
        model_path = os.path.join(model_dir, "food11.pth")
        
        # From args, get info about what device the model is supposed to be on
        instance_kind = args.get("model_instance_kind", "cpu").lower()
        if instance_kind == "gpu":
            device_id = int(args.get("model_instance_device_id", 0))
            torch.cuda.set_device(device_id)
            self.device = torch.device(f"cuda:{device_id}" if torch.cuda.is_available() else 'cpu')
        else:
            self.device = torch.device('cpu')

        self.model = torch.load(model_path, map_location=self.device, weights_only=False)
        self.model.to(self.device)
        self.model.eval()

        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                    std=[0.229, 0.224, 0.225]),
        ])
        self.classes = np.array([
            "Bread", "Dairy product", "Dessert", "Egg", "Fried food",
            "Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup",
            "Vegetable/Fruit"
        ])
```

-   A `preprocess` method, which will run on each input image that is passed:

``` python
def preprocess(self, image_data):
    if isinstance(image_data, str):
        image_data = base64.b64decode(image_data)

    if isinstance(image_data, bytes):
        image_data = image_data.decode("utf-8")
        image_data = base64.b64decode(image_data)

    image = Image.open(io.BytesIO(image_data)).convert('RGB')

    img_tensor = self.transform(image).unsqueeze(0)
    return img_tensor
```

-   and an `execute`, which will apply to batches of requests sent to this model:

``` python
def execute(self, requests):
    # Gather inputs from all requests
    batched_inputs = []
    for request in requests:
        in_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_IMAGE")
        input_data_array = in_tensor.as_numpy()  # each assumed to be shape [1]
        # Preprocess each input (resulting in a tensor of shape [1, C, H, W])
        batched_inputs.append(self.preprocess(input_data_array[0, 0]))
    
    # Combine inputs along the batch dimension
    batched_tensor = torch.cat(batched_inputs, dim=0).to(self.device)
    print("BatchSize: ", len(batched_inputs))
    # Run inference once on the full batch
    with torch.no_grad():
        outputs = self.model(batched_tensor)
    
    # Process the outputs and split them for each request
    responses = []
    for i, request in enumerate(requests):
        output = outputs[i:i+1]  # select the i-th output
        prob, predicted_class = torch.max(output, 1)
        predicted_label = self.classes[predicted_class.item()]
        probability = torch.sigmoid(prob).item()
        
        # Create numpy arrays with shape [1, 1] for consistency.
        out_label_np = np.array([[predicted_label]], dtype=object)
        out_prob_np = np.array([[probability]], dtype=np.float32)
        
        out_tensor_label = pb_utils.Tensor("FOOD_LABEL", out_label_np)
        out_tensor_prob = pb_utils.Tensor("PROBABILITY", out_prob_np)
        
        inference_response = pb_utils.InferenceResponse(
            output_tensors=[out_tensor_label, out_tensor_prob])
        responses.append(inference_response)
    
    return responses
```

Finally, now that we understand how the server works, let’s [look at how the Flask app sends requests to it](https://github.com/teaching-on-testbeds/gourmetgram/blob/triton/app.py). Inside the Flask app, we now have a function which is called whenever there is a new image uploaded to `predict` or `test`, which sends the image to the Triton server:

``` python
def request_triton(image_path):
    try:
        # Connect to Triton server
        triton_client = httpclient.InferenceServerClient(url=TRITON_SERVER_URL)

        # Prepare inputs and outputs
        with open(image_path, 'rb') as f:
            image_bytes = f.read()

        inputs = []
        inputs.append(httpclient.InferInput("INPUT_IMAGE", [1, 1], "BYTES"))

        encoded_str =  base64.b64encode(image_bytes).decode("utf-8")
        input_data = np.array([[encoded_str]], dtype=object)
        inputs[0].set_data_from_numpy(input_data)

        outputs = []
        outputs.append(httpclient.InferRequestedOutput("FOOD_LABEL", binary_data=False))
        outputs.append(httpclient.InferRequestedOutput("PROBABILITY", binary_data=False))

        # Run inference
        results = triton_client.infer(model_name=FOOD11_MODEL_NAME, inputs=inputs, outputs=outputs)

        predicted_class = results.as_numpy("FOOD_LABEL")[0,0]
        probability = results.as_numpy("PROBABILITY")[0,0]

        return predicted_class, probability

    except Exception as e:
        print(f"Error during inference: {e}")  
        return None, None  
```

### Bring up containers

To start, run

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up -d
```

This uses a [Docker Compose configuration](https://github.com/teaching-on-testbeds/serve-system-chi/blob/main/docker/docker-compose-triton.yaml) to bring up three containers:

-   one container with NVIDIA Triton Server, with the host’s GPUs passed to the container, and with the `models` directory (containing the model and its configuration) passed as a bind mount
-   one container that hosts the Flask app, which will serve the user interface and send inference requests to the Triton server
-   one Jupyter container with the Triton client installed, for us to conduct a performance evaluation of the Triton server

Watch the logs from the Triton server as it starts up:

``` bash
# runs on node-serve-system
docker logs triton_server -f
```

Once the Triton server starts up, you should see something like

    +--------------------------+---------+--------+
    | Model                    | Version | Status |
    +--------------------------+---------+--------+
    | food_classifier | 1       | READY  |
    +--------------------------+---------+--------+

and then some additional output. Near the end, you will see

    "Started GRPCInferenceService at 0.0.0.0:8001"
    "Started HTTPService at 0.0.0.0:8000"
    "Started Metrics Service at 0.0.0.0:8002"

(and then some messages about not getting GPU power consumption, which is fine and not a concern.)

You can use Ctrl+C to stop watching the logs once you see this output.

Let’s test this service. In a browser, run

    http://A.B.C.D

but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.

Finally, check the logs of the Jupyter container:

``` bash
# runs on node-serve-system
docker logs jupyter
```

and look for a line like

    http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running *on your compute instance*.

Then, in the file browser on the left side, open the “work” directory and then click on the `triton.ipynb` notebook to continue.

Meanwhile, on the host, run

``` bash
# runs on node-serve-system
nvtop
```

to monitor GPU usage - we will refer back to this a few times as we run through the rest of this notebook.

### Serving a PyTorch model

The Triton client comes with a performance analyzer, which we can use to send requests to the server and get some statistics back. Let’s try it:

In [1]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 1035
    Throughput: 57.1653 infer/sec
    Avg latency: 17393 usec (standard deviation 324 usec)
    p50 latency: 17333 usec
    p90 latency: 17563 usec
    p95 latency: 17797 usec
    p99 latency: 18695 usec
    Avg HTTP time: 17386 usec (send/recv 210 usec + response wait 17176 usec)
  Server: 
    Inference count: 1035
    Execution count: 1035
    Successful request count: 1035
    Avg request latency: 16593 usec (overhead 3 usec + queue 67 usec + compute input 55 usec + compute infer 16383 usec + compute output 83 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 57.1653 infer/sec, latency 17393 us

Make a note of the line showing the total average request latency, and the breakdown including:

-   `queue`, the queuing delay
-   and `compute infer`, the inference delay

<!--

    Avg request latency: 18689 usec (overhead 2 usec + queue 22 usec + compute input 44 usec + compute infer 18570 usec + compute output 49 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 51.549 infer/sec, latency 19311 usec

-->

Let’s further exercise this service. In the command above, a single client sends continuous requests to the server - each time a response is returned, a new request is generated. Now, let’s configure **8** concurrent clients, each sending continuous requests - as soon as any client gets a response, it sends a new request:

In [18]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --request-distribution=poisson --concurrency-range 1

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 1030
    Throughput: 56.9957 infer/sec
    Avg latency: 17463 usec (standard deviation 926 usec)
    p50 latency: 17317 usec
    p90 latency: 17736 usec
    p95 latency: 17984 usec
    p99 latency: 18808 usec
    Avg HTTP time: 17456 usec (send/recv 207 usec + response wait 17249 usec)
  Server: 
    Inference count: 1030
    Execution count: 1030
    Successful request count: 1030
    Avg request latency: 16677 usec (overhead 2 usec + queue 65 usec + compute input 54 usec + compute infer 16475 usec + compute output 80 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 56.9957 infer/sec, latency 17463 us

<!-- 

    Avg request latency: 151375 usec (overhead 3 usec + queue 132341 usec + compute input 59 usec + compute infer 18922 usec + compute output 49 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 52.3786 infer/sec, latency 151983 usec

-->

While the inference time (`compute infer`) remains low, the overall system latency is high because of `queue` delay. Only one sample is processed at a time, and other samples have to wait in a queue for their turn. Here, since there are 8 concurrent clients sending continuous requests, the delay is approximately 8x the inference delay. With more concurrent requests, the queuing delay would grow even larger:

In [30]:
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --request-distribution=poisson --concurrency-range 1:8 --collect-metrics -f metrics_output.csv

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 8 concurrent requests
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 1034
    Throughput: 57.1949 infer/sec
    Avg latency: 17390 usec (standard deviation 1336 usec)
    p50 latency: 17185 usec
    p90 latency: 17526 usec
    p95 latency: 18181 usec
    p99 latency: 21263 usec
    Avg HTTP time: 17383 usec (send/recv 197 usec + response wait 17186 usec)
  Server: 
    Inference count: 1034
    Execution count: 1034
    Successful request count: 1034
    Avg request latency: 16608 usec (overhead 3 usec + queue 67 usec + compute input 56 usec + compute infer 16401 usec + compute output 79 usec)
  Server Prometheus Metrics: 
    Avg GPU Util

In [33]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --request-distribution=poisson --concurrency-range 1:8 --log-frequency=10 --collect-metrics -f metrics_output_test.csv

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 8 concurrent requests
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 931
    Throughput: 51.4832 infer/sec
    Avg latency: 19321 usec (standard deviation 1501 usec)
    p50 latency: 19045 usec
    p90 latency: 19528 usec
    p95 latency: 19992 usec
    p99 latency: 24033 usec
    Avg HTTP time: 19315 usec (send/recv 226 usec + response wait 19089 usec)
  Server: 
    Inference count: 931
    Execution count: 931
    Successful request count: 931
    Avg request latency: 18521 usec (overhead 4 usec + queue 66 usec + compute input 55 usec + compute infer 18315 usec + compute output 81 usec)
  Server Prometheus Metrics: 
    Avg GPU Utilizat

In [34]:
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --request-distribution=poisson --concurrency-range 1:8 --collect-metrics -f metrics_output_test2.csv

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 8 concurrent requests
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 932
    Throughput: 51.5819 infer/sec
    Avg latency: 19274 usec (standard deviation 1565 usec)
    p50 latency: 18980 usec
    p90 latency: 19505 usec
    p95 latency: 20193 usec
    p99 latency: 23958 usec
    Avg HTTP time: 19267 usec (send/recv 220 usec + response wait 19047 usec)
  Server: 
    Inference count: 932
    Execution count: 932
    Successful request count: 932
    Avg request latency: 18459 usec (overhead 3 usec + queue 66 usec + compute input 56 usec + compute infer 18254 usec + compute output 78 usec)
  Server Prometheus Metrics: 
    Avg GPU Utilizat

In [None]:
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --request-distribution=poisson --concurrency-range 1:8 --log-frequency=10 --collect-metrics -f metrics_output_test.csv

In [None]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 16 --concurrency-range 1

Although the delay is large (over 100 ms), it’s not because of inadequate compute - if you check the `nvtop` display on the host while the test above is running, you will note low GPU utilization! Take a screenshot of the `nvtop` output when this test is running.

We *could* get more throughput without increasing prediction latency, by batching requests:

#Skipping the following part for now

But, that’s not very helpful in a situation when requests come from individual users, one at a time.

### Dynamic batching

Earlier, we noted that our model can achieve higher throughput with low latency by performing inference on batches of input samples, instead of individual samples. However, our client sends requests with individual samples.

To improve performance, we can ask the Triton server to batch incoming requests whenever possible, and send them through the server together instead of a sequence. In other words, if the server is ready to handle the next request, and it finds four requests waiting in the queue, it should serve those four as a batch instead of just taking the next request in line.

In [79]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier/versions/1/stats

{"model_stats":[{"name":"food_classifier","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[],"memory_usage":[]}]}


Then, run the benchmark:

In [80]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using synchronous calls for inference

Request concurrency: 8
  Client: 
    Request count: 1081
    Throughput: 59.7671 infer/sec
    Avg latency: 133233 usec (standard deviation 26632 usec)
    p50 latency: 129976 usec
    p90 latency: 134445 usec
    p95 latency: 134874 usec
    p99 latency: 152350 usec
    Avg HTTP time: 133221 usec (send/recv 204 usec + response wait 133017 usec)
  Server: 
    Inference count: 1082
    Execution count: 1082
    Successful request count: 1082
    Avg request latency: 132408 usec (overhead 2 usec + queue 115751 usec + compute input 58 usec + compute infer 16517 usec + compute output 79 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 59.7671 infer/sec, la

<!--

    Avg request latency: 100423 usec (overhead 6 usec + queue 44892 usec + compute input 197 usec + compute infer 55111 usec + compute output 216 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 78.6276 infer/sec, latency 101232 usec

-->

and get per-batch stats again:

In [None]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier/versions/1/stats

<!--

{"model_stats":[{"name":"food_classifier","version":"1","last_inference":1741928954242,"inference_count":1436,"execution_count":386,"inference_stats":{"success":{"count":1436,"ns":144129653806},"fail":{"count":0,"ns":0},"queue":{"count":1436,"ns":64542800676},"compute_input":{"count":1436,"ns":283368073},"compute_infer":{"count":1436,"ns":78984688177},"compute_output":{"count":1436,"ns":309635270},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[{"batch_size":1,"compute_input":{"count":26,"ns":1754466},"compute_infer":{"count":26,"ns":757012965},"compute_output":{"count":26,"ns":2038319}},{"batch_size":2,"compute_input":{"count":127,"ns":14474588},"compute_infer":{"count":127,"ns":3718519926},"compute_output":{"count":127,"ns":13184875}},{"batch_size":3,"compute_input":{"count":55,"ns":7182962},"compute_infer":{"count":55,"ns":2144383142},"compute_output":{"count":55,"ns":7683505}},{"batch_size":4,"compute_input":{"count":9,"ns":1446080},"compute_infer":{"count":9,"ns":456549788},"compute_output":{"count":9,"ns":1596636}},{"batch_size":5,"compute_input":{"count":73,"ns":14796021},"compute_infer":{"count":73,"ns":4268808423},"compute_output":{"count":73,"ns":16766209}},{"batch_size":6,"compute_input":{"count":82,"ns":19691717},"compute_infer":{"count":82,"ns":5577222019},"compute_output":{"count":82,"ns":22604780}},{"batch_size":7,"compute_input":{"count":14,"ns":4742974},"compute_infer":{"count":14,"ns":1103416079},"compute_output":{"count":14,"ns":4618631}}],"memory_usage":[]}]}

-->

Note that the stats show that some requests were served in batch sizes greater than 1, even though each client sent a single request at a time.

### Scaling up

Another easy way to improve performance is to scale up! Let’s edit the model configuration:

``` bash
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt
```

and change

      instance_group [
        {
          count: 1
          kind: KIND_GPU
          gpus: [ 0 ]
        }
    ]

to run two instances on GPU 0 and two instances on GPU 1:

      instance_group [
        {
          count: 2
          kind: KIND_GPU
          gpus: [ 0 ]
        },
        {
          count: 2
          kind: KIND_GPU
          gpus: [ 1 ]
        }
    ]

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
```

and then bring the server back up:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
```

and use

``` bash
# runs on node-serve-system
docker logs triton_server
```

to make sure the server comes up and is ready.

On the host, run

``` bash
# runs on node-serve-system
nvidia-smi
```

and note that there are two instances of `triton_python_backend` processes running on GPU 0, and two on GPU 1.

Then, benchmark *this* service with increased concurrency:

In [None]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

<!-- 

    Avg request latency: 40707 usec (overhead 3 usec + queue 7036 usec + compute input 75 usec + compute infer 33514 usec + compute output 78 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 192.849 infer/sec, latency 41374 usec

-->

Although there is still some queuing delay (because our degree of concurrency, 8, is still higher than the number of server instances, 4), and the inference time is also increased due to sharing the compute resources, the prediction delay is still on the order of 10s of ms - not over 100ms, like it was previously with concurrency 8!

Also, if you look at the `nvtop` output on the host while running this test, you will observe higher GPU utilization than before (which is good! We want to use the GPU. Underutilization is bad.) (Take a screenshot!) However, we are still not fully utilizing the GPU.

Let’s try increasing the number of instances again. Edit the model configuration:

``` bash
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt
```

and change

      instance_group [
        {
          count: 2
          kind: KIND_GPU
          gpus: [ 0 ]
        },
        {
          count: 2
          kind: KIND_GPU
          gpus: [ 1 ]
        }
    ]

to

      instance_group [
        {
          count: 4
          kind: KIND_GPU
          gpus: [ 0 ]
        },
        {
          count: 4
          kind: KIND_GPU
          gpus: [ 1 ]
        }
    ]

Re-build the container image with this change:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
```

and then bring the server back up:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
```

use

``` bash
# runs on node-serve-system
docker logs triton_server
```

to make sure the server comes up and is ready.

Then, re-run our benchmark:

In [None]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

<!--

    Avg request latency: 66737 usec (overhead 2 usec + queue 466 usec + compute input 61 usec + compute infer 66118 usec + compute output 89 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 118.688 infer/sec, latency 67559 usec

-->

This makes things worse - our inference time is higher, even though we are still underutilizing the GPU (as seen in `nvtop`) (take a screenshot!).

Our system is not limited by GPU - we are underutilizing the GPU. However, we are being killed by the overhead of the Python backend and our `model.py` implementation.

#Skipped up to here!!

### Serving an ONNX model

The Python backend we have been using is flexible, but not necessarily the most performant. To get better performance, we will use one of the highly optimized backend in Triton. Since we already have an ONNX model, let’s use the ONNX backend.

To serve a model using the ONNX backend, we will create a [directory structure like this](https://github.com/teaching-on-testbeds/serve-system-chi/tree/main/models_staging/food_classifier_onnx):

    food_classifier_onnx/
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

There is no more `model.py` - Triton serves the model directly, we just have to name it `model.onnx`. In [`config.pbtxt`](https://github.com/teaching-on-testbeds/serve-system-chi/blob/main/models_staging/food_classifier_onnx/config.pbtxt), we will specify the backend as `onnxruntime`:

    name: "food_classifier_onnx"
    backend: "onnxruntime"
    max_batch_size: 16
    input [
      {
        name: "input"  # has to match ONNX model's input name
        data_type: TYPE_FP32
        dims: [3, 224, 224]  # has to match ONNX input shape
      }
    ]
    output [
      {
        name: "output"  # has to match ONNX model output name
        data_type: TYPE_FP32  # output is a list of probabilities
        dims: [11]  # 
      }
    ]
      instance_group [
        {
          count: 1
          kind: KIND_GPU
          gpus: [ 0 ]
        }
    ]

Copy this to Triton’s models directory:

``` bash
# runs on node-serve-system
cp -r ~/summer2025/models/food_classifier_onnx ~/exp-chi/models/
```

Re-build the container image with this change:

``` bash
# runs on node-serve-system
docker compose -f ~/exp-chi/docker/docker/docker-compose.yaml build triton_server
```

and then bring the server back up:

``` bash
# runs on node-serve-system
docker compose -f ~/exp-chi/docker/docker/docker-compose.yaml up triton_server --force-recreate -d
```

use

``` bash
# runs on node-serve-system
docker logs triton_server
```

to make sure the server comes up and is ready. Note that the server will load two models: the original `food_classifier` with Python backend, and the `food_classifier_onnx` model we just added.

Let’s benchmark our service. Our ONNX model won’t accept image bytes directly - it expects images that already have been pre-processed into arrays. So, our benchmark command will be a little bit different:

In [35]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 2400
    Throughput: 125.502 infer/sec
    Avg latency: 7373 usec (standard deviation 868 usec)
    p50 latency: 7388 usec
    p90 latency: 8330 usec
    p95 latency: 8442 usec
    p99 latency: 9236 usec
    Avg HTTP time: 7365 usec (send/recv 520 usec + response wait 6845 usec)
  Server: 
    Inference count: 2400
    Execution count: 2400
    Successful request count: 2400
    Avg request latency: 5285 usec (overhead 35 usec + queue 80 usec + compute input 231 usec + compute infer 4918 usec + compute output 19 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 125.502 infer/sec, latency 7373 usec


In [35]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 2400
    Throughput: 125.502 infer/sec
    Avg latency: 7373 usec (standard deviation 868 usec)
    p50 latency: 7388 usec
    p90 latency: 8330 usec
    p95 latency: 8442 usec
    p99 latency: 9236 usec
    Avg HTTP time: 7365 usec (send/recv 520 usec + response wait 6845 usec)
  Server: 
    Inference count: 2400
    Execution count: 2400
    Successful request count: 2400
    Avg request latency: 5285 usec (overhead 35 usec + queue 80 usec + compute input 231 usec + compute infer 4918 usec + compute output 19 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 125.502 infer/sec, latency 7373 usec


In [37]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-distribution=poisson --concurrency-range=1:8 --collect-metrics -f metrics_onnx1.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 8 concurrent requests
  Using synchronous calls for inference

Request concurrency: 1
  Client: 
    Request count: 2227
    Throughput: 114.747 infer/sec
    Avg latency: 7948 usec (standard deviation 907 usec)
    p50 latency: 7758 usec
    p90 latency: 9213 usec
    p95 latency: 9415 usec
    p99 latency: 9502 usec
    Avg HTTP time: 7939 usec (send/recv 586 usec + response wait 7353 usec)
  Server: 
    Inference count: 2227
    Execution count: 2227
    Successful request count: 2227
    Avg request latency: 5581 usec (overhead 37 usec + queue 82 usec + compute input 238 usec + compute infer 5202 usec + compute output 21 usec)
  Server Prometheus Metrics: 
    Avg GPU Utilization:
      GPU-81207bda-7e38-0495-0510-11595cdbff2c : 0%
      GPU-f

In [None]:
# Simulation of M/D/1 queue varying λ from 36.5-328.5 considering an average μ=365 (to get values for ρ in [0.1,0.9])

In [78]:
  perf_analyzer -u triton_server:8000 \
  -m food_classifier_onnx \
  -b 1 \
  --shape IMAGE:3,224,224 \
  --request-rate-range 11:99:11 \
  --request-distribution=poisson \
  --collect-metrics \
  -f test_output.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Request Rate limit: 99 requests per seconds
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 11 inference requests per second
  Client: 
    Request count: 161
    Throughput: 8.89096 infer/sec
    Avg latency: 10257 usec (standard deviation 1565 usec)
    p50 latency: 9837 usec
    p90 latency: 10652 usec
    p95 latency: 14947 usec
    p99 latency: 16994 usec
    Avg HTTP time: 10247 usec (send/recv 1062 usec + response wait 9185 usec)
  Server: 
    Inference count: 161
    Execution count: 161
    Successful request count: 161
    Avg request latency: 7253 usec (overhead 44 usec + queue 485 usec + compute input 201 usec + compute infer 6496 usec + compute output 26 usec)
  Server Prometheus Metrics: 
    Av

In [None]:
#Run above gave an average μ of 20000 => adjusting λ for the test below to actually get ρ values between [0.1,0.9]
#Simulation of M/D/1 queue varying λ from 2000-18000 considering an average μ=365 (to get values for ρ in [0.1,0.9])

In [None]:
#!/bin/bash

# Estimated service rate μ (inferences/sec)
MU=100

SUMMARY_FILE="rho_summary.csv"

# Write header if file doesn't exist
if [ ! -f "$SUMMARY_FILE" ]; then
  echo "rho,lambda,inferences_per_sec,server_queue,p50_latency" > "$SUMMARY_FILE"
fi

# Loop over rho values
for RHO in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; do
  LAMBDA=$(awk "BEGIN {printf \"%d\", $RHO * $MU}")
  OUTFILE="temp_rho_${RHO}.csv"

  echo "▶ Running for ρ=$RHO (λ=$LAMBDA req/sec) at fixed concurrency $FIXED_CONCURRENCY"

  perf_analyzer -u "$SERVER_URL" \
    -m food_classifier_onnx \
    -b 1 \
    --shape IMAGE:3,224,224 \
    --request-rate-range ${LAMBDA}:${LAMBDA}:1 \
    --request-distribution=poisson \
    --collect-metrics \
    -f "$OUTFILE"

  if [ -f "$OUTFILE" ]; then
    LINE=$(tail -n +2 "$OUTFILE" | head -1)
    if [ -n "$LINE" ]; then
      INF_SEC=$(echo "$LINE" | cut -d',' -f2)
      SERVER_QUEUE=$(echo "$LINE" | cut -d',' -f6)
      P50_LATENCY=$(echo "$LINE" | cut -d',' -f12)
      echo "$RHO,$LAMBDA,$INF_SEC,$SERVER_QUEUE,$P50_LATENCY" >> "$SUMMARY_FILE"
      echo "Logged for ρ=$RHO"
    else
      echo "No stable data at ρ=$RHO"
    fi
  else
    echo "Failed to run perf_analyzer for ρ=$RHO"
  fi
done

echo "Done! Summary in $SUMMARY_FILE"


▶ Running for ρ=0.1 (λ=10 req/sec) at fixed concurrency 
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Request Rate limit: 10 requests per seconds
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 10 inference requests per second
  Client: 
    Request count: 182
    Throughput: 10.0259 infer/sec
    Avg latency: 10016 usec (standard deviation 1013 usec)
    p50 latency: 9949 usec
    p90 latency: 10168 usec
    p95 latency: 10347 usec
    p99 latency: 15253 usec
    Avg HTTP time: 10007 usec (send/recv 1043 usec + response wait 8964 usec)
  Server: 
    Inference count: 182
    Execution count: 182
    Successful request count: 182
    Avg request latency: 7011 usec (overhead 45 usec + queue 234 usec + compute input 218 usec + compute infer 6487 usec + com

In [70]:
perf_analyzer -u triton_server:8000 \
  -m food_classifier_onnx \
  -b 1 \
  --shape IMAGE:3,224,224 \
  --request-rate-range 11:99:11 \
  --request-distribution=poisson \
  --collect-metrics \
  -f test_output.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Request Rate limit: 99 requests per seconds
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 11 inference requests per second
  Client: 
    Request count: 161
    Throughput: 8.89008 infer/sec
    Avg latency: 10213 usec (standard deviation 1569 usec)
    p50 latency: 9825 usec
    p90 latency: 10831 usec
    p95 latency: 14911 usec
    p99 latency: 17069 usec
    Avg HTTP time: 10203 usec (send/recv 1061 usec + response wait 9142 usec)
  Server: 
    Inference count: 161
    Execution count: 161
    Successful request count: 161
    Avg request latency: 7207 usec (overhead 43 usec + queue 476 usec + compute input 202 usec + compute infer 6458 usec + compute output 27 usec)
  Server Prometheus Metrics: 
    Av

This model has much better inference performance than our PyTorch model with Python backend did, in a similar test. Also, if we monitor with `nvtop`, we should see higher GPU utilization while the test is running (which is a good thing!) (Take a screenshot!)

Let’s try scaling *this* model up. Edit the model configuration:

``` bash
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier_onnx/config.pbtxt
```

and change

      instance_group [
        {
          count: 1
          kind: KIND_GPU
          gpus: [ 0 ]
        }
    ]

to

      instance_group [
        {
          count: 2      
          kind: KIND_GPU
          gpus: [ 0, 1 ]
        }
    ]

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
```

and then bring the server back up:

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
```

and use

``` bash
# runs on node-serve-system
docker logs triton_server
```

to make sure the server comes up and is ready.

Then, run our benchmark with higher concurrency. (2 instances on each GPU, because we noticed that a single instance used less than half a GPU.)

Watch the `nvtop` output as you run this test! (Take a screenshot!)

In [None]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 8 

<!-- 

    Avg request latency: 3961 usec (overhead 18 usec + queue 697 usec + compute input 97 usec + compute infer 3137 usec + compute output 11 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 8, throughput: 1182.39 infer/sec, latency 6089 usec

-->

This time, we should see that our model is fully utilizing the GPU (that’s good!) And, our inference performance is much better than the PyTorch model with Python backend could achieve with concurrency 8.

Let’s see how we do with even higher concurrency:

In [None]:
# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 16  

<!-- 



    Avg request latency: 9960 usec (overhead 19 usec + queue 6793 usec + compute input 100 usec + compute infer 3036 usec + compute output 11 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 16, throughput: 1257.15 infer/sec, latency 12025 usec

-->

We still have some queue delay, since the rate at which requests arrive is greater than the service rate of the models. But, we can feel good that we are no longer underutilizing the GPUs!

There’s one more issue we should address: our ONNX model doesn’t directly work with our Flask server now, because the inputs and outputs are different. The ONNX model expects a pre-processed array, and returns a list of class probabilities.

Since the pre-processing and post-processing doesn’t need GPU anyway, we’ll move it to the Flask app.

Edit the Docker compose file:

``` bash
# runs on node-serve-system
nano ~/serve-system-chi/docker/docker-compose-triton.yaml
```

and change

      flask:
        build:
          context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton

to

      flask:
        build:
          context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton_onnx

to use [a version of our Flask app where the pre- and post-processing is built in](https://github.com/teaching-on-testbeds/gourmetgram/blob/triton_onnx/app.py). Also change

          - FOOD11_MODEL_NAME=food_classifier

to

          - FOOD11_MODEL_NAME=food_classifier_onnx

so that our Flask app will send requests to the new ONNX model service.

Then run

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build flask
```

to re-build the container image, and

``` bash
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up flask --force-recreate -d
```

to restart the Flask container with the new image.

Let’s test this service. In a browser, run

    http://A.B.C.D

but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.

Then, download this entire notebook for later reference.

## Experiment

In [92]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
--request-distribution=constant --request-rate-range 100

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using synchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1927
    Throughput: 100.003 infer/sec
    Avg latency: 8476 usec (standard deviation 639 usec)
    p50 latency: 8722 usec
    p90 latency: 8888 usec
    p95 latency: 9036 usec
    p99 latency: 9330 usec
    Avg HTTP time: 8467 usec (send/recv 380 usec + response wait 8087 usec)
  Server: 
    Inference count: 1927
    Execution count: 1927
    Successful request count: 1927
    Avg request latency: 6353 usec (overhead 43 usec + queue 97 usec + compute input 232 usec + compute infer 5955 usec + compute output 25 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 100.003 infer/sec, latency 

In [93]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 100

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1832
    Avg send request rate: 99.36 infer/sec
    Throughput: 97.53 infer/sec
    Avg latency: 10952 usec (standard deviation 3851 usec)
    p50 latency: 9494 usec
    p90 latency: 16554 usec
    p95 latency: 18753 usec
    p99 latency: 22897 usec
    Avg HTTP time: 10943 usec (send/recv 742 usec + response wait 10201 usec)
  Server: 
    Inference count: 1832
    Execution count: 1832
    Successful request count: 1832
    Avg request latency: 8080 usec (overhead 37 usec + queue 2816 usec + compute input 222 usec + compute infer 4983 usec + compute output 21 usec)
Inferences/Second vs. Client Average Batch Latency
Request 

In [94]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 200

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 200 inference requests per second
  Client: 
    Request count: 3918
    Avg send request rate: 203.79 infer/sec
    Throughput: 196.63 infer/sec
    Avg latency: 9522 usec (standard deviation 3035 usec)
    p50 latency: 9128 usec
    p90 latency: 13364 usec
    p95 latency: 14019 usec
    p99 latency: 16278 usec
    Avg HTTP time: 9513 usec (send/recv 678 usec + response wait 8835 usec)
  Server: 
    Inference count: 3918
    Execution count: 3918
    Successful request count: 3918
    Avg request latency: 6745 usec (overhead 25 usec + queue 3241 usec + compute input 168 usec + compute infer 3296 usec + compute output 14 usec)
Inferences/Second vs. Client Average Batch Latency
Request R

In [95]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 500

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using synchronous calls for inference

Request Rate: 500 inference requests per second
  Client: 
    Request count: 8779
    Avg send request rate: 437.15 infer/sec
    Throughput: 372.62 infer/sec
    Avg latency: 10006 usec (standard deviation 1103 usec)
    p50 latency: 10286 usec
    p90 latency: 10522 usec
    p95 latency: 10758 usec
    p99 latency: 13163 usec
    Avg HTTP time: 9998 usec (send/recv 984 usec + response wait 9014 usec)
  Server: 
    Inference count: 8780
    Execution count: 8780
    Successful request count: 8780
    Avg request latency: 7278 usec (overhead 18 usec + queue 4609 usec + compute input 105 usec + compute infer 2535 usec + compute output 10 usec)
Inferences/Second vs. Client Average Batch Latency
Request

In [96]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 500 --concurrency-range 4

Usage: perf_analyzer [options]
==== SYNOPSIS ====
 
	--version 
	-m <model name>
	-x <model version>
	--bls-composing-models <string>
	--model-signature-name <model signature name>
	--service-kind <"triton"|"openai"|"tfserving"|"torchserve"|"triton_c_api">
	--endpoint <string>
	-v

I. MEASUREMENT PARAMETERS: 
	--async (-a)
	--sync
	--measurement-interval (-p) <measurement window (in msec)>
	--concurrency-range <start:end:step>
	--periodic-concurrency-range <start:end:step>
	--session-concurrency <session concurrency>
	--request-period <number of responses>
	--request-rate-range <start:end:step>
	--request-distribution <"poisson"|"constant">
	--request-intervals <path to file containing time intervals in microseconds>
	--serial-sequences
	--binary-search
	--num-of-sequences <number of concurrent sequences>
	--latency-threshold (-l) <latency threshold (in msec)>
	--max-threads <thread counts>
	--stability-percentage (-s) <deviation threshold for stable measurement (in percentage)>
	--max

: 99

In [97]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 500 --async

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using asynchronous calls for inference

Request Rate: 500 inference requests per second
  Client: 
    Request count: 7629
    Avg send request rate: 510.34 infer/sec
    Throughput: 374.00 infer/sec
    Avg latency: 20344624 usec (standard deviation 20347 usec)
    p50 latency: 20293961 usec
    p90 latency: 22349032 usec
    p95 latency: 22599814 usec
    p99 latency: 22829916 usec
    Avg HTTP time: 20344738 usec (send/recv 13233 usec + response wait 20331505 usec)
  Server: 
    Inference count: 7628
    Execution count: 7628
    Successful request count: 7628
    Avg request latency: 20307088 usec (overhead 19 usec + queue 20304426 usec + compute input 112 usec + compute infer 2520 usec + compute output 10 usec)
Inferences/Second vs. Client Averag

In [98]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 330 --async

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using asynchronous calls for inference

Request Rate: 330 inference requests per second
  Client: 
    Request count: 7381
    Avg send request rate: 335.79 infer/sec
    Throughput: 331.94 infer/sec
    Avg latency: 27890 usec (standard deviation 33067 usec)
    p50 latency: 14914 usec
    p90 latency: 84995 usec
    p95 latency: 115915 usec
    p99 latency: 136416 usec
    Avg HTTP time: 27616 usec (send/recv 519 usec + response wait 27097 usec)
  Server: 
    Inference count: 7381
    Execution count: 7381
    Successful request count: 7381
    Avg request latency: 24404 usec (overhead 17 usec + queue 21716 usec + compute input 85 usec + compute infer 2574 usec + compute output 11 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate

Expected waiting time is
$$\frac{1}{2\mu}\cdot\frac{\rho}{1-\rho}$$

In [116]:
rho_val=(0.2 0.9)
max_service_rate=330
step_size=$(awk -v d="$max_service_rate" 'BEGIN { printf "%.2f", 0.1 * d }')
request_arg=""
for rho in "${rho_val[@]}"; do
    request_rate=$(awk -v v="$rho" -v d="$max_service_rate" 'BEGIN { printf "%.2f", v * d }')
    echo $request_rate
    request_arg+="$request_rate:"
done

request_arg+="$step_size"
echo $request_arg

66.00
297.00
66.00:297.00:33.00


In [119]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range $request_arg --async -f result_test.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Request Rate limit: 297 requests per seconds
  Using poisson distribution on request generation
  Using asynchronous calls for inference

Request Rate: 66 inference requests per second
  Client: 
    Request count: 1177
    Throughput: 63.4895 infer/sec
    Avg latency: 11440 usec (standard deviation 3672 usec)
    p50 latency: 9765 usec
    p90 latency: 16339 usec
    p95 latency: 18821 usec
    p99 latency: 25492 usec
    Avg HTTP time: 11412 usec (send/recv 1107 usec + response wait 10305 usec)
  Server: 
    Inference count: 1177
    Execution count: 1177
    Successful request count: 1177
    Avg request latency: 8211 usec (overhead 42 usec + queue 2161 usec + compute input 188 usec + compute infer 5792 usec + compute output 26 usec)
Request Rate: 99 inference requests per 

In [112]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 33:297:33 --async -f result_test.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Request Rate limit: 297 requests per seconds
  Using poisson distribution on request generation
  Using asynchronous calls for inference

Request Rate: 33 inference requests per second
Failed to obtain stable measurement.
Failed to obtain stable measurement.



: 2

In [118]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 66 --async -f result_test.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using poisson distribution on request generation
  Using asynchronous calls for inference

Request Rate: 66 inference requests per second
  Client: 
    Request count: 1181
    Throughput: 63.4872 infer/sec
    Avg latency: 11337 usec (standard deviation 3668 usec)
    p50 latency: 9671 usec
    p90 latency: 16294 usec
    p95 latency: 18700 usec
    p99 latency: 25029 usec
    Avg HTTP time: 11287 usec (send/recv 1117 usec + response wait 10170 usec)
  Server: 
    Inference count: 1181
    Execution count: 1181
    Successful request count: 1181
    Avg request latency: 8078 usec (overhead 42 usec + queue 2090 usec + compute input 187 usec + compute infer 5732 usec + compute output 26 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 66, throughput: 63.4872 infer/sec, latency 113

In [None]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=poisson --request-rate-range 100 --async -f result_test.csv

## Batching Experiment

Earlier, we noted that our model can achieve higher throughput with low latency by performing inference on batches of input samples, instead of individual samples. However, our client sends requests with individual samples.

To improve performance, we can ask the Triton server to batch incoming requests whenever possible, and send them through the server together instead of a sequence. In other words, if the server is ready to handle the next request, and it finds four requests waiting in the queue, it should serve those four as a batch instead of just taking the next request in line.

Let’s edit the model configuration:

``` bash
# runs on node-serve-system
nano ~/summer2025/models/food_classifier_onnx/config.pbtxt
```

and at the end, add

    dynamic_batching {
      preferred_batch_size: [4, 6, 8, 10]
      max_queue_delay_microseconds: 100
    }

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

``` bash
# runs on node-serve-system
docker compose -f ~/exp-chi/docker/docker/docker-compose.yaml build triton_server
```

and then bring the server back up:

``` bash
# runs on node-serve-system
docker compose -f ~/exp-chi/docker/docker/docker-compose.yaml up triton_server --force-recreate -d
```

and use

``` bash
# runs on node-serve-system
docker logs triton_server
```

to make sure the server comes up and is ready.

Before we benchmark this service again, let’s get some pre-benchmark stats about how many requests have been served, broken down by batch size. (If you’ve just restarted the server, it would be zero!)

In [154]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats 

{"model_stats":[{"name":"food_classifier_onnx","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[],"memory_usage":[]}]}


Then let's run a benchmark. I will run benchmarks for a constant request rate of 100 (constant distribution) for various batch sizes {1,8,10,12,16} while the model's configuration has a set max batch number of 16.
I will then change the max batch number back to 1 and rerun a couple of benchmarks. It shouldn't allow me to run anything with a batch size greater than one.

Note: Reran first benchmarks after changing config again to save results in csv files.

In [155]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch1.csv

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1905
    Throughput: 100.001 infer/sec
    Avg latency: 9454 usec (standard deviation 2274 usec)
    p50 latency: 9490 usec
    p90 latency: 10029 usec
    p95 latency: 10289 usec
    p99 latency: 12677 usec
    Avg HTTP time: 9341 usec (send/recv 491 usec + response wait 8850 usec)
  Server: 
    Inference count: 1905
    Execution count: 1905
    Successful request count: 1905
    Avg request latency: 6370 usec (overhead 40 usec + queue 137 usec + compute input 156 usec + compute infer 6011 usec + compute output 25 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 100.001 infer/sec, latency 9454 u

Get stats per batch

In [156]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats

{"model_stats":[{"name":"food_classifier_onnx","version":"1","last_inference":1750104654444,"inference_count":3878,"execution_count":3878,"inference_stats":{"success":{"count":3878,"ns":2822039192172},"fail":{"count":0,"ns":0},"queue":{"count":3878,"ns":2794667220082},"compute_input":{"count":3878,"ns":575425899},"compute_infer":{"count":3878,"ns":26580321067},"compute_output":{"count":3878,"ns":84801843},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[{"batch_size":1,"compute_input":{"count":3878,"ns":575425899},"compute_infer":{"count":3878,"ns":26580321067},"compute_output":{"count":3878,"ns":84801843}}],"memory_usage":[]}]}


Max Batch Number was set to 1 in the configuration. Now it is changed to 10 so let's run the benchmark again.

In [157]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 8 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch8.csv

*** Measurement Settings ***
  Batch size: 8
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1988
    Throughput: 800.134 infer/sec
    Avg latency: 13301 usec (standard deviation 4342 usec)
    p50 latency: 12626 usec
    p90 latency: 13868 usec
    p95 latency: 14726 usec
    p99 latency: 29348 usec
    Avg HTTP time: 13106 usec (send/recv 4005 usec + response wait 9101 usec)
  Server: 
    Inference count: 15904
    Execution count: 1988
    Successful request count: 1988
    Avg request latency: 6820 usec (overhead 22 usec + queue 200 usec + compute input 572 usec + compute infer 6010 usec + compute output 15 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 800.134 infer/sec, latency 1

In [158]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats

{"model_stats":[{"name":"food_classifier_onnx","version":"1","last_inference":1750104778051,"inference_count":55494,"execution_count":10330,"inference_stats":{"success":{"count":10330,"ns":21019163113344},"fail":{"count":0,"ns":0},"queue":{"count":10330,"ns":20937878192070},"compute_input":{"count":10330,"ns":4276098761},"compute_infer":{"count":10330,"ns":76586698802},"compute_output":{"count":10330,"ns":168676035},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[{"batch_size":1,"compute_input":{"count":3878,"ns":575425899},"compute_infer":{"count":3878,"ns":26580321067},"compute_output":{"count":3878,"ns":84801843}},{"batch_size":8,"compute_input":{"count":6452,"ns":3700672862},"compute_infer":{"count":6452,"ns":50006377735},"compute_output":{"count":6452,"ns":83874192}}],"memory_usage":[]}]}


In [159]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 16 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch16.csv

*** Measurement Settings ***
  Batch size: 16
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1573
    Throughput: 1315.81 infer/sec
    Avg latency: 11139137 usec (standard deviation 72627 usec)
    p50 latency: 6505055 usec
    p90 latency: 18334846 usec
    p95 latency: 18869078 usec
    p99 latency: 19328583 usec
    Avg HTTP time: 11148685 usec (send/recv 4189141 usec + response wait 6959544 usec)
  Server: 
    Inference count: 25168
    Execution count: 1573
    Successful request count: 1573
    Avg request latency: 5502057 usec (overhead 32 usec + queue 5489924 usec + compute input 1233 usec + compute infer 10840 usec + compute output 28 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throug

In [160]:
# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats

{"model_stats":[{"name":"food_classifier_onnx","version":"1","last_inference":1750104938614,"inference_count":159062,"execution_count":16803,"inference_stats":{"success":{"count":16803,"ns":46389981414662},"fail":{"count":0,"ns":0},"queue":{"count":16803,"ns":46227405651444},"compute_input":{"count":16803,"ns":13195332776},"compute_infer":{"count":16803,"ns":148603962452},"compute_output":{"count":16803,"ns":333853326},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"response_stats":{},"batch_stats":[{"batch_size":1,"compute_input":{"count":3878,"ns":575425899},"compute_infer":{"count":3878,"ns":26580321067},"compute_output":{"count":3878,"ns":84801843}},{"batch_size":8,"compute_input":{"count":6452,"ns":3700672862},"compute_infer":{"count":6452,"ns":50006377735},"compute_output":{"count":6452,"ns":83874192}},{"batch_size":16,"compute_input":{"count":6473,"ns":8919234015},"compute_infer":{"count":6473,"ns":72017263650},"compute_output":{"count":6473,"ns":165177291}}],"m

In [161]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 12 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch12.csv

*** Measurement Settings ***
  Batch size: 12
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 2001
    Throughput: 1266.01 infer/sec
    Avg latency: 11482101 usec (standard deviation 78780 usec)
    p50 latency: 15692173 usec
    p90 latency: 21291909 usec
    p95 latency: 21462857 usec
    p99 latency: 21558780 usec
    Avg HTTP time: 11474427 usec (send/recv 6894382 usec + response wait 4580045 usec)
  Server: 
    Inference count: 24036
    Execution count: 2003
    Successful request count: 2003
    Avg request latency: 977366 usec (overhead 29 usec + queue 967909 usec + compute input 1083 usec + compute infer 8323 usec + compute output 21 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughp

In [162]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 10 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch10.csv

*** Measurement Settings ***
  Batch size: 10
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1926
    Throughput: 1000.21 infer/sec
    Avg latency: 16407 usec (standard deviation 2974 usec)
    p50 latency: 15821 usec
    p90 latency: 16753 usec
    p95 latency: 19954 usec
    p99 latency: 32344 usec
    Avg HTTP time: 16350 usec (send/recv 4309 usec + response wait 12041 usec)
  Server: 
    Inference count: 19250
    Execution count: 1925
    Successful request count: 1925
    Avg request latency: 8688 usec (overhead 22 usec + queue 459 usec + compute input 822 usec + compute infer 7368 usec + compute output 15 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 1000.21 infer/sec, latency

In [167]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 14 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch14.csv

*** Measurement Settings ***
  Batch size: 14
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1778
    Throughput: 1313.12 infer/sec
    Avg latency: 4093885 usec (standard deviation 80487 usec)
    p50 latency: 4960687 usec
    p90 latency: 6151674 usec
    p95 latency: 6185497 usec
    p99 latency: 6221961 usec
    Avg HTTP time: 4096567 usec (send/recv 2113403 usec + response wait 1983164 usec)
  Server: 
    Inference count: 24906
    Execution count: 1779
    Successful request count: 1779
    Avg request latency: 1260688 usec (overhead 26 usec + queue 1250052 usec + compute input 1195 usec + compute infer 9394 usec + compute output 20 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 

In [166]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 14 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch14.csv

*** Measurement Settings ***
  Batch size: 14
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
Failed to obtain stable measurement.
Failed to obtain stable measurement.



: 2

In [164]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 4 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch4.csv

*** Measurement Settings ***
  Batch size: 4
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1958
    Throughput: 399.993 infer/sec
    Avg latency: 13504 usec (standard deviation 4333 usec)
    p50 latency: 13197 usec
    p90 latency: 14326 usec
    p95 latency: 15646 usec
    p99 latency: 25560 usec
    Avg HTTP time: 13333 usec (send/recv 2783 usec + response wait 10550 usec)
  Server: 
    Inference count: 7832
    Execution count: 1958
    Successful request count: 1958
    Avg request latency: 6607 usec (overhead 38 usec + queue 241 usec + compute input 495 usec + compute infer 5808 usec + compute output 24 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 399.993 infer/sec, latency 1

In [165]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 6 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async -f batch6.csv

*** Measurement Settings ***
  Batch size: 6
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1911
    Throughput: 600.047 infer/sec
    Avg latency: 13628 usec (standard deviation 2321 usec)
    p50 latency: 13657 usec
    p90 latency: 14445 usec
    p95 latency: 14651 usec
    p99 latency: 19401 usec
    Avg HTTP time: 13517 usec (send/recv 2699 usec + response wait 10818 usec)
  Server: 
    Inference count: 11466
    Execution count: 1911
    Successful request count: 1911
    Avg request latency: 6483 usec (overhead 39 usec + queue 123 usec + compute input 695 usec + compute infer 5600 usec + compute output 26 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 600.047 infer/sec, latency 

Changed max batch size to 1 again in config file to see what would happen without batches.

In [152]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async 

*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
  Client: 
    Request count: 1908
    Throughput: 99.9963 infer/sec
    Avg latency: 9849 usec (standard deviation 2974 usec)
    p50 latency: 9818 usec
    p90 latency: 10559 usec
    p95 latency: 11348 usec
    p99 latency: 12183 usec
    Avg HTTP time: 9688 usec (send/recv 597 usec + response wait 9091 usec)
  Server: 
    Inference count: 1908
    Execution count: 1908
    Successful request count: 1908
    Avg request latency: 6479 usec (overhead 42 usec + queue 143 usec + compute input 194 usec + compute infer 6073 usec + compute output 26 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 100, throughput: 99.9963 infer/sec, latency 9849 u

In [153]:
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 8 --shape IMAGE:3,224,224 \
    --request-distribution=constant --request-rate-range 100 --async 

*** Measurement Settings ***
  Batch size: 8
  Service Kind: TRITON
  Using "time_windows" mode for stabilization
  Stabilizing using average throughput
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using asynchronous calls for inference

Request Rate: 100 inference requests per second
Failed to retrieve results from inference request.
Thread [0] had error: [request id: 0] inference request batch-size must be <= 1 for 'food_classifier_onnx'

Thread [1] had error: [request id: 281474976710656] inference request batch-size must be <= 1 for 'food_classifier_onnx'

Thread [2] had error: [request id: 562949953421312] inference request batch-size must be <= 1 for 'food_classifier_onnx'

Thread [3] had error: [request id: 844424930131968] inference request batch-size must be <= 1 for 'food_classifier_onnx'




: 99

Tells us it is not possible to do batches of size 8 because the configuration only allows max batch size of 1. Not worth testing bigger batches - since it is not going to work.