Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. <br> SPDX-License-Identifier: BSD-3-Clause-Clear

**Takeaways:** Users will learn how to tune the Resnet50 model for best throughput and latency

**Before you start:** 
- There are some commands (folder locations etc) that will need to be updated in this notebook based on the platform and installation location. Some commands might need sudo prefix to run properly.
- The terms 'model' and 'network' are used interchangeably in this notebook. 
- The terms 'NSP' and 'AI compute core" are used interchangeably in this notebook.

**Last Verified Qualcomm Cloud AI Platform SDK and Apps SDK Version:** Platform SDK 1.10.0.193 and Apps SDK 1.10.0.193 

# <span style='color:Blue'> Performance Tuning on Cloud AI </span>

##  Pre-requisite reading 
New users on Cloud AI platforms are expected to go over the Cloud AI SoC architecture and the key compile/runtime parameters that determine performance. This is discussed in the Tune Performance section in the Inference workflow documentation. 


## Introduction 
This notebook is for beginners and will take the user through the workflow to achieve best throughput and latency on Cloud AI platforms for the Resnet50 model. 

Here is the workflow that will be demonstated in this notebook. 

1. **Install required packages**: Begin by installing all the required packages
2. **Import the model**: Download the Resnet50 model in ONNX. 
2. **Device Health Check**: Query the device health using qaic-util tool. 
2. **Identify best throughput configuration**: Go over the Model Configurator tool to identify best throughput 
3. **Identity least latency configuration**: Go over the key parameters to tweak for least latency


# <span style='color:Blue'> 1. Install required packages </span>

We will install the required Python packages 

In [2]:
!pip install -r requirements.txt

[0mCollecting onnx==1.12.0 (from -r requirements.txt (line 1))
  Downloading onnx-1.12.0.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m114.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting numpy==1.23.4 (from -r requirements.txt (line 3))
  Downloading numpy-1.23.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m115.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting torch==1.13.0 (from -r requirements.txt (line 5))
  Downloading torch-1.13.0-cp311-cp311-manylinux1_x86_64.whl (890.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.2/890.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:18[0m
[?25hCollecting pillow==8.3.2 (from -r requirements.txt (line 6))
  Downloading Pillow-8.3.2.tar.gz (48.8 MB)
[2K     [90m━━━━━━━━

[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[31mERROR: Cannot install -r requirements.txt (line 1), -r requirements.txt (line 4), numpy==1.23.4 and onnxruntime==1.15.1 because these package versions have conflicting dependencies.[0m[31m
[0m[?25h
The conflict is caused by:
    The user requested numpy==1.23.4
    onnx 1.12.0 depends on numpy>=1.16.6
    onnxruntime 1.15.1 depends on numpy>=1.24.2
    The user requested numpy==1.23.4
    onnx 1.12.0 depends on numpy>=1.16.6
    onnxruntime 1.15.0 depends on numpy>=1.24.2

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32

# <span style='color:Blue'>2. Download the model </span>

Download the pretrained Resnet50 model. 

In [8]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
!optimum-cli export onnx --model 'microsoft/resnet-50' resnet-50_onnx/ --opset 11 --task image-classification --width 224 --height 224 --num_channels 3

2023-08-12 23:02:38.877369: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-12 23:02:38.877413: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Framework not specified. Using pt to export to ONNX.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using framework PyTorch: 1.11.0+cu102
  if num_channels != self.num_channels:
Post-processing the exported models...
Validating models in subprocesses...
2023-08-12 23:02:45.720204: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
202

# <span style='color:Blue'> 3. Check device health </span>

'qaic-util' tool can be used to query the health of all the Cloud AI cards in the server. 

"Status: Ready" indicates that the health of the cards is good. 
"Status: Error" indicates that the cards are not in good health and a system administrator needs to be contacted to rectify the issue. 

In [1]:
!/opt/qti-aic/tools/qaic-util -q | grep "Status"

	Status:Ready


# <span style='color:Blue'> 4. Identify the best throughput configuration </span>

Model Configurator is a python script that is used to find the optimal configuration of batch size, cores etc for a given model for the throughput (Inf/s). The input to model configurator is the model and a search space that the tool iterates over.   

1. Use the optimal config (model configurator output) as an indicator of the best compile flags.
2. Compile with qaic-exec and run with qaic-runner using optimal config from step 1 to validate the performance. 

Key compile optimization flags are 
- **aic-num-cores** : # of AI compute cores (aka NSP) used to compile the model
- **bs** : batch size
- **mos** : Maximum output splitting, denotes the no of AI compute cores across which an output channel (and associated weights) is split. 
- **ols** : Overlap splitting, enabled output splitting to improve core level parallelism, eg tensor and vector unit

Key runtime optimization flags are 
- **instance/activations** : No of instances of the compiled binary that can be run based on the # of AI compute cores on the card 
- **set-size** : denotes the number of inferences per instance that can be queued up on the host. Hides host side overhead by pipelining inferences. CV models typically can benefit from higher set-sizes for higher throughput (with increased latency).  

## Config Optimizer 
The optimized search is run on one or more searchable parameters (aic-num-cores, instances, batch-size etc). The search space is provided through a json configuration file. 
This table captures the key elements of the configuration file. 

| Key | Type | Description | Recommended value|
| :- |:- | :- | :-|
|"max_func_eval"|Integer|Maximun number of evaluations to do for each initial point. This number can be increased if successful convergence is not achieved|200|
|"Objective"|String| Search objective. Options are "maximize_inf_rate" or "minimize_latency" | Choose "maximize_inf_rate" for maximum throughput or "minimize_latency" for minimum latency |
|“params”|Json Object|Provide the search range for each of the parameters - cores, mos, ols, etc. through min, max values| See table below |
“initial_values”|List of Json Objects|List of initial values for the search parameters. A fresh search is initiated from each of these points and the results returned. Initial Values must be picked from within the search range defined in “params”|	Provide multiple initial values as shown in the example json|
|“static_params”|Json Object|Optional static values to be used for searchable parameters which have been excluded from the search space| |


### Parameter Range


| Parameter|	Recommended Search Range|	Valid Range |	Comments|
| :- | :- | :- | :-|
|cores|	1-Number of NSP on device|	1-Number of NSP on device||	
|mos	|1-Number of NSP on device|	1-Number of NSP on device||	
|ols	|1-8|	Integers>0|	
|batch-size (bs)|	1-16|	Integers>0|	Min, max values must be power of 2. The max value would depend on the model|
|instances|	1-Number of NSP on device|	1-Number of NSP on device|	

In [17]:
#Lets assume the max Number of NSP on the device is 14. 
!cat resnet_base_dopt_throughput.json

{
  "max_func_eval": 200,
  "objective": "maximize_inf_rate",
  "params": {
    "cores": {
      "min": 1,
      "max": 14
    },
    "mos": {
      "min": 1,
      "max": 8
    },
    "ols": {
      "min": 1,
      "max": 8
    },
    "bs": {
      "min": 1,
      "max": 16
    },
    "instances": {
      "min": 1,
      "max": 14
    }
  },
  "initial_values": [
    {
      "cores": 1,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 14
    },
    {
      "cores": 2,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 7
    },
    {
      "cores": 4,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 3
    },
    {
      "cores": 7,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 2
    },
    {
      "cores": 14,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 1
    }
  ]
}

In [29]:
# Dump the model inputs and outputs 

import onnx
model = onnx.load("resnet-50_onnx/model.onnx")
for _input in model.graph.input:
    print(_input)
for _output in model.graph.output:
    print(_output)

name: "pixel_values"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "num_channels"
      }
      dim {
        dim_param: "height"
      }
      dim {
        dim_param: "width"
      }
    }
  }
}

name: "logits"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_value: 1000
      }
    }
  }
}



In [38]:
# Run model_configurator tool to identify the best throughput configuration. 
# For CV networks, higher set sizes are preferred for higher throughput. Default set-size is 10. 
!python3 /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py resnet-50_onnx/model.onnx onnx \
-onnx-define-symbol-batch-size=batch_size \
-onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224  -onnx-define-symbol=width,224 \
-multicast-weights \
-optimized-config-search=resnet_base_dopt_throughput.json -max-compilation-threads=16 -time=5 \
-convert-to-fp16 -device-id=0 

  "class": algorithms.Blowfish,
2023-08-13 00:06:08.943 - [INFO]: Starting /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py resnet-50_onnx/model.onnx onnx -onnx-define-symbol-batch-size=batch_size -onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224 -onnx-define-symbol=width,224 -multicast-weights -optimized-config-search=resnet_base_dopt_throughput.json -max-compilation-threads=16 -time=5 -convert-to-fp16 -device-id=0
2023-08-13 00:06:08.944 - [INFO]: Model Name: model.onnx
2023-08-13 00:06:08.945 - [INFO]: Hostname: ac120r4-08-giga, Physical Cores: 32, Logical Cores: 32, Memory: 125.8 GB
2023-08-13 00:06:09.083 - [INFO]: Running optimized search
[2023-08-13 00:06:09.122] [[32minfo[m] Compiling model with compiler parameters: [(cores=1, mos=[1], ols=1, batchSize=1)]
[2023-08-13 00:06:42.576] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-1-output with runnerParams (instances=14)
[2023-08-13 00:

[2023-08-13 00:16:59.433] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-1-output with runnerParams (instances=7) running on device ID 0
[2023-08-13 00:17:05.038] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-1-output with runnerParams (instances=6)
[2023-08-13 00:17:05.403] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-1-output with runnerParams (instances=6) running on device ID 0
[2023-08-13 00:17:11.001] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=7)
[2023-08-13 00:17:11.439] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=7) running on device ID 0
[2023-08-13 00:17:16.554] [[32minfo[m] Running model at path model_configurator_output/c

[2023-08-13 00:24:16.892] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-4-output with runnerParams (instances=6)
[2023-08-13 00:24:17.546] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-4-output with runnerParams (instances=6) running on device ID 0
[2023-08-13 00:24:23.144] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-2-bs-4-output with runnerParams (instances=7)
[2023-08-13 00:24:23.844] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-2-bs-4-output with runnerParams (instances=7) running on device ID 0
[2023-08-13 00:24:29.464] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-8-output with runnerParams (instances=7)
[2023-08-13 00:24:30.599] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-core

[2023-08-13 00:30:51.884] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=3) running on device ID 0
[2023-08-13 00:30:57.447] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=3)
[2023-08-13 00:30:59.119] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=3) running on device ID 0
[2023-08-13 00:31:04.689] [[32minfo[m] Compiling model with compiler parameters: [(cores=3, mos=[1], ols=1, batchSize=8), (cores=3, mos=[1], ols=2, batchSize=4), (cores=3, mos=[2], ols=1, batchSize=4), (cores=4, mos=[1], ols=1, batchSize=4)]
[2023-08-13 00:32:14.425] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-4-output with runnerParams (instances=2)
[2023-08-13 00:32:14.921] [[32m

[2023-08-13 00:39:36.883] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-8-output with runnerParams (instances=1) running on device ID 0
[2023-08-13 00:39:42.419] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-1-bs-8-output with runnerParams (instances=2)
[2023-08-13 00:39:43.295] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-1-bs-8-output with runnerParams (instances=2) running on device ID 0
[2023-08-13 00:39:48.858] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-5-mos-2-ols-1-bs-8-output with runnerParams (instances=2)
[2023-08-13 00:39:49.769] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-5-mos-2-ols-1-bs-8-output with runnerParams (instances=2) running on device ID 0
[2023-08-13 00:39:55.341] [[32minfo[m] Running model at path model_configurator_output/c

[2023-08-13 00:48:40.848] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-4-output with runnerParams (instances=2) running on device ID 0
[2023-08-13 00:48:45.907] [[32minfo[m] Compiling model with compiler parameters: [(cores=7, mos=[1], ols=1, batchSize=16)]
[2023-08-13 00:49:49.658] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-16-output with runnerParams (instances=2)
[2023-08-13 00:49:51.314] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-16-output with runnerParams (instances=2) running on device ID 0
[2023-08-13 00:49:56.880] [[32minfo[m] Compiling model with compiler parameters: [(cores=6, mos=[1], ols=1, batchSize=16), (cores=7, mos=[1], ols=2, batchSize=16), (cores=7, mos=[2], ols=1, batchSize=16)]
[2023-08-13 00:51:07.598] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-

[2023-08-13 01:03:16.478] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-2-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-13 01:03:21.520] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-08-13 01:03:21.831] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-08-13 01:03:26.889] [[32minfo[m] Compiling model with compiler parameters: [(cores=12, mos=[1], ols=1, batchSize=16)]
[2023-08-13 01:04:31.403] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-12-mos-1-ols-1-bs-16-output with runnerParams (instances=1)
[2023-08-13 01:04:32.959] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-12-mos-1-ols-1-bs-16-output with runnerP

[2023-08-13 01:17:34.588] [[32minfo[m] Compiling model with compiler parameters: [(cores=9, mos=[1], ols=2, batchSize=16), (cores=10, mos=[1], ols=2, batchSize=8), (cores=10, mos=[2], ols=2, batchSize=16), (cores=11, mos=[1], ols=2, batchSize=16)]
[2023-08-13 01:19:02.899] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-9-mos-1-ols-2-bs-16-output with runnerParams (instances=1)
[2023-08-13 01:19:04.469] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-9-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-08-13 01:19:09.529] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-10-mos-1-ols-2-bs-8-output with runnerParams (instances=1)
[2023-08-13 01:19:10.392] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-10-mos-1-ols-2-bs-8-output with runnerParams (instances=1) running on device ID 0
[2023-08-13 01:19:15.438

[2023-08-13 01:34:30.145] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-8-mos-1-ols-5-bs-16-output with runnerParams (instances=1) running on device ID 0
2023-08-13 01:34:36.417 - [INFO]: Optimized search results:
  OptimizerStatus                      InitialValue                      cores mos   ols  batchSize  instances  Objective (inf/sec)
0      SUCCESS     (cores=1, mos=[1], ols=1, batchSize=1, instances=14)    1    [1]   2       2         14            4419.97      
1      SUCCESS      (cores=2, mos=[1], ols=1, batchSize=1, instances=7)    2    [1]   2       4          7            6418.51      
2      SUCCESS      (cores=4, mos=[1], ols=1, batchSize=1, instances=3)    5    [1]   2       8          2            5473.46      
3      SUCCESS      (cores=7, mos=[1], ols=1, batchSize=1, instances=2)    7    [1]   3      16          2            6091.01      
4      SUCCESS     (cores=14, mos=[1], ols=1, batchSize=1, instances=1)    8    [1]   2    

## Measure end-to-end latency 

Here are the steps to measure the end-to-end latency as well as the breakdown of latency for the best configuration identified in the previous step. 

1. Compile the model using 'qaic-exec'with the configuration parameters identified in previous step using model configurator. 
2. Execute the compiled model using 'qaic-runner' with the run time parameters identified in previous step using model configurator. Run 'qaic-runner' with flags that dump latency information.
3. Post process the latency information to identify percentile distribution (mean, median, 95 and 99) in latency across inferences.

In [32]:
## Compile the model 

!rm -rf compiled_fp16
!/opt/qti-aic/exec/qaic-exec -v -aic-hw  \
-m=resnet-50_onnx/model.onnx \
-onnx-define-symbol=batch_size,4 \
-onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224  -onnx-define-symbol=width,224 \
-mos=1 -ols=2 -aic-num-cores=2 \
-stats-batchsize=4 -aic-binary-dir=./compiled_fp16 \
-multicast-weights -convert-to-fp16 \
-aic-hw-version=2.0 -compile-only

Reading ONNX Model from resnet-50_onnx/model.onnx
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at ./compiled_fp16


In [33]:
## Execute the compiled model with the latency flags. qaic-runner generates random data if input data is not passed to it. 
!mkdir resnet50_stats
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 0 -a 7 -S 10\
--aic-profiling-type latency --aic-profiling-out-dir ./resnet50_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 20 

mkdir: cannot create directory ‘resnet50_stats’: File exists
 ---- Stats ----
InferenceCnt 31905 TotalDuration 20042730us BatchSize 4 Inf/Sec 6367.396
Deleting previous file: ./resnet50_stats/aic-profiling-program-0-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-0-latency.txt
Deleting previous file: ./resnet50_stats/aic-profiling-program-1-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-1-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-2-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-3-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-4-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-5-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-6-latency.txt


### Latency breakdown 

The end-to-end inference latency can be broken down into 4 major categories - Application, Linux Runtime (LRT) processing, Kernel mode driver (KMD) processing and Cloud AI device processing. 


![Latency Breakdown](Images/Latency.jpg)

**Key Latency Stats** (units in us)
- *totalRoundtripTime* :Time from point where application (qaic-runner in this case) calls Runtime API and ends where post-processing is complete and control is returned to the application indicating the inference is complete
- *preProcTime* : Time taken to pre-process the data (model input) on the host
- *postProc* : Time taken to post-process the data (model output) on the host
- *execTotal* : Time from inference object being submitted to kernel, completion on hardware and processing is returned to user-space


In [34]:
## Post process the latency information to identify percentile distribution

!python3 latency_stats_python3.py ./resnet50_stats/aic-profiling-program-0-latency.txt \
./resnet50_stats/aic-profiling-program-1-latency.txt \
./resnet50_stats/aic-profiling-program-2-latency.txt \
./resnet50_stats/aic-profiling-program-3-latency.txt \
./resnet50_stats/aic-profiling-program-4-latency.txt \
./resnet50_stats/aic-profiling-program-5-latency.txt \
./resnet50_stats/aic-profiling-program-6-latency.txt 

All activations combined:
                         mean        min        50%        75%        90%        95%        99%        max
hostRoundTrip       43.497543  39.139000  43.118000  45.156000  46.514000  47.111800  47.935920  49.395000
enqTime              0.006717   0.001940   0.006200   0.008150   0.009270   0.010230   0.019833   0.222081
preProcTime          0.375052   0.192681   0.326532   0.438332   0.590709   0.693545   0.935699   2.232812
submitTime           0.001776   0.000430   0.001720   0.002010   0.002260   0.002420   0.003320   0.198211
execTotal           43.531787  39.151401  43.152593  45.191984  46.552473  47.154850  47.978552  49.447936
exectoVc             0.004855   0.002000   0.005000   0.006000   0.006000   0.006000   0.008000   0.198000
execToComplete      43.492688  39.136000  43.114000  45.152000  46.510000  47.106800  47.930960  49.389000
postProc             0.005212   0.001850   0.004520   0.005850   0.007870   0.009530   0.014589   0.207241
totalRoundt

As observed, for the configuration choosen, throughput of ~6400 images/sec is observed with a mean latency of ~44ms per batched (bs=4) inference. 

# <span style='color:Blue'> 5. Identify the least latency configuration </span>

Identifying the least latency configurations requires the users to run the model_configurator tool with the "objective" parameter set to "minimize_latency". 

Minimum latency is achieved when the batch-size, instances and set-size are set to 1. 
We need to iterate through the cores used to compile the network to identify the least latency config. 

The key difference in the dopt.json files used for best throughput vs least latency are in the initial values. The number of instances is always set to 1 for least latency. For best throughput, the initial value of instances = floor((total no of cores in the device) / (no of cores used to compile the model)) 

Lets go through these steps for the resnet50 model.

In [47]:
!python3 /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py resnet-50_onnx/model.onnx onnx \
-onnx-define-symbol-batch-size=batch_size \
-onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224  -onnx-define-symbol=width,224 \
-multicast-weights \
-optimized-config-search=resnet_base_dopt_min_latency.json -max-compilation-threads=16 -time=5 \
-convert-to-fp16 -device-id=0 -set-size 1

  "class": algorithms.Blowfish,
2023-08-14 00:28:17.684 - [INFO]: Starting /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py resnet-50_onnx/model.onnx onnx -onnx-define-symbol-batch-size=batch_size -onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224 -onnx-define-symbol=width,224 -multicast-weights -optimized-config-search=resnet_base_dopt_min_latency.json -max-compilation-threads=16 -time=5 -convert-to-fp16 -device-id=0 -set-size 1
2023-08-14 00:28:17.684 - [INFO]: Model Name: model.onnx
2023-08-14 00:28:17.685 - [INFO]: Hostname: ac120r4-08-giga, Physical Cores: 32, Logical Cores: 32, Memory: 125.8 GB
2023-08-14 00:28:17.825 - [INFO]: Running optimized search
[2023-08-14 00:28:17.866] [[32minfo[m] Compiling model with compiler parameters: [(cores=1, mos=[1], ols=1, batchSize=1)]
[2023-08-14 00:28:51.257] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[20

[2023-08-14 00:32:49.902] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:32:50.098] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:32:55.131] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:32:55.323] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:33:00.353] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:33:00.543] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-core

[2023-08-14 00:38:55.996] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-8-mos-1-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:39:01.034] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-10-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:39:01.250] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-10-mos-1-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:39:06.286] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-9-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:39:06.511] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-9-mos-2-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:39:11.549] [[32minfo[m] Running model at path model_configurator_output

[2023-08-14 00:46:50.573] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-11-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:46:50.823] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-11-mos-2-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:46:55.864] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-11-mos-1-ols-2-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:46:56.126] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-11-mos-1-ols-2-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:47:01.169] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-11-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-08-14 00:47:01.490] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc

[2023-08-14 00:49:36.235] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:49:41.272] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-1-output with runnerParams (instances=2)
[2023-08-14 00:49:41.488] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-1-output with runnerParams (instances=2) running on device ID 0
[2023-08-14 00:49:46.530] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:49:46.712] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:49:51.748] [[32minfo[m] Running model at path model_configurator_output/c

[2023-08-14 00:55:44.825] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-11-mos-8-ols-2-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:55:45.156] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-11-mos-8-ols-2-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:55:50.191] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-12-mos-7-ols-2-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:55:50.481] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-12-mos-7-ols-2-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:55:55.526] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-12-mos-8-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:55:55.765] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc

[2023-08-14 00:59:29.839] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-08-14 00:59:30.110] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 00:59:35.144] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-1-output with runnerParams (instances=2)
[2023-08-14 00:59:35.374] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-1-output with runnerParams (instances=2) running on device ID 0
[2023-08-14 00:59:40.418] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 00:59:40.610] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-core

[2023-08-14 01:05:28.505] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-08-14 01:05:28.816] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 01:05:33.860] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-08-14 01:05:34.115] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-13-mos-2-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 01:05:39.149] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-2-bs-1-output with runnerParams (instances=1)
[2023-08-14 01:05:39.448] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc

[2023-08-14 01:09:22.420] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-08-14 01:09:22.719] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-08-14 01:09:27.758] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-1-output with runnerParams (instances=2)
[2023-08-14 01:09:28.021] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-1-output with runnerParams (instances=2) running on device ID 0
[2023-08-14 01:09:33.085] [[32minfo[m] Compiling model with compiler parameters: [(cores=14, mos=[1], ols=1, batchSize=2), (cores=14, mos=[1], ols=2, batchSize=1), (cores=14, mos=[2], ols=1, batchSize=1)]
[2023-08-14 01:11:24.010] [[32minfo[m] Running model at path model_co

The Configuration with 11 cores and batch size 1 is returned as the configuration that provides least latency per image. Developers could also use row 3 which uses 7 cores. On a device with 14 cores, users can run 2 instances with a slight increase in latency. 

In [23]:
## aic-num-cores =7 and instances = 1
!rm -rf compiled_fp16
!rm -rf resnet50_stats
!mkdir resnet50_stats

!/opt/qti-aic/exec/qaic-exec -v -aic-hw  \
-m=resnet-50_onnx/model.onnx \
-onnx-define-symbol=batch_size,1 \
-onnx-define-symbol=num_channels,3 -onnx-define-symbol=height,224  -onnx-define-symbol=width,224 \
-mos=1 -ols=1 -aic-num-cores=7 \
-stats-batchsize=1 -aic-binary-dir=./compiled_fp16 \
-multicast-weights -convert-to-fp16 \
-aic-hw-version=2.0 -compile-only

Reading ONNX Model from resnet-50_onnx/model.onnx
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at ./compiled_fp16


In [28]:
# Running a single instance of the compiled binary
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 0 -a 1 -S 1\
--aic-profiling-type latency --aic-profiling-out-dir ./resnet50_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 10 

 ---- Stats ----
InferenceCnt 12009 TotalDuration 10000635us BatchSize 1 Inf/Sec 1200.824
Deleting previous file: ./resnet50_stats/aic-profiling-program-0-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-0-latency.txt


In [29]:
# Measure latency for the single instance
!python3 latency_stats_python3.py ./resnet50_stats/aic-profiling-program-0-latency.txt 

All activations combined:
                        mean       min       50%       75%       90%       95%       99%       max
hostRoundTrip       0.684067  0.634000  0.680000  0.691000  0.719000  0.798600  0.843000  0.884000
enqTime             0.004928  0.002380  0.004770  0.005910  0.006690  0.007310  0.013023  0.039070
preProcTime         0.036922  0.030020  0.038401  0.038810  0.040480  0.041221  0.043546  0.066970
submitTime          0.001740  0.000730  0.001820  0.001990  0.002060  0.002130  0.002220  0.024850
execTotal           0.719364  0.651073  0.714634  0.725984  0.749424  0.836470  0.882172  0.935055
exectoVc            0.004541  0.003000  0.005000  0.005000  0.005000  0.005000  0.005000  0.019000
execToComplete      0.679526  0.629000  0.675000  0.687000  0.715000  0.794000  0.838000  0.880000
postProc            0.002600  0.001240  0.002570  0.002810  0.002950  0.003020  0.003140  0.009320
totalRoundtripTime  0.792106  0.718413  0.781705  0.799574  0.806634  0.926719  0.9

In [30]:
# Running 2 instances of the compiled binary
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 0 -a 2 -S 1\
--aic-profiling-type latency --aic-profiling-out-dir ./resnet50_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 10 

 ---- Stats ----
InferenceCnt 21911 TotalDuration 10000800us BatchSize 1 Inf/Sec 2190.925
Deleting previous file: ./resnet50_stats/aic-profiling-program-0-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-0-latency.txt
Deleting previous file: ./resnet50_stats/aic-profiling-program-1-latency.txt
Writing file:./resnet50_stats/aic-profiling-program-1-latency.txt


In [31]:
# Measure latency for 2 instances
!python3 latency_stats_python3.py ./resnet50_stats/aic-profiling-program-0-latency.txt \
./resnet50_stats/aic-profiling-program-1-latency.txt 

All activations combined:
                        mean       min       50%       75%       90%       95%       99%       max
hostRoundTrip       0.810078  0.658000  0.766000  0.894000  0.930000  0.936000  0.967000  1.092000
enqTime             0.004089  0.001680  0.003260  0.004770  0.007320  0.008040  0.009850  0.054131
preProcTime         0.033236  0.014310  0.031820  0.034276  0.037950  0.040680  0.047958  0.246621
submitTime          0.001468  0.000370  0.001370  0.001800  0.002060  0.002210  0.002470  0.018550
execTotal           0.834520  0.673993  0.777944  0.925465  0.960235  0.967630  0.980073  1.209497
exectoVc            0.003387  0.002000  0.003000  0.004000  0.004000  0.005000  0.005000  0.019000
execToComplete      0.806691  0.653000  0.764000  0.891000  0.927000  0.933000  0.965000  1.088000
postProc            0.002094  0.000820  0.002390  0.002600  0.002771  0.003050  0.003340  0.010550
totalRoundtripTime  0.887163  0.727994  0.828104  0.983996  1.010106  1.0

We see that the throughput has increased significantly (~1200 -> ~2200 inf/s) while latency has also increased but only slightly (~0.79ms to ~0.89ms). 