Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. <br>
SPDX-License-Identifier: BSD-3-Clause-Clear

**Takeaways:** Users will learn how to tune the BERT base neural network model for best throughput and latency

**Before you start:** 
- There are some commands (folder locations etc) that will need to be updated in this notebook based on the platform and installation location. Some commands might need sudo prefix to run properly.
- The terms 'model' and 'network' are used interchangeably in this notebook. 
- The terms 'NSP' and 'AI compute core" are used interchangeably in this notebook.

**Last Verified Qualcomm Cloud AI Platform SDK and Apps SDK Version:** Platform SDK 1.10.0.193 and Apps SDK 1.10.0.193 

**SKU used:** Cloud AI 100 Pro card

# <span style='color:Blue'> Performance Tuning on Cloud AI </span>

##  Pre-requisite reading 
New users on Cloud AI platforms are expected to go over the Cloud AI SoC architecture and the key compile/runtime parameters that determine performance. This is discussed in the Tune Performance section in the Inference workflow documentation. 


## Introduction 
This notebook is for beginners and will take the user through the workflow to achieve best throughput and latency on Cloud AI platforms for the 'bert-base-cased' model from HuggingFace. 

Here is the workflow that will be demonstated in this notebook. 

1. **Install required packages**: Begin by installing all the required packages
2. **Import the model**: Download the bert-base-cased model from HuggingFace in ONNX. 
2. **Device Health Check**: Query the device health using qaic-util tool. 
2. **Identify best throughput configuration**: Go over the Model Configurator tool to identify best throughput 
3. **Identity least latency configuration**: Go over the key parameters to tweak for least latency


# <span style='color:Blue'> 1. Install required packages </span>

We will install the required Python packages 

In [1]:
!pip install -r requirements.txt

Collecting torch===1.11.0 (from -r requirements.txt (line 5))
  Using cached torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl (750.6 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.0
    Uninstalling torch-1.13.0:
      Successfully uninstalled torch-1.13.0
Successfully installed torch-1.11.0
[0m

# <span style='color:Blue'>2. Download the model </span>

Download the pretrained bert-base-cased model using optimum-cli. 

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
!optimum-cli export onnx --model bert-base-cased --cache_dir model_files/cased --opset 11 --task question-answering bert_base_cased_onnx/

Framework not specified. Using pt to export to ONNX.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using framework PyTorch: 1.11.0+cu102
Overriding 1 configuration item(s)
	- use_cache -> False
Post-processing the exported models...
Validating models in subprocesses...
Validating ONNX model bert_base_cased_onnx/model.onnx...
	-[✓] ONNX model output names match reference model (end_logits, start_logits)
	- Validating ONNX Model output "start_logits":
		-[✓] (2, 16) matches (2, 16)
		-[✓] all values close (atol: 0.0001)
	- Validating ONNX Model output "end_logits":
		-[✓] (2, 16) matches (2, 16)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: bert_base_cased_onnx


In [3]:
import numpy as np
import onnx
from onnx import numpy_helper
        
def fix_onnx_fp16(
    gen_models_path: str,
    model_base_name: str,
) -> str:
    finfo = np.finfo(np.float16)
    fp16_max = finfo.max
    fp16_min = finfo.min

    model = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
    fp16_fix = False
    for tensor in onnx.external_data_helper._get_all_tensors(model):
        nptensor = numpy_helper.to_array(tensor, gen_models_path)
        if nptensor.dtype == np.float32 and (
            np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
        ):
            print(f'tensor value : {nptensor} above {fp16_max} or below {fp16_min}')
            nptensor = np.clip(nptensor, fp16_min, fp16_max)
            new_tensor = numpy_helper.from_array(nptensor, tensor.name)
            tensor.CopyFrom(new_tensor)
            fp16_fix = True
            
    if fp16_fix:
        # Save FP16 model
        print("Found constants out of FP16 range, clipped to FP16 range")
        model_base_name += "_fix_outofrange_fp16"
        onnx.save(model, f=f"{gen_models_path}/{model_base_name}.onnx")
        print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
    return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path="bert_base_cased_onnx", model_base_name="model")

tensor value : -3.4028234663852886e+38 above 65504.0 or below -65504.0
Found constants out of FP16 range, clipped to FP16 range
Saving modified onnx file at bert_base_cased_onnx/model_fix_outofrange_fp16.onnx


# <span style='color:Blue'> 3. Check device health </span>

'qaic-util' tool can be used to query the health of all the Cloud AI cards in the server. 

"Status: Ready" indicates that the health of the cards is good. 
"Status: Error" indicates that the cards are not in good health and a system administrator needs to be contacted to rectify the issue. 

In [4]:
!/opt/qti-aic/tools/qaic-util -q | grep "Status"

	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready
	Status:Ready


# <span style='color:Blue'> 4. Identify the best throughput configuration </span>

Model Configurator is a python script that is used to find the optimal configuration of batch size, cores etc for a given model for the throughput (Inf/s). The input to model configurator is the model and a search space that the tool iterates over.   

1. Use the optimal config (model configurator output) as an indicator of the best compile flags.
2. Compile with qaic-exec and run with qaic-runner using optimal config from step 1 to validate the performance. 

Key compile optimization flags are 
- **aic-num-cores** : # of AI compute cores (aka NSP) used to compile the model
- **bs** : batch size
- **mos** : Maximum output splitting, denotes the no of AI compute cores across which an output channel (and associated weights) is split. 
- **ols** : Overlap splitting, enabled output splitting to improve core level parallelism, eg tensor and vector unit

Key runtime optimization flags are 
- **instance/activations** : No of instances of the compiled binary that can be run based on the # of AI compute cores on the card 
- **set-size** : denotes the number of inferences per instance that can be queued up on the host. Hides host side overhead by pipelining inferences 

## Config Optimizer 
The optimized search is run on one or more searchable parameters (aic-num-cores, instances, batch-size etc). The search space is provided through a json configuration file. 
This table captures the key elements of the configuration file. 

| Key | Type | Description | Recommended value|
| :- |:- | :- | :-|
|"max_func_eval"|Integer|Maximun number of evaluations to do for each initial point. This number can be increased if successful convergence is not achieved|200|
|"Objective"|String| Search objective. Options are "maximize_inf_rate" | |
|“params”|Json Object|Provide the search range for each of the parameters - cores, mos, ols, etc. through min, max values| See table below |
“initial_values”|List of Json Objects|List of initial values for the search parameters. A fresh search is initiated from each of these points and the results returned. Initial Values must be picked from within the search range defined in “params”|	Provide multiple initial values as shown in the example json|
|“static_params”|Json Object|Optional static values to be used for searchable parameters which have been excluded from the search space| |


### Parameter Range


| Parameter|	Recommended Search Range|	Valid Range |	Comments|
| :- | :- | :- | :-|
|cores|	1-Number of NSP on device|	1-Number of NSP on device||	
|mos	|1-Number of NSP on device|	1-Number of NSP on device||	
|ols	|1-8|	Integers>0|	
|batch-size (bs)|	1-16|	Integers>0|	Min, max values must be power of 2. The max value would depend on the model|
|instances|	1-Number of NSP on device|	1-Number of NSP on device|	

In [5]:
#Lets assume the max Number of NSP on the device is 14. 
!cat bert_base_dopt.json

{
  "max_func_eval": 200,
  "objective": "maximize_inf_rate",
  "params": {
    "cores": {
      "min": 1,
      "max": 14
    },
    "mos": {
      "min": 1,
      "max": 8
    },
    "ols": {
      "min": 1,
      "max": 8
    },
    "bs": {
      "min": 1,
      "max": 16
    },
    "instances": {
      "min": 1,
      "max": 14
    }
  },
  "initial_values": [
    {
      "cores": 1,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 14
    },
    {
      "cores": 2,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 7
    },
    {
      "cores": 4,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 3
    },
    {
      "cores": 7,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 2
    },
    {
      "cores": 14,
      "mos": 1,
      "ols": 1,
      "bs": 1,
      "instances": 1
    }
  ]
}

In [6]:
# Dump the model inputs and outputs 

import onnx
model = onnx.load("bert_base_cased_onnx/model_fix_outofrange_fp16.onnx")
for _input in model.graph.input:
    print(_input)
for _output in model.graph.output:
    print(_output)

name: "input_ids"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "sequence_length"
      }
    }
  }
}

name: "attention_mask"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "sequence_length"
      }
    }
  }
}

name: "token_type_ids"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "sequence_length"
      }
    }
  }
}

name: "start_logits"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "sequence_length"
      }
    }
  }
}

name: "end_logits"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch_size"
      }
      dim {
        dim_param: "sequence_length"
      }
    }
  }
}



In [7]:
!python3 /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py bert_base_cased_onnx/model_fix_outofrange_fp16.onnx onnx \
-onnx-define-symbol=sequence_length,128 -onnx-define-symbol-batch-size=batch_size -multicast-weights -convert-to-fp16\
-optimized-config-search bert_base_dopt.json -max-compilation-threads 8 -time 5 \
-device-id 15

2023-09-18 18:36:35.966 - [INFO]: Starting /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py bert_base_cased_onnx/model_fix_outofrange_fp16.onnx onnx -onnx-define-symbol=sequence_length,128 -onnx-define-symbol-batch-size=batch_size -multicast-weights -convert-to-fp16 -optimized-config-search bert_base_dopt.json -max-compilation-threads 8 -time 5 -device-id 15
2023-09-18 18:36:35.966 - [INFO]: Model Name: model_fix_outofrange_fp16.onnx
2023-09-18 18:36:35.972 - [INFO]: Hostname: smr18c11-01-06, Physical Cores: 128, Logical Cores: 256, Memory: 503.8 GB
2023-09-18 18:36:37.131 - [INFO]: Running optimized search
2023-09-18 18:36:37.169 - [INFO]: Running optimization algorithm from initial value: (cores=1, mos=[1], ols=1, batchSize=1, instances=14)
2023-09-18 18:36:37.169 - [INFO]: Evaluating the following SearchPoints: [(cores=1, mos=[1], ols=1, batchSize=1, instances=14)]
2023-09-18 18:36:37.170 - [INFO]: Compiling : compilerParams=(cores=1, mos=[1], ols=1, batchSize=1)
2

[2023-07-19 13:04:31.184] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-1-output with runnerParams (instances=6) running on device ID 0
[2023-07-19 13:04:36.276] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=7)
[2023-07-19 13:04:36.703] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-1-bs-1-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:04:41.810] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-1-output with runnerParams (instances=7)
[2023-07-19 13:04:42.245] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-1-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:04:47.350] [[32minfo[m] Running model at path model_configurator_output/c

[2023-07-19 13:07:08.306] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-2-bs-8-output with runnerParams (instances=7)
[2023-07-19 13:07:08.713] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-2-bs-8-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:07:14.320] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-16-output with runnerParams (instances=7)
[2023-07-19 13:07:14.736] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-16-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:07:20.344] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-1-mos-1-ols-1-bs-8-output with runnerParams (instances=8)
[2023-07-19 13:07:20.796] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-co

[2023-07-19 13:09:58.374] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-8-output with runnerParams (instances=7)
[2023-07-19 13:09:58.813] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-8-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:10:04.418] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=7)
[2023-07-19 13:10:04.858] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:10:09.992] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-4-output with runnerParams (instances=7)
[2023-07-19 13:10:10.429] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-co

[2023-07-19 13:13:40.196] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-3-bs-8-output with runnerParams (instances=6)
[2023-07-19 13:13:40.604] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-3-bs-8-output with runnerParams (instances=6) running on device ID 0
[2023-07-19 13:13:45.734] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-3-bs-8-output with runnerParams (instances=7)
[2023-07-19 13:13:46.180] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-3-bs-8-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:13:51.326] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-4-bs-8-output with runnerParams (instances=7)
[2023-07-19 13:13:51.760] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-core

[2023-07-19 13:16:12.937] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-4-output with runnerParams (instances=6) running on device ID 0
[2023-07-19 13:16:18.064] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-4-output with runnerParams (instances=6)
[2023-07-19 13:16:18.485] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-4-output with runnerParams (instances=6) running on device ID 0
[2023-07-19 13:16:23.600] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-4-output with runnerParams (instances=6)
[2023-07-19 13:16:23.997] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-4-output with runnerParams (instances=6) running on device ID 0
[2023-07-19 13:16:29.117] [[32minfo[m] Running model at path model_configurator_output/c

[2023-07-19 13:19:04.890] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-2-ols-4-bs-8-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:19:10.021] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-4-bs-16-output with runnerParams (instances=7)
[2023-07-19 13:19:10.471] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-4-bs-16-output with runnerParams (instances=7) running on device ID 0
[2023-07-19 13:19:15.633] [[32minfo[m] Compiling model with compiler parameters: [(cores=2, mos=[1], ols=8, batchSize=8)]
[2023-07-19 13:19:43.102] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-8-output with runnerParams (instances=7)
[2023-07-19 13:19:43.619] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-8-bs-8-output with runnerParams (

[2023-07-19 13:23:16.726] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:23:21.774] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=2)
[2023-07-19 13:23:22.056] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:23:27.107] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=2)
[2023-07-19 13:23:27.384] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:23:32.434] [[32minfo[m] Running model at path model_configurator_output/c

[2023-07-19 13:25:51.159] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=3)
[2023-07-19 13:25:51.472] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-1-bs-2-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:25:56.551] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=4)
[2023-07-19 13:25:56.882] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=4) running on device ID 0
[2023-07-19 13:26:02.466] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=4)
[2023-07-19 13:26:02.790] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-c

[2023-07-19 13:28:42.489] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-2-bs-16-output with runnerParams (instances=4) running on device ID 0
[2023-07-19 13:28:48.064] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=5)
[2023-07-19 13:28:48.431] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-16-output with runnerParams (instances=5) running on device ID 0
[2023-07-19 13:28:54.019] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-8-output with runnerParams (instances=4)
[2023-07-19 13:28:54.341] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-2-mos-1-ols-1-bs-8-output with runnerParams (instances=4) running on device ID 0
[2023-07-19 13:28:59.911] [[32minfo[m] Running model at path model_configurator_outpu

[2023-07-19 13:32:28.912] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-8-output with runnerParams (instances=3)
[2023-07-19 13:32:29.248] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-8-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:32:34.329] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-16-output with runnerParams (instances=2)
[2023-07-19 13:32:34.604] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-1-bs-16-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:32:39.656] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-2-ols-1-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:32:39.974] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-c

[2023-07-19 13:35:29.228] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:35:29.636] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-16-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:35:34.710] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:35:35.125] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-16-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:35:40.206] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:35:40.607] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc

[2023-07-19 13:39:14.759] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-4-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:39:15.196] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-4-bs-16-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:39:20.274] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-3-bs-16-output with runnerParams (instances=3)
[2023-07-19 13:39:20.637] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-3-mos-1-ols-3-bs-16-output with runnerParams (instances=3) running on device ID 0
[2023-07-19 13:39:26.207] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-4-mos-1-ols-3-bs-8-output with runnerParams (instances=3)
[2023-07-19 13:39:26.546] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-

[2023-07-19 13:43:05.987] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-2-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:43:11.045] [[32minfo[m] Compiling model with compiler parameters: [(cores=5, mos=[1], ols=1, batchSize=2)]
[2023-07-19 13:43:35.904] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-2-output with runnerParams (instances=1)
[2023-07-19 13:43:36.180] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-5-mos-1-ols-1-bs-2-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:43:41.222] [[32minfo[m] Compiling model with compiler parameters: [(cores=5, mos=[1], ols=1, batchSize=1), (cores=6, mos=[1], ols=1, batchSize=2), (cores=6, mos=[1], ols=2, batchSize=1), (cores=6, mos=[2], ols=1, batchSize=1)]
[2023-07-19 13:44:30.138] [[32minfo[m] Running model at path model_configurator_o

[2023-07-19 13:47:07.041] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-2-bs-2-output with runnerParams (instances=2)
[2023-07-19 13:47:07.466] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-2-bs-2-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:47:12.520] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-4-output with runnerParams (instances=2)
[2023-07-19 13:47:12.932] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-4-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:47:17.984] [[32minfo[m] Compiling model with compiler parameters: [(cores=7, mos=[1], ols=1, batchSize=8)]
[2023-07-19 13:47:46.555] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-8-output with runnerPa

[2023-07-19 13:50:25.548] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-4-output with runnerParams (instances=1)
[2023-07-19 13:50:25.846] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-1-bs-4-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:50:30.890] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-2-ols-1-bs-4-output with runnerParams (instances=2)
[2023-07-19 13:50:31.353] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-2-ols-1-bs-4-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:50:36.412] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-7-mos-1-ols-2-bs-4-output with runnerParams (instances=2)
[2023-07-19 13:50:36.876] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-core

[2023-07-19 13:54:11.190] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:54:16.230] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-16-output with runnerParams (instances=1)
[2023-07-19 13:54:16.528] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:54:21.574] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-16-output with runnerParams (instances=1)
[2023-07-19 13:54:21.872] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:54:26.914] [[32minfo[m] Compiling model with compiler parameters: [(

[2023-07-19 13:57:40.591] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-7-mos-2-ols-2-bs-16-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 13:57:46.152] [[32minfo[m] Compiling model with compiler parameters: [(cores=6, mos=[1], ols=8, batchSize=16)]
[2023-07-19 13:58:46.470] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-8-bs-16-output with runnerParams (instances=1)
[2023-07-19 13:58:46.964] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-8-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 13:58:52.017] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-8-bs-16-output with runnerParams (instances=1)
[2023-07-19 13:58:52.540] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-8-bs-16-output with runnerPara

[2023-07-19 14:03:03.186] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-3-bs-16-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 14:03:08.250] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-8-output with runnerParams (instances=2)
[2023-07-19 14:03:08.632] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-1-ols-2-bs-8-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 14:03:13.683] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-6-mos-2-ols-2-bs-16-output with runnerParams (instances=2)
[2023-07-19 14:03:13.986] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-6-mos-2-ols-2-bs-16-output with runnerParams (instances=2) running on device ID 0
[2023-07-19 14:03:19.538] [[32minfo[m] Running model at path model_configurator_outpu

[2023-07-19 14:07:28.364] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-2-bs-1-output with runnerParams (instances=1)
[2023-07-19 14:07:28.694] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-2-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:07:33.735] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-1-bs-1-output with runnerParams (instances=1)
[2023-07-19 14:07:33.995] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-1-bs-1-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:07:39.042] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-14-mos-2-ols-1-bs-1-output with runnerParams (instances=1)
[2023-07-19 14:07:39.300] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc

[2023-07-19 14:12:24.453] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-14-mos-1-ols-1-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:12:29.495] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-2-ols-1-bs-16-output with runnerParams (instances=1)
[2023-07-19 14:12:29.904] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-13-mos-2-ols-1-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:12:34.955] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-2-bs-16-output with runnerParams (instances=1)
[2023-07-19 14:12:35.323] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-13-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:12:40.367] [[32minfo[m] Running model at path model_configurato

[2023-07-19 14:16:40.129] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-12-mos-2-ols-1-bs-16-output with runnerParams (instances=1)
[2023-07-19 14:16:40.394] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-12-mos-2-ols-1-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:16:45.437] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-12-mos-1-ols-2-bs-16-output with runnerParams (instances=1)
[2023-07-19 14:16:45.790] [[32minfo[m] Model at path model_configurator_output/compiled_models/qpc-cores-12-mos-1-ols-2-bs-16-output with runnerParams (instances=1) running on device ID 0
[2023-07-19 14:16:50.837] [[32minfo[m] Running model at path model_configurator_output/compiled_models/qpc-cores-11-mos-1-ols-1-bs-16-output with runnerParams (instances=1)
[2023-07-19 14:16:51.151] [[32minfo[m] Model at path model_configurator_output/compiled_model

## Measure end-to-end latency 

Here are the steps to measure the end-to-end latency as well as the breakdown of latency for the best configuration identified in the previous step. 

1. Compile the model using 'qaic-exec'with the configuration parameters identified in previous step using model configurator. 
2. Execute the compiled model using 'qaic-runner' with the run time parameters identified in previous step using model configurator. Run 'qaic-runner' with flags that dump latency information.
3. Post process the latency information to identify percentile distribution (mean, median, 95 and 99) in latency across inferences.

In [8]:
## Compile the model 

!rm -rf compiled_fp16
!/opt/qti-aic/exec/qaic-exec -v -aic-hw  -convert-to-fp16 \
-mos=1 -ols=3 -aic-num-cores=2 \
-m=bert_base_cased_onnx/model_fix_outofrange_fp16.onnx \
-onnx-define-symbol=sequence_length,128 -onnx-define-symbol=batch_size,8 \
-multicast-weights -stats-batchsize=8 -aic-binary-dir=./compiled_fp16 \
-aic-hw-version=2.0 -compile-only

Reading ONNX Model from bert_base_cased_onnx/model_fix_outofrange_fp16.onnx
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at ./compiled_fp16


In [9]:
## Execute the compiled model with the latency flags 
!mkdir bert_base_cased_stats
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 15 -S 10 -a 7 \
--aic-profiling-type latency --aic-profiling-out-dir ./bert_base_cased_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 20 

Using bounded random inputs
 ---- Stats ----
InferenceCnt 6517 TotalDuration 20215390us BatchSize 8 Inf/Sec 2579.025
Writing file:./bert_base_cased_stats/aic-profiling-program-0-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-1-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-2-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-3-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-4-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-5-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-6-latency.txt


### Latency breakdown 

The end-to-end inference latency can be broken down into 4 major categories - Application, Linux Runtime (LRT) processing, Kernel mode driver (KMD) processing and Cloud AI device processing. 


![Latency Breakdown](Images/Latency.jpg)

**Key Latency Stats** (units in us)
- *totalRoundtripTime* :Time from point where application (qaic-runner in this case) calls Runtime API and ends where post-processing is complete and control is returned to the application indicating the inference is complete
- *preProcTime* : Time taken to pre-process the data (model input) on the host
- *postProc* : Time taken to post-process the data (model output) on the host
- *execTotal* : Time from inference object being submitted to kernel, completion on hardware and processing is returned to user-space


In [10]:
## Post process the latency information to identify percentile distribution

!python3 latency_stats_python3.py ./bert_base_cased_stats/aic-profiling-program-0-latency.txt 

All activations combined:
                          mean         min         50%         75%         90%         95%         99%         max
hostRoundTrip       216.995348  214.439000  216.945500  217.741000  218.385100  218.728200  219.563000  220.446000
enqTime               0.002045    0.001090    0.001855    0.001960    0.002091    0.002271    0.007768    0.026280
preProcTime           0.006399    0.005860    0.006330    0.006540    0.006831    0.006995    0.007402    0.016070
submitTime            0.001481    0.001180    0.001480    0.001500    0.001592    0.001635    0.001697    0.001810
execTotal           217.049413  214.484501  216.996301  217.793079  218.440483  218.780040  219.614072  220.500821
exectoVc              0.005437    0.005000    0.005000    0.006000    0.006000    0.006000    0.006000    0.007000
execToComplete      216.989911  214.433000  216.940500  217.735000  218.380100  218.722750  219.557710  220.440000
postProc              0.003199    0.002800    0.003170

As observed, for the best configuration, throughput of ~2579 inf/sec is observed with a latency of ~217ms per batched (bs=8) inference. 

# <span style='color:Blue'> 5. Identify the least latency configuration </span>

Identifying the least latency configurations requires the users to run the model_configurator tool with the "objective" parameter set to "minimize_latency". 

Minimum latency is achieved when the batch-size, instances and set-size are set to 1. 
We need to iterate through the cores used to compile the network to identify the least latency config. 

The key difference in the dopt.json files used for best throughput vs least latency are in the initial values. The number of instances is always set to 1 for least latency. For best throughput, the initial value of instances = floor((total no of cores in the device) / (no of cores used to compile the model)) 

Lets go through these steps for the bert-base-cased model.

In [12]:
!python3 /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py bert_base_cased_onnx/model_fix_outofrange_fp16.onnx onnx \
-onnx-define-symbol=sequence_length,128 -onnx-define-symbol-batch-size=batch_size -multicast-weights -convert-to-fp16\
-optimized-config-search bert_base_dopt_min_latency.json -max-compilation-threads 16 -time 5 \
-device-id 15 -set-size 1

2023-09-18 20:04:38.161 - [INFO]: Starting /opt/qti-aic/scripts/qaic-model-configurator/model_configurator.py bert_base_cased_onnx/model_fix_outofrange_fp16.onnx onnx -onnx-define-symbol=sequence_length,128 -onnx-define-symbol-batch-size=batch_size -multicast-weights -convert-to-fp16 -optimized-config-search bert_base_dopt_min_latency.json -max-compilation-threads 16 -time 5 -device-id 15 -set-size 1
2023-09-18 20:04:38.161 - [INFO]: Model Name: model_fix_outofrange_fp16.onnx
2023-09-18 20:04:38.166 - [INFO]: Hostname: smr18c11-01-06, Physical Cores: 128, Logical Cores: 256, Memory: 503.8 GB
2023-09-18 20:04:38.948 - [INFO]: Running optimized search
2023-09-18 20:04:38.992 - [INFO]: Running optimization algorithm from initial value: (cores=1, mos=[1], ols=1, batchSize=1, instances=1)
2023-09-18 20:04:38.992 - [INFO]: Evaluating the following SearchPoints: [(cores=1, mos=[1], ols=1, batchSize=1, instances=1)]
2023-09-18 20:04:38.992 - [INFO]: Compiling : compilerParams=(cores=1, mos=[1]

The Configuration with 14 cores and batch size 1 is returned as the configuration that provides least latency per image. Developers could also use row 3 which uses 7 cores. On a device with 14 cores, users can run 2 instances with a slight increase in latency. 

In [13]:
## aic-num-cores =7 and instances = 1
!rm -rf compiled_fp16
!rm -rf bert_base_cased_stats
!mkdir bert_base_cased_stats

!/opt/qti-aic/exec/qaic-exec -v -aic-hw  \
-m=bert_base_cased_onnx/model_fix_outofrange_fp16.onnx \
-onnx-define-symbol=sequence_length,128 -onnx-define-symbol=batch_size,1 \
-mos=1 -ols=1 -aic-num-cores=7 \
-stats-batchsize=1 -aic-binary-dir=./compiled_fp16 \
-multicast-weights -convert-to-fp16 \
-aic-hw-version=2.0 -compile-only
    

Reading ONNX Model from bert_base_cased_onnx/model_fix_outofrange_fp16.onnx
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at ./compiled_fp16


In [14]:
# Running a single instance of the compiled binary
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 15 -a 1 -S 1\
--aic-profiling-type latency --aic-profiling-out-dir ./bert_base_cased_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 10 

Using bounded random inputs
 ---- Stats ----
InferenceCnt 4252 TotalDuration 10001975us BatchSize 1 Inf/Sec 425.116
Writing file:./bert_base_cased_stats/aic-profiling-program-0-latency.txt


In [15]:
# Measure latency for the single instance
!python3 latency_stats_python3.py ./bert_base_cased_stats/aic-profiling-program-0-latency.txt 

All activations combined:
                        mean       min       50%       75%       90%       95%       99%       max
hostRoundTrip       2.181382  2.166000  2.180000  2.183000  2.186000  2.188000  2.238000  2.275000
enqTime             0.001924  0.001190  0.001590  0.001730  0.001900  0.002194  0.011135  0.023400
preProcTime         0.004641  0.001040  0.004771  0.004860  0.004980  0.005030  0.005220  0.031140
submitTime          0.001523  0.000540  0.001490  0.001520  0.001640  0.001670  0.001710  0.181829
execTotal           2.231959  2.207576  2.230786  2.234186  2.237605  2.240640  2.288105  2.324786
exectoVc            0.006165  0.004000  0.006000  0.006000  0.007000  0.007000  0.007000  0.023000
execToComplete      2.175217  2.161000  2.174000  2.177000  2.180000  2.182000  2.232000  2.263000
postProc            0.002358  0.001891  0.002400  0.002510  0.002520  0.002574  0.002631  0.013170
totalRoundtripTime  2.284630  2.230046  2.283536  2.291049  2.297533  2.301629  2.3

In [16]:
# Running 2 instances of the compiled binary
!/opt/qti-aic/exec/qaic-runner --test-data ./compiled_fp16  -d 0 -a 2 -S 1\
--aic-profiling-type latency --aic-profiling-out-dir ./bert_base_cased_stats \
--aic-profiling-start-iter 100 --aic-profiling-num-samples 99999 --time 10 

Using bounded random inputs
 ---- Stats ----
InferenceCnt 7000 TotalDuration 10001468us BatchSize 1 Inf/Sec 699.897
Deleting previous file: ./bert_base_cased_stats/aic-profiling-program-0-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-0-latency.txt
Writing file:./bert_base_cased_stats/aic-profiling-program-1-latency.txt


In [17]:
# Measure latency for 2 instances
!python3 latency_stats_python3.py ./bert_base_cased_stats/aic-profiling-program-0-latency.txt \
./bert_base_cased_stats/aic-profiling-program-1-latency.txt 

All activations combined:
                        mean       min       50%       75%       90%       95%       99%       max
hostRoundTrip       2.698562  2.353000  2.704000  2.801000  2.859000  2.884000  2.917020  3.078000
enqTime             0.001925  0.000460  0.001570  0.001760  0.001910  0.002160  0.011850  0.024550
preProcTime         0.004667  0.000700  0.004620  0.004870  0.005130  0.005300  0.005940  0.033830
submitTime          0.001429  0.000320  0.001450  0.001500  0.001630  0.001690  0.001850  0.002250
execTotal           2.744018  2.393854  2.749188  2.846948  2.904316  2.930054  2.966103  3.227031
exectoVc            0.006294  0.004000  0.006000  0.007000  0.007000  0.007000  0.008000  0.028000
execToComplete      2.692269  2.347000  2.698000  2.794250  2.852000  2.877000  2.911010  3.071000
postProc            0.002415  0.001300  0.002370  0.002640  0.002940  0.002960  0.003080  0.011260
totalRoundtripTime  2.794029  2.443674  2.800323  2.897700  2.955614  2.982057  3.0

We see that the throughput has increased significantly (425 -> 699 inf/s) while latency has also increased but only slightly (~2.28ms to ~2.79ms).