Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. <br>
SPDX-License-Identifier: BSD-3-Clause-Clear

**Takeaways:** Users will learn how to onboard a BERT base neural network model on Cloud AI devices and run inference

**Before you start:** 
- There are some commands (folder locations etc) that will need to be updated in this notebook based on the platform and installation location. 
- The terms 'model' and 'network' are used interchangeably in this notebook. 

**Last Verified Qualcomm Cloud AI Platform SDK and Apps SDK Version:** Platform SDK 1.10.0.193 and Apps SDK 1.10.0.193 

## Introduction 
This notebook is for beginners and will take the user through the workflow, from onboarding the 'Bert Base Cased' model (from HuggingFace) to execution of inference on Cloud AI devices. 

Here is the Cloud AI inference workflow at a high level. 


![Workflow](Images/Workflow.jpg)


We will follow this sequence of steps in the notebook. 

1. **Install required packages**: Begin by installing all the required packages. We will begin by importing all the necessary libraries and importing all the required dependencies.
2. **torch-cpu inference**: Import the model, generate an input and run the model on CPU.
3. **ONNX conversion**: We will convert the pytorch model to onnx format. 
4. **Compilation**: Compile the model for Qualcomm Cloud AI 100.
5. **Creating a Session and setting up inputs**: Create a qaic session and prepare the models for qaic runtime.
6. **Inference on Cloud AI using Python APIs**: Run inference using qaic api and decode the output.
7. **Inference on Cloud AI using qaic-runner CLI**: Run inference with qaic-runner CLI and decode the output.

# 1. Install required packages 

We will install the required packages. 

In [1]:
# Let's make sure the Python interpreter path is set properly.
import sys
sys.executable

'/opt/qti-aic/dev/python/qaic-env/bin/python'

In [2]:
!pip3 install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /opt/qti-aic/dev/lib/x86_64/qaic-0.0.1-py3-none-any.whl (from -r requirements.txt (line 8))
qaic is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


## Import the necessary libraries.

We will import the pre-trained model from Hugging Face ```transformers``` library. 


In [3]:
import transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import sys
sys.path.append("/opt/qti-aic/examples/apps/qaic-python-sdk")
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np
from typing import Dict, List, Optional, Tuple
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# 2. Inference using torch-cpu

### Choose a model from ```transformers``` library 
For example: you can provide any pretrained models, but accordingly create ```<model_name>-config.yaml``` file containing compilation and execution options. 

In [4]:
model_card = 'distilbert-base-cased-distilled-squad' # Provide a model name supported in transformers library.

In [5]:
# Delete pre-generated model
os.system(f'rm -fr {model_card}')

0

In [6]:
# Import the pre-trained model
model = AutoModelForQuestionAnswering.from_pretrained(model_card)

# setup the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_card)

In [7]:
# Sentence example
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

In [8]:
# Run on CPU
with torch.no_grad():
    outputs = model(**inputs)


In [9]:
# process the output
start_token_index = outputs.start_logits.argmax()
end_token_index = outputs.end_logits.argmax()
predict_answer_tokens = inputs.input_ids[0, start_token_index : end_token_index + 1]
print(f'Answer : {tokenizer.decode(predict_answer_tokens)}')

2023-11-22 14:41:49.693161: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-22 14:41:49.693185: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Answer : a nice puppet


# 3. ONNX conversion

In [10]:
# Setup dir for saving onnx and qpc files
gen_models_path = f"{model_card}/generatedModels"
os.makedirs(gen_models_path, exist_ok=True)
model_base_name = model_card


In [11]:
# Set dynamic dims and axes.
dynamic_dims = {0: 'batch', 1 : 'sequence'}
dynamic_axes = {
    "input_ids" : dynamic_dims,
    "attention_mask" : dynamic_dims,
    "logits" : dynamic_dims
}
input_names = ["input_ids", "attention_mask"]
inputList = [inputs.input_ids, inputs.attention_mask]

model.eval() # setup the model in inference model.

torch.onnx.export(
    model,
    args=tuple(inputList),
    f=f"{gen_models_path}/{model_base_name}.onnx",
    verbose=False,
    input_names=input_names,
    output_names=['start_logits', 'end_logits'],
    dynamic_axes=dynamic_axes,
    opset_version=11,
)
print("INFO: ONNX Model is being generated successfully")

  mask, torch.tensor(torch.finfo(scores.dtype).min)


INFO: ONNX Model is being generated successfully


#### Modification 
Modify the onnx file to handle ```constants > FP16_Max and < FP16_Min ```. 
```fix_onnx_fp16``` is a helper function for this purpose. <Br> In the exported model, -inf is represented by the min value in FP32. The helper function modifies that to min in FP16. 

In [12]:
from onnx import numpy_helper
        
def fix_onnx_fp16(
    gen_models_path: str,
    model_base_name: str,
) -> str:
    finfo = np.finfo(np.float16)
    fp16_max = finfo.max
    fp16_min = finfo.min
    model = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
    fp16_fix = False
    for tensor in onnx.external_data_helper._get_all_tensors(model):
        nptensor = numpy_helper.to_array(tensor, gen_models_path)
        if nptensor.dtype == np.float32 and (
            np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
        ):
            # print(f'tensor value : {nptensor} above {fp16_max} or below {fp16_min}')
            nptensor = np.clip(nptensor, fp16_min, fp16_max)
            new_tensor = numpy_helper.from_array(nptensor, tensor.name)
            tensor.CopyFrom(new_tensor)
            fp16_fix = True
            
    if fp16_fix:
        # Save FP16 model
        print("Found constants out of FP16 range, clipped to FP16 range")
        model_base_name += "_fix_outofrange_fp16"
        onnx.save(model, f=f"{gen_models_path}/{model_base_name}.onnx")
        print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
    return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)


Found constants out of FP16 range, clipped to FP16 range
Saving modified onnx file at distilbert-base-cased-distilled-squad/generatedModels/distilbert-base-cased-distilled-squad_fix_outofrange_fp16.onnx


# 4. Compilation step

`qaic-exec` cli tool is used to compile the model for Qualcomm AI Cloud 100. The input to this tool is `onnx` file generated above. The tool produces a QPC (Qualcomm Program Container) binary file in the path defined by `-aic-binary-dir` argument. 

### Breakdown of key compile parameters.
We have compiled the onnx file 
- with 4 NSP cores
- with float 16 precision
- defined onnx symbols


In [13]:
# COMPILE using qaic-exec
os.system(f'rm -fr {model_card}/generatedModels/{model_card}_fix_outofrange_fp16_qpc')

!/opt/qti-aic/exec/qaic-exec \
-m=distilbert-base-cased-distilled-squad/generatedModels/distilbert-base-cased-distilled-squad_fix_outofrange_fp16.onnx \
-aic-num-cores=4 \
-convert-to-fp16 \
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128 \
-aic-binary-dir=distilbert-base-cased-distilled-squad/generatedModels/distilbert-base-cased-distilled-squad_fix_outofrange_fp16_qpc \
-aic-hw -aic-hw-version=2.0 \
-compile-only


Reading ONNX Model from distilbert-base-cased-distilled-squad/generatedModels/distilbert-base-cased-distilled-squad_fix_outofrange_fp16.onnx
loading compiler from: /opt/qti-aic/dev/lib/x86_64/libQAicCompiler.so
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at distilbert-base-cased-distilled-squad/generatedModels/distilbert-base-cased-distilled-squad_fix_outofrange_fp16_qpc


## NOTE:

There are three different approaches to invoke the device for inference. 

1. Utilizing a command line inferface (CLI) command - ```qaic-runner```
2. Employing `Python` API (as shown below)
3. Leveraging the `C++` api.

# 5. Creating a Session and setting up inputs

Now the we have compiled the model for Qualcomm Cloud AI 100, we can setup a session to run the inference on the device. ```qaic``` library is a set of APIs that provides support for running inference on AIC100 backend. 

```Session```: Session is the entry point of these APIs. Session is a factory method which user needs to call to create an instance of session with AIC100 backend.

### API:
```Session(model_qpc_path: str, **kwargs)```


### Examples:
Creating Session with options passed as KW args
```python
sess = qaic.Session(model_path='/path/to/qpc', num_activations = 8, set_size=10) 
```
 
Creating a Session by passing options in yaml file
```python
sess = qaic.Session(model_path='/path/to/qpc', options_path = ‘/path/xyz/options.yaml’)
```


### **Limitations**
- APIs are compatible with only python 3.8 
- These APIs are supported only on x86 platforms

Lets create a bert session 
`distilbert-base-cased-distilled-squad-config.yaml` contains inference parameters like num_activations which is used by qaic.Session 
along with input data for inference on the Cloud AI device.


In [14]:
# Contents of our yaml
options_path = f'{model_card}-config.yaml'
_ = os.system(f'cat {options_path}')

# Inference Parameters
num_activations: 2
set_size: 10

In [15]:
# Set the path of QPC generated with qaic-exec
qpcPath = f'{model_card}/generatedModels/{model_card}_fix_outofrange_fp16_qpc'

In [16]:

bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', options_path=options_path)
bert_sess.setup() # Loads the network to the device. If setup() is not called, the network gets loaded just before the first inference.
# alternatively, you can also provide arguments in the function call.
# bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', num_activations=2)

loading /opt/qti-aic/dev/lib/x86_64/libQAic.so


Here we are setting `num_activations = 1` and `set_size = 1`.
Additionally, you can provide `device_id` as inference parameters. 

Please find more details about the options [here](https://docs.qualcomm.com/bundle/resource/topics/AIC_Developer_Guide/).

In [17]:
# Here we are reading out all the input and output shapes/types
input_shape, input_type = bert_sess.model_input_shape_dict['input_ids']
attn_shape, attn_type = bert_sess.model_input_shape_dict['attention_mask']
s_output_shape, s_output_type = bert_sess.model_output_shape_dict['start_logits']
e_output_shape, e_output_type = bert_sess.model_output_shape_dict['end_logits']
print(f'Input token shape {input_shape} and type {input_type}')
print(f'Input attention mask shape {attn_shape} and type {attn_type}')
print(f'start_logits shape {s_output_shape} and type {s_output_type}')
print(f'end_logits shape {e_output_shape} and type {e_output_type}')

Input token shape (1, 128) and type int64
Input attention mask shape (1, 128) and type int64
start_logits shape (1, 128) and type float32
end_logits shape (1, 128) and type float32


# 6. Inference on Cloud AI using Python APIs

In [18]:
## Check health of the cards before deploying the inference. 
## Status:Ready indicates that the card is in good health and ready to accept inferences
## Status:Error indicates that the card is not in good health. Please contact the system administrator
!/opt/qti-aic/tools/qaic-util -q | grep -e "Status" -e "QID" -e "Nsp Free" -e "Dram Total"

QID 0
	Status:Ready
	Dram Total:33554432 KB
	Nsp Free:8
QID 1
	Status:Ready
	Dram Total:33554432 KB
	Nsp Free:16


In [19]:
# Create a input dictionary for given input.
input_dict = {"input_ids": inputs.input_ids.numpy().astype(input_type), "attention_mask" : inputs.attention_mask.numpy().astype(attn_type)}

In [20]:
# Run the model on Qualcomm Cloud AI 100
output = bert_sess.run(input_dict)

In [21]:
# Restructure the data from output buffer with output_shape, output_type
start_token_index = np.frombuffer(output['start_logits'], dtype=s_output_type).reshape(s_output_shape).argmax()
end_token_index = np.frombuffer(output['end_logits'], dtype=e_output_type).reshape(e_output_shape).argmax()

# Decode the output.
predict_answer_tokens = inputs.input_ids[0, start_token_index : end_token_index + 1]
print(f'Answer : {tokenizer.decode(predict_answer_tokens)}')

Answer : a nice puppet


In [22]:
# reset the session to release the NSP cores
bert_sess.reset()