Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. <br>
SPDX-License-Identifier: BSD-3-Clause-Clear

**Takeaways:** Users will learn how to onboard a BERT base neural network model on Cloud AI devices and run inference

**Before you start:** 
- There are some commands (folder locations etc) that will need to be updated in this notebook based on the platform and installation location. 
- The terms 'model' and 'network' are used interchangeably in this notebook. 

**Last Verified Qualcomm Cloud AI Platform SDK and Apps SDK Version:** Platform SDK 1.10.0.193 and Apps SDK 1.10.0.193 

## Introduction 
This notebook is for beginners and will take the user through the workflow, from onboarding the 'Bert Base Cased' model (from HuggingFace) to execution of inference on Cloud AI devices. 

Here is the Cloud AI inference workflow at a high level. 


![Workflow](Images/Workflow.jpg)


We will follow this sequence of steps in the notebook. 

1. **Install required packages**: Begin by installing all the required packages. We will begin by importing all the necessary libraries and importing all the required dependencies.
2. **torch-cpu inference**: Import the model, generate an input and run the model on CPU.
3. **ONNX conversion**: We will convert the pytorch model to onnx format. 
4. **Compilation**: Compile the model for Qualcomm Cloud AI 100.
5. **Creating a Session and setting up inputs**: Create a qaic session and prepare the models for qaic runtime.
6. **Inference on Cloud AI using Python APIs**: Run inference performantly using qaic api and ThreadPoolExecuter. 
7. **Inference on Cloud AI using qaic-runner CLI**: Run inference using qaic-runner CLI. 



# 1. Install required packages 

We will install the required packages. 

In [1]:
# Let's make sure the Python interpreter path is set properly.
import sys
sys.executable

'/opt/qti-aic/dev/python/qaic-env/bin/python'

In [2]:
!pip3 install -r requirements.txt

Processing /opt/qti-aic/dev/lib/x86_64/qaic-0.0.1-py3-none-any.whl (from -r requirements.txt (line 8))
qaic is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


## Import the necessary libraries.

We will import the pre-trained model from Hugging Face ```transformers``` library. 


In [3]:
import transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import sys
sys.path.append("/opt/qti-aic/examples/apps/qaic-python-sdk")
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np
from typing import Dict, List, Optional, Tuple
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# 2. Inference using torch-cpu

### Choose a model from ```transformers``` library 
For example: you can provide any pretrained models, but accordingly create ```<model_name>-config.yaml``` file containing compilation and execution options. 

In [4]:
model_card = 'bert-base-cased' # Provide a model name supported in transformers library.

In [5]:
# Delete any pre-generated model
os.system('rm -fr bert-base-cased')

0

In [6]:
# Import the pre-trained model
model = AutoModelForMaskedLM.from_pretrained(model_card)

# setup the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_card)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# Sentence example
sentences = [
"The [MASK] sat on the mat.",
"The lion is considered [MASK] of the jungle.",
"I saw a [MASK] in the park.",
"The cat is playing in the [MASK].",
"The dog is [MASK] a cookie.",
"The cat is drinking a glass of [MASK].",
"The [MASK] is sleeping in its bed.",
"The elephant is walking down the [MASK].",
"That person is talking on the [MASK].",
"Are you reading a [MASK]?",
]

In [8]:
def get_example_input(sentence, tokenizer):
    max_length = 128
    encodings = tokenizer(sentence, max_length=max_length, truncation=True, padding="max_length", return_tensors='pt')
    inputIds = encodings["input_ids"]
    attentionMask = encodings["attention_mask"]
    mask_token_index = torch.where(encodings['input_ids'] == tokenizer.mask_token_id)[1]
    return inputIds, attentionMask, mask_token_index

In [9]:
# get input example 
inputIds, attentionMask, mask_token_index = get_example_input(sentences[0],tokenizer)

In [10]:
with torch.no_grad():
    
    for sentence in sentences:
        inputIds, attentionMask, mask_token_index = get_example_input(sentence,tokenizer)
        # Compute token embeddings
        with torch.no_grad():
            model_output = model(input_ids=inputIds, attention_mask=attentionMask)

        token_logits = model_output.logits
        mask_token_logits = token_logits[0, mask_token_index, :]
        word = tokenizer.decode([torch.argmax(mask_token_logits)])
        print(sentence.replace("[MASK]", "\""+word+"\""))

2023-08-24 10:29:39.860072: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-24 10:29:39.860091: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


The "boy" sat on the mat.
The lion is considered "protector" of the jungle.
I saw a "car" in the park.
The cat is playing in the "garden".
The dog is "given" a cookie.
The cat is drinking a glass of "wine".
The "dog" is sleeping in its bed.
The elephant is walking down the "street".
That person is talking on the "phone".
Are you reading a "book"?


# 3. ONNX conversion

In [11]:
# Setup dir for saving onnx and qpc files
gen_models_path = f"{model_card}/generatedModels"
os.makedirs(gen_models_path, exist_ok=True)
model_base_name = model_card


In [12]:
# Set dynamic dims and axes.
dynamic_dims = {0: 'batch', 1 : 'sequence'}
dynamic_axes = {
    "input_ids" : dynamic_dims,
    "attention_mask" : dynamic_dims,
    "logits" : dynamic_dims
}
input_names = ["input_ids", "attention_mask"]
inputList = [inputIds, attentionMask]

model.eval() # setup the model in inference model.

torch.onnx.export(
    model,
    args=tuple(inputList),
    f=f"{gen_models_path}/{model_base_name}.onnx",
    verbose=False,
    input_names=input_names,
    output_names=["logits"],
    dynamic_axes=dynamic_axes,
    opset_version=11,
)
print("INFO: ONNX Model is being generated successfully")

INFO: ONNX Model is being generated successfully


#### Modification 
Modify the onnx file to handle ```constants > FP16_Max and < FP16_Min ```. 
```fix_onnx_fp16``` is a helper function for this purpose. <Br> In the exported model, -inf is represented by the min value in FP32. The helper function modifies that to min in FP16. 

In [13]:
from onnx import numpy_helper
        
def fix_onnx_fp16(
    gen_models_path: str,
    model_base_name: str,
) -> str:
    finfo = np.finfo(np.float16)
    fp16_max = finfo.max
    fp16_min = finfo.min
    model = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
    fp16_fix = False
    for tensor in onnx.external_data_helper._get_all_tensors(model):
        nptensor = numpy_helper.to_array(tensor, gen_models_path)
        if nptensor.dtype == np.float32 and (
            np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
        ):
            # print(f'tensor value : {nptensor} above {fp16_max} or below {fp16_min}')
            nptensor = np.clip(nptensor, fp16_min, fp16_max)
            new_tensor = numpy_helper.from_array(nptensor, tensor.name)
            tensor.CopyFrom(new_tensor)
            fp16_fix = True
            
    if fp16_fix:
        # Save FP16 model
        print("Found constants out of FP16 range, clipped to FP16 range")
        model_base_name += "_fix_outofrange_fp16"
        onnx.save(model, f=f"{gen_models_path}/{model_base_name}.onnx")
        print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
    return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)


Found constants out of FP16 range, clipped to FP16 range
Saving modified onnx file at bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx


# 4. Compilation step

`qaic-exec` cli tool is used to compile the model for Qualcomm AI Cloud 100. The input to this tool is `onnx` file generated above. The tool produces a QPC (Qualcomm Program Container) binary file in the path defined by `-aic-binary-dir` argument. 

### Breakdown of key compile parameters.
We have compiled the onnx file 
- with 4 NSP cores
- with float 16 precision
- defined onnx symbols


In [14]:
# COMPILE using qaic-exec
os.system('rm -fr bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc')

!/opt/qti-aic/exec/qaic-exec \
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx \
-aic-num-cores=4 \
-convert-to-fp16 \
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128 \
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc \
-aic-hw -aic-hw-version=2.0 \
-compile-only


Reading ONNX Model from bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx
loading compiler from: /opt/qti-aic/dev/lib/x86_64/libQAicCompiler.so
Compile started ............... 
Compiling model with FP16 precision.
Generated binary is present at bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc


## NOTE:

There are three different approaches to invoke the device for inference. 

1. Utilizing a command line inferface (CLI) command - ```qaic-runner```
2. Employing `Python` API (as shown below)
3. Leveraging the `C++` api.

# 5. Creating a Session and setting up inputs

Now the we have compiled the model for Qualcomm Cloud AI 100, we can setup a session to run the inference on the device. ```qaic``` library is a set of APIs that provides support for running inference on AIC100 backend. 

```Session```: Session is the entry point of these APIs. Session is a factory method which user needs to call to create an instance of session with AIC100 backend.

### API:
```Session(model_qpc_path: str, **kwargs)```


### Examples:
Creating Session with options passed as KW args
```python
sess = qaic.Session(model_path='/path/to/qpc', num_activations = 8, set_size=10) 
```
 
Creating a Session by passing options in yaml file
```python
sess = qaic.Session(model_path='/path/to/qpc', options_path = ‘/path/xyz/options.yaml’)
```


### **Limitations**
- APIs are compatible with only python 3.8 
- These APIs are supported only on x86 platforms

Lets create a bert session 
`bert-base-cased-config.yaml` contains inference parameters like num_activations which is used by qaic.Session 
along with input data for inference on the Cloud AI device.


In [15]:
# Contents of our yaml
options_path = f'{model_card}-config.yaml'
_ = os.system(f'cat {options_path}')

# Inference Parameters
num_activations: 2
set_size: 10

In [16]:
# Set the path of QPC generated with qaic-exec
qpcPath = 'bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc'

In [17]:

bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', options_path=options_path)
bert_sess.setup() # Loads the network to the device. If setup() is not called, the network gets loaded just before the first inference.
# alternatively, you can also provide arguments in the function call.
# bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', num_activations=2)

loading /opt/qti-aic/dev/lib/x86_64/libQAic.so


Here we are setting `num_activations = 1` and `set_size = 1`.
Additionally, you can provide `device_id` as inference parameters. 

Please find more details about the options [here](https://docs.qualcomm.com/bundle/resource/topics/AIC_Developer_Guide/).

In [18]:
# Here we are reading out all the input and output shapes/types
input_shape, input_type = bert_sess.model_input_shape_dict['input_ids']
attn_shape, attn_type = bert_sess.model_input_shape_dict['attention_mask']
output_shape, output_type = bert_sess.model_output_shape_dict['logits']
print(f'Input token shape {input_shape} and type {input_type}')
print(f'Input attention mask shape {attn_shape} and type {attn_type}')
print(f'Output logits shape {output_shape} and type {output_type}')

Input token shape (1, 128) and type int64
Input attention mask shape (1, 128) and type int64
Output logits shape (1, 128, 28996) and type float32


# 6. Inference on Cloud AI using Python APIs

### `set_size`
This helps in managing parallelism on host. When pre/post processing is happening on host, `set_size` helps in running pre/post processing in parallel on host before submitting it to the device.
`set_size` value is number of pre/post processing that can happen on host in parallel.

### `num_activations`
Instances of network loaded onto the device. This helps to run multiple instances of same network in parallel on the device.

### worker threads (in `ThreadPoolExecutor`)
Since `session.run` is a blocking call, we need threading to submit inference if we want parallelism.
These threads are only responsible for submitting inference request to runtime in parallel.

#### General guidance on number of threads

By default, the number of threads should be 10. When user is trying to play with `num_activations` and `set_size`, then threads should be `num_activations`\*`set_size` to get good performance.

That being said, number of threads should not be a very high number that process spends too much time in creating threads.

Our guidance would be to keep threads as `num_activations`*`set_size` but user needs to monitor how much time is being taken in spawning threads and thread switching. 


### helper functions

`buildinput` : Function to generate input_data for given sentence.

`infer` : Runs inference using `sess.run` and post-process to find the masked_word.

In [19]:
# Run the model on Qualcomm Cloud AI 100
import concurrent.futures

def buildinput(sentence):
    inputIds, attentionMask, mask_token_index = get_example_input(sentence,tokenizer)
    input_dict = {"input_ids": inputIds.numpy().astype(input_type), "attention_mask" : attentionMask.numpy().astype(attn_type)}
    inputs = {'dict' : input_dict, 'mask_token' : mask_token_index}
    return inputs


def infer(input_data, input_index):
    output = bert_sess.run(input_data['dict'])
    mask_token_index = input_data['mask_token']
    token_logits = np.frombuffer(output['logits'], dtype=output_type).reshape(output_shape) 
    masked_word = tokenizer.decode([np.argmax(token_logits[0, mask_token_index, :])])
    return masked_word, input_index


with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    my_input_data = [buildinput(sentence) for sentence in sentences]
    futures = [executor.submit(infer, input_data, i) for i, input_data in enumerate(my_input_data)]

    for i, future in enumerate(concurrent.futures.as_completed(futures)):
        masked_word, input_index = future.result()
        print(f'processed {input_index} :',sentences[input_index].replace("[MASK]", "\""+masked_word+"\""))

processed 1 : The lion is considered "protector" of the jungle.
processed 0 : The "boy" sat on the mat.
processed 2 : I saw a "car" in the park.
processed 3 : The cat is playing in the "garden".
processed 4 : The dog is "given" a cookie.
processed 5 : The cat is drinking a glass of "wine".
processed 7 : The elephant is walking down the "street".
processed 6 : The "dog" is sleeping in its bed.
processed 8 : That person is talking on the "phone".
processed 9 : Are you reading a "book"?


In [20]:
# reset the session to release the NSP cores
bert_sess.reset()

# 7. Inference using qaic-runner CLI tool

In [21]:
# store the example input in files
import os

def save_encodings(inputIds, attentionMask, i, path):
    os.makedirs(path, exist_ok=True)
    inputIds.detach().numpy().tofile(os.path.join(path, 'input_ids.raw'))
    attentionMask.detach().numpy().tofile(os.path.join(path, 'input_mask.raw'))
    print('Encodings saved at:', path)
    with open("inputs_list.txt", "a") as myfile:
        myfile.write(f'{path}/input_ids.raw,{path}/input_mask.raw\n')

# get example input
for i,sentence in enumerate(sentences):
    inputIds, attentionMask, mask_token_index = get_example_input(sentence, tokenizer)
    save_encodings(inputIds, attentionMask, i, path=f'bert-base-cased/inputFiles/{i}')


Encodings saved at: bert-base-cased/inputFiles/0
Encodings saved at: bert-base-cased/inputFiles/1
Encodings saved at: bert-base-cased/inputFiles/2
Encodings saved at: bert-base-cased/inputFiles/3
Encodings saved at: bert-base-cased/inputFiles/4
Encodings saved at: bert-base-cased/inputFiles/5
Encodings saved at: bert-base-cased/inputFiles/6
Encodings saved at: bert-base-cased/inputFiles/7
Encodings saved at: bert-base-cased/inputFiles/8
Encodings saved at: bert-base-cased/inputFiles/9


In [22]:
cmd = f'/opt/qti-aic/exec/qaic-runner \
    -t {qpcPath} \
    -a 1 \
    -n 100\
    --aic-batch-input-file-list inputs_list.txt\
    --write-output-start-iter 0\
    --write-output-num-samples 10\
    --write-output-dir bert-base-cased/outputFiles\
    -d 0'
os.system(cmd)

loading /opt/qti-aic/dev/lib/x86_64/libQAic.so
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-0.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-1.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-2.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-3.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-4.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-5.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-6.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-7.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-8.bin
Writing file:bert-base-cased/outputFiles/logits-activation-0-inf-9.bin
Number of Files(as per batch input):10
 ---- Stats ----
InferenceCnt 100 TotalDuration 389731us BatchSize 1 Inf/Sec 256.587


0

In [23]:
output_shape = (1, 128, 28996)

for i, sentence in enumerate(sentences):
    inputIds, attentionMask, mask_token_index = get_example_input(sentence, tokenizer)
    token_logits = np.fromfile(f'bert-base-cased/outputFiles/logits-activation-0-inf-{i}.bin', output_type).reshape(output_shape)
    # Decode the output.
    masked_word = tokenizer.decode([np.argmax(token_logits[0, mask_token_index, :])])
    print(f'Input index {i} :',sentences[i].replace("[MASK]", "\""+masked_word+"\""))

Input index 0 : The "boy" sat on the mat.
Input index 1 : The lion is considered "protector" of the jungle.
Input index 2 : I saw a "car" in the park.
Input index 3 : The cat is playing in the "garden".
Input index 4 : The dog is "given" a cookie.
Input index 5 : The cat is drinking a glass of "wine".
Input index 6 : The "dog" is sleeping in its bed.
Input index 7 : The elephant is walking down the "street".
Input index 8 : That person is talking on the "phone".
Input index 9 : Are you reading a "book"?
