# QNN Model Prepare on Linux

The Qualcomm AI Engine Direct SDK allows clients to run ML models on HTP hardware. The following steps describe how to prepare the TinyLlama 1.1B model on Linux platforms for Android platform with HTP capability.

Before continuing, ensure all steps from [README](README.md) are completed. 

This document uses the term Qualcomm Neural Network (QNN) and Qualcomm AI Engine Direct SDK interchangeably.


# Prerequisites

1. Qualcomm AI Engine Direct SDK (with Ubuntu Linux support)
2. Ubuntu 22.04 installation with required packages for QNN Tools
3. Android Platform tools version 31 or greater
4. This notebook could be executed with Anaconda (with the supplied environment.yaml) or a virtual environment (venv)
5. TinyLlama `.onnx` files and their corresponding AIMET encodings (generated via AIMET workflow)

This work flow assumes that you have generated the artifacts following the AIMET TinyLlama workflow:

- TinyLlama 1.1B model and its AIMET encodings
- `.pkl` file per network - a numpy object array saved as a Python pickle that contains data that is required as part of the model conversion step. 

![dir_struct](../assets/step-1_output_dir_contents.png "Overall directory Structure from notebook 1")

# Workflow


All the models and encodings are processed independently via different executable QNN utilities available in the Qualcomm AI Engine Direct SDK.

To prepare TinyLlama 1.1B models for inference, the QNN executable utilities require an Ubuntu 22.04 environment

1. Generate the AR-1 and AR-128 onnx models from the AR-1073 exported model.
2. Model splitting is not required to run TinyLlama 1.1B (w4a16) on NSP, so set num_splits=1.
3. Apply MHA2SHA transformation to convert all attention block MHAs to SHAs.
4. Convert the `.onnx` files to their equivalent QNN representation.
5. Generate the QNN model quantized libraries.
6. Generate the QNN context binaries for the QNN HTP backend.

After preparing the TinyLlama 1.1B model for inference, the next step is to execute the QNN context binaries for inference on a Snapdragon Android device.


![QNN Work flow](../assets/qnn-workflow.png)

### Setup AIMET model export directory
1. Create a folder called `assets` inside example2/host_linux.

2. Create a folder called `models` inside `assets` (example2/host_linux/assets).

3. The `assets/models` path must contain following AIMET model export artifacts:

       `onnx` folder: Containing .onnx and .encodings files

       `test_vectors` folder: Containing .pkl files for QNN conversion

### Configure QNN SDK path

The following step configures the Qualcomm AI Engine Direct SDK, which enables running TinyLlama 1.1B on the device. 

In [1]:
# !ln -sf /tmp/qnn assets/qnn

### Install the required python packages

In [2]:
# %pip install --quiet -r requirements.txt

## Set up models and Qualcomm AI Engine Direct SDK variables

In [3]:
import os
import subprocess
import concurrent.futures
import time
from pathlib import Path
# setup whether using multithread or single thread to compile
go_parallel = True

workfolder = os.getcwd()

# Set up environment variable to reference LLAMA_MODELS
LLAMA_MODELS = workfolder + "/assets/models"

# Set QNN_SDK_ROOT environment variable to the location of Qualcomm AI Engine Directory
QNN_SDK_ROOT = '/opt/qcom/aistack/qairt/2.34.2.250528/'

# Check path to LLAMA_MODELS and QNN_SDK_ROOT 
assert os.path.exists(QNN_SDK_ROOT) == True,"QNN_SDK_ROOT path does not exist"
assert os.path.exists(LLAMA_MODELS) == True,"LLAMA_MODELS path does not exist"
os.environ['QNN_SDK_ROOT'] = QNN_SDK_ROOT


In [4]:
import sys
sys.path.append(workfolder+'/../../../common/G2G')
sys.path.append(workfolder+'/../../../common/G2G/split_onnx_utils')
sys.path.append(workfolder+'/../../../common/')
from utilities.nsptargets import NspTargets
from utilities.profiler import event_marker

# Set up nsp target specification
nsp_target = NspTargets.Windows.GEN2

CL = 4096
ARNs = [1,128]
EXPORT_AR = 2073
EXPORT_CONTEXT_LENGTH = 4096
onnx_name = "qwen3"
# Model splitting is not required to run TinyLlama (w4a16) on NSP, so set num_splits=1
num_splits = 1

splits = range(1, num_splits+1)
arn_list = [ arn for arn in ARNs for i in splits ]
split_idxs = [i for arn in ARNs for i in splits]
print('All task list:', [f"ar{arn}-{n}" for arn,n in zip(arn_list,split_idxs)])

All task list: ['ar1-1', 'ar128-1']


# Prepare TinyLlama 1.1B model for Inference

The following section uses the Qualcomm AI Engine Direct SDK to prepare TinyLlama 1.1B model for on-target inference.

In [5]:
os.makedirs(f"{workfolder}/assets/models_ar_n", exist_ok=True)

import change_hardcoding
def gen_ar(arn):
    change_hardcoding.execute(
            f"{LLAMA_MODELS}", 
            f"{workfolder}/assets/models_ar_n/ar{arn}-cl{CL}", 
            [f" {EXPORT_AR},{arn}",f" -{EXPORT_AR},-1",f" {EXPORT_CONTEXT_LENGTH},{CL}",f" {EXPORT_CONTEXT_LENGTH-EXPORT_AR},{CL-arn}"]
            )

with event_marker(f'prepare-export'):
    with concurrent.futures.ProcessPoolExecutor(max_workers = len(ARNs) if go_parallel else 1) as executor:
        results = executor.map(gen_ar,ARNs)
print(f"Prepare AR128 AR1 export done.")

  from .autonotebook import tqdm as notebook_tqdm


Checking graph input/output/value_infoChecking graph input/output/value_info

[1, 2073] => [1, 128] : input_ids
[1, 2073] => [1, 1] : input_ids[1, 1, 2073, 64] => [1, 1, 128, 64] : position_ids_cos

[1, 1, 2073, 64] => [1, 1, 1, 64] : position_ids_cos[1, 1, 2073, 64] => [1, 1, 128, 64] : position_ids_sin

[1, 1, 2073, 4096] => [1, 1, 128, 4096] : attention_mask[1, 1, 2073, 64] => [1, 1, 1, 64] : position_ids_sin

[1, 8, 128, 2023] => [1, 8, 128, 3968] : past_key_0_in[1, 1, 2073, 4096] => [1, 1, 1, 4096] : attention_mask

[1, 8, 2023, 128] => [1, 8, 3968, 128] : past_value_0_in[1, 8, 128, 2023] => [1, 8, 128, 4095] : past_key_0_in

[1, 2073, 151936] => [1, 128, 151936] : logits[1, 8, 2023, 128] => [1, 8, 4095, 128] : past_value_0_in

[1, 8, 128, 2073] => [1, 8, 128, 128] : past_key_0_out[1, 2073, 151936] => [1, 1, 151936] : logits

[1, 8, 2073, 128] => [1, 8, 128, 128] : past_value_0_out[1, 8, 128, 2073] => [1, 8, 128, 1] : past_key_0_out

Checking initializer[1, 8, 2073, 128] => [1, 8,

## Preprocess ONNX 

Prior to utilizing the QNN tool chain to compile and generate the context binary for LLM we may need to split the model and generate the following artifacts
- ONNX file for each split of the model
- input vectors for each split
- golden output vectors for each split

TinyLlama 1.1B (w4a16) does not require model splitting to run on NSP targets, but the notebook supports the model splitting logic if required for a use-case.

To run TinyLlama 1.1B as a full model without splitting, we set **num_splits=1** above in the notebook.

We need to specify the following parameters to proceed with execution of the notebook and generate all necessary artifacts
- number of splits of the model (set to 1 to compile TinyLlama as a full model without splitting)
- path to TinyLlama onnx file
- path to TinyLlama encodings file
- path to *.pkl files 
  

![Split](../assets/ModelSplit.png)

### Set up environment variables for the Qualcomm AI Direct SDK tools

In [6]:
import os
import utils

qnn_env = os.environ.copy()
qnn_env["QNN_SDK_ROOT"] = QNN_SDK_ROOT
qnn_env["PYTHONPATH"] = QNN_SDK_ROOT + "/benchmarks/QNN/:" + QNN_SDK_ROOT + "/lib/python"
qnn_env["PATH"] = QNN_SDK_ROOT + "/bin/x86_64-linux-clang:" + qnn_env["PATH"]
qnn_env["LD_LIBRARY_PATH"] = QNN_SDK_ROOT + "/lib/x86_64-linux-clang"
qnn_env["HEXAGON_TOOLS_DIR"] = QNN_SDK_ROOT + "/bin/x86_64-linux-clang"
qnn_env["LLM"] = "1"
qnn_env["split_embedding"] = "0"
qnn_env["split_lmhead"] = "0"
os.environ = qnn_env

### Split Onnx export

This step splits a model into multiple parts based on the number of splits specified.

Expected execution time: ~< 15 minutes

In [7]:
def thread_split(arn):
    name = f"ar{arn}-cl{CL}"
    model_export = f"{workfolder}/assets/models_ar_n"
    model_artifact = f"{workfolder}/assets/artifacts/ar{arn}-cl{CL}/"
    os.makedirs(model_artifact, exist_ok = True)
    
    # create symlink to export
    symlink_src = os.path.join(model_artifact, 'src')
    symlink_path = Path(symlink_src)
    if symlink_path.is_symlink():
        os.unlink(symlink_src)
    os.symlink(src = os.path.join(model_export, name), dst = symlink_src)
    
    os.makedirs(f"{model_artifact}/split_onnx", exist_ok = True)
    TEST_VECTOR_PICKLE_TYPE = "pkl"
    print(f"Starting {onnx_name}.onnx")
    utils.split_onnx(onnxfile = f"{model_artifact}/src/onnx/{onnx_name}.onnx", modelname = name, 
                     pickle_filedir = os.path.join(model_export, f"ar{arn}-cl{CL}/test_vectors"),
                     num_splits = num_splits, output_dir = model_artifact, split_embedding = False,
                     encoding_file = f"{model_artifact}/src/onnx/{onnx_name}.encodings",using_qairt_workflow = True
                     )
    print(f"Ending {onnx_name}.onnx")

with event_marker(f'split-onnx'):
    with concurrent.futures.ProcessPoolExecutor(max_workers = len(ARNs) if go_parallel else 1) as executor:
        results = executor.map(thread_split,ARNs)
print(f"All onnx model splitted.")

Starting qwen3.onnxStarting qwen3.onnx



Loading /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar1-cl4096//src/onnx/qwen3.onnxLoading /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar128-cl4096//src/onnx/qwen3.onnx



Per_layer_output_names: Per_layer_output_names:['/Add_4/Add_output_0'] 
['/Add_4/Add_output_0']
Using per-layer output shape: [1, 128, 1024]
Using per-layer output shape: [1, 1, 1024]
Names_to_split Names_to_split[] 
[]
Saving /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar128-cl4096//split_onnx/ar128-cl4096_1_of_1.onnx
Saving /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar1-cl4096//split_onnx/ar1-cl4096_1_of_1.onnx
/home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/models_ar_n/ar128-cl4096/test_vectors/qt
/home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/models_ar_n/ar1-cl4096/test_vectors/qt
Mapping test vector 'lm_head_conv_Conv' to '/lm_head_conv_Conv/Conv'
Mapping test vector 'lm_head_conv_Conv_output_0_nchw' to '/lm_head_conv_Conv_output_0_nchw/Transpose'
Ending qwen3.onnx
Mapping test vector 'lm_head_conv

### Convert attention layers from MHA to SHA

The `mha2sha-onnx-converter` tool converts a model from MHA representation to its equivalent SHA representation. The encoding files generated from the AIMET workflow are provided as an input to this step via the `--exported-model-encoding-path` option.

This step generates a new `.onnx` file that represents the model in SHA format.

Expected execution time: ~10 minutes

In [10]:
mha2sha_root = workfolder+"/../../../common/G2G/MHA2SHA"
g2g_env = os.environ.copy()
g2g_env["PYTHONPATH"] = os.pathsep.join([g2g_env.get("PYTHONPATH", ""), os.path.join(mha2sha_root, "src/python")])
g2g_env["PATH"] = os.pathsep.join([g2g_env.get("PATH", ""), os.path.join(mha2sha_root, "bin")])
print(f"MHA2SHA tool root set to: {mha2sha_root}")

def thread_g2g(arn,split):
    import os
    os.chmod(os.path.join(mha2sha_root, "bin", "mha2sha-onnx-converter"), 0o777)
    os.chmod(os.path.join(mha2sha_root, "bin", "env_setup.sh"), 0o777)
    model_artifact = f"{workfolder}/assets/artifacts/ar{arn}-cl{CL}/"
    split_work_dir = os.path.join(model_artifact,f"{split}_of_{num_splits}")
    name = f"ar{arn}-cl{CL}_{split}_of_{num_splits}"
    os.makedirs(split_work_dir, exist_ok = True)
    sha_folder = f"{split_work_dir}/sha_output/"
    os.makedirs(sha_folder, exist_ok = True)
    name = f"ar{arn}-cl{CL}_{split}_of_{num_splits}"
    print(f"mha2sha-onnx-converter {name} running...")
    args=["mha2sha-onnx-converter",
                        "--sha-export-path", sha_folder,
                        "--model-name", name,
                        "--exported-model-encoding-path", f"{model_artifact}/src/onnx/{onnx_name}.encodings",
                        "--exported-model-path", f"{model_artifact}/split_onnx/{name}.onnx",
                        "--base-llm", "llama3",
                        "--mha-conv",
                        "--nchw-aligned",
                        "--handle-internal-rmsnorm",
                        "--log-level", "debug"]
    proc = subprocess.Popen(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE, env = g2g_env)
    output, error = proc.communicate()
    print(output.decode(),error.decode())
    print(f"mha2sha-onnx-converter {name} done.")

for arn, split in zip(arn_list, split_idxs):
    thread_g2g(arn, split)
print(f"All mha2sha convert done.")

MHA2SHA tool root set to: /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/../../../common/G2G/MHA2SHA
mha2sha-onnx-converter ar1-cl4096_1_of_1 running...
[3m                             [0m[1;3;94m Qualcomm MHA2SHA[0m[3m                              [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mFlag                          [0m[1m [0m┃[1m [0m[1mValue[0m[1m                                  [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ --ar-num [ To be deprecate]    │ [1mNone[0m                                    │
├────────────────────────────────┼─────────────────────────────────────────┤
│ --base-llm                     │ [1m"llama3"[0m                                │
├────────────────────────────────┼─────────────────────────────────────────┤
│ --build-ar                     │ [1mNone[0m                                    │
├───────────

## Convert the model from ONNX representation to QNN DLC representation

The Qualcomm AI Engine Direct SDK `qairt-converter` tool converts a model from ONNX representation to its equivalent QNN DLC representation. The encoding files generated from the AIMET workflow are provided as an input to this step via the `–quantization_overrides model.encodings` option.

This step generates a `.dlc` file that represents the model as a series of QNN API calls.

Expected execution time: ~< 20 minutes

In [9]:
def thread_convert(arn,split):
    model_artifact = f"{workfolder}/assets/artifacts/ar{arn}-cl{CL}/"
    split_work_dir = os.path.join(model_artifact,f"{split}_of_{num_splits}")
    name = f"ar{arn}-cl{CL}_{split}_of_{num_splits}"
    os.makedirs(split_work_dir, exist_ok = True)
    out_dir = os.path.join(split_work_dir, "converted_model")
    os.makedirs(out_dir, exist_ok = True)
    
    # create symlink to export
    for src in [f"input_list_{name}.txt",f"test_inputs_{name}"]:
        symlink_input = os.path.join(split_work_dir, src)
        symlink_path = Path(symlink_input)
        if symlink_path.is_symlink():
            os.unlink(symlink_input)
        os.symlink(src = os.path.join(model_artifact, src), dst = symlink_input)       

    input_onnx=f"{split_work_dir}/sha_output/{name}.onnx"
    quantization_overrides= f"{split_work_dir}/sha_output/{name}.encodings"
    
    args = [QNN_SDK_ROOT + "/bin/x86_64-linux-clang/qairt-converter",
                    "--input_network", input_onnx,
                    "--quantization_overrides", quantization_overrides,
                    "-o", f'{out_dir}/{name}.dlc'
                    ]
    options = utils.get_input_layout(input_onnx, using_qairt_workflow = True)
    for entry in options:
        args+=entry
    
    proc = subprocess.Popen(args, stdout = subprocess.PIPE, stderr = subprocess.PIPE, env = qnn_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f"qairt-converter {name} done!")

with event_marker(f'convert-onnx'):
    for arn, split in zip(arn_list, split_idxs):
        thread_convert(arn, split)

print(f"All qairt-converter done.")

Loading /home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar1-cl4096/1_of_1/sha_output/ar1-cl4096_1_of_1.onnx


FileNotFoundError: [Errno 2] No such file or directory: '/home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/Step-2/host_linux/assets/artifacts/ar1-cl4096/1_of_1/sha_output/ar1-cl4096_1_of_1.onnx'

##  Quantized QNN DLC model

The  Qualcomm AI Engine Direct SDK `qairt-quantizer` compiles the model `.dlc` and input`.raw` files into a `model.quantized.dlc` file.

The inputs to this stage are the input raw files &  `model.dlc` generated in the previous step.

Expected execution time: ~< 10 minutes


In [None]:
def thread_genlib(arn,split):
    model_artifact = f"{workfolder}/assets/artifacts/ar{arn}-cl{CL}/"
    split_work_dir = os.path.join(model_artifact,f"{split}_of_{num_splits}")
    name = f"ar{arn}-cl{CL}_{split}_of_{num_splits}"
    os.chdir(split_work_dir)
    out_dir = os.path.join(split_work_dir,"compiled_model")
    os.makedirs( os.path.join(split_work_dir,"compiled_model"), exist_ok = True)

    float_dlc_file = os.path.join(split_work_dir, "converted_model", f'{name}.dlc')
    quantized_dlc_file = os.path.join(out_dir, f'{name}_quantized.dlc')  
    ip_list_file = os.path.join(model_artifact, f'input_list_{name}.txt')
    
    proc = subprocess.Popen([QNN_SDK_ROOT + "/bin/x86_64-linux-clang/qairt-quantizer",
                            "--input_dlc", float_dlc_file,
                            "--input_list", ip_list_file,
                            "--output_dlc", quantized_dlc_file,
                            "--act_bitwidth", "16",
                            "--bias_bitwidth", "32"
                            ],stdout = subprocess.PIPE, stderr = subprocess.PIPE, env = qnn_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f"qairt-quantizer {name} done!")
    os.chdir(workfolder)

with event_marker(f'qairt-quantizer'):
    for arn, split in zip(arn_list, split_idxs):
        thread_genlib(arn, split)

print(f"All qairt-quantizer done.")


## QNN HTP weight sharing context binary

The  Qualcomm AI Engine Direct SDK `qnn-context-binary-generator` tool creates a QNN context binary applicable to the QNN HTP backend. This binary can be deployed to run on a Snapdragon 8 Gen2 / Gen4 device that runs Android. This step requires the ar128 and ar1 quantized DLCs from the previous step and the `libQnnHtp.so` library, available in the Qualcomm AI Engine Direct SDK.

Provide additional options that pertain to the QNN HTP backend by passing the `libQnnHtpBackendExtensions.so` library that implements extensions for the QNN HTP backend. The library is available in the Qualcomm AI Engine Direct SDK.

### Define Htp Perf Setting

In [None]:
import os
import json 

def make_config_file(index, folder, src_graphs, soc_id=43, dsp_arch="v73"): # For GEN4, set soc_id=69 and dsp_arch="v79" here
    htp_config_json = os.path.join(folder, f"HtpConfigFile_API_{index}.json")
    perf_config_json = os.path.join(folder, f"PerfSetting_API_{index}.conf")

    soc_id = int(soc_id)
    with open(htp_config_json, 'w') as f:
        config = {
            "backend_extensions": {
                "shared_library_path": "libQnnHtpNetRunExtensions.so", 
                "config_file_path": f"{perf_config_json}"
            }
        }
        
        json.dump(config, f, indent=4)

    with open(perf_config_json,'w') as f:
        config = {
            "graphs": [{
                "O": 3.0,
                "vtcm_mb": 8,
                "graph_names": src_graphs,
                "fp16_relaxed_precision": 0
            }],
            "devices": [
                {
                    "soc_id": soc_id,
                    "dsp_arch": dsp_arch,
                    "cores": [
                        {
                            "perf_profile": "burst",
                            "rpc_control_latency": 100
                        }
                    ],
                    "pd_session": "unsigned"
                }
            ], 
            "context": {
                    "weight_sharing_enabled": len(src_graphs) > 1
            }, 
            "memory": {
                    "mem_type": "shared_buffer"
            }
        }
        json.dump(config, f, indent = 4)    

### Compile context binary
Expected execution time: ~10 minutes

In [None]:
import subprocess

soc_id = nsp_target.soc_id
dsp_arch = nsp_target.dsp_arch

def thread_gen_ws_cb(i):
    ar128_src = f"{workfolder}/assets/artifacts/ar128-cl{CL}/"
    ar1_src = f"{workfolder}/assets/artifacts/ar1-cl{CL}/"
    output_dir = f"{workfolder}/assets/artifacts/ar128-ar1-cl{CL}_conf_files/"
    ctx_output_dir = f"{workfolder}/assets/artifacts/ar128-ar1-cl{CL}/"  
    os.makedirs(output_dir, exist_ok = True)
    os.makedirs(ctx_output_dir, exist_ok = True)

    src1_split_folder = os.path.join(ar128_src, f"{i}_of_{num_splits}", "compiled_model")
    src2_split_folder = os.path.join(ar1_src, f"{i}_of_{num_splits}", "compiled_model")

    src1_graph_name = f"ar128-cl{CL}_{i}_of_{num_splits}"
    src1_q_dlc = os.path.join(src1_split_folder, f"{src1_graph_name}_quantized.dlc")
    src2_graph_name = f"ar1-cl{CL}_{i}_of_{num_splits}"
    src2_q_dlc = os.path.join(src2_split_folder, f"{src2_graph_name}_quantized.dlc")

    graph_list = [src1_graph_name, src2_graph_name]
    make_config_file(i, output_dir, graph_list, soc_id, dsp_arch)

    cmd = ["qnn-context-binary-generator",
            "--log_level=verbose",
            "--backend","libQnnHtp.so",
            "--model", "libQnnModelDlc.so",
            "--input_output_tensor_mem_type", "memhandle",
            "--output_dir", ctx_output_dir,
            "--config_file",f"{output_dir}/HtpConfigFile_API_{i}.json",
            "--binary_file", f"weight_sharing_model_{i}_of_{num_splits}.serialized",
            "--dlc_path", f"{src1_q_dlc},{src2_q_dlc}"]
    proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.PIPE, env = qnn_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f'#{i} weight sharing model generated')

with event_marker(f'gen-binary'):
    with concurrent.futures.ProcessPoolExecutor(max_workers = len(splits) if go_parallel else 1) as executor:
        results = executor.map(thread_gen_ws_cb, splits)
print(f"All weight shared qnn-context-binary generated.")

### Save profiling stats

In [None]:
from utilities.profiler import EventProfiler
EventProfiler().report()
EventProfiler().json_dump(os.path.join(workfolder, 'assets/profiling_stats.json'))

Upon completion of these steps to prepare TinyLlama 1.1B model for inference, QNN context binaries  are available in `./assets/artifacts`. 
The next step is to execute the prepared models (now represented as serialized context binaries) on a Snapdragon 8 Gen2 / Gen4 Android device using executable utilities available in the Qualcomm AI Engine Direct SDK.


Copyright (c) 2024 Qualcomm Technologies, Inc. and/or its subsidiaries.