# AIMET Quantization workflow for Llama 3.2 3B Context Length 4K

This notebook shows a working code example of how to use AIMET to quantize LlamaV3.2 model.


---

### Required packages

The notebook assumes AIMET and LLamaV3.2 related packages are already installed.


In [1]:
# Install packages only if running in jupyter notebook mode
# if hasattr(__builtins__,'__IPYTHON__'):
#     !sudo -H apt-get -qq update
#     !sudo -H apt-get -qq install libc++-dev
#     !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.43.2
#     !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.19.0

### Overall flow

This notebook covers the following

1. Setting QNN SDK
2. Instantiate and adapt FP32 model
3. Complete the last step(s) of model adaptation
4. Convert FP32 model to FP16
5. Model Evaluation
6. Model Sample Input
7. Prepare model using AIMET model preparer pro
8. Quantization
9. Export

### What this notebook is not

- This notebook is not intended to show the full scope of optimization. For example, the flow will not use QAT, KD-QAT as deliberate choice to have the notebook execute more quickly.


### 1.1 Setting QNN SDK


In [2]:
import sys
import os
QNN_SDK_ROOT='/opt/qcom/aistack/qairt/2.34.2.250528/' # QNN 2.33
os.environ['QNN_SDK_ROOT'] = QNN_SDK_ROOT
os.system(f'source {QNN_SDK_ROOT}/bin/envsetup.sh')

!source {QNN_SDK_ROOT}/bin/envsetup.sh

assert QNN_SDK_ROOT != '', 'Please point the QNN_SDK_ROOT variable to your QNN SDK'
lib_clang_path = os.path.join(QNN_SDK_ROOT, 'lib', 'x86_64-linux-clang')
sys.path.insert(0, QNN_SDK_ROOT + '/lib/python')
LD_LIBRARY_PATH = os.getenv('LD_LIBRARY_PATH', None)
os.environ['LD_LIBRARY_PATH'] = lib_clang_path + ':' + LD_LIBRARY_PATH if LD_LIBRARY_PATH is not None else lib_clang_path
enable_fp16 = False # Flag to enable e2e fp16 flow, set to false to set fp32 flow

sh: 1: source: not found


[INFO] AISW SDK environment set
[INFO] QNN_SDK_ROOT: /opt/qcom/aistack/qairt/2.34.2.250528
[INFO] SNPE_ROOT: /opt/qcom/aistack/qairt/2.34.2.250528


### 1.2 Setting NSP Target


In [3]:
sys.path.append('../../common/')
from utilities.nsptargets import NspTargets

# Windows GEN 2 is supported for this notebook
nsp_target = NspTargets.Windows.GEN2

## Select quantsim config based on target
## HACK: We should consider to change this as some fixed config in the future
htp_config_file = f'/home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/htp_quantsim_config_{nsp_target.dsp_arch}.json'
print(htp_config_file)

/home/azureuser/zack/qnn-expr/llama32-compute/qwen3_mha_model/htp_quantsim_config_v73.json


---

### 2. Instantiate and adapt FP32 model

Now, we will use our own unique implementation for the Qwen3 model. Use our own implementation to instantiate the FP32 model.


In [4]:
import os
import sys

os.environ['HF_TOKEN'] = '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'
import torch
from hf_tokenizers import Tokenizer
sys.path.append("../../../qwen3_torch")
from modeling_qwen3 import QNNQwen3, QNNLLMUtils

model_name = 'qwen3'
model_id = 'Qwen/Qwen3-0.6B'

cache_dir = './cache_dir'
output_dir = f'./output_dir_{os.path.basename(model_id)}'
os.makedirs(output_dir, exist_ok=True)

qnn_model = QNNQwen3.from_pretrained(model_id, cache_dir=cache_dir)
tokenizer = Tokenizer("/home/azureuser/zack/qnn-expr/experiments/qwen3_tokenizer.json")


# ==== restrict to use only 1 layer ====
qnn_model.model.layers = qnn_model.model.layers[:1]
qnn_model.config.num_hidden_layers = 1
config = qnn_model.config
print(f'num_layer: {config.num_hidden_layers}'
      f',  num_hidden_size :{config.num_attention_heads},  num_kv_heads: {config.num_key_value_heads}')
# ======================================

qnn_model.qnn_init()
config = qnn_model.config
SEQ_LEN = 2073
KV_LEN = 4096 - SEQ_LEN
device = "cuda" if torch.cuda.is_available() else "cpu"
qnn_model.to(device)
qnn_llm_utils = QNNLLMUtils(SEQ_LEN, KV_LEN, device, config)

  from .autonotebook import tqdm as notebook_tqdm


num_layer: 1,  num_hidden_size :16,  num_kv_heads: 8


In [5]:
qnn_model

QNNQwen3(
  (model): Module(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0): QNNQwen3DecoderLayer(
        (self_attn): QNNQwen3Attention(
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (q_proj_conv): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (k_proj_conv): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (v_proj_conv): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (o_proj_conv): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
        (mlp): QNNQwen3MLP(
          (act_fn): SiLU()
          (gate_proj_conv): Conv2d(1024, 3072, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (up_proj_conv): Conv2d(1024, 3072, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (down_proj_conv): Conv2d(3072, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
     

#### 2.1 Direct verification

We use the models loaded, and directly check if the model can do correct inference. Ensure the model can create the reasonable outputs.


In [6]:
eos_token_id = 151645
eos_tokens = {151645}
input_text = (
    "<|im_start|>user\nIntroduce Newton's first law of motion. Be short and concise.<|im_end|>"
    "\n<|im_start|>assistant\n"
)
input_ids_list = tokenizer.encode(input_text)  # a list of ints

n_past = 0
curr_len = len(input_ids_list)

if curr_len < SEQ_LEN:
    input_ids_list = input_ids_list + [eos_token_id] * (SEQ_LEN - curr_len)

input_ids = torch.tensor(input_ids_list, dtype=torch.long, device=device).unsqueeze(0)
attention_mask = qnn_llm_utils.get_attention_mask(n_past, curr_len)
position_ids = qnn_llm_utils.get_position_ids(n_past, SEQ_LEN)
cos, sin = qnn_llm_utils.get_cos_sin(attention_mask, position_ids)  # attn_mask as dummpy input to get device
all_layer_kv_caches = qnn_llm_utils.get_kv_cache()
last_token_indices = torch.tensor([n_past + curr_len - 1], dtype=torch.long, device=device)

generated_ids = []

for i in range(10):
    with torch.no_grad():

        outputs = qnn_model(input_ids, cos, sin, attention_mask, all_layer_kv_caches)
        next_token_logits = outputs[0][:, last_token_indices, :]
        next_token_id = torch.argmax(next_token_logits, dim=-1).item()
        generated_ids.append(next_token_id)

        ## update the inputs for next token
        all_layer_kv_caches = qnn_llm_utils.update_kv_cache(all_layer_kv_caches, outputs[1:], n_past, curr_len)
        n_past += curr_len
        curr_len = 1
        next_input_ids = [next_token_id] + [eos_token_id] * (SEQ_LEN - 1)
        input_ids = torch.tensor(next_input_ids, dtype=torch.long, device=device).unsqueeze(
            0
        )  # set bs_size = 1 manually
        last_token_indices = torch.tensor([curr_len - 1], dtype=torch.long, device=device)
        attention_mask = qnn_llm_utils.get_attention_mask(n_past, curr_len)
        position_ids = qnn_llm_utils.get_position_ids(n_past, SEQ_LEN)
        cos, sin = qnn_llm_utils.get_cos_sin(attention_mask, position_ids)

    if next_token_id in eos_tokens:
        break

generated_text = tokenizer.decode(generated_ids)
print(generated_text)

[nodeelefaultsfulsivenessivenessivenessiveness


---

### 3. Convert FP32 model to FP16


The following code contains a temporary measure needed to maintain model accuracy when converting to FP16. RMSnorm operators are very sensitive to changes in bitwidth, and must upcast the input tensor to FP32 first. Once the QNN converter is able to recognize and coalesce RMSnorm operations, this upconversion will be handled automatically. Until then, we must insert operators to upcast tensors to FP32 before the first RMSnorm component operation, and downcast tensors back to FP16 once all RMSnorm component ops are complete.


In [7]:
from aimet_torch import elementwise_ops

class PreCast(torch.nn.Module):
    def __init__(self, module, dtype):
        super(PreCast, self).__init__()
        self.module = module
        self.upcast = elementwise_ops.Cast(dtype)

    def forward(self, *inputs):
        casted_inputs = [self.upcast(input) for input in inputs]
        return self.module(*casted_inputs)

class PostCast(torch.nn.Module):
    def __init__(self, module, dtype):
        super(PostCast, self).__init__()
        self.module = module
        self.downcast = elementwise_ops.Cast(dtype)

    def forward(self, *inputs):
        output = self.module(*inputs)
        casted_output = self.downcast(output)
        return casted_output
    
# Helper function to convert FP32 model to FP16
# Inserts upcast and downcast operators around RMSnorm operators if found in the graph
## NEXA: After model preparation, rms norm will have two ops norm_Pow and norm_Mul_1,
#        This is why we are using the norm_Pow, and norm_Mul_1.
def convert_model_to_fp16(model):
    model.half()
    for name, module in model.named_modules():
        if name.endswith("norm_Pow"):
            setattr(model, name, PreCast(module, torch.float32))
        if name.endswith("norm_Mul_1"):
            setattr(model, name, PostCast(module, torch.float16))
            
# Helper function to convert FP16 model back to FP32
# Removes upcast and downcast operators inserted by convert_model_to_fp16, if present
def convert_model_to_fp32(model):
    model.float()
    
    for name, module in model.named_modules():
        if name.endswith("norm_Pow"):
            setattr(model, name, module.module)
        if name.endswith("norm_Mul_1"):
            setattr(model, name, module.module)

2025-07-26 16:29:11,197 - root - INFO - aimetpro-release-1.34.0_Build_Id_0.207.0.44.torch-gpu-pt113-release


  from aimet_torch import elementwise_ops


In [8]:
model = qnn_model
if(enable_fp16):
   convert_model_to_fp16(model)

---

### 4. Model Evaluation


In [9]:
from torch.nn import CrossEntropyLoss
from tqdm import tqdm
from datasets import load_dataset
from aimet_torch.pro.utils.profiler import event_marker

test_dataset = load_dataset("nexa4ai/qwen3_wiki_calib", split="test")
num_total_batches = len(test_dataset)


def ppl_eval(test_dataset, model, num_batches=0, data_type=torch.float32):
    first_sample = next(iter(test_dataset))["input_ids"]
    first_input_ids = torch.tensor(first_sample, dtype=torch.long, device=device).unsqueeze(0)
    curr_len = first_input_ids.shape[1]
    _, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(first_input_ids)

    attention_mask = attention_mask.to(data_type)
    cos = cos.to(data_type)
    sin = sin.to(data_type)
    all_layer_kv_cache = [kv_cache.to(data_type) for kv_cache in all_layer_kv_cache]
    
    loss = 0
    if num_batches == 0:
        num_batches = num_total_batches
    else:
        num_batches = min(num_batches, num_total_batches)
    for batch_id, batch in enumerate(tqdm(test_dataset, total=num_batches, desc="Evaluating PPL")):
        if batch_id >= num_batches:
            break

        input_ids = batch["input_ids"]
        input_ids = torch.tensor(input_ids, dtype=torch.long, device=device).unsqueeze(0)
        curr_len = input_ids.shape[1]

        input_ids = torch.cat(
            [input_ids, torch.full((1, SEQ_LEN - curr_len), qnn_llm_utils.eos_token_id, device=input_ids.device)],
            dim=-1,
        )

        with torch.no_grad():
            output = model(input_ids, cos, sin, attention_mask, all_layer_kv_cache)

        lm_logits = output[0]
        shift_logits = lm_logits[:, :-1, :][:, :curr_len, :].contiguous().to(dtype=torch.float32)
        shift_labels = input_ids[:, 1:][:, :curr_len].contiguous().to(shift_logits.device)

        loss_fct = CrossEntropyLoss()
        loss += loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

    loss = loss / num_batches
    ppl = loss.exp()
    return ppl

with event_marker("FP eval"):
    orig_ppl = ppl_eval(test_dataset, model, num_batches=20)
print(f"PPL: {orig_ppl}")

2025-07-26 16:29:11,505 - datasets - INFO - PyTorch version 1.13.1+cu117 available.
2025-07-26 16:29:14,005 - Utils - INFO - Created RAM watermark daemon process(pid=3666636) for pid=3666497, polling at 100.0 ms
2025-07-26 16:29:14,004 - Utils - INFO - Created Latency/Memory profiler: empty_cache=False
2025-07-26 16:29:14,012 - Utils - INFO - memory usage @ 'FP eval >> ' : GPU default:2.5 GB, RAM 3.1 GB


Evaluating PPL: 100%|██████████| 20/20 [00:00<00:00, 37.17it/s]

2025-07-26 16:29:21,080 - Utils - INFO - memory usage @ 'FP eval << ' : GPU default:7.8 GB, RAM 3.1 GB
2025-07-26 16:29:21,081 - Utils - INFO - Event FP eval : time=0:00:07, GPU=5.4 GB(+ 2.5 GB), RAM=33.0 MB(+ 3.1 GB)
PPL: 2703437.75





---

### 5. Model Sample Input


In [10]:
def get_dummy_data():
    input_ids = next(iter(test_dataset))["input_ids"]
    input_ids = torch.tensor(input_ids, dtype=torch.long, device=model.device).unsqueeze(0)
    input_ids, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(input_ids)
    inputs_values = [input_ids, cos, sin, attention_mask]
    inputs_values.extend(all_layer_kv_cache)
    inputs_keys = ['input_ids', 'position_ids_cos', 'position_ids_sin','attention_mask'] 
    
    kv_inputs_keys = []
    for i in range(config.num_hidden_layers):
        kv_inputs_keys.append(f"past_key_{i}_in")
        kv_inputs_keys.append(f"past_value_{i}_in")
    inputs_keys.extend(kv_inputs_keys)
    inputs = dict(zip(inputs_keys, inputs_values))
    return inputs, all_layer_kv_cache

dummy_inputs, all_layer_kv_cache = get_dummy_data()
len(dummy_inputs)

# transverse and print all the key and value's shape, and dtype
for key, value in dummy_inputs.items():
    print(f"{key}: {value.shape}, {value.dtype}")

input_ids: torch.Size([1, 2073]), torch.int64
position_ids_cos: torch.Size([1, 1, 2073, 64]), torch.float32
position_ids_sin: torch.Size([1, 1, 2073, 64]), torch.float32
attention_mask: torch.Size([1, 1, 2073, 4096]), torch.float32
past_key_0_in: torch.Size([1, 8, 128, 2023]), torch.float32
past_value_0_in: torch.Size([1, 8, 2023, 128]), torch.float32


---

### 6. Prepare model using AIMET model preparer pro

#### 6.1 KVCache MHA model preparation


In [11]:
import time

from aimet_torch.utils import load_pytorch_model
import aimet_torch.pro.ir_graph_op_handler as ir_graph_op_handler
from aimet_torch import onnx_utils
from aimet_torch.pro import model_preparer
# Setting this flag to False means that the prepared model will be flattened
# This flag must be set to false because we rely on the model structure being flat to enable weight sharing
onnx_utils.EXPORT_TO_ONNX_DIRECT = True
ir_graph_op_handler.KEEP_ORIGINAL_MODEL_STRUCTURE = False

from aimet_utils.rmsnorm_update import RmsNorm, RmsNormOphandler
# Update ir graph op handler's registry with new RmsNormOpHandler class
from aimet_torch.pro import ir_graph_op_handler
ir_graph_op_handler.ir_to_handler_dict['RmsNorm'] = RmsNormOphandler

# Register RmsNorm op definition in custom_modules_for_qnn
from aimet_torch.pro import custom_modules_for_qnn
setattr(custom_modules_for_qnn, 'RmsNorm', RmsNorm)



dummy_input, all_layer_kv_cache = get_dummy_data()
input_names = list(dummy_input.keys())
output_names = ['logits'] 
for i in range(config.num_hidden_layers):
    output_names.append(f'past_key_{i}_out')
    output_names.append(f'past_value_{i}_out')

# Build converter args
converter_args_param = ['--input_layout']
converter_args_value = 'NONTRIVIAL'
converter_args = []
for input_param in converter_args_param:
    for input_name in input_names:
        converter_args += [input_param, input_name, converter_args_value]

skip_prepare = False # This is done only once
prepare_path = os.path.join(output_dir, 'prepare')
os.makedirs(prepare_path, exist_ok=True)
prepare_filename = f'{model_name}_kvcache_{config.num_hidden_layers}_layer'


if skip_prepare:
    with event_marker(f"KVCache load pre-prepared {prepare_filename}", flush_ram=True):
        prepared_model_path = os.path.join(prepare_path, f'{prepare_filename}.py')
        if not os.path.exists(prepared_model_path):
            raise ValueError(f"prepared artifacts not found in {prepare_path}")
        else:
            print(f'WARNING: preparation skipped for model={prepare_filename}, prepared at {time.ctime(os.path.getmtime(prepared_model_path))}')
            prepared_model = load_pytorch_model(path=prepare_path, filename=prepare_filename,
                                                model_name=prepare_filename, load_state_dict=True)

else:
    with event_marker("KVCache prepare model", flush_ram=True):
        if(enable_fp16):
            convert_model_to_fp32(model)
        dummy_input_for_prepare = {
            "input_ids": dummy_inputs["input_ids"],
            "cos": dummy_inputs["position_ids_cos"],
            "sin": dummy_inputs["position_ids_sin"],
            "attention_mask": dummy_inputs["attention_mask"],
            "all_layers_kv_cache": all_layer_kv_cache
        }
        prepared_model = model_preparer.prepare_model(model,
                                                      dummy_input_for_prepare,
                                                      model_name=prepare_filename,
                                                      filename=prepare_filename,
                                                      path=prepare_path,
                                                      input_names=input_names,
                                                      output_names=output_names,
                                                      onnx_export_args={"opset_version":14},
                                                      converter_args=converter_args)
del model # original model no longer needed

  param_schemas = callee.param_schemas()


2025-07-26 16:29:36,613 - Utils - INFO - memory usage @ 'KVCache prepare model[gc] >> ' : GPU default:2.6 GB, RAM 3.2 GB
2025-07-26 16:29:44,565 - root - INFO - Input shape info 


2025-07-26 16:29:44,565 - 270 - INFO - Input shape info 


2025-07-26 16:30:26,637 - Utils - INFO - memory usage @ 'KVCache prepare model[gc] << ' : GPU default:7.6 GB, RAM 8.1 GB
2025-07-26 16:30:26,638 - Utils - INFO - Event KVCache prepare model[gc] : time=0:00:50, GPU=5.1 GB(+ 2.6 GB), RAM=4.9 GB(+ 3.2 GB)


#### 7.2 Convert prepared model to FP16


In [12]:
if(enable_fp16):
   convert_model_to_fp16(prepared_model)

#### 7.3 Model prepare verification

Verify if prepared KV cache model generates the same PPL as FP model.


In [13]:
prepared_model.to(device)

def ppl_eval(test_dataset, model, num_batches=0, data_type=torch.float32):
    first_sample = next(iter(test_dataset))["input_ids"]
    first_input_ids = torch.tensor(first_sample, dtype=torch.long, device=device).unsqueeze(0)
    curr_len = first_input_ids.shape[1]
    _, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(first_input_ids)

    attention_mask = attention_mask.to(data_type)
    cos = cos.to(data_type)
    sin = sin.to(data_type)
    all_layer_kv_cache = [kv_cache.to(data_type) for kv_cache in all_layer_kv_cache]
    
    loss = 0
    if num_batches == 0:
        num_batches = num_total_batches
    else:
        num_batches = min(num_batches, num_total_batches)
    for batch_id, batch in enumerate(tqdm(test_dataset, total=num_batches, desc="Evaluating PPL")):
        if batch_id >= num_batches:
            break

        input_ids = batch["input_ids"]
        input_ids = torch.tensor(input_ids, dtype=torch.long, device=device).unsqueeze(0)
        curr_len = input_ids.shape[1]

        input_ids = torch.cat(
            [input_ids, torch.full((1, SEQ_LEN - curr_len), qnn_llm_utils.eos_token_id, device=input_ids.device)],
            dim=-1,
        )

        with torch.no_grad():
            all_inputs = (input_ids, cos, sin, attention_mask, *all_layer_kv_cache)
            output = model(*all_inputs)

        lm_logits = output[0]
        shift_logits = lm_logits[:, :-1, :][:, :curr_len, :].contiguous().to(dtype=torch.float32)
        shift_labels = input_ids[:, 1:][:, :curr_len].contiguous().to(shift_logits.device)

        loss_fct = CrossEntropyLoss()
        loss += loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

    loss = loss / num_batches
    ppl = loss.exp()
    return ppl

In [14]:
with event_marker("KVcache prepared FP eval", flush_ram=True):
    with torch.no_grad():
        prepared_kvcache_ppl = ppl_eval(test_dataset, prepared_model, num_batches=20)
print(f"ppl score of KVCACHE prepared fp model: {prepared_kvcache_ppl}")

2025-07-26 16:30:26,875 - Utils - INFO - memory usage @ 'KVcache prepared FP eval[gc] >> ' : GPU default:3.8 GB, RAM 3.3 GB


Evaluating PPL: 100%|██████████| 20/20 [00:00<00:00, 37.00it/s]

2025-07-26 16:30:34,973 - Utils - INFO - memory usage @ 'KVcache prepared FP eval[gc] << ' : GPU default:9.1 GB, RAM 3.3 GB
2025-07-26 16:30:34,973 - Utils - INFO - Event KVcache prepared FP eval[gc] : time=0:00:08, GPU=5.4 GB(+ 3.8 GB), RAM=1.1 MB(+ 3.3 GB)
ppl score of KVCACHE prepared fp model: 2703445.5





---

## 8. Quantization

The _Quantization_ step is the primary focus of this notebook, this section could be modified to execute various quantization experiments.


---

### 8.1 Create quantsim configured for QNN HTP target


In [15]:
from aimet_common.defs import QuantScheme
from aimet_torch.v2.quantsim import QuantizationSimModel
import copy

sim_model = copy.deepcopy(prepared_model)
sim_model.to(device)

def get_dummy_data():
    input_ids = next(iter(test_dataset))["input_ids"]
    input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda").unsqueeze(0)
    input_ids, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(input_ids)
    inputs_values = [input_ids, cos, sin, attention_mask]
    inputs_values.extend(all_layer_kv_cache)
    inputs_keys = ['input_ids', 'position_ids_cos', 'position_ids_sin','attention_mask'] 
    
    kv_inputs_keys = []
    for i in range(config.num_hidden_layers):
        kv_inputs_keys.append(f"past_key_{i}_in")
        kv_inputs_keys.append(f"past_value_{i}_in")
    inputs_keys.extend(kv_inputs_keys)
    inputs = dict(zip(inputs_keys, inputs_values))
    return inputs, all_layer_kv_cache

dummy_input, _ = get_dummy_data()
dummy_input_for_quantsim = tuple(dummy_input.values())
with event_marker("create KVCache Quantsim"):
    quantsim = QuantizationSimModel(model=sim_model,
                                    quant_scheme=QuantScheme.post_training_tf,
                                    dummy_input=dummy_input_for_quantsim,
                                    default_output_bw=16,
                                    default_param_bw=4,
                                    in_place=True,
                                    config_file=htp_config_file)

2025-07-26 16:30:42,582 - Utils - INFO - memory usage @ 'create KVCache Quantsim >> ' : GPU default:5.0 GB, RAM 3.3 GB
2025-07-26 16:30:42,765 - Quant - INFO - Unsupported op type BatchPermutation
2025-07-26 16:30:42,765 - Quant - INFO - Unsupported op type CropAndResize
2025-07-26 16:30:42,766 - Quant - INFO - Unsupported op type BatchToSpace
2025-07-26 16:30:42,766 - Quant - INFO - Unsupported op type SpaceToBatch
2025-07-26 16:30:42,766 - Quant - INFO - Unsupported op type GroupNormalization
2025-07-26 16:30:42,767 - Quant - INFO - Unsupported op type LayerNormalization
2025-07-26 16:30:42,767 - Quant - INFO - Unsupported op type Mean
2025-07-26 16:30:42,768 - Quant - INFO - Unsupported op type RMSNormalization
2025-07-26 16:30:42,768 - Quant - INFO - Unsupported op type Squeeze
2025-07-26 16:30:42,768 - Quant - INFO - Unsupported op type Unsqueeze
2025-07-26 16:30:42,769 - Quant - INFO - Unsupported op type Compress
2025-07-26 16:30:42,769 - Quant - INFO - Unsupported op type Ident

---

### 8.2 Setting 16bit x 8bit matmuls

To keep key and value tensors as 8 bits, reducing data I/O costs associated with KV-cache orchestration.


In [16]:
from aimet_torch.v2.experimental.quantsim_utils import set_matmul_second_input_producer_to_8bit_symmetric

set_matmul_second_input_producer_to_8bit_symmetric(quantsim)

---

### 8.3 Concat encoding unification

configuring concat ops to have shared encoding on input and output activations.


In [17]:
from aimet_torch.v2.experimental import propagate_output_encodings
import aimet_torch.elementwise_ops as aimet_ops
propagate_output_encodings(quantsim, aimet_ops.Concat)

---

### 8.4 Manual Mixed Precision

applying mixed precision configuration to ops


In [18]:
from llm_utils.mixed_precision_overrides import ManualQuantsimMixedPrecisionConfig
quantsim_adjuster = ManualQuantsimMixedPrecisionConfig(mixed_precision_config_file= "./config/mixed_precision_config/exceptions.json")
quantsim_adjuster.apply_exceptions(quantsim)

Applying \w*lm_head_(MatMul|conv_Conv):	{'param_exceptions': {'bitwidth': 8}}
Applying \w*norm_(Mul_1|Mul_1.module):	{'input_exceptions': [{'input_index': 0, 'bitwidth': 16, 'asymmetric': True}]}
Applying \w*norm_(Pow|Pow.module|ReduceMean|Add|Sqrt|Div|Mul):	{'output_exceptions': [{'output_index': 0, 'enabled': False}]}
Applying \w*v_proj_(MatMul|conv_Conv):	{'output_exceptions': [{'output_index': 0, 'bitwidth': 8, 'asymmetric': False}]}
Applying \w*Concat_\d+:	{'output_exceptions': [{'output_index': 0, 'bitwidth': 8, 'asymmetric': False}]}
Applying QuantizedRmsNorm:	{'param_exceptions': {'asymmetric': True, 'bitwidth': 16}}


---

### 8.5 Sequential MSE

applying sequential MSE technique to optimize parameter encodings


In [19]:
from aimet_torch.v2.seq_mse import apply_seq_mse
from aimet_torch.seq_mse import SeqMseParams
from aimet_torch.utils import load_pytorch_model
from torch.utils.data import DataLoader

train_dataset = load_dataset("nexa4ai/qwen3_wiki_calib", split="train")
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)


input_ids = next(iter(train_dataloader))["input_ids"]
input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda").unsqueeze(0)
input_ids, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(input_ids)

def _forward_fn(model, inputs):
    # slice inputs so that we only end up doing inference using first n tokens
    input_ids = inputs["input_ids"]
    input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda").unsqueeze(0)
    curr_len = input_ids.shape[1]
    input_ids = torch.cat(
        [input_ids, torch.full((1, 2073 - curr_len), 151645, device=input_ids.device)],
        dim=-1,
    )
    all_inputs = (input_ids, cos, sin, attention_mask, *all_layer_kv_cache)
    model(*all_inputs)


## HACK: We change num_batches from 20 to 1 to save time during the learning stage.
params = SeqMseParams(num_batches=20,
                      inp_symmetry="symqt",
                      num_candidates=20,
                      loss_fn="mse",
                      forward_fn=_forward_fn)

In [20]:
with event_marker("SeqMSE"):
    prepared_model.to("cuda")
    quantsim.model.to("cuda")
    apply_seq_mse(prepared_model, quantsim, train_dataloader, params)

del prepared_model

2025-07-26 16:30:52,146 - Utils - INFO - memory usage @ 'SeqMSE >> ' : GPU default:5.0 GB, RAM 3.3 GB
2025-07-26 16:30:53,863 - Utils - INFO - Caching 20 batches from data loader at path location: /tmp/tmp5eizl03u/cached_dataset
2025-07-26 16:30:53,922 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: q_proj_conv_Conv
2025-07-26 16:30:59,485 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: k_proj_conv_Conv


  cand_max = torch.tensor(per_channel_max / num_candidates * (cand + 1))


2025-07-26 16:31:03,067 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: v_proj_conv_Conv
2025-07-26 16:31:06,683 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: o_proj_conv_Conv
2025-07-26 16:31:10,244 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: gate_proj_conv_Conv
2025-07-26 16:31:14,010 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: up_proj_conv_Conv
2025-07-26 16:31:17,785 - SeqMse - INFO - Finding and freezing optimal param encodings candidate of module: down_proj_conv_Conv
2025-07-26 16:31:21,667 - Utils - INFO - memory usage @ 'SeqMSE << ' : GPU default:8.0 GB, RAM 3.3 GB
2025-07-26 16:31:21,668 - Utils - INFO - Event SeqMSE : time=0:00:29, GPU=3.0 GB(+ 5.0 GB), RAM=6.6 MB(+ 3.3 GB)


---

### 8.6 Calibration

compute activation encodings using AIMET


In [21]:
from torch.utils.data import DataLoader
train_dataset = load_dataset("nexa4ai/qwen3_wiki_calib", split="train")
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)


input_ids = next(iter(train_dataloader))["input_ids"]
input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda").unsqueeze(0)
input_ids, attention_mask, cos, sin, all_layer_kv_cache = qnn_llm_utils.prepare_inputs(input_ids)

def _forward_fn(model, kwargs):
    data_loader = kwargs['data_loader']
    max_iterations = kwargs['num_batches']
    
    for batch_id, batch in enumerate(tqdm(data_loader, total=max_iterations)):
        if batch_id < max_iterations:
            input_ids = batch['input_ids']
            input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda").unsqueeze(0)
            curr_len = input_ids.shape[1]
            input_ids = torch.cat(
                [input_ids, torch.full((1, 2073 - curr_len), 151645, device=input_ids.device)],
                dim=-1,
            )
            all_inputs = (input_ids, cos, sin, attention_mask, *all_layer_kv_cache)
            model(*all_inputs)
        else:
            break
        
kwargs = {
   'data_loader': train_dataloader,
   'num_batches': 20
}

with event_marker("compute encoding", flush_ram=True):
    quantsim.model.to("cuda")
    quantsim.compute_encodings(_forward_fn, kwargs)

2025-07-26 16:31:31,194 - Utils - INFO - memory usage @ 'compute encoding[gc] >> ' : GPU default:3.8 GB, RAM 3.3 GB


100%|██████████| 20/20 [00:01<00:00, 18.31it/s]

2025-07-26 16:31:32,312 - Utils - INFO - memory usage @ 'compute encoding[gc] << ' : GPU default:7.4 GB, RAM 3.3 GB
2025-07-26 16:31:32,312 - Utils - INFO - Event compute encoding[gc] : time=0:00:01, GPU=3.5 GB(+ 3.8 GB), RAM=1.3 MB(+ 3.3 GB)





### 8.7 Eval KV Cache sim model.


In [22]:
with event_marker("KV cache sim eval", flush_ram=True):
    with torch.no_grad():
        quantsim.model.to("cuda")
        sim_ppl = ppl_eval(test_dataset, quantsim.model, num_batches=20)

print(f"ppl score of KVCACHE sim fp model: {sim_ppl}")

2025-07-26 16:31:32,493 - Utils - INFO - memory usage @ 'KV cache sim eval[gc] >> ' : GPU default:3.8 GB, RAM 3.3 GB


Evaluating PPL: 100%|██████████| 20/20 [00:01<00:00, 12.82it/s]

2025-07-26 16:31:41,594 - Utils - INFO - memory usage @ 'KV cache sim eval[gc] << ' : GPU default:9.7 GB, RAM 3.3 GB
2025-07-26 16:31:41,595 - Utils - INFO - Event KV cache sim eval[gc] : time=0:00:09, GPU=5.9 GB(+ 3.8 GB), RAM=32.5 MB(+ 3.3 GB)





ppl score of KVCACHE sim fp model: 2184613.75


#### 8.8 Real Test

We conduct the real LLM inference process to ensure that the quantized model can do the reasonable inference.


In [23]:
eos_token_id = 151645
eos_tokens = {151645}
input_text = (
    "<|im_start|>user\nIntroduce Newton's first law of motion. Be short and concise.<|im_end|>"
    "\n<|im_start|>assistant\n"
)
input_ids_list = tokenizer.encode(input_text)  # a list of ints

n_past = 0
curr_len = len(input_ids_list)

if curr_len < SEQ_LEN:
    input_ids_list = input_ids_list + [eos_token_id] * (SEQ_LEN - curr_len)

input_ids = torch.tensor(input_ids_list, dtype=torch.long, device=device).unsqueeze(0)
attention_mask = qnn_llm_utils.get_attention_mask(n_past, curr_len)
position_ids = qnn_llm_utils.get_position_ids(n_past, SEQ_LEN)
cos, sin = qnn_llm_utils.get_cos_sin(attention_mask, position_ids)  # attn_mask as dummpy input to get device
all_layer_kv_caches = qnn_llm_utils.get_kv_cache()
last_token_indices = torch.tensor([n_past + curr_len - 1], dtype=torch.long, device=device)

generated_ids = []

for i in range(10):
    with torch.no_grad():
        prepared_inputs = (input_ids, cos, sin, attention_mask, *all_layer_kv_caches)
        outputs = quantsim.model(*prepared_inputs)
        next_token_logits = outputs[0][:, last_token_indices, :]
        next_token_id = torch.argmax(next_token_logits, dim=-1).item()
        generated_ids.append(next_token_id)

        ## update the inputs for next token
        all_layer_kv_caches = qnn_llm_utils.update_kv_cache(all_layer_kv_caches, outputs[1:], n_past, curr_len)
        n_past += curr_len
        curr_len = 1
        next_input_ids = [next_token_id] + [eos_token_id] * (SEQ_LEN - 1)
        input_ids = torch.tensor(next_input_ids, dtype=torch.long, device=device).unsqueeze(
            0
        )  # set bs_size = 1 manually
        last_token_indices = torch.tensor([curr_len - 1], dtype=torch.long, device=device)
        attention_mask = qnn_llm_utils.get_attention_mask(n_past, curr_len)
        position_ids = qnn_llm_utils.get_position_ids(n_past, SEQ_LEN)
        cos, sin = qnn_llm_utils.get_cos_sin(attention_mask, position_ids)

    if next_token_id in eos_tokens:
        break

generated_text = tokenizer.decode(generated_ids)
print(generated_text)

[node[nodeeleeleeleeleelelementlementlement


---

## 9. Export

the pipeline call below would export onnx model, encoding and test vector for KVCache models.


---

### 9.1 Generating test vectors for QNN SDK

We actually only use one piece of data. Thus, the computation can be done in CPU, since this step can be easily OOM on CUDA.

<span style="color:#d62728;font-weight:bold;">TODO</span>: We still need to figure out what this is, and how we are going to use it. We didn't run through this cell for now.


In [24]:
%load_ext autoreload
%autoreload 2
from llm_utils.test_vectors import generate_test_vectors
# del prepared_model  
test_vector_layers = [
    "model_layers_\\d+_input_layernorm_Pow",
    "lm_head_conv_Conv"
]
with event_marker("generate test vector"):
    quantsim.model.to("cpu")
    generate_test_vectors(quantsim, qnn_llm_utils, train_dataloader, output_dir, 
                          num_batches=1, test_vector_layers=test_vector_layers, input_names=input_names)

2025-07-26 16:31:42,548 - Utils - INFO - memory usage @ 'generate test vector >> ' : GPU default:3.8 GB, RAM 3.3 GB


  return nested_map(t, lambda x: torch.tensor(x) if isinstance(x, QuantizedTensorBase) else x)
Test vector generation: 100%|██████████| 1/1 [01:10<00:00, 70.92s/it]

2025-07-26 16:32:53,707 - Utils - INFO - memory usage @ 'generate test vector << ' : GPU default:3.8 GB, RAM 22.7 GB
2025-07-26 16:32:53,708 - Utils - INFO - Event generate test vector : time=0:01:11, GPU=0.0 bytes(+ 3.8 GB), RAM=19.4 GB(+ 3.3 GB)





---

### 9.2 Export KVCache Model


In [25]:
dummy_input, _ = get_dummy_data()
dummy_input = tuple(dummy_input.values())

In [26]:
from aimet_torch.utils import change_tensor_device_placement
from aimet_torch.onnx_utils import OnnxExportApiArgs
onnx_dir = os.path.join(output_dir, 'onnx')
os.makedirs(onnx_dir, exist_ok=True)

if(enable_fp16):
    # Convert FP16 model back to FP32 for ONNX export
    convert_model_to_fp32(quantsim.model)

quantsim.model.to("cpu")
onnx_api_args = OnnxExportApiArgs(input_names=input_names,output_names=output_names, opset_version=14)
sample_inputs = change_tensor_device_placement(dummy_input, torch.device('cpu'))
with event_marker("KVCache export", flush_ram=True):
    quantsim.export(onnx_dir, model_name, sample_inputs, onnx_export_args=onnx_api_args)

2025-07-26 16:33:01,582 - Utils - INFO - memory usage @ 'KVCache export[gc] >> ' : GPU default:2.7 GB, RAM 5.4 GB
2025-07-26 16:33:15,299 - Quant - INFO - Layers excluded from quantization: []
2025-07-26 16:33:19,150 - Utils - INFO - memory usage @ 'KVCache export[gc] << ' : GPU default:2.7 GB, RAM 11.6 GB
2025-07-26 16:33:19,151 - Utils - INFO - Event KVCache export[gc] : time=0:00:17, GPU=1.3 MB(+ 2.7 GB), RAM=6.2 GB(+ 5.4 GB)


---

### Summary


In [27]:
from aimet_torch.pro.utils.profiler import EventProfiler
EventProfiler().report()
EventProfiler().json_dump(os.path.join(output_dir, 'profiling_stats'))

import json
with open(f'{output_dir}/ppl.json', 'wt') as f:
    json.dump({
        "original": float(orig_ppl),
        "prepared_kvcache": float(prepared_kvcache_ppl),
        "QuantSim": float(sim_ppl),
    }, f, indent=2)

2025-07-26 16:33:19,196 - Utils - INFO - #0: Event FP eval : time=0:00:07, GPU=5.4 GB(+ 2.5 GB), RAM=33.0 MB(+ 3.1 GB)
2025-07-26 16:33:19,197 - Utils - INFO - #1: Event KVCache prepare model[gc] : time=0:00:50, GPU=5.1 GB(+ 2.6 GB), RAM=4.9 GB(+ 3.2 GB)
2025-07-26 16:33:19,198 - Utils - INFO - #2: Event KVcache prepared FP eval[gc] : time=0:00:08, GPU=5.4 GB(+ 3.8 GB), RAM=1.1 MB(+ 3.3 GB)
2025-07-26 16:33:19,198 - Utils - INFO - #3: Event create KVCache Quantsim : time=0:00:00, GPU=3.0 GB(+ 5.0 GB), RAM=1.1 MB(+ 3.3 GB)
2025-07-26 16:33:19,199 - Utils - INFO - #4: Event SeqMSE : time=0:00:29, GPU=3.0 GB(+ 5.0 GB), RAM=6.6 MB(+ 3.3 GB)
2025-07-26 16:33:19,199 - Utils - INFO - #5: Event compute encoding[gc] : time=0:00:01, GPU=3.5 GB(+ 3.8 GB), RAM=1.3 MB(+ 3.3 GB)
2025-07-26 16:33:19,200 - Utils - INFO - #6: Event KV cache sim eval[gc] : time=0:00:09, GPU=5.9 GB(+ 3.8 GB), RAM=32.5 MB(+ 3.3 GB)
2025-07-26 16:33:19,200 - Utils - INFO - #7: Event generate test vector : time=0:01:11, GPU

Copyright (c) 2024 Qualcomm Technologies, Inc. and/or its subsidiaries.
