Static Per-Tensor Quantization of KV Cache #3995

dw61 · 2025-06-04T23:07:15Z

dw61
Jun 4, 2025

I am working on the AIMET from Qualcomm, and here is the problem, on Qualcomm's current hardware, there is only static graph, and this means, for the activation quantization, I need to use the static quant, which means I need to have fixed scale and zero points. Another limitation of the Qualcomm hardware is that it can only do per-tensor quantization for the activation.

I am currently looking at the quantization in QNN for the language models. And KV cache is one distinct problem, now, I have each KV cache as activation, but I can only do the per tensor activation. Because of the static graph limitation, I need to allocate the 4096 length for it, which means that we have each tensor shape now as 4096*128 even after I split the heads.

So, I guess the precision loss would be a lot, right? I believe there is also other frameworks with the same problem, like the ONNX. What I wish to understand here is:

For my case, for the per tensor quantization, could it still make LLM to output result reasonably?
What is the existing other methods for the KV cache quantization in LLM.

Answered by quic-ashvkuma

Jun 19, 2025

You're correct that we require a static graph to run on device, and that we use per-tensor quantization for activations (including KV cache). Typically, we keep KV cache activations at 8-bits and the rest of the activations in the model at 16-bits. We've found that this (depending on the technique you use for quantizing your weights) can get you quite close to matching the floating point performance of the model.

View full answer

quic-ashvkuma · 2025-06-19T01:50:45Z

quic-ashvkuma
Jun 19, 2025

You're correct that we require a static graph to run on device, and that we use per-tensor quantization for activations (including KV cache). Typically, we keep KV cache activations at 8-bits and the rest of the activations in the model at 16-bits. We've found that this (depending on the technique you use for quantizing your weights) can get you quite close to matching the floating point performance of the model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Per-Tensor Quantization of KV Cache #3995

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Static Per-Tensor Quantization of KV Cache #3995

Uh oh!

dw61 Jun 4, 2025

Replies: 1 comment

Uh oh!

quic-ashvkuma Jun 19, 2025

dw61
Jun 4, 2025

quic-ashvkuma
Jun 19, 2025