Static Per-Tensor Quantization of KV Cache #3995
-
|
I am working on the AIMET from Qualcomm, and here is the problem, on Qualcomm's current hardware, there is only static graph, and this means, for the activation quantization, I need to use the static quant, which means I need to have fixed scale and zero points. Another limitation of the Qualcomm hardware is that it can only do per-tensor quantization for the activation. I am currently looking at the quantization in QNN for the language models. And KV cache is one distinct problem, now, I have each KV cache as activation, but I can only do the per tensor activation. Because of the static graph limitation, I need to allocate the 4096 length for it, which means that we have each tensor shape now as 4096*128 even after I split the heads. So, I guess the precision loss would be a lot, right? I believe there is also other frameworks with the same problem, like the ONNX. What I wish to understand here is:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
You're correct that we require a static graph to run on device, and that we use per-tensor quantization for activations (including KV cache). Typically, we keep KV cache activations at 8-bits and the rest of the activations in the model at 16-bits. We've found that this (depending on the technique you use for quantizing your weights) can get you quite close to matching the floating point performance of the model. |
Beta Was this translation helpful? Give feedback.
You're correct that we require a static graph to run on device, and that we use per-tensor quantization for activations (including KV cache). Typically, we keep KV cache activations at 8-bits and the rest of the activations in the model at 16-bits. We've found that this (depending on the technique you use for quantizing your weights) can get you quite close to matching the floating point performance of the model.