Would Per-Tensor Int8 KV Cache Break Small LLMs? #3994
Unanswered
Leo Wang (dw61)
asked this question in
Q&A
Replies: 1 comment
-
|
Hi Leo Wang (@dw61), sorry for delayed resopnse. This is a very good question 😊 According to our observation, small LLMs weren't particularly more sensitive to int8 KV cache. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We’ve observed that the current KV cache uses per-tensor int8 quantization, but each KV tensor is large (992×64). Wouldn’t this lead to significant precision loss, especially in small models like a 0.5B LLM, where parameter sensitivity is high? Has Qualcomm encountered performance issues from this setup?
Also, for activations, are there alternative quantization methods beyond per-tensor?
Beta Was this translation helpful? Give feedback.
All reactions