Would Per-Tensor Int8 KV Cache Break Small LLMs? #3994

Leo Wang (dw61) · 2025-06-04T20:52:06Z

Leo Wang (dw61)
Jun 4, 2025

We’ve observed that the current KV cache uses per-tensor int8 quantization, but each KV tensor is large (992×64). Wouldn’t this lead to significant precision loss, especially in small models like a 0.5B LLM, where parameter sensitivity is high? Has Qualcomm encountered performance issues from this setup?

Also, for activations, are there alternative quantization methods beyond per-tensor?

Kyunggeun Lee (quic-kyunggeu) · 2025-06-24T22:22:54Z

Kyunggeun Lee (quic-kyunggeu)
Jun 24, 2025

Hi Leo Wang (@dw61), sorry for delayed resopnse. This is a very good question 😊

According to our observation, small LLMs weren't particularly more sensitive to int8 KV cache.
With int8 KV cache, small LLMs (<=1B) have shown mostly decent accuarcy.
We did observe some models can be sensitive to int8 KV cache, but it was due to the model characteristics, not model size.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Would Per-Tensor Int8 KV Cache Break Small LLMs? #3994

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Would Per-Tensor Int8 KV Cache Break Small LLMs? #3994

Uh oh!

Leo Wang (dw61) Jun 4, 2025

Replies: 1 comment

Uh oh!

Kyunggeun Lee (quic-kyunggeu) Jun 24, 2025

Leo Wang (dw61)
Jun 4, 2025

Kyunggeun Lee (quic-kyunggeu)
Jun 24, 2025