[Quantization] Int8 Dynamic Quantization for LLM #3312
Unanswered
yang-ahuan
asked this question in
Q&A
Replies: 1 comment
-
Hi @yang-ahuan, thanks for reporting this! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
According to the documentation, when applying dynamic quantization on LLMs, the
DYNAMIC_QUANTIZATION_GROUP_SIZE
must be nonzero [1]. But, I have observed the following issue:DYNAMIC_QUANTIZATION_GROUP_SIZE
(or the default 32), I observe that weight-only quantization has lower latency than dynamic quantization.Is this behavior expected? Any insights would be appreciated!
BTW, I know that the group size for INT8 weight compression must be -1 [2]. However, I'm not sure if this is the reason causing the above results.
[1] https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations
[2] https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/algorithm.py#L131
Beta Was this translation helpful? Give feedback.
All reactions