-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Kernels] Support blocked fp8 quantization for compressed tensors MoE #25219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for blocked FP8 quantization for Mixture of Experts (MoE) layers, primarily to enable DeepGEMM kernels. The changes are mostly in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
. My review identifies a critical issue with parameter registration that could lead to incorrect behavior and makes the code confusing. I've also noted a minor issue in vllm/model_executor/warmup/deep_gemm_warmup.py
.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
216add1
to
1ce4114
Compare
@mgoin , @tlrmchlsmth I think this is ready to go as a first cut. |
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and thanks for the work!
Could you also add performance metric using vllm bench throughput...
?
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Bill Nell <bnell@redhat.com>
Benchmark with DP=4 + deepep_high_throughput
|
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
…vllm-project#25219) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…#25219) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Add fp8 blocked quantization support to compressed tensors MoE.
Test Plan
Run lm-eval on the following models:
Test Result