-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) #25987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4
when using Multi-Token Prediction (MTP).
The main changes are:
- In
vllm/model_executor/layers/quantization/modelopt.py
, theModelOptNvFp4Config.get_quant_method
now checks if an MoE layer is in the exclusion list and returnsNone
if so. - In
vllm/model_executor/layers/fused_moe/layer.py
, theFusedMoE
layer's__init__
method is updated to handle theNone
return fromget_quant_method
by falling back to the unquantized method, effectively skipping quantization for that layer. - Several related changes in
deepseek_v2.py
,deepseek_mtp.py
, anddeepseek_eagle.py
refactor how the model configuration is passed toDeepseekV2DecoderLayer
to correctly support draft models in speculative decoding scenarios.
The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, thanks for the fix
@benchislett The basic model failure seems related
|
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett please merge with main to fix the docker |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Purpose
There is no fallback in
ModelOptNvfp4Config.get_quant_method
for when quant_config should skip an MoE layer.This is a problem for
nvidia/DeepSeek-R1-FP4
when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]
This PR includes some diff from #25953.
Testing
Evaluated in combination with #25984, see results there.