-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models #24960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: toncao <cpatonn@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a KeyError
during the loading of quantized models for qwen3-next
and qwen2moe
by adding necessary prefixes to shared_expert
and mlp
modules. The changes correctly add a prefix
parameter to Qwen2MoeMLP
and use it to construct proper names for its submodules. The fix for qwen3-next
appears complete, as it passes the correct prefix to its shared_expert
. However, the fix for qwen2moe
seems incomplete, as the new prefix
parameter is not used at the call sites within qwen2_moe.py
, which could lead to the same loading errors. I've left a specific comment on this.
hidden_act: str, | ||
quant_config: Optional[QuantizationConfig] = None, | ||
reduce_results: bool = True, | ||
prefix: str = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding the prefix
parameter here is a good step, it seems the fix for qwen2moe
is incomplete. This Qwen2MoeMLP
is instantiated in two places in this file, but the prefix
is not passed in either case:
- In
Qwen2MoeSparseMoeBlock
for theshared_expert
(L131). - In
Qwen2MoeDecoderLayer
for the non-MoEmlp
(L302).
Without passing the prefix, the submodules inside Qwen2MoeMLP
won't have the correct names, which will likely lead to the same KeyError
for quantized qwen2moe
models. This seems critical to fix for the PR to fully achieve its goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toncao could you please look at this, bot is right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely! I have updated the pr reflecting this in the most recent commit ed5e2a29cdde3a34dfaa3843d52a8f4d2e01543e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
@jeejeelee Apologize that I commented on a wrong thread. Could you please convert this PR to "ready to review" again? Thanks. |
Signed-off-by: toncao <cpatonn@gmail.com>
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
i build vllm from latest
error still exists
params
using model nm-testing/qwen3-80b-fp8-dynamic my hardware is 2 x A100 |
@halexan Thanks for trying the new model! Can you try using this model on H100 machines? I don't think A100 would work. Thanks! |
I might be wrong but I think this is due to Edit: |
…mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) Signed-off-by: toncao <cpatonn@gmail.com> Co-authored-by: toncao <cpatonn@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) Signed-off-by: toncao <cpatonn@gmail.com> Co-authored-by: toncao <cpatonn@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) Signed-off-by: toncao <cpatonn@gmail.com> Co-authored-by: toncao <cpatonn@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Purpose
Due to qwen3-next does not pass prefixes as it loads shared_expert modules, and thus, it can not match shared_expert modules with the ignore list in quantization config, and ultimately leads to e.g.,:
Please visit this discussion and model for more information.
If the shared_expert modules are not ignored, which deteriorates model outputs, there are no loading errors.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.