Skip to content

Conversation

benchislett
Copy link
Collaborator

@benchislett benchislett commented Sep 30, 2025

Purpose

There is no fallback in ModelOptNvfp4Config.get_quant_method for when quant_config should skip an MoE layer.

This is a problem for nvidia/DeepSeek-R1-FP4 when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:

https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]

This PR includes some diff from #25953.

Testing

Evaluated in combination with #25984, see results there.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4 when using Multi-Token Prediction (MTP).

The main changes are:

  1. In vllm/model_executor/layers/quantization/modelopt.py, the ModelOptNvFp4Config.get_quant_method now checks if an MoE layer is in the exclusion list and returns None if so.
  2. In vllm/model_executor/layers/fused_moe/layer.py, the FusedMoE layer's __init__ method is updated to handle the None return from get_quant_method by falling back to the unquantized method, effectively skipping quantization for that layer.
  3. Several related changes in deepseek_v2.py, deepseek_mtp.py, and deepseek_eagle.py refactor how the model configuration is passed to DeepseekV2DecoderLayer to correctly support draft models in speculative decoding scenarios.

The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks for the fix

@mgoin
Copy link
Member

mgoin commented Oct 1, 2025

@benchislett The basic model failure seems related

[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1139, in <lambda>
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix,
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 997, in __init__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     config = config or vllm_config.model_config.hf_config
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     return func(*args, **kwargs)
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)            ^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151) RuntimeError: Boolean value of Tensor with more than one value is ambiguous
[2025-10-01T00:28:07Z] FAILED

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett added the bug Something isn't working label Oct 2, 2025
@mgoin
Copy link
Member

mgoin commented Oct 2, 2025

@benchislett please merge with main to fix the docker

Copy link

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 3, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the needs-rebase label Oct 3, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants