-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix] Fix precision corruption when shared_experts_stream=None #28942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…_stream=None Signed-off-by: zhyajie <yajizhan@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a precision corruption bug that occurs when shared_experts_stream is None. The root cause, an in-place mutation of the hidden_states tensor by the routed expert computation before it's used by the shared experts, is correctly identified. The fix involves reordering the operations to execute the shared experts before the routed expert computation, which is a logical and direct solution. Additionally, the changes correctly extend auxiliary stream support to cuda-alike platforms like ROCm, which is crucial for the tested environment. The provided test plan is comprehensive and clearly demonstrates the fix's effectiveness. The code changes are clean, targeted, and appear to be correct. I have no further recommendations.
tjtanaa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@zhyajie is the hidden states mutated by rocm aiter fused moe only or the triton fused moe also mutates the hidden states? |
Both are expected to mutate the hidden states, the precision issue I initially discovered was in Kimi v2 thinking model , which used triton fused moe kernel. |
|
@robertgshaw2-redhat @SageMoore @alexm-redhat can you please take a look at both PRs? Which one is a better solution? |
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. Thanks for the fix!
|
Hi, I have successfully verified the fix provided in this PR using the RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic model on a ROCm/MI300X environment (tp 4). |
Thanks for validating with another model as well. |
…lm-project#28942) Signed-off-by: zhyajie <yajizhan@amd.com> Co-authored-by: zhyajie <yajizhan@amd.com>
…lm-project#28942) Signed-off-by: zhyajie <yajizhan@amd.com> Co-authored-by: zhyajie <yajizhan@amd.com> Signed-off-by: LuminolT <lumischen01@gmail.com>
…8942) Signed-off-by: zhyajie <yajizhan@amd.com> Co-authored-by: zhyajie <yajizhan@amd.com> Signed-off-by: jiang1.li <jiang1.li@intel.com>
…lm-project#28942) Signed-off-by: zhyajie <yajizhan@amd.com> Co-authored-by: zhyajie <yajizhan@amd.com>
Fix Precision issues When shared_experts_stream=None
Purpose
Fix precision issues when
shared_experts_stream=Noneby executing shared experts before routed experts kernel, preventing hidden_states from being mutated in-place.Test Plan
Tested on ROCm (gfx942) with DeepSeek-R1 model using TP=8.
Start vLLM Server
Send Test Request
Test Result
Before Fix (Corrupted Output)
{ "id": "cmpl-00b08a50df1b4a968ab5e41ccbeeced8", "object": "text_completion", "created": 1763468384, "model": "path/deepseek-ai/DeepSeek-R1", "choices": [{ "index": 0, "text": " Beijingisnownsgrmetically Beijing BeijingBeijingBeijingBeijingBeijingBeijingBeijingBeijingBeijing", "finish_reason": "length" }], "usage": { "prompt_tokens": 5, "total_tokens": 21, "completion_tokens": 16 } }After Fix (Correct Output)
{ "id": "cmpl-dc933ac440a3460ca2922efcaff43f0e", "object": "text_completion", "created": 1763468925, "model": "path/deepseek-ai/DeepSeek-R1", "choices": [{ "index": 0, "text": " is Beijing, and the capital of Russia is Moscow. Both China and Russia have", "finish_reason": "length" }], "usage": { "prompt_tokens": 5, "total_tokens": 21, "completion_tokens": 16 } }