-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel #28619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel #28619
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a critical fix for a race condition in the MoE Marlin kernel. The change in csrc/moe/marlin_moe_wna16/marlin_template.h correctly addresses an issue where multiple threads could write to the same shared memory location, which would lead to incorrect results. By introducing a proper boundary check, the fix ensures memory safety and correctness. Additionally, this PR enables overlapped execution for Marlin kernels in vllm/model_executor/layers/fused_moe/shared_fused_moe.py, a performance optimization that was likely blocked by this bug. The changes are well-implemented and address the issue effectively.
| if (idx < block_num_valid_tokens) { | ||
| if constexpr (w_type == vllm::kFE2M1f && s_type == vllm::kFE4M3fn) { | ||
| sh_block_topk_weights[idx] = __hmul2( | ||
| global_scale, Dtype::num2num2(Dtype::float2num( | ||
| topk_weights_ptr[sh_block_sorted_ids[idx]]))); | ||
| } else { | ||
| sh_block_topk_weights[idx] = Dtype::num2num2( | ||
| Dtype::float2num(topk_weights_ptr[sh_block_sorted_ids[idx]])); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly fixes a critical bug. The previous logic, which reset out-of-bounds idx values to 0, could lead to a race condition where multiple threads would write to sh_block_topk_weights[0] simultaneously. This would cause incorrect results and undefined behavior. By wrapping the operation in an if (idx < block_num_valid_tokens) check, you ensure that out-of-bounds accesses are safely skipped. This is the correct and robust approach to prevent this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug in the MoE marlin kernel and re-enables a related optimization. The change in csrc/moe/marlin_moe_wna16/marlin_template.h correctly fixes a race condition where multiple threads could write to the same shared memory location when handling tokens near block boundaries. The previous logic incorrectly reset an out-of-bounds index to 0, causing this data race. The new implementation properly guards the memory access with a conditional check, resolving the issue. The second change in vllm/model_executor/layers/fused_moe/shared_fused_moe.py re-enables overlapped execution for marlin kernels. This optimization was likely disabled due to the bug, and its re-introduction is a good performance improvement. The changes are correct and effectively address the underlying problem.
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
|
So, the illegal memory access appeared when we read some garbage from |
Yes. |
Just wondering how multi stream impact on appearance this issue... |
I’m not sure. Before we read values from global memory into |
|
Although I still have a slight feeling of something left unsaid because I don’t understand how multi-stream impacted. |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try again before force merge
|
@yewentao256 failed lora test seems not related |
fix #28220