-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Bugfix] Disable shared expert overlap if Marlin MoE is used #28410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Disable shared expert overlap if Marlin MoE is used #28410
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix to disable shared expert overlap when Marlin MoE kernels are used. This is achieved by adding a use_marlin_kernels property to the FusedMoE layer, which checks for a use_marlin attribute in the quantization method. The SharedFusedMoE layer is then updated to use this property to conditionally disable overlapping computation. The relevant Marlin-based MoE quantization methods (AWQMoEMethod, CompressedTensorsWNA16MarlinMoEMethod, GPTQMarlinMoEMethod, and Mxfp4MoEMethod) have been correctly updated to set this use_marlin flag. The changes are well-contained and correctly implemented to address the issue. I have no further comments.
|
Can you also revert the disable of overlap from that CI test? Otherwise LGTM |
Signed-off-by: mgoin <mgoin64@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
vadiklyutiy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the motion
|
But actually my worry is how do we know that other MoE backends is multi streams safe? |
|
it is good if it fails with illegal memory as in this case: at least we know that there is a bug. But frequently race condition might just corrupt hidden state and we got random incorrect output... |
I agree with this - looks like we are disable this as it's failing in CI, but we need to do a more comprehensive testing for other kernels as well? (especially on older HWs with lesser memory) cc @mgoin / @vadiklyutiy |
My experience says that just comprehensive testing isn't enough (bugs are rare and random). The multi streams part of code should be design and reviewed to be stream safe. |
…oject#28410) Signed-off-by: mgoin <mgoin64@gmail.com>
…oject#28410) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
…oject#28410) Signed-off-by: mgoin <mgoin64@gmail.com>
…oject#28410) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Purpose
For now it seems that Marlin MoE might not be safe with multiple CUDA streams, which is triggered when shared expert overlap is used. This was disabled within CI in #28324 so this PR disables shared expert overlap whenever Marlin is used to avoid user issues as well as fix CI.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.