Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Nov 10, 2025

Purpose

For now it seems that Marlin MoE might not be safe with multiple CUDA streams, which is triggered when shared expert overlap is used. This was disabled within CI in #28324 so this PR disables shared expert overlap whenever Marlin is used to avoid user issues as well as fix CI.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bugfix to disable shared expert overlap when Marlin MoE kernels are used. This is achieved by adding a use_marlin_kernels property to the FusedMoE layer, which checks for a use_marlin attribute in the quantization method. The SharedFusedMoE layer is then updated to use this property to conditionally disable overlapping computation. The relevant Marlin-based MoE quantization methods (AWQMoEMethod, CompressedTensorsWNA16MarlinMoEMethod, GPTQMarlinMoEMethod, and Mxfp4MoEMethod) have been correctly updated to set this use_marlin flag. The changes are well-contained and correctly implemented to address the issue. I have no further comments.

@robertgshaw2-redhat
Copy link
Collaborator

Can you also revert the disable of overlap from that CI test? Otherwise LGTM

Signed-off-by: mgoin <mgoin64@gmail.com>
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) November 10, 2025 19:30
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 10, 2025
@mgoin mgoin added bug Something isn't working moe labels Nov 10, 2025
@mergify
Copy link

mergify bot commented Nov 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Nov 11, 2025
Copy link
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the motion

@vadiklyutiy
Copy link
Collaborator

But actually my worry is how do we know that other MoE backends is multi streams safe?
It is hard to be 100% sure but maybe we have at least review them and check.
Otherwise such race condition bugs may be really hard to find and identify.

@vadiklyutiy
Copy link
Collaborator

it is good if it fails with illegal memory as in this case: at least we know that there is a bug. But frequently race condition might just corrupt hidden state and we got random incorrect output...

@robertgshaw2-redhat robertgshaw2-redhat merged commit e5f599d into vllm-project:main Nov 11, 2025
55 checks passed
@zhewenl
Copy link
Collaborator

zhewenl commented Nov 11, 2025

But actually my worry is how do we know that other MoE backends is multi streams safe?

I agree with this - looks like we are disable this as it's failing in CI, but we need to do a more comprehensive testing for other kernels as well? (especially on older HWs with lesser memory)

cc @mgoin / @vadiklyutiy

@vadiklyutiy
Copy link
Collaborator

vadiklyutiy commented Nov 12, 2025

But actually my worry is how do we know that other MoE backends is multi streams safe?

I agree with this - looks like we are disable this as it's failing in CI, but we need to do a more comprehensive testing for other kernels as well? (especially on older HWs with lesser memory)

cc @mgoin / @vadiklyutiy

My experience says that just comprehensive testing isn't enough (bugs are rare and random). The multi streams part of code should be design and reviewed to be stream safe.

fangyuchu pushed a commit to fangyuchu/vllm that referenced this pull request Nov 12, 2025
@yewentao256 yewentao256 deleted the disable-marlin-shared-expert-overlap branch November 12, 2025 14:38
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
…oject#28410)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: George D. Torres <gdavtor@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
…oject#28410)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working moe ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants