-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels #25488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables GPT-OSS DP/EP with Marlin kernels by introducing a MarlinExperts
modular kernel and integrating it into the mxfp4
quantization backend. The changes correctly refactor the MoE logic to use the modular kernel framework. However, I've found a high-severity issue in the MarlinExperts
implementation where workspace management is incorrect, which could lead to performance degradation due to repeated memory allocations. I've provided a suggestion to fix this.
Looks good to me. Just had a few minor comments. Will PR interfere with #21166? |
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
c1c70de
to
d9d38b0
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work dealing with the tricky cases in Marlin, LGTM!
# | ||
|
||
|
||
def _is_marlin_mxfp4_w4an(quant_config: Optional[FusedMoEQuantConfig] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: call it _is_marlin_mxfp4_w4aN
to make it more clear
Given Marlin packed weight matrices w1_packed, and w2_packed, | ||
return the MoE intermediate size N | ||
""" | ||
marlin_tile_size = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this tile size actually correspond to? Would be good to leave a note
…m-project#25488) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#25488) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
…m-project#25488) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Purpose
Enable GPTOSS DP/EP for DeepEPHighThroughput All2All using Marlin codepath. This could be an alternative to matmul_ogs from Triton
Example serving command :
VLLM_MXFP4_USE_MARLIN=1 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" canhazgpu run -g2 -- vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
Test Plan
GPT-OSS evals
server command :
VLLM_MXFP4_USE_MARLIN=1 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
command:
OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 128 --base-url http://localhost:9010/v1
Test Result
<style type="text/css"></style>
Benchmarks
Please find results here
TLDR;
comparing with TP : Better TTFT; Bad TPOT as deepep_high_throughput enforces eager mode
comparing with OAITritonExperts : Worse TTFT and TPOT