Skip to content

Conversation

varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Sep 23, 2025

Purpose

Enable GPTOSS DP/EP for DeepEPHighThroughput All2All using Marlin codepath. This could be an alternative to matmul_ogs from Triton

Example serving command : VLLM_MXFP4_USE_MARLIN=1 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" canhazgpu run -g2 -- vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010

Test Plan

GPT-OSS evals
server command : VLLM_MXFP4_USE_MARLIN=1 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
command: OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 128 --base-url http://localhost:9010/v1

Test Result

<style type="text/css"></style>

gpt-oss-120b  
Reasoning Effort GPQA
low 0.652
medium 0.723
high 0.797

Benchmarks

Please find results here
TLDR;
comparing with TP : Better TTFT; Bad TPOT as deepep_high_throughput enforces eager mode
comparing with OAITritonExperts : Worse TTFT and TPOT

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables GPT-OSS DP/EP with Marlin kernels by introducing a MarlinExperts modular kernel and integrating it into the mxfp4 quantization backend. The changes correctly refactor the MoE logic to use the modular kernel framework. However, I've found a high-severity issue in the MarlinExperts implementation where workspace management is incorrect, which could lead to performance degradation due to repeated memory allocations. I've provided a suggestion to fix this.

@varun-sundar-rabindranath varun-sundar-rabindranath marked this pull request as ready for review September 24, 2025 14:06
@varun-sundar-rabindranath
Copy link
Contributor Author

@zyongye @mgoin @bnellnm please take a look when you have a moment. Thanks 🙌

@bnellnm
Copy link
Contributor

bnellnm commented Sep 30, 2025

Looks good to me. Just had a few minor comments. Will PR interfere with #21166?

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Varun Sundar Rabindranath added 2 commits October 1, 2025 17:47
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 1, 2025
@varun-sundar-rabindranath
Copy link
Contributor Author

Looks good to me. Just had a few minor comments. Will PR interfere with #21166?

Thanks for the review!! I dont think this PR will interfere with #21166 - This PR has more to do with Marlin than mxfp4.

Copy link
Contributor

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work dealing with the tricky cases in Marlin, LGTM!

#


def _is_marlin_mxfp4_w4an(quant_config: Optional[FusedMoEQuantConfig] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call it _is_marlin_mxfp4_w4aN to make it more clear

Given Marlin packed weight matrices w1_packed, and w2_packed,
return the MoE intermediate size N
"""
marlin_tile_size = 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this tile size actually correspond to? Would be good to leave a note

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Oct 3, 2025
@mgoin mgoin merged commit 7ef40bb into vllm-project:main Oct 4, 2025
56 checks passed
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
…m-project#25488)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
…m-project#25488)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
…m-project#25488)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants