[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle by BowenBao · Pull Request #39136 · vllm-project/vllm

BowenBao · 2026-04-07T01:43:06Z

Remove QuarkOCP_MX_MoEMethod_OSS and add aiter w4a8 backend.
Add unittest cases for rocm w4a16, w4a8 fused moe.
Validated locally with

pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py \
    --config-list-file=tests/evals/gpt_oss/configs/models-gfx950.txt

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
    --config-list-file=tests/evals/gsm8k/configs/models-qwen35-mi355.txt

gemini-code-assist

Code Review

This pull request introduces support for ROCm-specific MXFP4 MoE backends, specifically targeting GFX950 architectures using AITER triton kernels. Key changes include the addition of the AiterW4A8ExpertsMonolithic class for W4A8 (MXFP4 weights with static FP8 activations) and the expansion of the oracle system to handle backend selection and weight conversion for these new schemes. Additionally, the QuarkOCP_MX_MoEMethod has been refactored to unify backend selection through the oracle, allowing for the removal of the redundant QuarkOCP_MX_MoEMethod_OSS class. Review feedback suggests enhancing the descriptiveness of error messages regarding missing input scales and simplifying the complex emulation mode logic to improve code maintainability.

AndreasKaratzas

LGTM. Some minor questions only.

mergify · 2026-04-09T17:50:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-22T21:28:35Z

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

- Add oracle backend selection for MXFP4 MOE - Add unittest cases, fix w4a8 weight re-assign - Refactor kernel selection and move out aiter kernel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Bowen Bao <bowenbao@amd.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Rohan138 · 2026-04-30T21:09:34Z

    MARLIN = "MARLIN"
-    # ROCm AITER
-    AITER = "AITER"
+    # ROCm AITER backends


I think we should rename to AITER_CK and AITER_TRITON_MXFP4_FP8 for clarity ... AITER_CK supports both W4A4 and W4A16, and experimenting with W4A8 rn

In general that makes sense, it's either we keep a single backend enum & expert class for AITER_CK, or we keep separate wrapper for each quant config like in #41436.

With AITER_CK the only issue I see so far is that you can't immediately tell what config combo it uses / supports. Current weight postprocessing (shuffling etc) is branched on backends, with AITER_CK we are introducing more complex logic to further distinguish between configs there. That is if we assume w4a16, w4a4 and potentially w4a8 does different postprocessing logic, which seems like the case from existing code.

Rohan138 · 2026-04-30T21:17:16Z

        # TODO: Remove once all OCP MX schemes use the kernel abstraction
-        _AITER_NATIVE_OCP_MX_SCHEMES = ("w_mxfp4", "w_mxfp4_a_mxfp4")
+        _AITER_NATIVE_OCP_MX_SCHEMES = ("w_mxfp4", "w_mxfp4_a_mxfp4", "w_mxfp4_a_fp8")
        self.emulate = (


Shouldn't we remove this flag/override entirely and just let the oracle set the backend?

yes in #41436 after w4a4 is handled

Rohan138 · 2026-04-30T21:40:09Z

@@ -392,10 +430,15 @@ def _return_or_raise(
        )

    for backend in AVAILABLE_BACKENDS:


this should be _get_priority_backends if we're changing the names from select_gpt_oss_mxfp4_moe_backend -> select_mxfp4_moe_backend like this ... let me discuss offline

good catch, let's resolve in follow-ups.

mergify Bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm labels Apr 7, 2026

github-project-automation Bot added this to AMD and gpt-oss Issues & Enhancements Apr 7, 2026

github-project-automation Bot moved this to Todo in AMD Apr 7, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 7, 2026

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py

BowenBao commented Apr 7, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py Outdated

BowenBao marked this pull request as ready for review April 7, 2026 01:58

BowenBao requested review from WoosukKwon, mgoin, pavanimajety, tjtanaa, tlrmchlsmth and yewentao256 as code owners April 7, 2026 01:58

AndreasKaratzas reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/kernels/moe/test_ocp_mx_moe.py

Comment thread tests/kernels/moe/test_ocp_mx_moe.py

Comment thread tests/kernels/moe/test_ocp_mx_moe.py

BowenBao mentioned this pull request Apr 7, 2026

[Feature]: Refactor Quark MoE and mxfp4 MoE to align with MoE oracle/MK #34851

Open

10 tasks

mergify Bot added the needs-rebase label Apr 9, 2026

BowenBao mentioned this pull request Apr 10, 2026

[NVFP4] NVFP4 MOE emulation fallback for H100/MI300/MI350, standardize TritonExperts usage for OCP MX emulation #35737

Merged

fxmarty-amd reviewed Apr 13, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated

Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py

Comment thread tests/kernels/moe/test_ocp_mx_moe.py

BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1a54e39 to e2573f7 Compare April 22, 2026 21:23

mergify Bot removed the needs-rebase label Apr 22, 2026

BowenBao requested a review from robertgshaw2-redhat as a code owner April 22, 2026 21:31

robertgshaw2-redhat reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated

robertgshaw2-redhat reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

robertgshaw2-redhat reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated

BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1655483 to 1c160f4 Compare April 29, 2026 21:52

BowenBao commented Apr 29, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

Rohan138 mentioned this pull request Apr 30, 2026

[ROCm][CI] Fix GPT-OSS Quark MXFP4+FP8 MoE startup #41330

Closed

AndreasKaratzas mentioned this pull request Apr 30, 2026

[CI Failure]: mi355_2: GPQA Eval (GPT-OSS) (2xB200-2xMI355) #41324

Open

3 tasks

BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1c160f4 to 7a6e293 Compare April 30, 2026 16:05

robertgshaw2-redhat reviewed Apr 30, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

robertgshaw2-redhat reviewed Apr 30, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

robertgshaw2-redhat reviewed Apr 30, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated

move aiter moe to dedicated file

a8691a6

Signed-off-by: Bowen Bao <bowenbao@amd.com>

robertgshaw2-redhat approved these changes Apr 30, 2026

View reviewed changes

github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Apr 30, 2026

robertgshaw2-redhat enabled auto-merge (squash) April 30, 2026 19:23

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 30, 2026

Rohan138 reviewed Apr 30, 2026

View reviewed changes

BowenBao mentioned this pull request May 1, 2026

[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle #41436

Draft

		@@ -392,10 +430,15 @@ def _return_or_raise(
		)

		for backend in AVAILABLE_BACKENDS:

Uh oh!

Conversation

BowenBao commented Apr 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndreasKaratzas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rohan138 Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BowenBao May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

BowenBao May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

BowenBao May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rohan138 Apr 30, 2026 •

edited

Loading