Skip to content

[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle #39136

Open
BowenBao wants to merge 2 commits intovllm-project:mainfrom
BowenBao:bowenbao/oracle_w4a8
Open

[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle #39136
BowenBao wants to merge 2 commits intovllm-project:mainfrom
BowenBao:bowenbao/oracle_w4a8

Conversation

@BowenBao
Copy link
Copy Markdown
Contributor

@BowenBao BowenBao commented Apr 7, 2026

  1. Remove QuarkOCP_MX_MoEMethod_OSS and add aiter w4a8 backend.
  2. Add unittest cases for rocm w4a16, w4a8 fused moe.
  3. Validated locally with
pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py \
    --config-list-file=tests/evals/gpt_oss/configs/models-gfx950.txt

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
    --config-list-file=tests/evals/gsm8k/configs/models-qwen35-mi355.txt

@mergify mergify Bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm labels Apr 7, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for ROCm-specific MXFP4 MoE backends, specifically targeting GFX950 architectures using AITER triton kernels. Key changes include the addition of the AiterW4A8ExpertsMonolithic class for W4A8 (MXFP4 weights with static FP8 activations) and the expansion of the oracle system to handle backend selection and weight conversion for these new schemes. Additionally, the QuarkOCP_MX_MoEMethod has been refactored to unify backend selection through the oracle, allowing for the removal of the redundant QuarkOCP_MX_MoEMethod_OSS class. Review feedback suggests enhancing the descriptiveness of error messages regarding missing input scales and simplifying the complex emulation mode logic to improve code maintainability.

Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py
Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py Outdated
@BowenBao BowenBao marked this pull request as ready for review April 7, 2026 01:58
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor questions only.

Comment thread tests/kernels/moe/test_ocp_mx_moe.py
Comment thread tests/kernels/moe/test_ocp_mx_moe.py
Comment thread tests/kernels/moe/test_ocp_mx_moe.py
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated
Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py
Comment thread tests/kernels/moe/test_ocp_mx_moe.py
@BowenBao BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1a54e39 to e2573f7 Compare April 22, 2026 21:23
@mergify mergify Bot removed the needs-rebase label Apr 22, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 22, 2026

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
Comment thread vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated
@BowenBao BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1655483 to 1c160f4 Compare April 29, 2026 21:52
Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
- Add oracle backend selection for MXFP4 MOE
- Add unittest cases, fix w4a8 weight re-assign
- Refactor kernel selection and move out aiter kernel

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com>
@BowenBao BowenBao force-pushed the bowenbao/oracle_w4a8 branch from 1c160f4 to 7a6e293 Compare April 30, 2026 16:05
Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
Comment thread vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
Comment thread vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py Outdated
Signed-off-by: Bowen Bao <bowenbao@amd.com>
@github-project-automation github-project-automation Bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Apr 30, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) April 30, 2026 19:23
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 30, 2026
MARLIN = "MARLIN"
# ROCm AITER
AITER = "AITER"
# ROCm AITER backends
Copy link
Copy Markdown
Contributor

@Rohan138 Rohan138 Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename to AITER_CK and AITER_TRITON_MXFP4_FP8 for clarity ... AITER_CK supports both W4A4 and W4A16, and experimenting with W4A8 rn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general that makes sense, it's either we keep a single backend enum & expert class for AITER_CK, or we keep separate wrapper for each quant config like in #41436.

With AITER_CK the only issue I see so far is that you can't immediately tell what config combo it uses / supports. Current weight postprocessing (shuffling etc) is branched on backends, with AITER_CK we are introducing more complex logic to further distinguish between configs there. That is if we assume w4a16, w4a4 and potentially w4a8 does different postprocessing logic, which seems like the case from existing code.

# TODO: Remove once all OCP MX schemes use the kernel abstraction
_AITER_NATIVE_OCP_MX_SCHEMES = ("w_mxfp4", "w_mxfp4_a_mxfp4")
_AITER_NATIVE_OCP_MX_SCHEMES = ("w_mxfp4", "w_mxfp4_a_mxfp4", "w_mxfp4_a_fp8")
self.emulate = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove this flag/override entirely and just let the oracle set the backend?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes in #41436 after w4a4 is handled

@@ -392,10 +430,15 @@ def _return_or_raise(
)

for backend in AVAILABLE_BACKENDS:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be _get_priority_backends if we're changing the names from select_gpt_oss_mxfp4_moe_backend -> select_mxfp4_moe_backend like this ... let me discuss offline

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, let's resolve in follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo
Status: Ready

Development

Successfully merging this pull request may close these issues.

5 participants