New PR number #25693 #25688

charlifu · 2025-09-25T16:09:35Z

This PR adds a few fusion pass for Aiter to fusion layernorm + fp8 block quant and silu + fp8 block quant.

gemini-code-assist

Code Review

This pull request adds support for fusing layernorm and silu operations with FP8 block quantization for AITemplate on ROCm. The changes introduce new fusion patterns for torch.compile and the corresponding custom operator implementations. The changes also expand AITemplate support to non-MI300 ROCm devices by providing a Triton-based fallback for GEMM operations.

My review found a critical issue in the type hints for the newly added custom operator implementations in vllm/model_executor/layers/layernorm.py. The residual parameter is incorrectly typed as torch.Tensor instead of Optional[torch.Tensor], which will lead to a TypeError during compilation and tracing. I've provided suggestions to fix this. The rest of the changes look good and are consistent with the goal of the pull request.

gemini-code-assist · 2025-09-25T16:11:19Z

vllm/model_executor/layers/layernorm.py

+        x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor,
+        variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:


The residual parameter is typed as torch.Tensor, but it can be None when called from the AiterRMSGroupQuantFP8Pattern fusion pass. This will cause a TypeError during compilation. Please change the type hint to Optional[torch.Tensor] to reflect that it can be None.

Suggested change

x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor,

variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:

x: torch.Tensor, residual: Optional[torch.Tensor], weight: torch.Tensor,

variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:

gemini-code-assist · 2025-09-25T16:11:19Z

vllm/model_executor/layers/layernorm.py

+        x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor,
+        variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:


The residual parameter in this fake implementation is typed as torch.Tensor, but it can be None. This will cause a TypeError during fake-tensor tracing for the fusion pass. Please change the type hint to Optional[torch.Tensor].

Suggested change

x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor,

variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:

x: torch.Tensor, residual: Optional[torch.Tensor], weight: torch.Tensor,

variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:

…ect#24649) Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: Haoyang Li <haoyang.li@amd.com> Signed-off-by: charlifu <charlifu@amd.com>

…mark_serving_multi_turn) (vllm-project#23255) Signed-off-by: daniels <daniels@pliops.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com> Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: charlifu <charlifu@amd.com>

…llm-project#24969) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Zhuohan Li <zhuohan123@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

…tract_tool_call_required_streaming (vllm-project#24668) Signed-off-by: Shijun Yin <shijun.yin@outlook.com> Signed-off-by: charlifu <charlifu@amd.com>

…#25065) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: charlifu <charlifu@amd.com>

…vllm-project#25046) Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: charlifu <charlifu@amd.com>

…d warning. (vllm-project#25010) Signed-off-by: samzong <samzong.lu@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: charlifu <charlifu@amd.com>

…project#24970) Signed-off-by: samzong <samzong.lu@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

mergify · 2025-09-25T16:29:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2025-09-25T16:29:16Z

Thanks for submitting this, this is really exciting!

I'm currently overhauling custom op matching in #24604. We also recently added a torch implementation of group quant, could you compare its performance with AITER? Also could you compare the perf of the fused AITER kernel to the fused torch.compile kernel for rmsnorm+quant. Happy to help out with instructions, but overall:

you'll need [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 reapplied (it was recently reverted)
you'll need to disable quant_fp8 using-O.custom_ops+=-quant_fp8
you'll have to replace the AITER block quant with QuantFP8
- we should refactor this after [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 is re-merged so that the aiter op is under QuantFP8 as well

charlifu · 2025-09-25T16:53:02Z

New PR #25693

charlifu requested review from zou3519, youkaichao, ProExpertProg, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 25, 2025 16:09

mergify bot added the rocm Related to AMD ROCm label Sep 25, 2025

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

haoyangli-amd and others added 21 commits September 25, 2025 16:16

[Rocm] [quantization] Fix quark ptpc moe and add test case (vllm-proj…

5637508

…ect#24649) Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: Haoyang Li <haoyang.li@amd.com> Signed-off-by: charlifu <charlifu@amd.com>

Add more documentation and improve usability of lognormal dist (bench…

0f3fc8e

…mark_serving_multi_turn) (vllm-project#23255) Signed-off-by: daniels <daniels@pliops.com> Signed-off-by: charlifu <charlifu@amd.com>

[XPU] Fix xpu model runner call torch.cuda APIs (vllm-project#25011)

ce74366

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

[EPLB] Support EPLB for Mixtral Model (vllm-project#22842)

b10a3c3

Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com> Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: charlifu <charlifu@amd.com>

[Core][MultiModalHasher] Hash images without converting image mode (v…

281e11e

…llm-project#24969) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

[Model] Pass param prefix to LLMHead (vllm-project#24862)

52a69b8

Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: charlifu <charlifu@amd.com>

[Model] Apply SharedFusedMoE to glm4_moe. (vllm-project#24849)

fa87338

Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: charlifu <charlifu@amd.com>

[Core] Remove tokenizer group in vLLM (vllm-project#24078)

332a076

Signed-off-by: Zhuohan Li <zhuohan123@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

[Docs] Fix griffe warning in base_static_graph.py (vllm-project#25018)

55f4643

Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: charlifu <charlifu@amd.com>

[DP] Create placement groups by ray_device_key (vllm-project#25026)

c37895f

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

[Frontend] Support returning all prompt logprobs (vllm-project#24956)

f94602f

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

[BugFix] enable DOTALL to match multi-line tool_call parameters in ex…

5938e5f

…tract_tool_call_required_streaming (vllm-project#24668) Signed-off-by: Shijun Yin <shijun.yin@outlook.com> Signed-off-by: charlifu <charlifu@amd.com>

[Misc] Avoid use of deprecated AutoModelForVision2Seq (vllm-project…

5152935

…#25065) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>

Add RADIO Vision Encoder Support to vLLM (vllm-project#24595)

b9578f7

Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: charlifu <charlifu@amd.com>

[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check (…

d9c268a

…vllm-project#25046) Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: charlifu <charlifu@amd.com>

Apply fixes for CUDA 13 (vllm-project#24599)

b0b2bc0

Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

[fix] lora benchmarks pass no_lora_flag_cpu (vllm-project#23774)

3da495d

Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP imple…

250ac06

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: charlifu <charlifu@amd.com>

[Docs] improve code formatting and comments for eliminate griffe buil…

9b7dd51

…d warning. (vllm-project#25010) Signed-off-by: samzong <samzong.lu@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Remove old cutlass mla (vllm-project#23961)

e86407a

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: charlifu <charlifu@amd.com>

[Docs] vllm/benchmarks/datasets.py fix docstring param format. (vllm-…

2db13d9

…project#24970) Signed-off-by: samzong <samzong.lu@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

github-project-automation bot added this to gpt-oss Issues & Enhancements and Structured Output Sep 25, 2025

mergify bot added the tpu Related to Google TPUs label Sep 25, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Sep 25, 2025

mergify bot added tool-calling kv-connector labels Sep 25, 2025

github-project-automation bot added this to Tool Calling Sep 25, 2025

mergify bot added the needs-rebase label Sep 25, 2025

charlifu closed this Sep 25, 2025

github-project-automation bot moved this to Done in Structured Output Sep 25, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Sep 25, 2025

github-project-automation bot moved this to Done in Tool Calling Sep 25, 2025

charlifu deleted the amd/aiter_fusion_pass branch September 25, 2025 16:44

charlifu mentioned this pull request Sep 25, 2025

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Open

charlifu changed the title ~~[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter~~ New PR number #25693 Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New PR number #25693 #25688

New PR number #25693 #25688

Uh oh!

charlifu commented Sep 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

charlifu commented Sep 25, 2025

Uh oh!

Uh oh!

		x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor,
		variance_epsilon: float) -> tuple[torch.Tensor, torch.Tensor]:

Uh oh!

New PR number #25693 #25688

New PR number #25693 #25688

Uh oh!

Conversation

charlifu commented Sep 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

charlifu commented Sep 25, 2025

Uh oh!

Uh oh!

charlifu commented Sep 25, 2025 •

edited by github-actions bot

Loading