[V1][Attention] Split triton_attn in triton-only and rocm specific backends #24648

bringlein · 2025-09-11T09:28:47Z

Purpose

This PR splits the triton_attn backend of V1 into two backends: One triton-only and platform-independent triton_attn and one rocm-specific rocm_attn, including aiter kernels.
This facilitates easier maintenance of both backends. Also, adaptations to the vllm-internal triton backend then don't need to ensure full compatibility with the external library aiter. For example, the standardization of kv-cache layouts in #21624 was blocked by this (and is also solved in this PR).
The selection of the backends is updated to select the new rocm_attn in case any of the aiter-specific variables are set.
Both backends still use the same triton kernels, but the logic to select them is now separated.

CC: @tdoublep @SageMoore @LucasWilkinson @jvlunteren

Test Plan

Correctness tests and manual comparison of the file diff between old and new split backends.

Test Result

Correctness

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

on main, H100:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.782|±  |0.0185|

with this PR on H100:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.782|±  |0.0185|

on main, MI300:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.796|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.776|±  |0.0187|

with this PR on MI300:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.778|±  |0.0186|

to test the "new" rocm backend, on MI300:

VLLM_ATTENTION_BACKEND=ROCM_ATTN_VLLM_V1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500
....
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.778|±  |0.0186|

`git diff`

Initially, I renamed triton_attn.py to rocm_attn.py with git mv.... However, github now shows rocm_attn.py as completely new file. However, the difference to the triton_attn in main is quite minimal, they are only renamings.

So, to facilitate the PR review, I created the diff manually:

git show origin/main:vllm/v1/attention/backends/triton_attn.py > tmp.py

diff tmp.py vllm/v1/attention/backends/rocm_attn.py 
9a10
> from vllm import _custom_ops as ops                       
27,31d27
< if current_platform.is_cuda_alike():                        
<     from vllm import _custom_ops as ops
< elif current_platform.is_xpu():                           
<     from vllm._ipex_ops import ipex_ops as ops
<                                                                                                                                           
36c32
< class TritonAttentionMetadata:                                                                                                            
---    
> class RocmAttentionMetadata:                                                                                                              
65,66c61,62
< class TritonAttentionMetadataBuilder(                                                                                                     
<         AttentionMetadataBuilder[TritonAttentionMetadata]):
---                                          
> class RocmAttentionMetadataBuilder(
>         AttentionMetadataBuilder[RocmAttentionMetadata]):
84c80                                                                                                                                       
<     ) -> TritonAttentionMetadata:              
---
>     ) -> RocmAttentionMetadata:
95c91
<               fast_build: bool = False) -> TritonAttentionMetadata:
---
>               fast_build: bool = False) -> RocmAttentionMetadata:
123c119
<         attn_metadata = TritonAttentionMetadata(
---
>         attn_metadata = RocmAttentionMetadata(
141c137
< class TritonAttentionBackend(AttentionBackend):
---
> class RocmAttentionBackend(AttentionBackend):
166c162
<         return "TRITON_ATTN_VLLM_V1"
---
>         return "ROCM_ATTN_VLLM_V1"
169,170c165,166
<     def get_impl_cls() -> type["TritonAttentionImpl"]:
<         return TritonAttentionImpl
---
>     def get_impl_cls() -> type["RocmAttentionImpl"]:
>         return RocmAttentionImpl
174c170
<         return TritonAttentionMetadata
---
>         return RocmAttentionMetadata
192,193c188,189
<     def get_builder_cls() -> type["TritonAttentionMetadataBuilder"]: 
<         return TritonAttentionMetadataBuilder
---
>     def get_builder_cls() -> type["RocmAttentionMetadataBuilder"]:
>         return RocmAttentionMetadataBuilder
205c201
< class TritonAttentionImpl(AttentionImpl):
---
> class RocmAttentionImpl(AttentionImpl):
244c240
<         TritonAttentionBackend.validate_head_size(head_size)
---
>         RocmAttentionBackend.validate_head_size(head_size)
250c246
<                                       "TritonAttentionImpl")
---
>                                       "RocmAttentionImpl")
261c257
<                     "Using aiter unified attention for TritonAttentionImpl")
---
>                     "Using aiter unified attention for RocmAttentionImpl")
267c263
<                     "Using vllm unified attention for TritonAttentionImpl")
---
>                     "Using vllm unified attention for RocmAttentionImpl")
308c304
<                 " for TritonAttentionImpl")
---
>                 " for RocmAttentionImpl")

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

gemini-code-assist

Code Review

This PR splits the triton_attn backend into triton_attn and rocm_attn to facilitate easier maintenance and platform-specific adaptations. The review focuses on identifying potential issues related to correctness and maintainability, particularly in the new rocm_attn.py file and modifications to platforms/rocm.py.

vllm/v1/attention/backends/rocm_attn.py

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

vllm/platforms/rocm.py

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

gshtras · 2025-09-12T14:58:10Z

vllm/v1/attention/backends/rocm_attn.py

+                                      "RocmAttentionImpl")
+
+        self.fp8_dtype = current_platform.fp8_dtype()
+        self.force_prefill_decode_attn = \


Maybe for simplicity we can remove this variable altogether, to not have 2 identical unified attention codepaths
The logic (for ROCm) could be just:
if backend == ROCM_ATTN_VLLM_V1:
use rocm_attn with split attention (or aiter)
else:
use triton_attn with unified attention

could be a good idea, but maybe as a follow-up PR? I think the focus of this PR should be to enable such optimizations easily later.

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

tdoublep

LGTM

…ckends (vllm-project#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

…ckends (vllm-project#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

…ckends (#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

bringlein added 8 commits September 10, 2025 09:40

split triton_attn and rocm_attn

19c2d80

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

change kv cache layout (vllm-project#21624)

3d9fa72

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

adapt supported head sizes

95d567a

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

update dtypes for triton_attn

e791f65

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

merge latest main

4d77050

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

resolve merge conflict, update both attn backends

217fcae

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

add rocm attn to v1 list

5065f71

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

fmt

2257978

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

bringlein requested review from gshtras and tdoublep as code owners September 11, 2025 09:28

mergify bot added rocm Related to AMD ROCm v1 labels Sep 11, 2025

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

cosmetics

ac2bdc9

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

gshtras reviewed Sep 11, 2025

View reviewed changes

vllm/platforms/rocm.py Outdated Show resolved Hide resolved

bringlein added 2 commits September 12, 2025 04:31

fixing rocm selection logic

f7d1391

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

Merge branch 'main' into ngl_triton_attn_split_pr

ce0645f

gshtras reviewed Sep 12, 2025

View reviewed changes

gshtras approved these changes Sep 15, 2025

View reviewed changes

SageMoore approved these changes Sep 15, 2025

View reviewed changes

bringlein added 2 commits September 18, 2025 16:53

Merge branch 'main' into ngl_triton_attn_split_pr

113b9bb

manually replicating vllm-project#22758 for rocm_attn

f848fd7

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

tdoublep approved these changes Sep 22, 2025

View reviewed changes

tdoublep enabled auto-merge (squash) September 22, 2025 13:22

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

tdoublep merged commit 175811e into vllm-project:main Sep 22, 2025
55 checks passed

bringlein mentioned this pull request Sep 22, 2025

[V1][Hybrid] Make KV cache layout of triton_attn compatible with hybrid models #21624

Closed

4 tasks

gshtras mentioned this pull request Sep 23, 2025

[ROCm] Split AITER unified attention into its own backend #25507

Open

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[V1][Attention] Split triton_attn in triton-only and rocm specific ba…

23f2e9e

…ckends (vllm-project#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[V1][Attention] Split triton_attn in triton-only and rocm specific ba…

2d5ffdc

…ckends (vllm-project#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[V1][Attention] Split triton_attn in triton-only and rocm specific ba…

e55ffe3

…ckends (#24648) Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Attention] Split triton_attn in triton-only and rocm specific backends #24648

[V1][Attention] Split triton_attn in triton-only and rocm specific backends #24648

Uh oh!

bringlein commented Sep 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gshtras Sep 12, 2025

Uh oh!

bringlein Sep 15, 2025

Uh oh!

tdoublep left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1][Attention] Split triton_attn in triton-only and rocm specific backends #24648

[V1][Attention] Split triton_attn in triton-only and rocm specific backends #24648

Uh oh!

Conversation

bringlein commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Correctness

git diff

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gshtras Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

bringlein Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bringlein commented Sep 11, 2025 •

edited by github-actions bot

Loading

`git diff`