[PERF] Add `conv1d` metadata to GDN attn #25105

vadiklyutiy · 2025-09-17T22:15:45Z

Purpose

The on of the main gap in performance for Qwen3-next model is low utilisation of GPU in GDN attn.
Low utilization caused by several GPU<->host memory transfers.
These transfers caused by causal_conv1d_fn.
Qwen3-next haven't supported this specific convolution metadata. The purpose of the metadata is avoid these memory transfer.

Add conv metadata (similar to mamba2).

+ corrected tensor.tensor -> tensor.Tensor types annotation.

Test Result

H200, tp=4

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct   --endpoint /v1/completions --dataset-name random --random-input 32768   --random-output 1 --max-concurrency 8 --num-prompt 8

Before

============ Serving Benchmark Result ============
Successful requests:                     8
Maximum request concurrency:             8
Benchmark duration (s):                  4.56
Total input tokens:                      262144
Total generated tokens:                  8
Request throughput (req/s):              1.76
Output token throughput (tok/s):         1.76
Total Token throughput (tok/s):          57510.95
---------------Time to First Token----------------
Mean TTFT (ms):                          2600.46
Median TTFT (ms):                        2601.40
P99 TTFT (ms):                           4511.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     8
Maximum request concurrency:             8
Benchmark duration (s):                  3.61
Total input tokens:                      262144
Total generated tokens:                  8
Request throughput (req/s):              2.22
Output token throughput (tok/s):         2.22
Total Token throughput (tok/s):          72623.02
---------------Time to First Token----------------
Mean TTFT (ms):                          2067.73
Median TTFT (ms):                        2068.62
P99 TTFT (ms):                           3570.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

Speedup is 26%

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request introduces an optimization for the Qwen3-next model by adding mamba2-style convolution metadata to the GDN attention mechanism. This change aims to reduce GPU-host memory transfers in causal_conv1d_fn, leading to a significant performance improvement as demonstrated by the benchmark results. The implementation correctly extends GDNAttentionMetadata and integrates the metadata preparation and usage within the model's forward pass. The changes are well-targeted and effective. I have one minor suggestion to improve type hint correctness.

vllm/v1/attention/backends/gdn_attn.py

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy · 2025-09-18T08:05:45Z

Could someone please run the CI?

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

add mamba2 metadata to GDN attn

af46d8a

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested review from sighingnow, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and tdoublep as code owners September 17, 2025 22:15

mergify bot added qwen Related to Qwen models v1 labels Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/v1/attention/backends/gdn_attn.py Outdated Show resolved Hide resolved

vadiklyutiy added 2 commits September 18, 2025 02:39

pre-commit

d298780

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

torch.tensor -> torch.Tensor in typies annotations

705957a

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy changed the title ~~add mamba2 metadata to GDN attn~~ [PERF] Add conv1d metadata to GDN attn Sep 18, 2025

vadiklyutiy changed the title ~~[PERF] Add conv1d metadata to GDN attn~~ [PERF] Add conv1d metadata to GDN attn Sep 18, 2025

sighingnow approved these changes Sep 18, 2025

View reviewed changes

sighingnow enabled auto-merge (squash) September 18, 2025 12:34

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025

sighingnow merged commit 072d7e5 into vllm-project:main Sep 18, 2025
65 checks passed

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[PERF] Add conv1d metadata to GDN attn (vllm-project#25105)

edfe381

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

JartX mentioned this pull request Sep 19, 2025

[BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) #25268

Merged

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[PERF] Add conv1d metadata to GDN attn (vllm-project#25105)

8a8105b

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[PERF] Add conv1d metadata to GDN attn (vllm-project#25105)

af8824a

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[PERF] Add `conv1d` metadata to GDN attn #25105

[PERF] Add `conv1d` metadata to GDN attn #25105

Uh oh!

vadiklyutiy commented Sep 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

vadiklyutiy commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[PERF] Add conv1d metadata to GDN attn #25105

[PERF] Add conv1d metadata to GDN attn #25105

Uh oh!

Conversation

vadiklyutiy commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

vadiklyutiy commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

[PERF] Add `conv1d` metadata to GDN attn #25105

[PERF] Add `conv1d` metadata to GDN attn #25105

vadiklyutiy commented Sep 17, 2025 •

edited by github-actions bot

Loading