Skip to content

Conversation

vadiklyutiy
Copy link
Contributor

@vadiklyutiy vadiklyutiy commented Sep 17, 2025

Purpose

The on of the main gap in performance for Qwen3-next model is low utilisation of GPU in GDN attn.
Low utilization caused by several GPU<->host memory transfers.
These transfers caused by causal_conv1d_fn.
Qwen3-next haven't supported this specific convolution metadata. The purpose of the metadata is avoid these memory transfer.

Add conv metadata (similar to mamba2).

+ corrected tensor.tensor -> tensor.Tensor types annotation.

Test Result

H200, tp=4

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct   --endpoint /v1/completions --dataset-name random --random-input 32768   --random-output 1 --max-concurrency 8 --num-prompt 8

Before

============ Serving Benchmark Result ============
Successful requests:                     8
Maximum request concurrency:             8
Benchmark duration (s):                  4.56
Total input tokens:                      262144
Total generated tokens:                  8
Request throughput (req/s):              1.76
Output token throughput (tok/s):         1.76
Total Token throughput (tok/s):          57510.95
---------------Time to First Token----------------
Mean TTFT (ms):                          2600.46
Median TTFT (ms):                        2601.40
P99 TTFT (ms):                           4511.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     8
Maximum request concurrency:             8
Benchmark duration (s):                  3.61
Total input tokens:                      262144
Total generated tokens:                  8
Request throughput (req/s):              2.22
Output token throughput (tok/s):         2.22
Total Token throughput (tok/s):          72623.02
---------------Time to First Token----------------
Mean TTFT (ms):                          2067.73
Median TTFT (ms):                        2068.62
P99 TTFT (ms):                           3570.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

Speedup is 26%

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for the Qwen3-next model by adding mamba2-style convolution metadata to the GDN attention mechanism. This change aims to reduce GPU-host memory transfers in causal_conv1d_fn, leading to a significant performance improvement as demonstrated by the benchmark results. The implementation correctly extends GDNAttentionMetadata and integrates the metadata preparation and usage within the model's forward pass. The changes are well-targeted and effective. I have one minor suggestion to improve type hint correctness.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy changed the title add mamba2 metadata to GDN attn [PERF] Add conv1d metadata to GDN attn Sep 18, 2025
@vadiklyutiy vadiklyutiy changed the title [PERF] Add conv1d metadata to GDN attn [PERF] Add conv1d metadata to GDN attn Sep 18, 2025
@vadiklyutiy
Copy link
Contributor Author

Could someone please run the CI?

@sighingnow sighingnow enabled auto-merge (squash) September 18, 2025 12:34
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025
@sighingnow sighingnow merged commit 072d7e5 into vllm-project:main Sep 18, 2025
65 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: charlifu <charlifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants