[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 #29449

MatthewBonanni · 2025-11-25T22:36:11Z

Purpose

Currently, when DCP > 1, FlashAttention uses the host-side seq_lens_cpu to compute the DCP KV context lens. This requires that the host and device be synchronized, which interferes with asynchronous speculative decoding. This PR modifies the logic to use only device-side tensors, and employs a safe upper bound for max_seq_len.

Test Plan

vllm serve deepseek-ai/DeepSeek-R1 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  -dp 8 \
  --enable-expert-parallel \
  -dcp 8 \
  --async-scheduling

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request aims to improve performance by eliminating several GPU-to-CPU synchronization points in the FlashAttention metadata building process, especially when Decode Context Parallelism (DCP) is active. The changes correctly replace CPU-side tensor operations with their GPU-side equivalents and avoid a costly .item() call by computing a safe upper bound for the maximum sequence length.

However, I've identified a critical issue in the calculation of this upper bound (max_dcp_context_kv_len). The current formula is only correct when cp_kv_cache_interleave_size is 1. For other values, it can underestimate the required buffer size, potentially leading to out-of-bounds memory access. I have provided a detailed comment with a corrected formula to fix this bug.

Other changes, such as creating tensors directly on the target device in utils.py, are good practice and well-implemented.

vllm/v1/attention/backends/flash_attn.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/attention/backends/flash_attn.py

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni · 2025-11-25T22:44:45Z

cc @LucasWilkinson

LucasWilkinson

LGTM! Thanks for doing this!

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Benjamin Feuer <penfever@gmail.com>

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

MatthewBonanni added 2 commits November 25, 2025 17:19

eliminate cpu access in FA metadata building

9552bce

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

tighter upper bound

3faf1e0

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni requested a review from LucasWilkinson as a code owner November 25, 2025 22:36

mergify bot added the v1 label Nov 25, 2025

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 25, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

fix upper bound logic

78829f4

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni mentioned this pull request Nov 25, 2025

[Performance]: Fully Async Spec-Decoding | Make seq_lens_cpu in CommonAttentionMetadata optional #29134

Open

MatthewBonanni changed the title ~~[Attention][Async] Eliminate sync in FlashAttention metadata building with DCP > 1~~ [Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 Nov 25, 2025

LucasWilkinson approved these changes Nov 26, 2025

View reviewed changes

LucasWilkinson added ready ONLY add when PR is ready to merge/full CI is needed and removed v1 labels Nov 26, 2025

mergify bot added the v1 label Nov 26, 2025

mgoin approved these changes Nov 27, 2025

View reviewed changes

vllm-bot merged commit 7774019 into vllm-project:main Nov 27, 2025
53 of 55 checks passed

MatthewBonanni deleted the fa_eliminate_cpu branch November 27, 2025 04:15

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadat…

276efb9

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 #29449

[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 #29449

Uh oh!

MatthewBonanni commented Nov 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

MatthewBonanni commented Nov 25, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 #29449

[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 #29449

Uh oh!

Conversation

MatthewBonanni commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

MatthewBonanni commented Nov 25, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 #29449

[Attention][Async] Eliminate `seq_lens_cpu` in FlashAttention metadata building with DCP > 1 #29449

MatthewBonanni commented Nov 25, 2025 •

edited by github-actions bot

Loading