Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Nov 25, 2025

Purpose

Currently, when DCP > 1, FlashAttention uses the host-side seq_lens_cpu to compute the DCP KV context lens. This requires that the host and device be synchronized, which interferes with asynchronous speculative decoding. This PR modifies the logic to use only device-side tensors, and employs a safe upper bound for max_seq_len.

Test Plan

vllm serve deepseek-ai/DeepSeek-R1 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  -dp 8 \
  --enable-expert-parallel \
  -dcp 8 \
  --async-scheduling

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by eliminating several GPU-to-CPU synchronization points in the FlashAttention metadata building process, especially when Decode Context Parallelism (DCP) is active. The changes correctly replace CPU-side tensor operations with their GPU-side equivalents and avoid a costly .item() call by computing a safe upper bound for the maximum sequence length.

However, I've identified a critical issue in the calculation of this upper bound (max_dcp_context_kv_len). The current formula is only correct when cp_kv_cache_interleave_size is 1. For other values, it can underestimate the required buffer size, potentially leading to out-of-bounds memory access. I have provided a detailed comment with a corrected formula to fix this bug.

Other changes, such as creating tensors directly on the target device in utils.py, are good practice and well-implemented.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni
Copy link
Contributor Author

cc @LucasWilkinson

@MatthewBonanni MatthewBonanni changed the title [Attention][Async] Eliminate sync in FlashAttention metadata building with DCP > 1 [Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 Nov 25, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for doing this!

@LucasWilkinson LucasWilkinson added ready ONLY add when PR is ready to merge/full CI is needed and removed v1 labels Nov 26, 2025
@mergify mergify bot added the v1 label Nov 26, 2025
@vllm-bot vllm-bot merged commit 7774019 into vllm-project:main Nov 27, 2025
53 of 55 checks passed
@MatthewBonanni MatthewBonanni deleted the fa_eliminate_cpu branch November 27, 2025 04:15
penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025
…a building with DCP > 1 (vllm-project#29449)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Benjamin Feuer <penfever@gmail.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…a building with DCP > 1 (vllm-project#29449)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
…a building with DCP > 1 (vllm-project#29449)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants