-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1
#29449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by eliminating several GPU-to-CPU synchronization points in the FlashAttention metadata building process, especially when Decode Context Parallelism (DCP) is active. The changes correctly replace CPU-side tensor operations with their GPU-side equivalents and avoid a costly .item() call by computing a safe upper bound for the maximum sequence length.
However, I've identified a critical issue in the calculation of this upper bound (max_dcp_context_kv_len). The current formula is only correct when cp_kv_cache_interleave_size is 1. For other values, it can underestimate the required buffer size, potentially leading to out-of-bounds memory access. I have provided a detailed comment with a corrected formula to fix this bug.
Other changes, such as creating tensors directly on the target device in utils.py, are good practice and well-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
seq_lens_cpu in FlashAttention metadata building with DCP > 1
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for doing this!
…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Benjamin Feuer <penfever@gmail.com>
…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Purpose
Currently, when DCP > 1, FlashAttention uses the host-side
seq_lens_cputo compute the DCP KV context lens. This requires that the host and device be synchronized, which interferes with asynchronous speculative decoding. This PR modifies the logic to use only device-side tensors, and employs a safe upper bound formax_seq_len.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.