-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables #25274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new environment variable VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH
to allow users to configure the max_num_splits
for FlashAttention with CUDA graphs. The changes correctly add the environment variable definition, parsing logic, and integrate it into the attention backend. However, there is a logical flaw in how the environment variable is defined and consumed. The current implementation in vllm/envs.py
causes the check in vllm/v1/attention/backends/mla/flashattn_mla.py
to always be true, leading to dead code. My review provides suggestions to align the implementation with the existing pattern for optional integer environment variables in the codebase, which will fix the logical issue and improve code clarity and consistency.
@@ -118,6 +118,7 @@ | |||
VLLM_SERVER_DEV_MODE: bool = False | |||
VLLM_V1_OUTPUT_PROC_CHUNK_SIZE: int = 128 | |||
VLLM_MLA_DISABLE: bool = False | |||
VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other optional integer environment variables like VLLM_FLASH_ATTN_VERSION
, it's better to define this as Optional[int]
and handle the default value in the consumer module (flashattn_mla.py
). This makes the intent clearer that the variable is optional and has a fallback. This change is related to another suggested change for the lambda function of this environment variable.
VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16 | |
VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: Optional[int] = None |
vllm/envs.py
Outdated
"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | ||
lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To allow checking if the environment variable was explicitly set, this lambda should return None
when the variable is not present. The current implementation always returns an integer, which causes a logical flaw in flashattn_mla.py
. Using maybe_convert_int
without a default for os.getenv
is the standard pattern in this file for optional integer variables like VLLM_FLASH_ATTN_VERSION
.
"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | |
lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")), | |
"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | |
lambda: maybe_convert_int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH")), |
if envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None: | ||
logger.info_once("Getting flash attention max num splits for " | ||
"cuda graph from environment variable, value=%s", | ||
envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH) | ||
self.max_num_splits = envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH | ||
else: | ||
self.max_num_splits = _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logical flaw here. The if
condition envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None
will always evaluate to True
because of how the environment variable is defined in vllm/envs.py
. The lambda
function for it always returns an integer (defaulting to 16 if not set), never None
. This makes the else
block unreachable (dead code).
To fix this, you should modify vllm/envs.py
to follow the pattern of other optional integer environment variables. Specifically:
- Change the type hint to
Optional[int] = None
. - Change the lambda to use
maybe_convert_int(os.getenv(...))
without a default, so it returnsNone
if the variable is not set.
With those changes in vllm/envs.py
, this block of code will work as intended. I've added separate comments in vllm/envs.py
with the specific suggestions.
Thanks for the contribution! Could you also update the non-MLA flash attention backend to use this env var? Regarding gemini's comments, I think you can get rid of |
thanks Matt, updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution!
… users have control over cuda graph max_num_splits in cli level Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
…LA flash attention Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
2755ba8
to
5257c7b
Compare
This pull request has merge conflicts that must be resolved before it can be |
5257c7b
to
7a418c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the api process count and rank related to this PR? seems like a bad merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh seems like this merged pr https://github.com/vllm-project/vllm/pull/23717/files is included in mine somehow. Let me try to fix it.
551ed9e
to
7a4b528
Compare
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>
…v variables (#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Add
VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH
in env variables so users have control over cuda graph max_num_splits in cli level.When applying #23958, realized the _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH value is copied from Flash_attn (code ref) and this mentioned to be tuned if needed. Thinking we should surface this to front end.
Test Plan
Tested based off docker image
vllm/vllm-openai:v0.10.2
with this prTest Result
Quality check
Command:
Flash Attention Quality Check on
Mixtral-8x7B