-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Directly get max encoder len from VLLM config in V1 #24866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by optimizing how the maximum encoder length is retrieved. However, the current implementation introduces a critical correctness issue by using an incorrect value, which creates an inconsistency between memory allocation for the KV cache and the metadata used in the attention mechanism. This could lead to out-of-bounds memory access. I've provided a detailed comment explaining the issue and recommending a safer approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a critical correctness issue by using a value for max_encoder_len
that is inconsistent with how the cross-attention KV cache is allocated. This can lead to out-of-bounds memory access and other runtime errors.
The new implementation uses scheduler_config.max_num_encoder_input_tokens
, which is derived from max_num_batched_tokens
. This is a general batching configuration (e.g., 16384) and not the model-specific maximum encoder sequence length (e.g., 1500 for Whisper).
Meanwhile, the memory for the cross-attention KV cache is allocated based on the correct, smaller value via MULTIMODAL_REGISTRY.get_encdec_max_encoder_len
in CrossAttentionSpec.max_memory_usage_bytes
. Using a much larger max_seq_len
in the attention metadata here creates a dangerous discrepancy.
While the performance motivation is valid, a safer solution is to compute the correct value once during engine initialization and cache it in the configuration. For now, it's best to revert to the original implementation to avoid memory corruption issues.
def _get_max_encoder_len(vllm_config: "VllmConfig") -> int:
return MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(
vllm_config.model_config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but best for @russellb @heheda12345 to confirm.
@russellb @heheda12345 Could you please review this PR when you have time ? Thx ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before: Average latency was 1300ms.
After: Average latency is now 305ms.
The change make sense to me. But the speedup is much larger than I expect from this change. What is the expensive operation that lead to such a large speedup?
Thanks for the PR! You will also need to update your commit message(s) to include the |
Improves performance by getting the max encoder length directly from the initialized `vllm_config.scheduler_config`. This avoids the expensive lookup and re-computation previously done by `MULTIMODAL_REGISTRY.get_encdec_max_encoder_len`. Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
Signed-off-by: Sugar-zsg <952242923@qq.com>
f53f33b
to
d992f22
Compare
During decoding, each call to _prepare_inputs triggers the cross-attention builder, which in turn executes MULTIMODAL_REGISTRY.get_encdec_max_encoder_len. In my tests, this method takes around 10ms per call, while the decoder’s forward computation itself only takes about 2ms. |
@heheda12345 @DarkLight1337 These have been resolved. Please check when you have a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I confirmed that this value in the scheduler_config is initialized from the same code from the multimodal registry:
self.scheduler_config.max_num_encoder_input_tokens = \
MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(self.model_config)
I haven't enabled auto-merge yet in case you wanted to look again @heheda12345 |
This is the same change that was made in vllm-project#24866. In that PR, it was pointed out that this code: MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(...) is much slower than getting the same value that's cached on the scheduler config: scheduler_config.max_num_encoder_input_tokens This PR makes the change to more spots: the scheduler, kv cache manager, and gpu model runner. Related to issue vllm-project#24946. Signed-off-by: Russell Bryant <rbryant@redhat.com>
related PR: making the same change in some other areas: #24989 |
Signed-off-by: Sugar-zsg <952242923@qq.com>
Improves performance by getting the max encoder length directly from the initialized
vllm_config.scheduler_config
. This avoids the expensive lookup and re-computation previously done byMULTIMODAL_REGISTRY.get_encdec_max_encoder_len
.Test Results:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.