-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Get num_encoder_tokens from scheduler config #24989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Get num_encoder_tokens from scheduler config #24989
Conversation
This is the same change that was made in vllm-project#24866. In that PR, it was pointed out that this code: MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(...) is much slower than getting the same value that's cached on the scheduler config: scheduler_config.max_num_encoder_input_tokens This PR makes the change to more spots: the scheduler, kv cache manager, and gpu model runner. Related to issue vllm-project#24946. Signed-off-by: Russell Bryant <rbryant@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to optimize performance by replacing a slow function call, get_encdec_max_encoder_len
, with a cached value from the scheduler configuration. However, the implementation introduces a critical issue. The cached value used, scheduler_config.max_num_encoder_input_tokens
, is an alias for max_num_batched_tokens
, which is a scheduling parameter and not the model-specific maximum encoder length. Using this incorrect value for calculating memory usage and allocating the cross-attention KV cache can lead to memory under-allocation and potential out-of-bounds memory access. I have provided comments in each of the affected files recommending that these changes be reverted until a correct caching mechanism for the maximum encoder length is implemented.
num_encoder_tokens =\ | ||
self.scheduler_config.max_num_encoder_input_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change replaces the call to get_encdec_max_encoder_len
with self.scheduler_config.max_num_encoder_input_tokens
. However, max_num_encoder_input_tokens
is initialized to max_num_batched_tokens
in SchedulerConfig
, which is not the correct value for the model's maximum encoder length (e.g., a fixed value like 3000 for Whisper). Using a potentially smaller value from max_num_batched_tokens
will lead to under-allocation of the cross-attention KV cache, which can cause out-of-bounds memory access. This is a critical issue. The original call was functionally correct, though slow. I recommend reverting this change until the value is cached correctly in the configuration.
num_encoder_tokens =\ | |
self.scheduler_config.max_num_encoder_input_tokens | |
num_encoder_tokens = MULTIMODAL_REGISTRY.\ | |
get_encdec_max_encoder_len( | |
self.vllm_config.model_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where it gets set to the correct value that we want:
Lines 2788 to 2790 in 218454b
elif self.model_config.is_encoder_decoder: | |
self.scheduler_config.max_num_encoder_input_tokens = \ | |
MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(self.model_config) |
This isn't a huge improvement (about 3% throughput increase), but it's a good step. There's still something more significant limiting throughput that I'm tracking down. |
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: charlifu <charlifu@amd.com>
This is the same change that was made in #24866. In that PR, it was
pointed out that this code:
is much slower than getting the same value that's cached on the
scheduler config:
This PR makes the change to more spots: the scheduler, kv cache
manager, and gpu model runner.
Related to issue #24946.
Signed-off-by: Russell Bryant rbryant@redhat.com