Skip to content

Conversation

russellb
Copy link
Member

This is the same change that was made in #24866. In that PR, it was
pointed out that this code:

MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(...)

is much slower than getting the same value that's cached on the
scheduler config:

scheduler_config.max_num_encoder_input_tokens

This PR makes the change to more spots: the scheduler, kv cache
manager, and gpu model runner.

Related to issue #24946.

Signed-off-by: Russell Bryant rbryant@redhat.com

This is the same change that was made in vllm-project#24866. In that PR, it was
pointed out that this code:

    MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(...)

is much slower than getting the same value that's cached on the
scheduler config:

    scheduler_config.max_num_encoder_input_tokens

This PR makes the change to more spots: the scheduler, kv cache
manager, and gpu model runner.

Related to issue vllm-project#24946.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize performance by replacing a slow function call, get_encdec_max_encoder_len, with a cached value from the scheduler configuration. However, the implementation introduces a critical issue. The cached value used, scheduler_config.max_num_encoder_input_tokens, is an alias for max_num_batched_tokens, which is a scheduling parameter and not the model-specific maximum encoder length. Using this incorrect value for calculating memory usage and allocating the cross-attention KV cache can lead to memory under-allocation and potential out-of-bounds memory access. I have provided comments in each of the affected files recommending that these changes be reverted until a correct caching mechanism for the maximum encoder length is implemented.

Comment on lines +468 to +469
num_encoder_tokens =\
self.scheduler_config.max_num_encoder_input_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change replaces the call to get_encdec_max_encoder_len with self.scheduler_config.max_num_encoder_input_tokens. However, max_num_encoder_input_tokens is initialized to max_num_batched_tokens in SchedulerConfig, which is not the correct value for the model's maximum encoder length (e.g., a fixed value like 3000 for Whisper). Using a potentially smaller value from max_num_batched_tokens will lead to under-allocation of the cross-attention KV cache, which can cause out-of-bounds memory access. This is a critical issue. The original call was functionally correct, though slow. I recommend reverting this change until the value is cached correctly in the configuration.

Suggested change
num_encoder_tokens =\
self.scheduler_config.max_num_encoder_input_tokens
num_encoder_tokens = MULTIMODAL_REGISTRY.\
get_encdec_max_encoder_len(
self.vllm_config.model_config)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where it gets set to the correct value that we want:

vllm/vllm/config/__init__.py

Lines 2788 to 2790 in 218454b

elif self.model_config.is_encoder_decoder:
self.scheduler_config.max_num_encoder_input_tokens = \
MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(self.model_config)

@russellb
Copy link
Member Author

This isn't a huge improvement (about 3% throughput increase), but it's a good step. There's still something more significant limiting throughput that I'm tracking down.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 16, 2025 19:16
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025
@vllm-bot vllm-bot merged commit 58d4c70 into vllm-project:main Sep 17, 2025
43 of 48 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Russell Bryant <rbryant@redhat.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: charlifu <charlifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants