Directly get max encoder len from VLLM config in V1 #24866

Sugar-zsg · 2025-09-15T09:00:10Z

Improves performance by getting the max encoder length directly from the initialized vllm_config.scheduler_config. This avoids the expensive lookup and re-computation previously done by MULTIMODAL_REGISTRY.get_encdec_max_encoder_len.

Test Results:

Environment: H20 GPU
Data: 10s audio
Before: Average latency was 1300ms.
After: Average latency is now 305ms.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request aims to improve performance by optimizing how the maximum encoder length is retrieved. However, the current implementation introduces a critical correctness issue by using an incorrect value, which creates an inconsistency between memory allocation for the KV cache and the metadata used in the attention mechanism. This could lead to out-of-bounds memory access. I've provided a detailed comment explaining the issue and recommending a safer approach.

gemini-code-assist · 2025-09-15T09:02:01Z

vllm/attention/layers/cross_attention.py

This change introduces a critical correctness issue by using a value for max_encoder_len that is inconsistent with how the cross-attention KV cache is allocated. This can lead to out-of-bounds memory access and other runtime errors.

The new implementation uses scheduler_config.max_num_encoder_input_tokens, which is derived from max_num_batched_tokens. This is a general batching configuration (e.g., 16384) and not the model-specific maximum encoder sequence length (e.g., 1500 for Whisper).

Meanwhile, the memory for the cross-attention KV cache is allocated based on the correct, smaller value via MULTIMODAL_REGISTRY.get_encdec_max_encoder_len in CrossAttentionSpec.max_memory_usage_bytes. Using a much larger max_seq_len in the attention metadata here creates a dangerous discrepancy.

While the performance motivation is valid, a safer solution is to compute the correct value once during engine initialization and cache it in the configuration. For now, it's best to revert to the original implementation to avoid memory corruption issues.

def _get_max_encoder_len(vllm_config: "VllmConfig") -> int: return MULTIMODAL_REGISTRY.get_encdec_max_encoder_len( vllm_config.model_config)

vllm/attention/layers/cross_attention.py

DarkLight1337

LGTM, but best for @russellb @heheda12345 to confirm.

Sugar-zsg · 2025-09-15T13:05:36Z

@russellb @heheda12345 Could you please review this PR when you have time ? Thx !

vllm/attention/layers/cross_attention.py

heheda12345

Before: Average latency was 1300ms.
After: Average latency is now 305ms.

The change make sense to me. But the speedup is much larger than I expect from this change. What is the expensive operation that lead to such a large speedup?

russellb · 2025-09-15T18:25:34Z

Thanks for the PR! You will also need to update your commit message(s) to include the Signed-off-by header to satisfy the DCO check in CI.

Improves performance by getting the max encoder length directly from the initialized `vllm_config.scheduler_config`. This avoids the expensive lookup and re-computation previously done by `MULTIMODAL_REGISTRY.get_encdec_max_encoder_len`. Signed-off-by: Sugar-zsg <952242923@qq.com>

Signed-off-by: Sugar-zsg <952242923@qq.com>

Sugar-zsg · 2025-09-16T02:19:53Z

Before: Average latency was 1300ms.
After: Average latency is now 305ms.
The change make sense to me. But the speedup is much larger than I expect from this change. What is the expensive operation that lead to such a large speedup?

During decoding, each call to _prepare_inputs triggers the cross-attention builder, which in turn executes MULTIMODAL_REGISTRY.get_encdec_max_encoder_len. In my tests, this method takes around 10ms per call, while the decoder’s forward computation itself only takes about 2ms.

Sugar-zsg · 2025-09-16T12:45:08Z

@heheda12345 @DarkLight1337 These have been resolved. Please check when you have a chance.

russellb

Thanks!

I confirmed that this value in the scheduler_config is initialized from the same code from the multimodal registry:

                self.scheduler_config.max_num_encoder_input_tokens = \
                    MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(self.model_config)

russellb · 2025-09-16T12:50:43Z

I haven't enabled auto-merge yet in case you wanted to look again @heheda12345

This is the same change that was made in vllm-project#24866. In that PR, it was pointed out that this code: MULTIMODAL_REGISTRY.get_encdec_max_encoder_len(...) is much slower than getting the same value that's cached on the scheduler config: scheduler_config.max_num_encoder_input_tokens This PR makes the change to more spots: the scheduler, kv cache manager, and gpu model runner. Related to issue vllm-project#24946. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2025-09-16T19:00:49Z

related PR: making the same change in some other areas: #24989

Signed-off-by: Sugar-zsg <952242923@qq.com>

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

DarkLight1337 reviewed Sep 15, 2025

View reviewed changes

vllm/attention/layers/cross_attention.py Outdated Show resolved Hide resolved

DarkLight1337 approved these changes Sep 15, 2025

View reviewed changes

heheda12345 reviewed Sep 15, 2025

View reviewed changes

vllm/attention/layers/cross_attention.py Outdated Show resolved Hide resolved

heheda12345 reviewed Sep 15, 2025

View reviewed changes

Sugar-zsg requested a review from LucasWilkinson as a code owner September 16, 2025 02:02

Sugar-zsg and others added 6 commits September 16, 2025 10:07

standardization

5a5c11f

Signed-off-by: Sugar-zsg <952242923@qq.com>

standardization

135fdd8

Signed-off-by: Sugar-zsg <952242923@qq.com>

without getattr

f319d51

Signed-off-by: Sugar-zsg <952242923@qq.com>

standardization

85a8854

Signed-off-by: Sugar-zsg <952242923@qq.com>

change to assert

d992f22

Signed-off-by: Sugar-zsg <952242923@qq.com>

Sugar-zsg force-pushed the whisper-v1-test branch from f53f33b to d992f22 Compare September 16, 2025 02:07

russellb approved these changes Sep 16, 2025

View reviewed changes

russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025

Merge branch 'main' into whisper-v1-test

a3ec262

russellb mentioned this pull request Sep 16, 2025

[Bug]: The inference speed of the whisper model under the v1 engine is much slower than v0 #24946

Open

1 task

heheda12345 enabled auto-merge (squash) September 16, 2025 16:12

heheda12345 merged commit cd1f885 into vllm-project:main Sep 16, 2025
43 checks passed

russellb mentioned this pull request Sep 16, 2025

[Core] Get num_encoder_tokens from scheduler config #24989

Merged

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

Directly get max encoder len from VLLM config in V1 (vllm-project#24866)

1720697

Signed-off-by: Sugar-zsg <952242923@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Directly get max encoder len from VLLM config in V1 #24866

Directly get max encoder len from VLLM config in V1 #24866

Uh oh!

Sugar-zsg commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Uh oh!

Uh oh!

DarkLight1337 left a comment

Uh oh!

Sugar-zsg commented Sep 15, 2025

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

russellb commented Sep 15, 2025

Uh oh!

Sugar-zsg commented Sep 16, 2025 •

edited

Loading

Uh oh!

Sugar-zsg commented Sep 16, 2025

Uh oh!

russellb left a comment

Uh oh!

russellb commented Sep 16, 2025

Uh oh!

Uh oh!

russellb commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Directly get max encoder len from VLLM config in V1 #24866

Directly get max encoder len from VLLM config in V1 #24866

Uh oh!

Conversation

Sugar-zsg commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Sugar-zsg commented Sep 15, 2025

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented Sep 15, 2025

Uh oh!

Sugar-zsg commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sugar-zsg commented Sep 16, 2025

Uh oh!

russellb left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented Sep 16, 2025

Uh oh!

Uh oh!

russellb commented Sep 16, 2025

Uh oh!

Uh oh!

Sugar-zsg commented Sep 15, 2025 •

edited by github-actions bot

Loading

Sugar-zsg commented Sep 16, 2025 •

edited

Loading