Skip to content

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Sep 27, 2025

Purpose

Previously we assume the max number of multimodal tokens is equal/close to the number of embeddings from multimodal encoder. This is no longer true given more models are injecting a large number of special tokens in between multimodal embeddings and thus expose some challenge since the ViT memory profiling was coupled with encoder cache memory profiling.

This PR fixes this issue and therefore eliminates the risk of OOM when encoder output size does not match max number of multimodal tokens

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential out-of-memory issue during memory profiling for multimodal models. The issue arises when the number of tokens produced by the multimodal encoder is smaller than the allocated budget in the encoder cache, which can happen when special tokens are injected around the embeddings. The fix correctly pads the dummy encoder outputs to match the encoder_budget during the profiling run. This ensures that memory profiling accounts for the maximum possible size of the embeddings in the cache, thus preventing OOM errors in production. The implementation is straightforward and correctly handles the padding. I've reviewed the changes and they look good. I don't have any critical or high-severity comments.

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 28, 2025
Copy link
Contributor

@wwl2755 wwl2755 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! As long as we could make sure the encoder_budget could have the padded/scattered shape (which should be addressed in #25557)

@ywang96 ywang96 enabled auto-merge (squash) September 28, 2025 00:46
@ywang96 ywang96 merged commit 6931144 into vllm-project:main Sep 28, 2025
45 checks passed
simon-mo pushed a commit that referenced this pull request Sep 28, 2025
…25810)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Sep 28, 2025
…llm-project#25810)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…25810)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants