[MM] Optimize memory profiling for scattered multimodal embeddings #25810

ywang96 · 2025-09-27T23:37:01Z

Purpose

Previously we assume the max number of multimodal tokens is equal/close to the number of embeddings from multimodal encoder. This is no longer true given more models are injecting a large number of special tokens in between multimodal embeddings and thus expose some challenge since the ViT memory profiling was coupled with encoder cache memory profiling.

This PR fixes this issue and therefore eliminates the risk of OOM when encoder output size does not match max number of multimodal tokens

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>

gemini-code-assist

Code Review

This pull request addresses a potential out-of-memory issue during memory profiling for multimodal models. The issue arises when the number of tokens produced by the multimodal encoder is smaller than the allocated budget in the encoder cache, which can happen when special tokens are injected around the embeddings. The fix correctly pads the dummy encoder outputs to match the encoder_budget during the profiling run. This ensures that memory profiling accounts for the maximum possible size of the embeddings in the cache, thus preventing OOM errors in production. The implementation is straightforward and correctly handles the padding. I've reviewed the changes and they look good. I don't have any critical or high-severity comments.

wwl2755

LGTM! As long as we could make sure the encoder_budget could have the padded/scattered shape (which should be addressed in #25557)

…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

…llm-project#25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>

…llm-project#25810) Signed-off-by: Roger Wang <hey@rogerw.io>

…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

update

a739332

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested review from WoosukKwon, robertgshaw2-redhat, njhill, comaniac and alexm-redhat as code owners September 27, 2025 23:37

Merge branch 'main' into fix-profiling-scattered

a844649

ywang96 added this to the v0.11.0 Cherry Picks milestone Sep 27, 2025

mergify bot added the v1 label Sep 27, 2025

ywang96 mentioned this pull request Sep 27, 2025

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

Merged

5 tasks

gemini-code-assist bot reviewed Sep 27, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 28, 2025

houseroad approved these changes Sep 28, 2025

View reviewed changes

wwl2755 approved these changes Sep 28, 2025

View reviewed changes

ywang96 enabled auto-merge (squash) September 28, 2025 00:46

ywang96 merged commit 6931144 into vllm-project:main Sep 28, 2025
45 checks passed

simon-mo pushed a commit that referenced this pull request Sep 28, 2025

[MM] Optimize memory profiling for scattered multimodal embeddings (#…

6de3d43

…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

ywang96 mentioned this pull request Sep 29, 2025

[MM]: Optimize encoder cache memory consumption by storing encoder outputs only #25903

Open

1 task

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[MM] Optimize memory profiling for scattered multimodal embeddings (v…

150982e

…llm-project#25810) Signed-off-by: Roger Wang <hey@rogerw.io>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[MM] Optimize memory profiling for scattered multimodal embeddings (#…

02e87f1

…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM] Optimize memory profiling for scattered multimodal embeddings #25810

[MM] Optimize memory profiling for scattered multimodal embeddings #25810

Uh oh!

ywang96 commented Sep 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

wwl2755 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MM] Optimize memory profiling for scattered multimodal embeddings #25810

[MM] Optimize memory profiling for scattered multimodal embeddings #25810

Uh oh!

Conversation

ywang96 commented Sep 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

wwl2755 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ywang96 commented Sep 27, 2025 •

edited by github-actions bot

Loading

wwl2755 left a comment •

edited

Loading