-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[MM] Optimize memory profiling for scattered multimodal embeddings #25810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a potential out-of-memory issue during memory profiling for multimodal models. The issue arises when the number of tokens produced by the multimodal encoder is smaller than the allocated budget in the encoder cache, which can happen when special tokens are injected around the embeddings. The fix correctly pads the dummy encoder outputs to match the encoder_budget
during the profiling run. This ensures that memory profiling accounts for the maximum possible size of the embeddings in the cache, thus preventing OOM errors in production. The implementation is straightforward and correctly handles the padding. I've reviewed the changes and they look good. I don't have any critical or high-severity comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! As long as we could make sure the encoder_budget
could have the padded/scattered shape (which should be addressed in #25557)
…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>
…llm-project#25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
…llm-project#25810) Signed-off-by: Roger Wang <hey@rogerw.io>
…25810) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Previously we assume the max number of multimodal tokens is equal/close to the number of embeddings from multimodal encoder. This is no longer true given more models are injecting a large number of special tokens in between multimodal embeddings and thus expose some challenge since the ViT memory profiling was coupled with encoder cache memory profiling.
This PR fixes this issue and therefore eliminates the risk of OOM when encoder output size does not match max number of multimodal tokens
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.