-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model] Move vision_feature_select_strategy
into resolve_visual_encoder_outputs
#25938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Move vision_feature_select_strategy
into resolve_visual_encoder_outputs
#25938
Conversation
…coder_outputs` Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a nice refactoring that centralizes the vision_feature_select_strategy
logic into the resolve_visual_encoder_outputs
utility function. This successfully removes duplicated code across several vision models, improving maintainability. The changes are consistent and well-applied across all relevant files.
However, I've found a critical issue in the resolve_visual_encoder_outputs
function in vllm/model_executor/models/vision.py
. The logic for applying post_layer_norm
will cause a runtime error, and the condition to check if the last layer is being used is also incorrect. I've provided a detailed comment with a suggested fix for this.
vllm/model_executor/models/vision.py
Outdated
uses_last_layer = select_layers[-1] in (len(hs_pool) - 1, -1) | ||
if post_layer_norm is not None and uses_last_layer: | ||
hs_pool[-1] = post_layer_norm(encoder_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appear to be two issues in this block of code:
-
The condition to check if the last layer is being used seems incorrect.
uses_last_layer
is checked againstlen(hs_pool) - 1
, which islen(select_layers) - 1
. This doesn't seem to correctly identify if the last layer of the encoder is being used. It should probably check againstmax_possible_layers
, for example:select_layers[-1] in (max_possible_layers - 1, -1)
. -
post_layer_norm
is being called withencoder_outputs
, which is a list of tensors whenselect_layers
is provided. This will cause a runtime error. It should be called with the last hidden state tensor, which isencoder_outputs[-1]
.
Here is a suggested fix for both issues:
uses_last_layer = select_layers[-1] in (len(hs_pool) - 1, -1) | |
if post_layer_norm is not None and uses_last_layer: | |
hs_pool[-1] = post_layer_norm(encoder_outputs) | |
uses_last_layer = select_layers[-1] in (max_possible_layers - 1, -1) | |
if post_layer_norm is not None and uses_last_layer: | |
hs_pool[-1] = post_layer_norm(encoder_outputs[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These suggestions look reasonable, so I have applied them. cc @alex-jw-brooks
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
46f187a
to
88d9dd3
Compare
…coder_outputs` (vllm-project#25938) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…coder_outputs` (#25938) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Clean up some duplicate code across models that use common vision encoders. This also avoids applying layernorm on the features that are not selected.
Also, clean up the signature of
resolve_visual_encoder_outputs
.Test Plan
Unblock all multimodal tests
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.