Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Sep 30, 2025

Purpose

Clean up some duplicate code across models that use common vision encoders. This also avoids applying layernorm on the features that are not selected.

Also, clean up the signature of resolve_visual_encoder_outputs.

Test Plan

Unblock all multimodal tests

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…coder_outputs`

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Sep 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a nice refactoring that centralizes the vision_feature_select_strategy logic into the resolve_visual_encoder_outputs utility function. This successfully removes duplicated code across several vision models, improving maintainability. The changes are consistent and well-applied across all relevant files.

However, I've found a critical issue in the resolve_visual_encoder_outputs function in vllm/model_executor/models/vision.py. The logic for applying post_layer_norm will cause a runtime error, and the condition to check if the last layer is being used is also incorrect. I've provided a detailed comment with a suggested fix for this.

Comment on lines 172 to 174
uses_last_layer = select_layers[-1] in (len(hs_pool) - 1, -1)
if post_layer_norm is not None and uses_last_layer:
hs_pool[-1] = post_layer_norm(encoder_outputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appear to be two issues in this block of code:

  1. The condition to check if the last layer is being used seems incorrect. uses_last_layer is checked against len(hs_pool) - 1, which is len(select_layers) - 1. This doesn't seem to correctly identify if the last layer of the encoder is being used. It should probably check against max_possible_layers, for example: select_layers[-1] in (max_possible_layers - 1, -1).

  2. post_layer_norm is being called with encoder_outputs, which is a list of tensors when select_layers is provided. This will cause a runtime error. It should be called with the last hidden state tensor, which is encoder_outputs[-1].

Here is a suggested fix for both issues:

Suggested change
uses_last_layer = select_layers[-1] in (len(hs_pool) - 1, -1)
if post_layer_norm is not None and uses_last_layer:
hs_pool[-1] = post_layer_norm(encoder_outputs)
uses_last_layer = select_layers[-1] in (max_possible_layers - 1, -1)
if post_layer_norm is not None and uses_last_layer:
hs_pool[-1] = post_layer_norm(encoder_outputs[-1])

Copy link
Member Author

@DarkLight1337 DarkLight1337 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These suggestions look reasonable, so I have applied them. cc @alex-jw-brooks

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 30, 2025 08:55
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337 DarkLight1337 force-pushed the mv-vision-feature-select-strategy branch from 46f187a to 88d9dd3 Compare September 30, 2025 09:37
@DarkLight1337 DarkLight1337 merged commit d7e34b4 into vllm-project:main Sep 30, 2025
52 checks passed
@DarkLight1337 DarkLight1337 deleted the mv-vision-feature-select-strategy branch September 30, 2025 11:24
lkm2835 added a commit to lkm2835/vllm that referenced this pull request Sep 30, 2025
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…coder_outputs` (vllm-project#25938)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…coder_outputs` (#25938)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants