-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate
#25337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization to the fast_pos_embed_interpolate
method in vllm/model_executor/models/qwen3_vl.py
. The changes refactor the method to perform computations on the GPU using vectorized PyTorch operations, avoiding expensive list manipulations and CPU-GPU data transfers. A constant num_grid_per_side
is now pre-calculated in the __init__
method to avoid repeated calculations. The new implementation is more efficient and readable, leveraging batched tensor operations for embedding lookups and calculations, which should lead to the performance improvements shown in the PR description. The logic appears correct and functionally equivalent to the previous implementation. I have no high or critical severity comments on these changes.
vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>
vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>
vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>
#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Test Plan
10 QPS of VisionArena on Qwen3-VL 4B on A100
Test Result
Main
This branch
MMMU matched
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.