[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

ywang96 · 2025-09-21T08:27:19Z

Purpose

Test Plan

10 QPS of VisionArena on Qwen3-VL 4B on A100

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================

This branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================

MMMU matched

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>

gemini-code-assist

Code Review

This pull request introduces a performance optimization to the fast_pos_embed_interpolate method in vllm/model_executor/models/qwen3_vl.py. The changes refactor the method to perform computations on the GPU using vectorized PyTorch operations, avoiding expensive list manipulations and CPU-GPU data transfers. A constant num_grid_per_side is now pre-calculated in the __init__ method to avoid repeated calculations. The new implementation is more efficient and readable, leveraging batched tensor operations for embedding lookups and calculations, which should lead to the performance improvements shown in the PR description. The logic appears correct and functionally equivalent to the previous implementation. I have no high or critical severity comments on these changes.

Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>

#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

ywang96 added 3 commits September 21, 2025 06:03

add

aa6bd8d

Signed-off-by: Roger Wang <hey@rogerw.io>

update

51ae2dc

Signed-off-by: Roger Wang <hey@rogerw.io>

update

cfb879e

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested a review from sighingnow as a code owner September 21, 2025 08:27

mergify bot added the qwen Related to Qwen models label Sep 21, 2025

gemini-code-assist bot reviewed Sep 21, 2025

View reviewed changes

cleanup

dde249f

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested a review from Isotr0py September 21, 2025 08:41

ywang96 mentioned this pull request Sep 21, 2025

[Model] Support Qwen3-VL Model Series #24727

Merged

12 tasks

Isotr0py approved these changes Sep 21, 2025

View reviewed changes

Isotr0py enabled auto-merge (squash) September 21, 2025 09:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 21, 2025

Isotr0py merged commit 30d0891 into vllm-project:main Sep 21, 2025
59 checks passed

Isotr0py mentioned this pull request Sep 21, 2025

[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate #25347

Merged

5 tasks

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

c3f7ed3

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

9a90102

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

aacff96

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

5fd95c7

#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

Uh oh!

ywang96 commented Sep 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate #25337

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate #25337

Uh oh!

Conversation

ywang96 commented Sep 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

ywang96 commented Sep 21, 2025 •

edited by github-actions bot

Loading