Skip to content

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Sep 21, 2025

Purpose

Test Plan

10 QPS of VisionArena on Qwen3-VL 4B on A100

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================

This branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================

MMMU matched


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96 ywang96 requested a review from sighingnow as a code owner September 21, 2025 08:27
@mergify mergify bot added the qwen Related to Qwen models label Sep 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization to the fast_pos_embed_interpolate method in vllm/model_executor/models/qwen3_vl.py. The changes refactor the method to perform computations on the GPU using vectorized PyTorch operations, avoiding expensive list manipulations and CPU-GPU data transfers. A constant num_grid_per_side is now pre-calculated in the __init__ method to avoid repeated calculations. The new implementation is more efficient and readable, leveraging batched tensor operations for embedding lookups and calculations, which should lead to the performance improvements shown in the PR description. The logic appears correct and functionally equivalent to the previous implementation. I have no high or critical severity comments on these changes.

Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96 ywang96 requested a review from Isotr0py September 21, 2025 08:41
@Isotr0py Isotr0py enabled auto-merge (squash) September 21, 2025 09:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 21, 2025
@Isotr0py Isotr0py merged commit 30d0891 into vllm-project:main Sep 21, 2025
59 checks passed
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
vllm-project#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants