Skip to content

Conversation

lgeiger
Copy link
Contributor

@lgeiger lgeiger commented Sep 12, 2025

This is a followup to #24443 from @david6666666

When I profiled Qwen2.5-VL it seems like an implicit CUDA sync is still happening during the indexing:

hidden_states = hidden_states[reverse_indices, :]

This is because reverse_indices now is computed on CPU and needs to be copied in a blocking way to the GPU before the indexing operation can happen.

Screenshot 2025-09-12 at 03 01 11

This PR copies the reverse_indices to the GPU in a non blocking removing this sync. An alternative would be to call invert_permutation on the window_index GPU tensor which would run it on GPU. Since it's a simple indexing operation it's probably not worth doing on GPU.

I ran end2end benchmarks on the lmarena-ai/VisionArena-Chat dataset but didn't see any changes in performance which I don't quite understand. However, the profile clearly shows that the CUDA sync is gone with this change which is a good thing in general and might be relevant with bit multimodal inputs.

Related to #23884, so tagging @ywang96 for review.

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@lgeiger lgeiger requested a review from sighingnow as a code owner September 12, 2025 11:08
@mergify mergify bot added the qwen Related to Qwen models label Sep 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the Qwen2.5-VL model by eliminating an implicit CUDA synchronization. The change correctly uses pinned memory for a CPU-based tensor (reverse_indices) and then performs a non-blocking copy to the GPU. This is a standard and effective technique to improve performance by avoiding stalls in the CUDA stream. The implementation appears correct and directly addresses the issue described. I have not identified any issues of high or critical severity in these changes.

@Isotr0py Isotr0py requested a review from ywang96 September 12, 2025 13:40
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 12, 2025 13:45
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2025
@lgeiger
Copy link
Contributor Author

lgeiger commented Sep 12, 2025

Merging main to re-trigger failing CI

@DarkLight1337 DarkLight1337 merged commit b0d1213 into vllm-project:main Sep 12, 2025
43 checks passed
@lgeiger lgeiger deleted the qwen-gpu-sync branch September 12, 2025 16:43
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants