[Models] Prevent CUDA sync in Qwen2.5-VL #24741
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a followup to #24443 from @david6666666
When I profiled Qwen2.5-VL it seems like an implicit CUDA sync is still happening during the indexing:
vllm/vllm/model_executor/models/qwen2_5_vl.py
Line 826 in 59d5d2c
This is because
reverse_indices
now is computed on CPU and needs to be copied in a blocking way to the GPU before the indexing operation can happen.This PR copies the
reverse_indices
to the GPU in a non blocking removing this sync. An alternative would be to callinvert_permutation
on thewindow_index
GPU tensor which would run it on GPU. Since it's a simple indexing operation it's probably not worth doing on GPU.I ran end2end benchmarks on the
lmarena-ai/VisionArena-Chat
dataset but didn't see any changes in performance which I don't quite understand. However, the profile clearly shows that the CUDA sync is gone with this change which is a good thing in general and might be relevant with bit multimodal inputs.Related to #23884, so tagging @ywang96 for review.