-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Models][Qwen3VL] Speedup fast_pos_embed_interpolate
#26647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
| weights = weights.to( | ||
| dtype=self.dtype, device=self.device, non_blocking=True | ||
| ) | ||
| weights = weights.to(dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights will already be on self.device so no need to copy it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several well-reasoned optimizations to the fast_pos_embed_interpolate function in Qwen3VL. The changes, including refactoring weight calculations, vectorizing index computations, and using more efficient PyTorch operations like .sum() instead of unbind() followed by manual addition, are mathematically sound and contribute to the reported 11% performance improvement. The code is now more concise and idiomatic. I've reviewed the changes and found them to be correct and beneficial for performance without sacrificing readability. This is a solid improvement.
|
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Re: caching t, h, w - I had the same idea when I first cleaned up the code here that we could build a small cache on CPU for this, but I also wonder whether the h2d cost is worth the effort
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Head branch was pushed to by a user without write access
|
Thanks for the fast review. I also updated the tiling logic in 24b6717 which slightly improves things further. I updated the benchmarks above. |
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
I doubt that this would actually be a performance win since I think the GPU to CPU-cache transfer would need to be synchronous to be safe which might actually hurt performance more. However, I could add a simple cache that caches the GPU tensors only for the duration of this functions which at least would help for requests with multiple images of the same size to prevent re-computation within the loop. @ywang96 let me know what you think, I'm happy to make a PR. |
Purpose
Followup on #25337 and #25347.
fast_pos_embed_interpolatelaunches many small cuda ops so this PR slightly simplifies and optimised the implementation./cc @ywang96 @Isotr0py
Test Plan & Results
I verified that the new implementation doesn't change the computation.
A quick micro benchmark on an H100 shows a 15% speedup of
fast_pos_embed_interpolateand I don't think it reduces readability of the code.