-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing #24925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for hashing multimodal inputs by avoiding intermediate byte conversions and copies. The use of generators to yield bytes
or memoryview
objects, which are then directly consumed by the blake3
hasher, is a solid approach. The changes are well-reasoned and the performance gains are evident. However, I've identified a pre-existing critical issue in the handling of bfloat16
tensors that will cause a runtime error. This should be addressed to ensure the stability of the hashing mechanism.
vllm/multimodal/hasher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR: what's the reason for converting all images to RGBA
before hashing? This not only introduces some compute costs but also increases the numpy data which needs to be hashed by 30%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this was due to some security concerns. See #17378
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the unittests, this prevents hash collisions for images with different modes or palettes. How do you feel about hashing the mode, palette and data separately which should prevent the need for converting all images first. I'm happy to make a followup PR for something like this.
data = {"mode": obj.mode, "data": np.asarray(obj)}
if obj.palette is not None:
data["palette"] = obj.palette.palette
yield from cls.iter_item_to_bytes("image", data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashing the mode, palette and data separately
Yea I was going to suggest that and that should be much cheaper than needing to convert the image itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a followup PR, since this would actually change the hash.
vllm/multimodal/hasher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.view(np.uint8)
needs to be used since blake3 only supports uint8/int8 buffers. Though calling .view(np.uint8)
is very fast since it doesn't need to copy the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much appreciate the contribution and optimization! @lgeiger
blake3 already accepts uint8 buffers so this PR removes the need to convert everything back to bytes in item_to_bytes
I actually didn't know this 😅 and I left a question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but @DarkLight1337 should also take a look
…shing Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
c8ddc39
to
2d5f345
Compare
Thanks for optimizing this! |
…shing (vllm-project#24925) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This solution still has problems. In fact, there are still many parameters in the image that may affect its state, such as the tRNS state. |
This PR aims at addressing efficiency. The uniqueness of the hash should be same as before. If that's not the case (or if you feel that the uniqueness has some problems), please open a separate issue and provide more details |
Purpose
As mentioned in #22044 hashing large multimodal input can be inefficient. #19484 removed a data copy of numpy arrays by relying on memory views for c-contiguous data.
blake3
already accepts uint8 buffers so this PR removes the need to convert everything back to bytes initem_to_bytes
vllm/vllm/multimodal/hasher.py
Line 81 in 5bcc153
Implementation wise
serialize_item
now yields eitherbytes
or amemoryview
whichblake3
directly consumes.Test Plan
Correctness should be covered by the existing hasher tests on CI.
The performance can be measured using:
Test Result
For a 4k image this speeds up hashing by ~30%. This is not massive, but for a lot of multimodal input this might become noticeably.