Skip to content

Conversation

lgeiger
Copy link
Contributor

@lgeiger lgeiger commented Sep 16, 2025

Purpose

As mentioned in #22044 hashing large multimodal input can be inefficient. #19484 removed a data copy of numpy arrays by relying on memory views for c-contiguous data.

blake3 already accepts uint8 buffers so this PR removes the need to convert everything back to bytes in item_to_bytes

return b''.join(kb + vb for kb, vb in cls.iter_item_to_bytes(key, obj))

Implementation wise serialize_item now yields either bytes or a memoryview which blake3 directly consumes.

Test Plan

Correctness should be covered by the existing hasher tests on CI.

The performance can be measured using:

import numpy as np
from vllm.multimodal.hasher import MultiModalHasher

np.random.seed(42)
data = np.random.randn(3840, 2160, 4)

%timeit MultiModalHasher.hash_kwargs(data=data)

Test Result

For a 4k image this speeds up hashing by ~30%. This is not massive, but for a lot of multimodal input this might become noticeably.

# main
168 ms ± 892 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This PR
120 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for hashing multimodal inputs by avoiding intermediate byte conversions and copies. The use of generators to yield bytes or memoryview objects, which are then directly consumed by the blake3 hasher, is a solid approach. The changes are well-reasoned and the performance gains are evident. However, I've identified a pre-existing critical issue in the handling of bfloat16 tensors that will cause a runtime error. This should be addressed to ensure the stability of the hashing mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: what's the reason for converting all images to RGBA before hashing? This not only introduces some compute costs but also increases the numpy data which needs to be hashed by 30%.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this was due to some security concerns. See #17378

Copy link
Contributor Author

@lgeiger lgeiger Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the unittests, this prevents hash collisions for images with different modes or palettes. How do you feel about hashing the mode, palette and data separately which should prevent the need for converting all images first. I'm happy to make a followup PR for something like this.

data = {"mode": obj.mode, "data": np.asarray(obj)}
if obj.palette is not None:
    data["palette"] = obj.palette.palette
yield from cls.iter_item_to_bytes("image", data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashing the mode, palette and data separately

Yea I was going to suggest that and that should be much cheaper than needing to convert the image itself

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a followup PR, since this would actually change the hash.

Comment on lines +63 to +62
Copy link
Contributor Author

@lgeiger lgeiger Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.view(np.uint8) needs to be used since blake3 only supports uint8/int8 buffers. Though calling .view(np.uint8) is very fast since it doesn't need to copy the data.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much appreciate the contribution and optimization! @lgeiger

blake3 already accepts uint8 buffers so this PR removes the need to convert everything back to bytes in item_to_bytes

I actually didn't know this 😅 and I left a question

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but @DarkLight1337 should also take a look

…shing

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@DarkLight1337
Copy link
Member

Thanks for optimizing this!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 16, 2025 13:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025
@DarkLight1337 DarkLight1337 merged commit 0836928 into vllm-project:main Sep 16, 2025
49 of 51 checks passed
@lgeiger lgeiger deleted the mm-hash-memoryview branch September 16, 2025 15:37
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…shing (vllm-project#24925)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@kexinoh
Copy link

kexinoh commented Sep 28, 2025

This solution still has problems. In fact, there are still many parameters in the image that may affect its state, such as the tRNS state.
@lgeiger

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 28, 2025

This PR aims at addressing efficiency. The uniqueness of the hash should be same as before. If that's not the case (or if you feel that the uniqueness has some problems), please open a separate issue and provide more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants