-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support batch size > 1 image-text inference #36682
base: main
Are you sure you want to change the base?
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
2d81f59
to
03e338e
Compare
Question before reviewing: why we pass an empty list for no-image prompt? What if we just do |
@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any
?
Otherwise LGTM
Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as The issue I see is that we wouldn't catch an error now if we have |
@yonigozlan agreed, I think we can expect users to use consistent format within one input. @hiyouga there's a failing test which is caused by this PR i think, can you take a look? |
ac56330
to
0b9acfc
Compare
Hi @zucchini-nlp , I have made necessary changes to |
images = [self.image1] | ||
with self.assertRaises(ValueError): | ||
processor(text=text, images=images, padding=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hiyouga great, thanks for handling the tests!
I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)
e2c82a4
to
5d4a4fb
Compare
@zucchini-nlp I have added a variable named
There are some logics in the Idefics's processor that match the image with the text according to image tokens, so I decide to remove such cases in unittest to avoid ci fails. transformers/src/transformers/models/idefics3/processing_idefics3.py Lines 257 to 268 in a22a437
|
# Search for the first image in the image list. | ||
first_image_in_list = make_flat_list_of_images(images_list)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, something more verbose would be better imo. Smth like
Search for the first image in the image list. NOTE: we can't slice the fist image with
images_list[0][0]in case if the first batch contains no images. See #36682
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, a tiny nit about comments, overall LGTM. Thanks for handling it!
I will request one more review, then we can merge. Core code is not modified so we can ask for vision maintainer's review :)
@zucchini-nlp Thanks for your review! Have adjusted the comments about this change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @hiyouga, here are a few more comments 🤗
@@ -357,16 +358,19 @@ def preprocess( | |||
|
|||
# All transformations expect numpy arrays. | |||
images_list = [[to_numpy_array(image) for image in images] for images in images_list] | |||
# Search for the first image in the image list. | |||
# NOTE: we can't slice the first image with images_list[0][0] if the first batch contains no images. See #36682 | |||
first_image_in_list = make_flat_list_of_images(images_list)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks a bit like overkill using this function with lots of validations under-the-hood to get the first element. Maybe try this one-liner. What do you think?
first_image_in_list = [images for images in images_list if images][0][0]
src/transformers/image_utils.py
Outdated
@@ -243,7 +243,7 @@ def make_flat_list_of_images( | |||
if ( | |||
isinstance(images, (list, tuple)) | |||
and all(isinstance(images_i, (list, tuple)) for images_i in images) | |||
and all(is_valid_list_of_images(images_i) for images_i in images) | |||
and any(is_valid_list_of_images(images_i) for images_i in images) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I would better extend it to support an empty list, otherwise, we might have something not relevant in any of the lists, e.g.
and any(is_valid_list_of_images(images_i) for images_i in images) | |
and all(is_valid_list_of_images(images_i) or not images_i for images_i in images) |
src/transformers/image_utils.py
Outdated
@@ -277,7 +277,7 @@ def make_nested_list_of_images( | |||
if ( | |||
isinstance(images, (list, tuple)) | |||
and all(isinstance(images_i, (list, tuple)) for images_i in images) | |||
and all(is_valid_list_of_images(images_i) for images_i in images) | |||
and any(is_valid_list_of_images(images_i) for images_i in images) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Hi @qubvel , thank you for review. I have updated the PR according to your suggestions. I also modified a key of the |
We should keep |
@qubvel got it, how about the latest one? |
What does this PR do?
This PR follows #35558
Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,
Using the latest code, it will receive a value error
Invalid input type. Must be a single image, a list of images, or a list of batches of images.
.In this PR, we use
any
instead ofall
to judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541
transformers/src/transformers/models/mllama/image_processing_mllama.py
Lines 535 to 541 in 6bc0fbc
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp