Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batch size > 1 image-text inference #36682

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

hiyouga
Copy link
Contributor

@hiyouga hiyouga commented Mar 12, 2025

What does this PR do?

This PR follows #35558

Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,

images = [
  [Image],
  []
]

Using the latest code, it will receive a value error Invalid input type. Must be a single image, a list of images, or a list of batches of images..

In this PR, we use any instead of all to judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.

https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541

# If it's a list of batches, it's already in the right format
elif (
isinstance(images, (list, tuple))
and all(isinstance(images_i, (list, tuple)) for images_i in images)
and any(is_valid_list_of_images(images_i) for images_i in images)
):
output_images = images

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@github-actions github-actions bot marked this pull request as draft March 12, 2025 17:56
Copy link

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@hiyouga hiyouga marked this pull request as ready for review March 12, 2025 17:58
@hiyouga hiyouga force-pushed the patch-14 branch 3 times, most recently from 2d81f59 to 03e338e Compare March 13, 2025 11:48
@zucchini-nlp
Copy link
Member

Question before reviewing: why we pass an empty list for no-image prompt? What if we just do images = [ [Image]] instead of images = [ [Image], [] ] ?

@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 13, 2025

@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any?

Otherwise LGTM

@yonigozlan
Copy link
Member

Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as [[image1], []] Right now it seems to be causing some issues with SmolVLM processor, but this is more of a problem with SmolVLM than with this PR.

The issue I see is that we wouldn't catch an error now if we have [[image1], image2] for example, when we should. But we cannot catch every possible wrong input formats, so this might not be too bad. WDYT @zucchini-nlp ?

@zucchini-nlp
Copy link
Member

@yonigozlan agreed, I think we can expect users to use consistent format within one input.

@hiyouga there's a failing test which is caused by this PR i think, can you take a look?

@hiyouga hiyouga force-pushed the patch-14 branch 7 times, most recently from ac56330 to 0b9acfc Compare March 14, 2025 16:21
@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 14, 2025

Hi @zucchini-nlp , I have made necessary changes to Gemma3ImageProcessor, Idefics2ImageProcessor, Idefics3ImageProcessor and SmolVLMImageProcessor, to make them support inputs like [[image], []] and [[], [image]]

Comment on lines -292 to -294
images = [self.image1]
with self.assertRaises(ValueError):
processor(text=text, images=images, padding=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiyouga great, thanks for handling the tests!

I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)

@hiyouga hiyouga force-pushed the patch-14 branch 7 times, most recently from e2c82a4 to 5d4a4fb Compare March 17, 2025 16:19
@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 17, 2025

@zucchini-nlp I have added a variable named first_image_in_list to indicate the one image we needed for getting input format.

didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image

There are some logics in the Idefics's processor that match the image with the text according to image tokens, so I decide to remove such cases in unittest to avoid ci fails.

https://github.com/huggingface/transformers/blob/v4.49.0/src/transformers/models/idefics3/processing_idefics3.py#L257-L268

if text is not None:
if sum(n_images_in_text) != len(images):
raise ValueError(
f"The total number of {self.image_token.content} tokens in the prompts should be the same as the number of images passed."
f" Found {sum(n_images_in_text)} {self.image_token.content} tokens and {len(images)} images."
)
# Reorganize the images to match the prompts
cumsum_images_in_text = [0] + list(accumulate(n_images_in_text))
images = [
images[cumsum_images_in_text[i] : cumsum_images_in_text[i + 1]]
for i in range(len(n_images_in_text))
]

Comment on lines 718 to 720
# Search for the first image in the image list.
first_image_in_list = make_flat_list_of_images(images_list)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, something more verbose would be better imo. Smth like

Search for the first image in the image list. NOTE: we can't slice the fist image with images_list[0][0]in case if the first batch contains no images. See #36682

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, a tiny nit about comments, overall LGTM. Thanks for handling it!

I will request one more review, then we can merge. Core code is not modified so we can ask for vision maintainer's review :)

@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 17, 2025

@zucchini-nlp Thanks for your review! Have adjusted the comments about this change

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @hiyouga, here are a few more comments 🤗

@@ -357,16 +358,19 @@ def preprocess(

# All transformations expect numpy arrays.
images_list = [[to_numpy_array(image) for image in images] for images in images_list]
# Search for the first image in the image list.
# NOTE: we can't slice the first image with images_list[0][0] if the first batch contains no images. See #36682
first_image_in_list = make_flat_list_of_images(images_list)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit like overkill using this function with lots of validations under-the-hood to get the first element. Maybe try this one-liner. What do you think?

first_image_in_list = [images for images in images_list if images][0][0]

@@ -243,7 +243,7 @@ def make_flat_list_of_images(
if (
isinstance(images, (list, tuple))
and all(isinstance(images_i, (list, tuple)) for images_i in images)
and all(is_valid_list_of_images(images_i) for images_i in images)
and any(is_valid_list_of_images(images_i) for images_i in images)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I would better extend it to support an empty list, otherwise, we might have something not relevant in any of the lists, e.g.

Suggested change
and any(is_valid_list_of_images(images_i) for images_i in images)
and all(is_valid_list_of_images(images_i) or not images_i for images_i in images)

@@ -277,7 +277,7 @@ def make_nested_list_of_images(
if (
isinstance(images, (list, tuple))
and all(isinstance(images_i, (list, tuple)) for images_i in images)
and all(is_valid_list_of_images(images_i) for images_i in images)
and any(is_valid_list_of_images(images_i) for images_i in images)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 17, 2025

Hi @qubvel , thank you for review. I have updated the PR according to your suggestions. I also modified a key of the pad method because the previous one is ambiguous images -> images_list

@qubvel
Copy link
Member

qubvel commented Mar 17, 2025

We should keep images for backward compatibility and for consistency with other pad methods. However, we still can resolve the ambiguity with proper type hints and docstrings

@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 17, 2025

@qubvel got it, how about the latest one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants