Support batch size > 1 image-text inference #36682

hiyouga · 2025-03-12T17:56:39Z

What does this PR do?

This PR follows #35558

Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,

images = [
  [Image],
  []
]

Using the latest code, it will receive a value error Invalid input type. Must be a single image, a list of images, or a list of batches of images..

In this PR, we use any instead of all to judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.

https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541

transformers/src/transformers/models/mllama/image_processing_mllama.py

Lines 535 to 541 in 6bc0fbc

    
           # If it's a list of batches, it's already in the right format 
        
           elif ( 
        
               isinstance(images, (list, tuple)) 
        
               and all(isinstance(images_i, (list, tuple)) for images_i in images) 
        
               and any(is_valid_list_of_images(images_i) for images_i in images) 
        
           ): 
        
               output_images = images

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

github-actions · 2025-03-12T17:56:52Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

zucchini-nlp · 2025-03-13T22:05:21Z

Question before reviewing: why we pass an empty list for no-image prompt? What if we just do images = [ [Image]] instead of images = [ [Image], [] ] ?

hiyouga · 2025-03-13T22:09:04Z

@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size

zucchini-nlp

I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any?

Otherwise LGTM

yonigozlan · 2025-03-14T14:10:11Z

Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as [[image1], []] Right now it seems to be causing some issues with SmolVLM processor, but this is more of a problem with SmolVLM than with this PR.

The issue I see is that we wouldn't catch an error now if we have [[image1], image2] for example, when we should. But we cannot catch every possible wrong input formats, so this might not be too bad. WDYT @zucchini-nlp ?

zucchini-nlp · 2025-03-14T14:28:38Z

@yonigozlan agreed, I think we can expect users to use consistent format within one input.

@hiyouga there's a failing test which is caused by this PR i think, can you take a look?

hiyouga · 2025-03-14T16:30:41Z

Hi @zucchini-nlp , I have made necessary changes to Gemma3ImageProcessor, Idefics2ImageProcessor, Idefics3ImageProcessor and SmolVLMImageProcessor, to make them support inputs like [[image], []] and [[], [image]]

zucchini-nlp · 2025-03-17T09:47:44Z

tests/models/idefics2/test_processor_idefics2.py

-        images = [self.image1]
-        with self.assertRaises(ValueError):
-            processor(text=text, images=images, padding=True)


didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image

zucchini-nlp

@hiyouga great, thanks for handling the tests!

I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)

qubvel

Thanks for working on this @hiyouga, here are a few more comments 🤗

qubvel · 2025-03-17T17:34:48Z

src/transformers/models/gemma3/image_processing_gemma3.py

@@ -357,16 +358,19 @@ def preprocess(

        # All transformations expect numpy arrays.
        images_list = [[to_numpy_array(image) for image in images] for images in images_list]
+        # Search for the first image in the image list.
+        # NOTE: we can't slice the first image with images_list[0][0] if the first batch contains no images. See #36682
+        first_image_in_list = make_flat_list_of_images(images_list)[0]


Looks a bit like overkill using this function with lots of validations under-the-hood to get the first element. Maybe try this one-liner. What do you think?

first_image_in_list = [images for images in images_list if images][0][0]

qubvel · 2025-03-17T17:45:26Z

src/transformers/image_utils.py

@@ -243,7 +243,7 @@ def make_flat_list_of_images(
    if (
        isinstance(images, (list, tuple))
        and all(isinstance(images_i, (list, tuple)) for images_i in images)
-        and all(is_valid_list_of_images(images_i) for images_i in images)
+        and any(is_valid_list_of_images(images_i) for images_i in images)


hmm, I would better extend it to support an empty list, otherwise, we might have something not relevant in any of the lists, e.g.

Suggested change

and any(is_valid_list_of_images(images_i) for images_i in images)

and all(is_valid_list_of_images(images_i) or not images_i for images_i in images)

qubvel · 2025-03-17T17:45:39Z

src/transformers/image_utils.py

@@ -277,7 +277,7 @@ def make_nested_list_of_images(
    if (
        isinstance(images, (list, tuple))
        and all(isinstance(images_i, (list, tuple)) for images_i in images)
-        and all(is_valid_list_of_images(images_i) for images_i in images)
+        and any(is_valid_list_of_images(images_i) for images_i in images)


hiyouga · 2025-03-17T18:03:37Z

Hi @qubvel , thank you for review. I have updated the PR according to your suggestions. I also modified a key of the pad method because the previous one is ambiguous images -> images_list

qubvel · 2025-03-17T18:07:03Z

We should keep images for backward compatibility and for consistency with other pad methods. However, we still can resolve the ambiguity with proper type hints and docstrings

hiyouga · 2025-03-17T18:20:04Z

@qubvel got it, how about the latest one?

hiyouga · 2025-04-01T16:36:50Z

@qubvel Hi Pavel, I'd like to ask that is there anything needs to be resolved in this PR?

qubvel · 2025-04-01T16:53:34Z

src/transformers/models/smolvlm/image_processing_smolvlm.py

@@ -561,11 +561,12 @@ def pad(
            else input_data_format
        )
        data_format = input_data_format if data_format is None else data_format
+        first_image_in_list = [images_ for images_ in images if images_][0][0]


Just nit, LGTM!

Suggested change

first_image_in_list = [images_ for images_ in images if images_][0][0]

# filter out empty image lists, then take first image of the first sample

first_image_in_list = [sample_images for sample_images in images if sample_images][0][0]

thanks! done

qubvel · 2025-04-07T11:18:07Z

Waiting for the CI to be green to merge 😄

hiyouga · 2025-04-07T11:36:57Z

@qubvel It seems that the integration of llama4 breaks all the processor unit tests https://github.com/huggingface/transformers/commits/main/

ArthurZucker

Can you documment what this enables? Like in the pipeline md?

hiyouga · 2025-04-08T15:42:28Z

@ArthurZucker This PR mainly enables the ImageTextToTextPipeline to have both image-text and text-only inputs in a whole batch. However, I'm not sure where I should add the document. Could you provide some instructions?

ArthurZucker

sorry you are right its kind of obvious that it should support batch > 1 image, what I mean is to have a small doc example somewhere for people to play with it! Let's fix the conflicts and get this merged 🔥

github-actions bot marked this pull request as draft March 12, 2025 17:56

hiyouga marked this pull request as ready for review March 12, 2025 17:58

github-actions bot requested review from ArthurZucker and Rocketknight1 March 12, 2025 17:58

hiyouga force-pushed the patch-14 branch 3 times, most recently from 2d81f59 to 03e338e Compare March 13, 2025 11:48

zucchini-nlp approved these changes Mar 14, 2025

View reviewed changes

hiyouga force-pushed the patch-14 branch 7 times, most recently from ac56330 to 0b9acfc Compare March 14, 2025 16:21

zucchini-nlp reviewed Mar 17, 2025

View reviewed changes

hiyouga force-pushed the patch-14 branch 7 times, most recently from e2c82a4 to 5d4a4fb Compare March 17, 2025 16:19

qubvel reviewed Mar 17, 2025

View reviewed changes

hiyouga added 7 commits March 17, 2025 17:59

update make nested image list

f26b275

fix make flat list of images

26d2aeb

update type anno

98a64cb

fix image_processing_smolvlm

5ab405d

use first image

a6153a6

add verbose comment

9596e3a

fix images

d86d963

hiyouga force-pushed the patch-14 branch from a9e4c3a to d86d963 Compare March 17, 2025 18:00

rollback

8b2a747

shethaadit approved these changes Mar 17, 2025

View reviewed changes

fix ut

e484f54

ArthurZucker removed request for Rocketknight1 and ArthurZucker March 20, 2025 10:02

Merge branch 'main' into patch-14

6c17ddc

qubvel reviewed Apr 1, 2025

View reviewed changes

hiyouga and others added 5 commits April 2, 2025 01:21

Update image_processing_smolvlm.py

07b9418

Update image_processing_idefics3.py

f728c4f

Merge branch 'main' into patch-14

a1c6357

Merge branch 'main' into patch-14

524ae5b

Merge branch 'main' into patch-14

05119a4

ArthurZucker approved these changes Apr 8, 2025

View reviewed changes

ArthurZucker reviewed Jun 19, 2025

View reviewed changes

	# If it's a list of batches, it's already in the right format
	elif (
	isinstance(images, (list, tuple))
	and all(isinstance(images_i, (list, tuple)) for images_i in images)
	and any(is_valid_list_of_images(images_i) for images_i in images)
	):
	output_images = images

	and any(is_valid_list_of_images(images_i) for images_i in images)
	and all(is_valid_list_of_images(images_i) or not images_i for images_i in images)

	first_image_in_list = [images_ for images_ in images if images_][0][0]
	# filter out empty image lists, then take first image of the first sample
	first_image_in_list = [sample_images for sample_images in images if sample_images][0][0]

Support batch size > 1 image-text inference #36682

Are you sure you want to change the base?

Support batch size > 1 image-text inference #36682

Conversation

hiyouga commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

zucchini-nlp commented Mar 13, 2025

Uh oh!

hiyouga commented Mar 13, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Mar 14, 2025

Uh oh!

zucchini-nlp commented Mar 14, 2025

Uh oh!

hiyouga commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

qubvel Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

hiyouga commented Mar 17, 2025

Uh oh!

qubvel commented Mar 17, 2025

Uh oh!

hiyouga commented Mar 17, 2025

Uh oh!

hiyouga commented Apr 1, 2025

Uh oh!

qubvel Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

hiyouga Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel commented Apr 7, 2025

Uh oh!

hiyouga commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

hiyouga commented Apr 8, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hiyouga commented Mar 12, 2025 •

edited

Loading

hiyouga commented Mar 14, 2025 •

edited

Loading

hiyouga commented Apr 7, 2025 •

edited

Loading