-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma3 can't be fine-tuned on multi-image examples #36816
Comments
@FredrikNoren Gemma3 works with muti-images in inference, and thus I'd assume training should be no different The inputs format expected with several images is as follows, can you verify the train script follows it: images = [[im1, im2], [im3]]
texts = ["Are these identical images? <image> <image>", "Describe this: <image>"]
inputs = processor(images=images, text=text, return_tensors='pt')
print(inputs['pixel_values'].shape) # 3, 3, 896, 896 |
@zucchini-nlp Yup that's the format I'm giving the processor, and the output of the processor I'm getting is:
But the problem is that the trainer seems to treat the first dimension of |
Oh, I see now. Actually gemma3 is not the first model where the first dim doesn't match with batch dim, earlier we had qwen2-vl where the first dim was image-seq-length. Lemme check out how it worked or if it worked |
I would guess the training here is also multi-GPU. so I'll link it to #33666 |
@zucchini-nlp I'm planning to do multi-GPU as well, but I haven't yet. So far this is just single-GPU. |
@FredrikNoren I asked internally the team and found that training was tested and works with gemma3 multi-image cases. So I believe there is something wrong in the way your train is set up. Could you please open the issue in TRL repo instead? |
@zucchini-nlp Hm I suspect it's not a trl issue but I've created an issue for them here now: huggingface/trl#3121 Would you mind asking the team that was able to train internally if they can share their code? Also; what is the expected output format of collate_fn in the case of multi-image training? |
Yes sure, the team is aware if the issue and someone from TRL will take a look soon |
System Info
There are more details in here: google-deepmind/gemma#193
But shortly; it seems like multi-image training is not implemented yet
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The following code works:
but if I increase
N_IMAGES
to 2 it crashes with the following error:Expected behavior
I'd expect either:
[batch, n_images, c, w, h]
, and that the model can handle thatIn the first case:
The text was updated successfully, but these errors were encountered: