-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Started getting new warnings for gemma3 after upgrading from 4.49.0-gemma3 to 4.50.0 #36942
Comments
@HJJ256 hey, can you include a short reproducer or is that an inference script from model docs? From the warnings seems that the model doesn't fir entirely in the GPU, but I am interested what changed between v4.49 to v.4.50 |
The model doesn't fit completely in GPU, but it was able to generate the output with 4.49.0-gemma3. I am using the same code present in the model docs here with the addition of torch_dtype=torch.bfloat16 parameter in Gemma3ForConditionalGeneration.from_pretrained This was the only dialog that I used to get with 4.49.0-gemma3 |
I see. I got issues with cc @SunMarc, seems related to accelerate |
Can you share the full traceback @HJJ256 . I'm able to run the model correctly on both branches with cpu/disk offload. Can you share the device_map of the model after loading the model : |
The only thing that we need to fix is to specify |
yeah, we can update, let me open a PR From issue description, it seems the user is already loading in bf16, not sure unless confirmed by the users |
Yeah, i'm not sure about the issue that he's experiencing as I didn't manage to reproduce the device mismatch |
Logs
I think the device map is incorrect, however, I am loading the model as specified and with torch_dtype=torch.bloat16 Traceback on request
nvidia-smi output when model is loaded. There is no other code running on the machine except this.
|
Also, this is the device map output when I run the code using 4.49.0-gemma3 release
This looks completely correct for Gemma3 nvidia-smi output
accelerate version: 1.5.2 |
I tried to put the above device_map directly for v4.50.0 and I got CUDA OOM error
Still do not get how it was able to load in previous version
I am getting this error during inference
|
@SunMarc @zucchini-nlp It seems that the problem is only occurring when the vision_tower is getting split on multiple devices, because in "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py" _get_layer_device_map_for_cache_init, we are getting the num_hidden_layers for the text_config only, and it is trying to map the number of layers in vision_tower (SIGLIP) model to the number of layers in gemma3_text model. |
Oh interesting! That is smth I was going to work on, saw problems in multi-GPU inference a few times with Gemma3. Didn't think could affect cases with one gpu also, when the device map contains simply "language_model" We call the LM part as "text_model" in some models (can check if we have other names as well). cc @SunMarc if you have bandwidth to make a workaround Side note for cache: @gante, imo |
in v4.50, for some kind of reason, the device_map that you get is Can you share the value of the args that goes into this function ? If you can find the faulty commit that triggered this device_map + OOM issue, that would be even better using git bissect. |
Parallel to the comment above: I'm updating how we find the decoder layers device mapping for cache initialization, to handle the case where a non-decoder module has the pattern This should fix the case @zucchini-nlp described, as well as the device map example above where the language model is in a single device (but the vision model is not) |
/opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py:1569: UserWarning: Current model requires 33280 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
warnings.warn(
[2025-03-24 19:49:10,626-accelerate.utils.modeling] - Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:
These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.
Loading checkpoint shards: 100% 5/5 [00:00<00:00, 40.00it/s]
After the model loads, on running generate, I get the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Note: I am pushing my input_ids to "cuda"
When I am loading and testing with 4.49.0-gemma3, everything is working fine. Is there a specific change that is affecting the model loading in 4.50.0?
Model: Gemma3-12b-it
GPU: NVIDIA RTX 4000 Ada
device_map: auto
torch dtype: bfloat16
The text was updated successfully, but these errors were encountered: