multimodal

stevhliu · stevhliu · commit 9a3e0ca0aa72 · 2025-03-21T10:56:15.000-07:00
diff --git a/docs/source/en/model_doc/bert.md b/docs/source/en/model_doc/bert.md
@@ -28,7 +28,7 @@ rendered properly in your Markdown viewer.
 
 [BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.
 
-You can find all the original BERT checkpoints under the [BERT collection](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc).
+You can find all the original BERT checkpoints under the BERT [collection](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc).
 
 > [!TIP]
 > Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks.
diff --git a/docs/source/en/model_doc/gemma3.md b/docs/source/en/model_doc/gemma3.md
@@ -15,36 +15,63 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# Gemma3
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>
 
-## Overview
+# Gemma 3
 
-The Gemma 3 model was proposed in the [Gemma 3 Techncial Report](https://goo.gle/Gemma3Report) by Google. It is a vision-language model composed by a [SigLIP](siglip) vision encoder and a [Gemma 2](gemma_2) language decoder, linked by a multimodal linear projection. It cuts an image into a fixed number of tokens, in the same way as SigLIP, as long as the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller patches and concatenates them with the base image embedding. One particularity is that the model uses bidirectional attention on all the image tokens. In addition, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
+[Gemma 3](https://goo.gle/Gemma3Report) is a multimodal model, available in pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer, support for a longer context length of 128K tokens, and a [SigLip](./siglip) encoder that can "pan & scan" high-resolution images to prevent information in images from disappearing.
 
-This model was contributed by [Ryan Mullins](https://huggingface.co/RyanMullins), [Raushan Turganbay](https://huggingface.co/RaushanTurganbay) [Arthur Zucker](https://huggingface.co/ArthurZ), and [Pedro Cuenca](https://huggingface.co/pcuenq).
+The instruction-tuned Gemma 3 model was post-trained with knowledge distillation and reinforcement learning.
 
+You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) release.
 
-## Usage tips
+> [!TIP]
+> Click on the Gemma 3 models in the right sidebar for more examples of how to apply Gemma to different vision and language tasks.
 
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
 
-- For image+text and image-only inputs use `Gemma3ForConditionalGeneration`.
-- For text-only inputs use `Gemma3ForCausalLM` for generation to avoid loading the vision tower.
-- Each sample can contain multiple images, and the number of images can vary between samples. However, make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
-- The text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
-- The processor has its own `apply_chat_template` method to convert chat messages to model inputs. See the examples below for more details on how to use it.
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
+```py
+import torch
+from transformers import pipeline
 
-### Image cropping for high resolution images
-
-The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set `do_pan_and_scan=True` to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.
+pipeline = pipeline(
+    task="image-text-to-text",
+    model="google/gemma-3-4b-pt",
+    device=0,
+    torch_dtype=torch.bfloat16
+)
+pipeline(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+    text="<start_of_image> What is shown in this image?"
+)
+```
 
-Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.
+</hfoption>
+<hfoption id="AutoModel">
 
-```python
+```py
+import torch
+from transformers import AutoProcessor, Gemma3ForConditionalGeneration
 
-processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")
+model = Gemma3ForConditionalGeneration.from_pretrained(
+    "google/gemma-3-4b-it",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    attn_implementation="sdpa"
+)
+processor = AutoProcessor.from_pretrained(
+    "google/gemma-3-4b-it",
+    padding_side="left"
+)
 
-url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
 messages = [
     {
         "role": "system",
@@ -54,7 +81,7 @@ messages = [
     },
     {
         "role": "user", "content": [
-            {"type": "image", "url": url},
+            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
             {"type": "text", "text": "What is shown in this image?"},
         ]
     },
@@ -65,24 +92,36 @@ inputs = processor.apply_chat_template(
     return_dict=True,
     return_tensors="pt",
     add_generation_prompt=True,
-    do_pan_and_scan=True,
-).to(model.device)
+).to("cuda")
 
+output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
+print(processor.decode(output[0], skip_special_tokens=True))
 ```
 
+</hfoption>
+</hfoptions>
 
-## Usage Example
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-### Single-image Inference
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
 
-```python
-from transformers import AutoProcessor, Gemma3ForConditionalGeneration
+```py
+# pip install torchao
+import torch
+from transformers import TorchAoConfig, Gemma3ForConditionalGeneration, AutoProcessor
 
-model_id = "google/gemma-3-4b-it"
-model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+model = Gemma3ForConditionalGeneration.from_pretrained(
+    "google/gemma-3-27b-it",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    quantization_config=quantization_config
+)
+processor = AutoProcessor.from_pretrained(
+    "google/gemma-3-27b-it",
+    padding_side="left"
+)
 
-url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
 messages = [
     {
         "role": "system",
@@ -92,7 +131,7 @@ messages = [
     },
     {
         "role": "user", "content": [
-            {"type": "image", "url": url},
+            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
             {"type": "text", "text": "What is shown in this image?"},
         ]
     },
@@ -103,69 +142,81 @@ inputs = processor.apply_chat_template(
     return_dict=True,
     return_tensors="pt",
     add_generation_prompt=True,
-).to(model.device)
+).to("cuda")
 
-output = model.generate(**inputs, max_new_tokens=50)
-print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
+output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
+print(processor.decode(output[0], skip_special_tokens=True))
 ```
 
-### Multi-image Inference
-
-```python
-model_id = "google/gemma-3-4b-it"
-model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
-
-url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
-url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
-messages = [
-    {
-        "role": "system",
-        "content": [
-            {"type": "text", "text": "You are a helpful assistant."}
-        ]
-    },
-    {
-        "role": "user", "content": [
-            {"type": "image", "url": url_cow},
-            {"type": "image", "url": url_stop},
-            {"type": "text", "text": "Are these two images identical?"},
-        ]
-    },
-]
-inputs = processor.apply_chat_template(
-    messages,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-    add_generation_prompt=True,
-).to(model.device)
-
-output = model.generate(**inputs, max_new_tokens=50)
-print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
-
-```
-
-### Text-only inference
-
-You can use the VLMs for text-only generation by omitting images in your input. However, you can also load the models in text-only mode as shown below. This will skip loading the vision tower and will save resources when you just need the LLM capabilities.
-```python
-from transformers import AutoTokenizer, Gemma3ForCausalLM
-
-model_id = "google/gemma-3-1b-it"
-
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")
-
-input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)
-
-outputs = model.generate(**input_ids, max_new_tokens=100)
-text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
 
-print(text)
+```py
+from transformers.utils.attention_visualizer import AttentionMaskVisualizer
 
+visualizer = AttentionMaskVisualizer("google/gemma-3-4b-it")
+visualizer("<img>What is shown in this image?")
 ```
 
+## Notes
+
+- Use [`Gemma3ForConditionalGeneration`] for image-and-text and image-only inputs.
+- Gemma 3 supports multiple input images, but make sure the images are correctly batched before passing them to the processor. Each batch should be a list of one or more images.
+
+    ```py
+    url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
+    url_cat = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+
+    messages =[
+        {
+            "role": "system",
+            "content": [
+                {"type": "text", "text": "You are a helpful assistant."}
+            ]
+        },
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "url": url_cow},
+                {"type": "image", "url": url_cat},
+                {"type": "text", "text": "Which image is cuter?"},
+            ]
+        },
+    ]
+    ```
+- Text passed to the processor should have a `<start_of_image>` token wherever an image should be inserted.
+- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
+- By default, the images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference.
+
+    ```diff
+    inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        return_dict=True,
+        return_tensors="pt",
+        add_generation_prompt=True,
+    +   do_pan_and_scan=True,
+        ).to("cuda")
+    ```
+- For text-only inputs, use [`AutoModelForCausalLM`] instead to skip loading the vision components and save resources.
+
+    ```py
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        "google/gemma-3-1b-pt",
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        "google/gemma-3-1b-pt",
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        attn_implementation="sdpa"
+    )
+    input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+
+    output = model.generate(**input_ids, cache_implementation="static")
+    print(tokenizer.decode(output[0], skip_special_tokens=True))
+    ```
 
 ## Gemma3ImageProcessor
 
diff --git a/docs/source/en/model_doc/llama.md b/docs/source/en/model_doc/llama.md
@@ -33,7 +33,7 @@ You can find all the original Llama checkpoints under the [Huggy Llama](https://
 > [!TIP]
 > Click on the Llama models in the right sidebar for more examples of how to apply Llama to different language tasks.
 
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">
@@ -107,7 +107,7 @@ output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-Use the `visualize_attention_mask` method to better understand what tokens the model can and cannot attend to.
+Use the [`~transformers.utils.AttentionMaskVisualizer`] utility to better understand what tokens the model can and cannot attend to.
 
 ```py
 from transformers.utils.attention_visualizer import AttentionMaskVisualizer
diff --git a/docs/source/en/model_doc/llama2.md b/docs/source/en/model_doc/llama2.md
@@ -28,7 +28,7 @@ rendered properly in your Markdown viewer.
 
 Llama 2-Chat is trained with supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF) - rejection sampling and proximal policy optimization (PPO) - is applied to the fine-tuned model to align the chat model with human preferences.
 
-You can find all the original Llama 2 checkpoints under the [Llama 2 Family collection](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b).
+You can find all the original Llama 2 checkpoints under the [Llama 2 Family](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) collection.
 
 > [!TIP]
 > Click on the Llama 2 models in the right sidebar for more examples of how to apply Llama to different language tasks.
@@ -107,7 +107,7 @@ output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-Use the `visualize_attention_mask` method to better understand what tokens the model can and cannot attend to.
+Use the [`~transformers.utils.AttentionMaskVisualizer`] to better understand what tokens the model can and cannot attend to.
 
 ```py
 from transformers.utils.attention_visualizer import AttentionMaskVisualizer
diff --git a/docs/source/en/model_doc/paligemma.md b/docs/source/en/model_doc/paligemma.md
diff --git a/docs/source/en/model_doc/whisper.md b/docs/source/en/model_doc/whisper.md