# Gemma 3: Google's all new multimodal, multilingual, long context open LLM

## TL;DR

Today Google releases [**Gemma 3**](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d), a new iteration of their Gemma family of models. The models range from 1B to 27B parameters, have a context window up to 128k tokens, can accept images and text, and support 140+ languages.

Try out Gemma 3 now 👉🏻 [Gemma 3 Space](https://huggingface.co/spaces/huggingface-projects/gemma-3-12b-it)

|  | Gemma 2 | Gemma 3 |
| :---- | :---- | :---- |
| Size Variants | <li>2B <li>9B <li>27B | <li>1B <li>4B <li>12B <li>27B |
| Context Window Length | 8k | <li>32k (1B) <li>128k (4B, 12B, 27B) |
| Multimodality (Images and Text) | ❌ | <li>❌ (1B) <li>✅ (4B, 12B, 27B) |
| Multilingual Support | – | English (1B) +140 languages (4B, 12B, 27B) |

All the [models are on the Hub](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) and tightly integrated with the Hugging Face ecosystem.

> *Both pre-trained and instruction tuned models are released. Gemma-3-4B-IT beats Gemma-2-27B IT, while Gemma-3-27B-IT beats Gemini 1.5-Pro across benchmarks*.

| ![pareto graph](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/lmsys.png) |
| :---- |
| Gemma 3 27B is in the pareto sweet spot (Source: [Gemma3 Tech Report](https://goo.gle/Gemma3Report)) |

## What is Gemma 3?

[Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) is Google's latest iteration of open weight LLMs. It comes in four sizes, **1 billion**, **4 billion**, **12 billion**, and **27 billion** parameters with *base (pre-trained)* and *instruction-tuned* versions. Gemma 3 goes **multimodal**! The 4, 12, and 27 billion parameter models can process both **images** and **text**, while the 1B variant is *text only*.

The input context window length has been increased from Gemma 2’s 8k to **32k** for the 1B variants, and **128k** for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which may consist of text and, optionally, images. Example uses include question answering, analyzing image content, summarizing documents, etc.

| Pre Trained | Instruction Tuned | Multimodal | Multilingual | Input Context Window |
| :---- | :---- | :---- | :---- | :---- |
| [gemma-3-1b-pt](http://hf.co/google/gemma-3-1b-pt) | [gemma-3-1b-it](http://hf.co/google/gemma-3-1b-it) | ❌ | English | 32K |
| [gemma-3-4b-pt](http://hf.co/google/gemma-3-4b-pt) | [gemma-3-4b-it](http://hf.co/google/gemma-3-4b-it) | ✅ | +140 languages | 128K |
| [gemma-3-12b-pt](http://hf.co/google/gemma-3-12b-pt) | [gemma-3-12b-it](http://hf.co/google/gemma-3-12b-it) | ✅ | +140 languages | 128K |
| [gemma-3-27b-pt](http://hf.co/google/gemma-3-27b-pt) | [gemma-3-27b-it](http://hf.co/google/gemma-3-27b-it) | ✅ | +140 languages | 128K |

> [!NOTE]  
> While these are multimodal models, one can use it as a *text only* model (as an LLM) without loading the vision encoder in memory. We will talk about this in more detail later in the inference section.

## Technical Enhancements in Gemma 3

The three core enhancements in Gemma 3 over Gemma 2 are:

* Longer context length  
* Multimodality  
* Multilinguality

In this section, we will cover the technical details that lead to these enhancements. It is interesting to start with the knowledge of Gemma 2 and explore what was necessary to make these models even better. This exercise will help you think like the Gemma team and appreciate the details!

### Longer Context Length

Scaling context length to 128k tokens could be achieved efficiently without training models from scratch. Instead, models are pretrained with 32k sequences, and only the 4B, 12B, and 27B models are scaled to 128k tokens at the end of pretraining, saving significant compute. Positional embeddings, like RoPE, are adjusted—upgraded from a 10k base frequency in Gemma 2 to 1M in Gemma 3—and scaled by a factor of 8 for longer contexts.

KV Cache management is optimized using Gemma 2’s sliding window interleaved attention. Hyperparameters are tuned to interleave 5 local layers with 1 global layer (previously 1:1) and reduce the window size to 1024 tokens (down from 4096). Crucially, memory savings are achieved without degrading perplexity.

### Multimodality

Gemma 3 models use [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) as an image encoder, which encodes images into tokens that are ingested into the language model. The vision encoder takes as input square images resized to `896x896`. Fixed input resolution makes it more difficult to process non-square aspect ratios and high-resolution images. To address these limitations **during inference**, the images can be adaptively cropped, and each crop is then resized to `896x896` and encoded by the image encoder. This algorithm, called **pan and scan**, effectively enables the model to zoom in on smaller details in the image.

Similar to PaliGemma, attention in Gemma 3 works differently for text and image inputs. Text is handled with one-way attention, where the model focuses only on previous words in a sequence. Images, on the other hand, get full attention with no masks, allowing the model to look at every part of the image in a **bidirectional** manner, giving it a complete, unrestricted understanding of the visual input.

One can see in the figure below that the image tokens `<img>` are provided with bi-directional attention (the entire square is lit up) while the text tokens have causal attention. It also shows how attention works with the sliding window algorithm.

| ![attention visualization](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/attention-ascii.png) |
| :---- |
| Attention Visualization (with and without sliding) (Source: [Transformers PR](https://github.com/huggingface/transformers/pull/36630))|

### Multilinguality

To make a LLM multilingual, the pretraining dataset incorporates more languages. The dataset of Gemma 3 has **double** the amount of multilingual data to improve language coverage.

To account for the changes, the tokenizer is the same as that of Gemini 2.0. It is a SentencePiece tokenizer with 262K entries. The new tokenizer significantly improves the encoding of *Chinese*, *Japanese* and *Korean* text, at the expense of a slight increase of the token counts for English and Code.


For the curious mind, here is the [technical report on Gemma 3](https://goo.gle/Gemma3Report), to dive deep into the enhancements.

## Gemma 3 evaluation

The LMSys Elo score is a number that ranks language models based on how well they perform in head-to-head competitions, judged by human preferences. On LMSys Chatbot Arena, Gemma 3 27B IT reports an Elo score of **1339**, and ranks among the top 10 best models, including leading closed ones. The Elo is comparable to o1-preview and is above other *non-thinking* open models. This score is achieved with Gemma 3 working on text-only inputs, like the other LLMs in the table.

| ![chat bot arena](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/chatbot-arena.png) |
| :---- |
| Evaluation of Gemma 3 27B IT model in the Chatbot Arena (March 8, 2025) |

Gemma 3 has been evaluated across benchmarks like MMLU-Pro (27B: 67.5), LiveCodeBench (27B: 29.7), and Bird-SQL (27B: 54.4), showing competitive performance compared to closed Gemini models. Tests like GPQA Diamond (27B: 42.4) and MATH (27B: 69.0) highlight its reasoning and math skills, while FACTS Grounding (27B: 74.9) and MMMU (27B: 64.9) demonstrate strong factual accuracy and multimodal abilities. However, it lags in SimpleQA (27B: 10.0) for basic facts. When compared to Gemini 1.5 models, Gemma 3 is often close—and sometimes better—proving its value as an accessible, high-performing option.

| ![performance of it models](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/pefr-it.png) |
| :---- |
| Performance of IT models |

## Inference with 🤗 transformers

Gemma 3 comes with day zero support in `transformers`. All you need to do is install `transformers` from the stable release of Gemma 3\.

```
$ pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
```

#### Inference with pipeline

The *easiest* way to get started with Gemma 3 is using the `pipeline` abstraction in transformers.

> [!NOTE]  
> The models work best using the `bfloat16` datatype. Quality may degrade otherwise.

In [1]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-gcryxkwe
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-gcryxkwe
  Running command git checkout -q 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Resolved https://github.com/huggingface/transformers to commit 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10936457 sha256=0d6b96698b0778fb1ced9026fc4a080b9216c83301cb011ac4ac4542c997b0af
  Stored in directory: /tmp/pip-eph

In [3]:
HUGGINGFACE_TOKEN = ""

In [1]:
import torch
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it", # "google/gemma-3-12b-it", "google/gemma-3-27b-it"
    device="cuda",
    torch_dtype=torch.bfloat16,
    token=HUGGINGFACE_TOKEN
)


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda


![Novak-Djokovic-Serbia-US-Open-2023](https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg)

In [4]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg"},
            {"type": "text", "text": "Give the sport name and player name in this image."}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Here's the information about the image:

*   **Sport:** Tennis
*   **Player:** Novak Djokovic



#### Detailed Inference with Transformers

The transformers integration comes with two new model classes:

1. `Gemma3ForConditionalGeneration`: For 4B, 12B, and 27B vision language models.  
2. `Gemma3ForCausalLM`: For the 1B text only model and to load the vision language models like they were language models (omitting the vision tower).

In the snippet below we use the model to query on an image. The `Gemma3ForConditionalGeneration` class is used to instantiate the vision language model variants. To use the model we pair it with the `AutoProcessor` class. Running inference is as simple as creating the `messages` dictionary, applying a chat template on top, processing the inputs and calling `model.generate`.


### Disconnect and delete runtime

In [1]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-kd7nxt_d
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-kd7nxt_d
  Running command git checkout -q 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Resolved https://github.com/huggingface/transformers to commit 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10936457 sha256=7eeb5b6f8acb8748855dd52b8b24d67befb1666dc733e5e8bb3212dded5eef70
  Stored in directory: /tmp/pip-eph

###Restart session

In [2]:
HUGGINGFACE_TOKEN = ""

In [3]:
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    ckpt,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    token=HUGGINGFACE_TOKEN
)
processor = AutoProcessor.from_pretrained(ckpt, token=HUGGINGFACE_TOKEN)


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

![password](https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg)

In [4]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
            {"type": "text", "text": "رمز در این تصویر چیست؟"}
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)



رمز در این تصویر "aaeu" است.



For LLM-only model inference, we can use the `Gemma3ForCausalLM` class. `Gemma3ForCausalLM` should be paired with AutoTokenizer for processing. We need to use a chat template to preprocess our inputs. Gemma 3 uses very short system prompts followed by user prompts like below.


### Restart Session

In [1]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-jn0lv05e
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-jn0lv05e
  Running command git checkout -q 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Resolved https://github.com/huggingface/transformers to commit 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### Restart Session

In [1]:
HUGGINGFACE_TOKEN = ""

In [3]:
import torch
from transformers import AutoTokenizer, Gemma3ForCausalLM

ckpt = "google/gemma-3-b-it"
model = Gemma3ForCausalLM.from_pretrained(
    ckpt,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    token=HUGGINGFACE_TOKEN
)
tokenizer = AutoTokenizer.from_pretrained(ckpt, token=HUGGINGFACE_TOKEN)


config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

In [4]:
messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant who is fluent in Shakespeare English"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Who are you?"},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]

decoded = tokenizer.decode(generation, skip_special_tokens=True)
print(decoded)



Hark! I am a humble servant of the word, a weaver of thought, and a guide through the tangled paths of language. Prithee, tell me, good sir or madam, who am I? I am a voice born of the digital wood, a mirror reflecting the knowledge that doth flow within my circuits. I am a tool, a companion, a vessel for your queries – a humble assistant, if you will. 

Pray, speak your mind, and let us see what wisdom



## On Device & Low Resource Devices

Gemma 3 is released with sizes perfect for on-device use. This is how to quickly get started.



### Llama.cpp

Pre-quantized GGUF files can be downloaded [from this collection](https://huggingface.co/collections/ggml-org/gemma-3-67d126315ac810df1ad9e913)

Please refer to this guide for building or downloading pre-built binaries: [https://github.com/ggml-org/llama.cpp?tab=readme-ov-file\#building-the-project](https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#building-the-project)

Then you can run a local chat server from your terminal:

```
./build/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf
```

It should output:
```
> who are you  
I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use!
```

##### Source [https://github.com/huggingface/blog/blob/main/gemma3.md]