[Bugfix] Revert custom attention mask for gemma3-mm #28995

Isotr0py · 2025-11-19T06:06:00Z

Purpose

Partially revert [Model] Add Gemma3 GGUF multimodal support #27772 to fix gemma3 test (https://buildkite.com/vllm/ci/builds/39589/steps/canvas?jid=019a9980-22d3-405a-b691-dc238b9e9ea0)

Test Plan

pytest -s -v tests/models/multimodal/generation/test_common.py -k gemma3-test

Test Result

Failing gemma3 test should pass now.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist

Code Review

This pull request partially reverts changes related to custom attention mask generation for Gemma3 multimodal models, which is intended to fix a failing test. The changes involve removing the uses_custom_attention_masks logic from ModelConfig, GpuModelRunner, and transformers_utils.config. The custom mask generation methods generate_attention_masks and prepare_attn_masks are also removed from gemma3_mm.py. Additionally, the get_multimodal_embeddings method in gemma3_mm.py has been renamed to embed_multimodal to align with the current interface, which is a good refactoring. The changes are clean, consistent, and effectively address the stated purpose of the PR. I find no issues with this change.

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/model_executor/models/gemma3.py

Lines 203 to 240 in cf34776

    
           def forward( 
        
               self, 
        
               positions: torch.Tensor, 
        
               hidden_states: torch.Tensor, 
        
               **kwargs, 
        
           ) -> torch.Tensor: 
        
               qkv, _ = self.qkv_proj(hidden_states) 
        
               q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) 
        
               q = q.unflatten(-1, (self.num_heads, self.head_dim)) 
        
               q = self.q_norm(q) 
        
               q = q.flatten(-2, -1) 
        
               k = k.unflatten(-1, (self.num_kv_heads, self.head_dim)) 
        
               k = self.k_norm(k) 
        
               k = k.flatten(-2, -1) 
        
               q, k = self.rotary_emb(positions, q, k) 
        
               attn_output = self.attn(q, k, v) 
        
               if not kwargs.get("has_images", False): 
        
                   # Fast path for text-only inputs. The performance for the text-only 
        
                   # inputs are not affected by the naive attention below. 
        
                   output, _ = self.o_proj(attn_output) 
        
                   return output 
        
               # NOTE(woosuk): Gemma3 uses bidirectional attention between image tokens 
        
               # that correspond to the same image while using causal attention 
        
               # otherwise. Current attention backends cannot handle this pattern, so 
        
               # we temporarily use a naive attention implementation with mask tensors. 
        
               # We intentionally keep the attention backend as-is and only override 
        
               # `attn_output` with the naive implementation's output. This minimizes 
        
               # changes to existing model runners and attention backends. The call to 
        
               # `self.attn(q, k, v)` is only used to populate the KV cache - its 
        
               # output is discarded and overwritten below. While this duplicates 
        
               # computation, it maintains compatibility. 
        
               # TODO(woosuk): Optimize by implementing custom attention kernels. 
        
               attn_output = self.naive_attn_with_masks(q, k, v, out=attn_output, **kwargs)

Restore custom masks when Gemma3 sees images

Gemma3’s attention module only enables the bidirectional mask for image tokens when has_images is passed in kwargs, and it expects accompanying seq_lens and mask tensors (global_attn_masks/local_attn_masks). After this change the GPU model runner no longer sets has_images or builds those masks when preparing multimodal batches, so the check in Gemma3Attention.forward is always false and the code falls back to the standard causal attention. That causes all multimodal Gemma3 requests to run with purely causal masks, preventing image patches from attending to each other as the model definition requires. Any inference that includes images will therefore produce incorrect attention patterns.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

DarkLight1337 · 2025-11-19T06:17:19Z

cc @lucianommartins

lucianommartins · 2025-11-19T08:21:49Z

hey @Isotr0py,

This breaks the mm support for Gemma3 GGUF. And test96 failure is related to different tokens between Transformers and VLLM V1 engines for a very specific non-GGUF pan-and-scan scenario - test95 uses pan-and-scan too and works fine.

Are you sure it is the best way out?

As an alternative couldn't we suspend test96 (as it is kind of duplicate with test95) while we investigate why the generation differs? I volunteer for that investigation.

Isotr0py · 2025-11-19T08:37:31Z

This breaks the mm support for Gemma3 GGUF.

Not really, I have verified that the gguf e2e mm test can still pass when submitting this PR:

tests/models/multimodal/generation/test_multimodal_gguf.py::test_models[10-32-bfloat16-model0]
  /home/mozf/develop-projects/vllm/tests/models/multimodal/generation/test_multimodal_gguf.py:87: UserWarning: Test0:
  Matched tokens:       [108, 2094]
  original:     '\n\nThis image captures a vibrant street scene in a Chinatown district, likely in Australia. The focal point is a grand, ornate Chinese gate, painted in vibrant red'  {2471: Logprob(logprob=-1.2075515985488892, rank=2, decoded_token=' image'), 10807: Logprob(logprob=-1.2075515985488892, rank=1, decoded_token=' photograph'), 4429: Logprob(logprob=-1.9575515985488892, rank=3, decoded_token=' photo'), 563: Logprob(logprob=-1.9575515985488892, rank=4, decoded_token=' is'), 28239: Logprob(logprob=-2.4575514793395996, rank=5, decoded_token=' vibrant'), 5777: Logprob(logprob=-4.7075514793396, rank=6, decoded_token=' wide'), 7804: Logprob(logprob=-5.3325514793396, rank=7, decoded_token=' bright'), 2258: Logprob(logprob=-6.2075514793396, rank=8, decoded_token=' color'), 11690: Logprob(logprob=-6.2075514793396, rank=9, decoded_token=' outdoor'), 5719: Logprob(logprob=-6.3325514793396, rank=10, decoded_token=' shot')}
  gguf: '\n\nThis photo captures a vibrant street scene in Chinatown, likely in Sydney, Australia. The focal point is a traditional Chinese gate, adorned with red and gold decorations'        {4429: Logprob(logprob=-1.0050928592681885, rank=1, decoded_token=' photo'), 10807: Logprob(logprob=-1.2550928592681885, rank=2, decoded_token=' photograph'), 2471: Logprob(logprob=-1.5050928592681885, rank=3, decoded_token=' image'), 563: Logprob(logprob=-2.7550928592681885, rank=4, decoded_token=' is'), 28239: Logprob(logprob=-3.3800928592681885, rank=5, decoded_token=' vibrant'), 11690: Logprob(logprob=-5.005092620849609, rank=6, decoded_token=' outdoor'), 5777: Logprob(logprob=-5.505092620849609, rank=7, decoded_token=' wide'), 6083: Logprob(logprob=-5.755092620849609, rank=8, decoded_token=' picture'), 5719: Logprob(logprob=-6.005092620849609, rank=9, decoded_token=' shot'), 2258: Logprob(logprob=-6.255092620849609, rank=10, decoded_token=' color')}
    check_logprobs_close(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================================================== 1 passed, 13 warnings in 195.93s (0:03:15) ===============================================================================

Therefore, given that the incorrect implemented custom mask has broken existing CI. We should revert it to make it pass again instead of ignoring the failure by suspension.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: LuminolT <lumischen01@gmail.com>

Restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in vllm-project#28995. Implements robust GGUF-only guards to ensure the feature only applies to GGUF models and does not affect HF models. Changes: - Add uses_custom_attention_masks() utility with GGUF file format check - Add uses_custom_attention_masks property to ModelConfig - Initialize uses_custom_attention_masks in GPUModelRunner - Restore generate_attention_masks() method to Gemma3ForConditionalGeneration - Implement 3-layer defense-in-depth guard mechanism The implementation uses check_gguf_file() to guarantee that custom attention mask logic only triggers for GGUF files, preventing the issue that caused the original revert where HF models incorrectly triggered the custom logic. Tested with GGUF models (1B, 4B, 270M) for both text-only and multimodal inference. HF model compatibility verified via pytest multimodal test suite. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

revert custom attention mask for gemma3

cf34776

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners November 19, 2025 06:06

mergify bot added the v1 label Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

Isotr0py requested a review from DarkLight1337 November 19, 2025 06:08

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

DarkLight1337 approved these changes Nov 19, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

Merge branch 'vllm-project:main' into revert-custom-attn

bd5b25e

Merge branch 'main' into revert-custom-attn

25b2f5d

Isotr0py merged commit 64192d5 into vllm-project:main Nov 20, 2025
53 checks passed

Isotr0py deleted the revert-custom-attn branch November 20, 2025 05:23

lucianommartins mentioned this pull request Nov 21, 2025

[Model] Restore Gemma3 GGUF multimodal support with GGUF-only guards #29198

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Revert custom attention mask for gemma3-mm #28995

[Bugfix] Revert custom attention mask for gemma3-mm #28995

Isotr0py commented Nov 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

DarkLight1337 commented Nov 19, 2025

Uh oh!

lucianommartins commented Nov 19, 2025 •

edited

Loading

Uh oh!

Isotr0py commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def forward(
	self,
	positions: torch.Tensor,
	hidden_states: torch.Tensor,
	**kwargs,
	) -> torch.Tensor:
	qkv, _ = self.qkv_proj(hidden_states)
	q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)

	q = q.unflatten(-1, (self.num_heads, self.head_dim))
	q = self.q_norm(q)
	q = q.flatten(-2, -1)
	k = k.unflatten(-1, (self.num_kv_heads, self.head_dim))
	k = self.k_norm(k)
	k = k.flatten(-2, -1)

	q, k = self.rotary_emb(positions, q, k)
	attn_output = self.attn(q, k, v)

	if not kwargs.get("has_images", False):
	# Fast path for text-only inputs. The performance for the text-only
	# inputs are not affected by the naive attention below.
	output, _ = self.o_proj(attn_output)
	return output

	# NOTE(woosuk): Gemma3 uses bidirectional attention between image tokens
	# that correspond to the same image while using causal attention
	# otherwise. Current attention backends cannot handle this pattern, so
	# we temporarily use a naive attention implementation with mask tensors.

	# We intentionally keep the attention backend as-is and only override
	# `attn_output` with the naive implementation's output. This minimizes
	# changes to existing model runners and attention backends. The call to
	# `self.attn(q, k, v)` is only used to populate the KV cache - its
	# output is discarded and overwritten below. While this duplicates
	# computation, it maintains compatibility.
	# TODO(woosuk): Optimize by implementing custom attention kernels.
	attn_output = self.naive_attn_with_masks(q, k, v, out=attn_output, **kwargs)

Uh oh!

[Bugfix] Revert custom attention mask for gemma3-mm #28995

[Bugfix] Revert custom attention mask for gemma3-mm #28995

Conversation

Isotr0py commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

DarkLight1337 commented Nov 19, 2025

Uh oh!

lucianommartins commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Isotr0py commented Nov 19, 2025 •

edited by github-actions bot

Loading

lucianommartins commented Nov 19, 2025 •

edited

Loading