-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Bugfix] Revert custom attention mask for gemma3-mm #28995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request partially reverts changes related to custom attention mask generation for Gemma3 multimodal models, which is intended to fix a failing test. The changes involve removing the uses_custom_attention_masks logic from ModelConfig, GpuModelRunner, and transformers_utils.config. The custom mask generation methods generate_attention_masks and prepare_attn_masks are also removed from gemma3_mm.py. Additionally, the get_multimodal_embeddings method in gemma3_mm.py has been renamed to embed_multimodal to align with the current interface, which is a good refactoring. The changes are clean, consistent, and effectively address the stated purpose of the PR. I find no issues with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/model_executor/models/gemma3.py
Lines 203 to 240 in cf34776
| def forward( | |
| self, | |
| positions: torch.Tensor, | |
| hidden_states: torch.Tensor, | |
| **kwargs, | |
| ) -> torch.Tensor: | |
| qkv, _ = self.qkv_proj(hidden_states) | |
| q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) | |
| q = q.unflatten(-1, (self.num_heads, self.head_dim)) | |
| q = self.q_norm(q) | |
| q = q.flatten(-2, -1) | |
| k = k.unflatten(-1, (self.num_kv_heads, self.head_dim)) | |
| k = self.k_norm(k) | |
| k = k.flatten(-2, -1) | |
| q, k = self.rotary_emb(positions, q, k) | |
| attn_output = self.attn(q, k, v) | |
| if not kwargs.get("has_images", False): | |
| # Fast path for text-only inputs. The performance for the text-only | |
| # inputs are not affected by the naive attention below. | |
| output, _ = self.o_proj(attn_output) | |
| return output | |
| # NOTE(woosuk): Gemma3 uses bidirectional attention between image tokens | |
| # that correspond to the same image while using causal attention | |
| # otherwise. Current attention backends cannot handle this pattern, so | |
| # we temporarily use a naive attention implementation with mask tensors. | |
| # We intentionally keep the attention backend as-is and only override | |
| # `attn_output` with the naive implementation's output. This minimizes | |
| # changes to existing model runners and attention backends. The call to | |
| # `self.attn(q, k, v)` is only used to populate the KV cache - its | |
| # output is discarded and overwritten below. While this duplicates | |
| # computation, it maintains compatibility. | |
| # TODO(woosuk): Optimize by implementing custom attention kernels. | |
| attn_output = self.naive_attn_with_masks(q, k, v, out=attn_output, **kwargs) |
Gemma3’s attention module only enables the bidirectional mask for image tokens when has_images is passed in kwargs, and it expects accompanying seq_lens and mask tensors (global_attn_masks/local_attn_masks). After this change the GPU model runner no longer sets has_images or builds those masks when preparing multimodal batches, so the check in Gemma3Attention.forward is always false and the code falls back to the standard causal attention. That causes all multimodal Gemma3 requests to run with purely causal masks, preventing image patches from attending to each other as the model definition requires. Any inference that includes images will therefore produce incorrect attention patterns.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
hey @Isotr0py, This breaks the mm support for Gemma3 GGUF. And test96 failure is related to different tokens between Transformers and VLLM V1 engines for a very specific non-GGUF pan-and-scan scenario - test95 uses pan-and-scan too and works fine. Are you sure it is the best way out? As an alternative couldn't we suspend test96 (as it is kind of duplicate with test95) while we investigate why the generation differs? I volunteer for that investigation. |
Not really, I have verified that the gguf e2e mm test can still pass when submitting this PR: Therefore, given that the incorrect implemented custom mask has broken existing CI. We should revert it to make it pass again instead of ignoring the failure by suspension. |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: LuminolT <lumischen01@gmail.com>
Restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in vllm-project#28995. Implements robust GGUF-only guards to ensure the feature only applies to GGUF models and does not affect HF models. Changes: - Add uses_custom_attention_masks() utility with GGUF file format check - Add uses_custom_attention_masks property to ModelConfig - Initialize uses_custom_attention_masks in GPUModelRunner - Restore generate_attention_masks() method to Gemma3ForConditionalGeneration - Implement 3-layer defense-in-depth guard mechanism The implementation uses check_gguf_file() to guarantee that custom attention mask logic only triggers for GGUF files, preventing the issue that caused the original revert where HF models incorrectly triggered the custom logic. Tested with GGUF models (1B, 4B, 270M) for both text-only and multimodal inference. HF model compatibility verified via pytest multimodal test suite. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in vllm-project#28995. Implements robust GGUF-only guards to ensure the feature only applies to GGUF models and does not affect HF models. Changes: - Add uses_custom_attention_masks() utility with GGUF file format check - Add uses_custom_attention_masks property to ModelConfig - Initialize uses_custom_attention_masks in GPUModelRunner - Restore generate_attention_masks() method to Gemma3ForConditionalGeneration - Implement 3-layer defense-in-depth guard mechanism The implementation uses check_gguf_file() to guarantee that custom attention mask logic only triggers for GGUF files, preventing the issue that caused the original revert where HF models incorrectly triggered the custom logic. Tested with GGUF models (1B, 4B, 270M) for both text-only and multimodal inference. HF model compatibility verified via pytest multimodal test suite. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Purpose
Test Plan
Test Result
Failing gemma3 test should pass now.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.