Feat/enable glm4 moe by ochougul · Pull Request #991 · quic/efficient-transformers

ochougul · 2026-05-15T10:05:46Z

No description provided.

Add `QEffQwen3ForCausalLM` in the list of supported architectures for `SamplerTransform` in order to enable On Device Sampling. Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>

The following PR add support for `num_crops` and `valid_size` from vLLM in override configs in case of E-PD. --------- Signed-off-by: Varun Gupta <vargupt@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>

quic-rishinr · 2026-05-18T09:15:10Z


-def blocked_qkv_attention_forward(
+def blocked_kv_attention_forward(
+    module: nn.Module,


please add doc string for better code readability.

quic-rishinr · 2026-05-18T09:17:22Z

+    sinks: Optional[torch.Tensor] = None,
+    **kwargs,
+) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+    if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None:


please split the method to smaller reusable methods

quic-rishinr · 2026-05-18T10:13:13Z

+    **kwargs,
+) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+    if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None:
+        return _blocked_hqkv_attention_forward_online(


please split this method as well and refactor

quic-rishinr · 2026-05-19T06:16:38Z

        return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


+class CtxGatherFunc3DGeneralized(torch.autograd.Function):


method looks exactly same as CtxGatherFunc3D. Do we need this redundant method?

quic-rishinr · 2026-05-19T06:22:24Z

+#
+# -----------------------------------------------------------------------------
+
+import copy


Can you also add small test for under unit tests?

ROPE dtype was set to torch.get_default_dtype() which was float32 by default. Changed it to set the config's dtype which was set by user. --------- Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

quic-hemagnih · 2026-05-19T11:34:15Z

+    return q_embed.to(q.dtype), k_embed.to(k.dtype)
+
+
 def eager_attention_forward_blocked_kv(


I couldn't find the caller of this function, where it is getting called?

quic-hemagnih · 2026-05-19T11:40:15Z

        return topk_indices, topk_weights

    def forward(self, hidden_states):
        # orig_i, orig_w = self.orig_forward(hidden_states)


Are we selecting TopK experts in same way how it's computed in original get_topk_indices method. Seems like in original code group-constrained top-k selection is used.

Are we getting same experts in both the cases?

quic-hemagnih · 2026-05-19T11:43:38Z

+            key_idx = torch.arange(split_block_len, device=query.device)
+            pad_mask = key_idx.unsqueeze(0) >= valid_in_chunk.unsqueeze(1)
+            attn_weights_block = attn_weights_block.masked_fill(pad_mask.view(1, 1, split, 1, split_block_len), -3.0e4)
+


Can we please use the constants instead of magic numbers like -3.0e4. Can we use MIN_MASKED_ATTENTION_VALUE

quic-hemagnih · 2026-05-19T11:44:27Z

+        )
+        query_pos = position_ids.repeat(1, num_kv_groups)
+        causal_mask = key_abs[None, :, None, :] > query_pos[:, None, :, None]
+        attn_weights_block = attn_weights_block.masked_fill(causal_mask.unsqueeze(1), -3.0e4)


Please replace -3.0e4 with MIN_MASKED_ATTENTION_VALUE at all instances

quic-hemagnih · 2026-05-19T11:46:29Z

+
+    @staticmethod
+    def symbolic(g: torch.Graph, data: torch.Value, ctx_indices: torch.Value) -> torch.Value:
+        return g.onnxscript_op(CtxGather3D, data, ctx_indices)


Shouldn't we add .setTypeAs(data) ?

quic-hemagnih · 2026-05-19T11:53:37Z

+            max_block = torch.where(skip_future, torch.full_like(max_block, MIN_MASKED_ATTENTION_VALUE), max_block)
+            exp_block = torch.where(skip_future, torch.zeros_like(exp_block), exp_block)
+
+        _, v_block = past_key_value.read_only_blockedKV(start_index, end_index, layer_idx, cache_kwargs)


Can we combine this reading of v_block with above K_block reading. This will save the two time I/O operations. I think all the params which are passed in both the calls are same and not getting changed

quic-hemagnih · 2026-05-19T11:55:26Z

    def forward(self, hidden_states):
        # orig_i, orig_w = self.orig_forward(hidden_states)
        hidden_states = hidden_states.view(-1, self.config.hidden_size)
        # import ipdb; ipdb.set_trace()c


Please remove this commented code

quic-hemagnih · 2026-05-19T11:56:00Z

-                    value_states,
-                    attention_mask,
-                    dropout=0.0 if not self.training else self.attention_dropout,
+            # sin and cos are specific to RoPE models; position_ids needed for the static cache


Can we remove all this commented code?

## Performance Note When compiling the Qwen2.5 vision encoder with subfunctions enabled, a performance degradation is observed. This is primarily due to a computation within the VisionAttention module that remains invariant across layers, resulting in unnecessary repeated execution and increased runtime overhead. --------- Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>

@quic-hemagnih

…cialization naming (quic#1006) This PR verifies the reported Llama decode-only compile behaviour **Findings**: - `decode_only=True` is currently not supported by `QEFFAutoModelForCausalLM.export()`. - `retain_full_kv=True` is only applicable to specialized disaggregated-serving models such as GPT-OSS/Kimi, and has no effect for Llama. - `prefill_seq_len=1` with prefill_only=False was incorrectly tagged as Prefill in the generated specialization metadata. **Changes**: - Raise a clear `NotImplementedError `when `decode_only=True` is passed to CausalLM export. - Warn and ignore `retain_full_kv=True` for non-specialized models such as Llama. - Tag `prefill_seq_len=1 and prefill_only=False` specializations as Decode. - Add quickcheck unit coverage for unsupported decode_only, ignored Llama retain_full_kv, and PL=1 decode specialization naming. **Validation**: - Verified PL=1 now emits "name": "Decode" in specialization output. - Added unit test and Ran focused quickcheck tests successfully: - PYENV_VERSION=qeff pytest -q tests/unit_test/models/test_model_quickcheck.py::TestCausalLMFlagDiagnostics cc: @quic-hemagnih @quic-rishinr Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Previous changes to non-blocked attention forward were causing the attention mask generated for CCL to not be correct in the standard attention forward. Returned the updated attention mask from the past_key_value_update helper function in order to fix this. --------- Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com> Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>

@quic-rishinr

…changes to v5.5.4 and restore PyTorch/ORT parity (quic#876) - Rebased downstream wrapper stack to transformers v5.3.0 and aligned coupled deps (huggingface-hub, peft, diffusers) in project config. - Updated model wrapper compatibility paths across causal/VLM/audio/export flows to match upstream v5 APIs while preserving downstream public behavior. - Hardened cache compatibility layer and runtime glue for mixed legacy/new cache semantics used by downstream generation/export paths. - Fixed attention/mask/rotary call-path mismatches introduced by upstream API changes (including model-specific signature updates). - Updated AWQ/quantizer and export compatibility paths to remain ONNX-safe. - Validation evidence: ``` python -m pytest -q tests/test_model_quickcheck.py -n 16 Result: 26 passed. ``` - [x] QAic Verification Pending - [x] E2E CI read out cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>

…o_empty() (quic#952) fix: improve weight offloading to handle plain tensor attrs and use to_empty() Replace manual storage resizing with `to_empty(device="meta")` for parameters/buffers and explicitly handle plain tensor attributes (e.g. stacked expert weights in MoE models) that are not registered as parameters or buffers. This ensures all tensors are properly moved to the meta device, reducing memory usage after ONNX export. Add unit tests for plain tensor attribute clearing --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

vbaddi · 2026-05-27T05:05:03Z

duplicate #988

Enable GLM4-MOE chunked prefill MoE, KV-blocked attention, and disaggregated serving export with subfunctions. - GLM4-MOE decode path - Chunked prefill MoE path with packed expert dispatch - KV-blocked attention path - Disaggregated prefill/decode serving example - ONNX subfunction export for decode and prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Use the headpar_offline KV-blocking path by default for GQA-compatible KV blocking, with fallback to the previous online implementation for unsupported masking/bias cases. Revert to previous commit if fails. WIP Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Route KV-containing combined blocking modes through the headpar_offline path when supported, and pass user-tiled compile flags explicitly in the GLM4 MoE disagg example. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…on export and update example Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_likeindex init, and add ONNX coverage for the second packed chunk slice. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…ss/qwen3/pr935 Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Enable On Device Sampling for Qwen3ForCausalLM (quic#963)

e084766

Add `QEffQwen3ForCausalLM` in the list of supported architectures for `SamplerTransform` in order to enable On Device Sampling. Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>

vbaddi force-pushed the feat/enable_glm4_moe branch from 6e91468 to 77e65e9 Compare May 15, 2026 14:41

quic-rishinr reviewed May 19, 2026

View reviewed changes

Updated the ROPE dtype for custom_dtype (quic#989)

af6865a

ROPE dtype was set to torch.get_default_dtype() which was float32 by default. Changed it to set the config's dtype which was set by user. --------- Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

quic-hemagnih requested changes May 19, 2026

View reviewed changes

abhishek-singh591 and others added 5 commits May 22, 2026 15:40

vbaddi added 8 commits May 27, 2026 10:55

feat(0514): Use head-parallel KV path for combined blocking

59095ea

Route KV-containing combined blocking modes through the headpar_offline path when supported, and pass user-tiled compile flags explicitly in the GLM4 MoE disagg example. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit: fix the license header in the example file

4953a7c

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

fix(0415): fix: avoid unsupported prefill MoE reductions in subfuncti…

df6d647

…on export and update example Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

fix(0526): Align MoE prefill blocking with bench path, same like gpto…

bce4b73

…ss/qwen3/pr935 Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

fix(0527): align GLM4-MoE with transformers 5.5 cache and expert APIs

5c632b7

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi force-pushed the feat/enable_glm4_moe branch from 8452e31 to 5c632b7 Compare May 27, 2026 05:52

		return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


		class CtxGatherFunc3DGeneralized(torch.autograd.Function):

		return q_embed.to(q.dtype), k_embed.to(k.dtype)


		def eager_attention_forward_blocked_kv(

Conversation

ochougul commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbaddi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants