Skip to content

Feat/enable glm4 moe#991

Open
ochougul wants to merge 16 commits into
quic:glm_air_branchfrom
vbaddi:feat/enable_glm4_moe
Open

Feat/enable glm4 moe#991
ochougul wants to merge 16 commits into
quic:glm_air_branchfrom
vbaddi:feat/enable_glm4_moe

Conversation

@ochougul
Copy link
Copy Markdown
Contributor

No description provided.

Add `QEffQwen3ForCausalLM` in the list of supported architectures for
`SamplerTransform` in order to enable On Device Sampling.

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
@vbaddi vbaddi force-pushed the feat/enable_glm4_moe branch from 6e91468 to 77e65e9 Compare May 15, 2026 14:41
The following PR add support for `num_crops` and `valid_size` from vLLM
in override configs in case of E-PD.

---------

Signed-off-by: Varun Gupta <vargupt@qti.qualcomm.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>

def blocked_qkv_attention_forward(
def blocked_kv_attention_forward(
module: nn.Module,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add doc string for better code readability.

sinks: Optional[torch.Tensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split the method to smaller reusable methods

**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None:
return _blocked_hqkv_attention_forward_online(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split this method as well and refactor

return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data)


class CtxGatherFunc3DGeneralized(torch.autograd.Function):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method looks exactly same as CtxGatherFunc3D. Do we need this redundant method?

#
# -----------------------------------------------------------------------------

import copy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add small test for under unit tests?

ROPE dtype was set to torch.get_default_dtype() which was float32 by
default. Changed it to set the config's dtype which was set by user.

---------

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
return q_embed.to(q.dtype), k_embed.to(k.dtype)


def eager_attention_forward_blocked_kv(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the caller of this function, where it is getting called?

return topk_indices, topk_weights

def forward(self, hidden_states):
# orig_i, orig_w = self.orig_forward(hidden_states)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we selecting TopK experts in same way how it's computed in original get_topk_indices method. Seems like in original code group-constrained top-k selection is used.

Are we getting same experts in both the cases?

key_idx = torch.arange(split_block_len, device=query.device)
pad_mask = key_idx.unsqueeze(0) >= valid_in_chunk.unsqueeze(1)
attn_weights_block = attn_weights_block.masked_fill(pad_mask.view(1, 1, split, 1, split_block_len), -3.0e4)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please use the constants instead of magic numbers like -3.0e4. Can we use MIN_MASKED_ATTENTION_VALUE

)
query_pos = position_ids.repeat(1, num_kv_groups)
causal_mask = key_abs[None, :, None, :] > query_pos[:, None, :, None]
attn_weights_block = attn_weights_block.masked_fill(causal_mask.unsqueeze(1), -3.0e4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace -3.0e4 with MIN_MASKED_ATTENTION_VALUE at all instances


@staticmethod
def symbolic(g: torch.Graph, data: torch.Value, ctx_indices: torch.Value) -> torch.Value:
return g.onnxscript_op(CtxGather3D, data, ctx_indices)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add .setTypeAs(data) ?

max_block = torch.where(skip_future, torch.full_like(max_block, MIN_MASKED_ATTENTION_VALUE), max_block)
exp_block = torch.where(skip_future, torch.zeros_like(exp_block), exp_block)

_, v_block = past_key_value.read_only_blockedKV(start_index, end_index, layer_idx, cache_kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this reading of v_block with above K_block reading. This will save the two time I/O operations. I think all the params which are passed in both the calls are same and not getting changed

def forward(self, hidden_states):
# orig_i, orig_w = self.orig_forward(hidden_states)
hidden_states = hidden_states.view(-1, self.config.hidden_size)
# import ipdb; ipdb.set_trace()c
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this commented code

value_states,
attention_mask,
dropout=0.0 if not self.training else self.attention_dropout,
# sin and cos are specific to RoPE models; position_ids needed for the static cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all this commented code?

abhishek-singh591 and others added 5 commits May 22, 2026 15:40
## Performance Note

When compiling the Qwen2.5 vision encoder with subfunctions enabled, a
performance degradation is observed. This is primarily due to a
computation within the VisionAttention module that remains invariant
across layers, resulting in unnecessary repeated execution and increased
runtime overhead.

---------

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
…cialization naming (quic#1006)

This PR verifies the reported Llama decode-only compile behaviour

**Findings**:

- `decode_only=True` is currently not supported by
`QEFFAutoModelForCausalLM.export()`.
- `retain_full_kv=True` is only applicable to specialized
disaggregated-serving models such as GPT-OSS/Kimi, and has no
    effect for Llama.
- `prefill_seq_len=1` with prefill_only=False was incorrectly tagged as
Prefill in the generated specialization metadata.

**Changes**:

- Raise a clear `NotImplementedError `when `decode_only=True` is passed
to CausalLM export.
- Warn and ignore `retain_full_kv=True` for non-specialized models such
as Llama.
- Tag `prefill_seq_len=1 and prefill_only=False` specializations as
Decode.
- Add quickcheck unit coverage for unsupported decode_only, ignored
Llama retain_full_kv, and PL=1 decode specialization
    naming.

**Validation**:

  - Verified PL=1 now emits "name": "Decode" in specialization output.
  - Added unit test and Ran focused quickcheck tests successfully:
- PYENV_VERSION=qeff pytest -q
tests/unit_test/models/test_model_quickcheck.py::TestCausalLMFlagDiagnostics

cc: @quic-hemagnih @quic-rishinr

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Previous changes to non-blocked attention forward were causing the
attention mask generated for CCL to not be correct in the standard
attention forward. Returned the updated attention mask from the
past_key_value_update helper function in order to fix this.

---------

Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…changes to v5.5.4 and restore PyTorch/ORT parity (quic#876)

- Rebased downstream wrapper stack to transformers v5.3.0 and aligned
coupled deps (huggingface-hub, peft, diffusers) in project config.
- Updated model wrapper compatibility paths across
causal/VLM/audio/export flows to match upstream v5 APIs while preserving
downstream public behavior.
- Hardened cache compatibility layer and runtime glue for mixed
legacy/new cache semantics used by downstream generation/export paths.
- Fixed attention/mask/rotary call-path mismatches introduced by
upstream API changes (including model-specific signature updates).
- Updated AWQ/quantizer and export compatibility paths to remain
ONNX-safe.
- Validation evidence:
```
python -m pytest -q tests/test_model_quickcheck.py -n 16
Result: 26 passed.
```

- [x] QAic Verification Pending
- [x]  E2E CI read out

cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github

---------

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…o_empty() (quic#952)

fix: improve weight offloading to handle plain tensor attrs and use
to_empty()

Replace manual storage resizing with `to_empty(device="meta")` for
parameters/buffers and explicitly handle plain tensor attributes (e.g.
stacked expert weights in MoE models) that are not registered as
parameters or buffers. This ensures all tensors are properly moved to
the meta device, reducing memory usage after ONNX export.

Add unit tests for plain tensor attribute clearing

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
@vbaddi
Copy link
Copy Markdown
Contributor

vbaddi commented May 27, 2026

duplicate #988

vbaddi added 8 commits May 27, 2026 10:55
Enable GLM4-MOE chunked prefill MoE, KV-blocked attention, and disaggregated
serving export with subfunctions.

  - GLM4-MOE decode path
  - Chunked prefill MoE path with packed expert dispatch
  - KV-blocked attention path
  - Disaggregated prefill/decode serving example
  - ONNX subfunction export for decode and prefill

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Use the headpar_offline KV-blocking path by default for GQA-compatible KV
blocking, with fallback to the previous online implementation for
unsupported masking/bias cases.

Revert to previous commit if fails. WIP

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Route KV-containing combined blocking modes through the
headpar_offline path when supported, and pass user-tiled
compile flags explicitly in the GLM4 MoE disagg example.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…on export and update example

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks,
restore torch.full_likeindex init, and add ONNX coverage for the second packed chunk slice.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ss/qwen3/pr935

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
@vbaddi vbaddi force-pushed the feat/enable_glm4_moe branch from 8452e31 to 5c632b7 Compare May 27, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants