Feat/enable glm4 moe#991
Conversation
Add `QEffQwen3ForCausalLM` in the list of supported architectures for `SamplerTransform` in order to enable On Device Sampling. Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
6e91468 to
77e65e9
Compare
The following PR add support for `num_crops` and `valid_size` from vLLM in override configs in case of E-PD. --------- Signed-off-by: Varun Gupta <vargupt@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
|
|
||
| def blocked_qkv_attention_forward( | ||
| def blocked_kv_attention_forward( | ||
| module: nn.Module, |
There was a problem hiding this comment.
please add doc string for better code readability.
| sinks: Optional[torch.Tensor] = None, | ||
| **kwargs, | ||
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: | ||
| if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None: |
There was a problem hiding this comment.
please split the method to smaller reusable methods
| **kwargs, | ||
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: | ||
| if position_bias is not None or sinks is not None or sliding_window is not None or attention_mask is not None: | ||
| return _blocked_hqkv_attention_forward_online( |
There was a problem hiding this comment.
please split this method as well and refactor
| return g.onnxscript_op(CtxGather3D, data, ctx_indices).setTypeAs(data) | ||
|
|
||
|
|
||
| class CtxGatherFunc3DGeneralized(torch.autograd.Function): |
There was a problem hiding this comment.
method looks exactly same as CtxGatherFunc3D. Do we need this redundant method?
| # | ||
| # ----------------------------------------------------------------------------- | ||
|
|
||
| import copy |
There was a problem hiding this comment.
Can you also add small test for under unit tests?
ROPE dtype was set to torch.get_default_dtype() which was float32 by default. Changed it to set the config's dtype which was set by user. --------- Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
| return q_embed.to(q.dtype), k_embed.to(k.dtype) | ||
|
|
||
|
|
||
| def eager_attention_forward_blocked_kv( |
There was a problem hiding this comment.
I couldn't find the caller of this function, where it is getting called?
| return topk_indices, topk_weights | ||
|
|
||
| def forward(self, hidden_states): | ||
| # orig_i, orig_w = self.orig_forward(hidden_states) |
There was a problem hiding this comment.
Are we selecting TopK experts in same way how it's computed in original get_topk_indices method. Seems like in original code group-constrained top-k selection is used.
Are we getting same experts in both the cases?
| key_idx = torch.arange(split_block_len, device=query.device) | ||
| pad_mask = key_idx.unsqueeze(0) >= valid_in_chunk.unsqueeze(1) | ||
| attn_weights_block = attn_weights_block.masked_fill(pad_mask.view(1, 1, split, 1, split_block_len), -3.0e4) | ||
|
|
There was a problem hiding this comment.
Can we please use the constants instead of magic numbers like -3.0e4. Can we use MIN_MASKED_ATTENTION_VALUE
| ) | ||
| query_pos = position_ids.repeat(1, num_kv_groups) | ||
| causal_mask = key_abs[None, :, None, :] > query_pos[:, None, :, None] | ||
| attn_weights_block = attn_weights_block.masked_fill(causal_mask.unsqueeze(1), -3.0e4) |
There was a problem hiding this comment.
Please replace -3.0e4 with MIN_MASKED_ATTENTION_VALUE at all instances
|
|
||
| @staticmethod | ||
| def symbolic(g: torch.Graph, data: torch.Value, ctx_indices: torch.Value) -> torch.Value: | ||
| return g.onnxscript_op(CtxGather3D, data, ctx_indices) |
There was a problem hiding this comment.
Shouldn't we add .setTypeAs(data) ?
| max_block = torch.where(skip_future, torch.full_like(max_block, MIN_MASKED_ATTENTION_VALUE), max_block) | ||
| exp_block = torch.where(skip_future, torch.zeros_like(exp_block), exp_block) | ||
|
|
||
| _, v_block = past_key_value.read_only_blockedKV(start_index, end_index, layer_idx, cache_kwargs) |
There was a problem hiding this comment.
Can we combine this reading of v_block with above K_block reading. This will save the two time I/O operations. I think all the params which are passed in both the calls are same and not getting changed
| def forward(self, hidden_states): | ||
| # orig_i, orig_w = self.orig_forward(hidden_states) | ||
| hidden_states = hidden_states.view(-1, self.config.hidden_size) | ||
| # import ipdb; ipdb.set_trace()c |
There was a problem hiding this comment.
Please remove this commented code
| value_states, | ||
| attention_mask, | ||
| dropout=0.0 if not self.training else self.attention_dropout, | ||
| # sin and cos are specific to RoPE models; position_ids needed for the static cache |
There was a problem hiding this comment.
Can we remove all this commented code?
## Performance Note When compiling the Qwen2.5 vision encoder with subfunctions enabled, a performance degradation is observed. This is primarily due to a computation within the VisionAttention module that remains invariant across layers, resulting in unnecessary repeated execution and increased runtime overhead. --------- Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
…cialization naming (quic#1006) This PR verifies the reported Llama decode-only compile behaviour **Findings**: - `decode_only=True` is currently not supported by `QEFFAutoModelForCausalLM.export()`. - `retain_full_kv=True` is only applicable to specialized disaggregated-serving models such as GPT-OSS/Kimi, and has no effect for Llama. - `prefill_seq_len=1` with prefill_only=False was incorrectly tagged as Prefill in the generated specialization metadata. **Changes**: - Raise a clear `NotImplementedError `when `decode_only=True` is passed to CausalLM export. - Warn and ignore `retain_full_kv=True` for non-specialized models such as Llama. - Tag `prefill_seq_len=1 and prefill_only=False` specializations as Decode. - Add quickcheck unit coverage for unsupported decode_only, ignored Llama retain_full_kv, and PL=1 decode specialization naming. **Validation**: - Verified PL=1 now emits "name": "Decode" in specialization output. - Added unit test and Ran focused quickcheck tests successfully: - PYENV_VERSION=qeff pytest -q tests/unit_test/models/test_model_quickcheck.py::TestCausalLMFlagDiagnostics cc: @quic-hemagnih @quic-rishinr Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Previous changes to non-blocked attention forward were causing the attention mask generated for CCL to not be correct in the standard attention forward. Returned the updated attention mask from the past_key_value_update helper function in order to fix this. --------- Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com> Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…changes to v5.5.4 and restore PyTorch/ORT parity (quic#876) - Rebased downstream wrapper stack to transformers v5.3.0 and aligned coupled deps (huggingface-hub, peft, diffusers) in project config. - Updated model wrapper compatibility paths across causal/VLM/audio/export flows to match upstream v5 APIs while preserving downstream public behavior. - Hardened cache compatibility layer and runtime glue for mixed legacy/new cache semantics used by downstream generation/export paths. - Fixed attention/mask/rotary call-path mismatches introduced by upstream API changes (including model-specific signature updates). - Updated AWQ/quantizer and export compatibility paths to remain ONNX-safe. - Validation evidence: ``` python -m pytest -q tests/test_model_quickcheck.py -n 16 Result: 26 passed. ``` - [x] QAic Verification Pending - [x] E2E CI read out cc: @quic-rishinr @quic-hemagnih @asmigosw @anujgupt-github --------- Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
…o_empty() (quic#952) fix: improve weight offloading to handle plain tensor attrs and use to_empty() Replace manual storage resizing with `to_empty(device="meta")` for parameters/buffers and explicitly handle plain tensor attributes (e.g. stacked expert weights in MoE models) that are not registered as parameters or buffers. This ensures all tensors are properly moved to the meta device, reducing memory usage after ONNX export. Add unit tests for plain tensor attribute clearing --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
|
duplicate #988 |
Enable GLM4-MOE chunked prefill MoE, KV-blocked attention, and disaggregated serving export with subfunctions. - GLM4-MOE decode path - Chunked prefill MoE path with packed expert dispatch - KV-blocked attention path - Disaggregated prefill/decode serving example - ONNX subfunction export for decode and prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Use the headpar_offline KV-blocking path by default for GQA-compatible KV blocking, with fallback to the previous online implementation for unsupported masking/bias cases. Revert to previous commit if fails. WIP Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Route KV-containing combined blocking modes through the headpar_offline path when supported, and pass user-tiled compile flags explicitly in the GLM4 MoE disagg example. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…on export and update example Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_likeindex init, and add ONNX coverage for the second packed chunk slice. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ss/qwen3/pr935 Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
8452e31 to
5c632b7
Compare
No description provided.