Skip to content

Add Glm4MoeForCausalLM Support#619

Closed
quic-shagun wants to merge 14 commits into
quic:mainfrom
quic-shagun:glm_air
Closed

Add Glm4MoeForCausalLM Support#619
quic-shagun wants to merge 14 commits into
quic:mainfrom
quic-shagun:glm_air

Conversation

@quic-shagun
Copy link
Copy Markdown
Contributor

This PR adds support for zai-org/GLM-4.5-Air model.
Open source MoE model with performance and accuracy better than many closed source models:

image

@quic-rishinr
Copy link
Copy Markdown
Contributor

quic-rishinr commented Nov 18, 2025

@shagsood do we have approval for this model? also do add this model under validated model list

@quic-rishinr
Copy link
Copy Markdown
Contributor

@vbaddi can you please review this PR?

@quic-rishinr quic-rishinr requested a review from vbaddi November 18, 2025 09:17
@quic-sgunnala
Copy link
Copy Markdown

@shagsood do we have approval for this model? also do add this model under validated model list

Yes we have legal approval for this model.


class QEffGlm4MoeMoE(Glm4MoeMoE):
"""
MoE Block
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can start using our optimized moe block for prefill/decode usecase here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific PR I need to refer for this?

key_states,
value_states,
attention_mask,
dropout=0.0 if not self.training else self.attention_dropout,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this dropout, since we are not using it in eager_attention_forward

value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove, since its not used.

topk_weights = router_scores.gather(1, topk_indices) # [T, 8]

if self.norm_topk_prob:
topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is subfunction verified? I feel this. sum() needs to be replaced w/einsum.

@anujgupt-github
Copy link
Copy Markdown
Contributor

@shagsood can you rebase and bring this to main branch?

shagsood and others added 12 commits May 7, 2026 15:40
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
@quic-rishinr
Copy link
Copy Markdown
Contributor

@asmigosw @mamtsing please review the PR

@quic-rishinr quic-rishinr requested review from asmigosw and vbaddi May 8, 2026 07:27
@quic-rishinr quic-rishinr requested review from quic-mamta and removed request for ochougul, quic-amitraj, quic-hemagnih and quic-rishinr May 8, 2026 07:27
@quic-rishinr
Copy link
Copy Markdown
Contributor

@ochougul please fix the Lint DCO and other failures

@quic-hemagnih
Copy link
Copy Markdown
Contributor

@ochougul Can you please fix the LINT errors

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
@quic-rishinr
Copy link
Copy Markdown
Contributor

New PR is raised with the following features. Will close this PR and will proceed with PR #988
• GLM4-MOE prefill/decode path
• Chunked prefill MoE path with packed expert dispatch
• KV-blocked attention path, headpar_offline to be default for all blocking combinations.
• Disaggregated prefill/decode serving example
• ONNX subfunction export for decode and prefill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants