Skip to content

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf)#18592

Merged
meta-codesync[bot] merged 2 commits intogh/SS-JIA/514/basefrom
gh/SS-JIA/514/head
Mar 31, 2026
Merged

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf)#18592
meta-codesync[bot] merged 2 commits intogh/SS-JIA/514/basefrom
gh/SS-JIA/514/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Mar 30, 2026

Stack from ghstack (oldest at bottom):

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing et_vk.apply_rotary_emb only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds et_vk.apply_rotary_emb_hf with:

  • Pattern matching: HfRotaryEmbeddingPattern in patterns/rope_hf.py using
    SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
    Supports both full rotation (freqs_dim == head_dim) and partial rotation
    (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
    by registering two pattern variants in get_hf_rope_graphs().
  • GLSL shader: rotary_embedding_hf.glsl which pairs elements at distance D/2
    (half-apart) instead of adjacent pairs, computing half_dim from the metadata
    UBO for dynamic shape support
  • C++ dispatch: add_rotary_embedding_hf_node with corrected assertion
    (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
  • Custom op registration in both xplat and fbcode
  • Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.

Differential Revision: D98741178

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18592

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 2 Pending, 5 Unrelated Failures

As of commit 14f27e1 with merge base d7cc5d7 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SS-JIA pushed a commit that referenced this pull request Mar 30, 2026
Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)

ghstack-source-id: 359702351
Pull Request resolved: #18592
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 30, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…emb_hf)"

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Mar 30, 2026
Pull Request resolved: #18592

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.
ghstack-source-id: 359963407
@exported-using-ghexport

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)
@meta-codesync meta-codesync Bot merged commit c1d9b15 into gh/SS-JIA/514/base Mar 31, 2026
153 of 170 checks passed
@meta-codesync meta-codesync Bot deleted the gh/SS-JIA/514/head branch March 31, 2026 01:44
@meta-codesync meta-codesync Bot temporarily deployed to cherry-pick-bot March 31, 2026 01:45 Inactive
SS-JIA pushed a commit that referenced this pull request Mar 31, 2026
Pull Request resolved: #18592

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.
ghstack-source-id: 359963407
@exported-using-ghexport

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)
rascani pushed a commit to rascani/executorch that referenced this pull request Apr 1, 2026
Pull Request resolved: pytorch#18592

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds `et_vk.apply_rotary_emb_hf` with:
- Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using
  SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
  Supports both full rotation (freqs_dim == head_dim) and partial rotation
  (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
  by registering two pattern variants in get_hf_rope_graphs().
- GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2
  (half-apart) instead of adjacent pairs, computing half_dim from the metadata
  UBO for dynamic shape support
- C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion
  (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
- Custom op registration in both xplat and fbcode
- Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.
ghstack-source-id: 359963407
@exported-using-ghexport

Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants