[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf) by SS-JIA · Pull Request #18592 · pytorch/executorch

SS-JIA · 2026-03-30T19:28:32Z

Stack from ghstack (oldest at bottom):

-> [ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf) #18592

Add a fused rotary positional embedding operator for the HuggingFace RoPE
convention used by Qwen3, Phi-4-mini, and other HF-based models.

The existing et_vk.apply_rotary_emb only matches the stock Meta/Llama RoPE
pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a
different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to
decompose into ~560 GPU dispatches per decode step instead of 16 fused
dispatches (~1,295 µs/decode, 7% of total).

This commit adds et_vk.apply_rotary_emb_hf with:

Pattern matching: HfRotaryEmbeddingPattern in patterns/rope_hf.py using
SubgraphMatcher to detect the HF RoPE graph and replace with fused op.
Supports both full rotation (freqs_dim == head_dim) and partial rotation
(freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75)
by registering two pattern variants in get_hf_rope_graphs().
GLSL shader: rotary_embedding_hf.glsl which pairs elements at distance D/2
(half-apart) instead of adjacent pairs, computing half_dim from the metadata
UBO for dynamic shape support
C++ dispatch: add_rotary_embedding_hf_node with corrected assertion
(head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim
Custom op registration in both xplat and fbcode
Op tests covering multiple configurations and dynamic prefill→decode resize

Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS
file to enable converting HF checkpoint weights to Meta format.

Authored with Claude.

Differential Revision: D98741178

Add a fused rotary positional embedding operator for the HuggingFace RoPE convention used by Qwen3, Phi-4-mini, and other HF-based models. The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to decompose into ~560 GPU dispatches per decode step instead of 16 fused dispatches (~1,295 µs/decode, 7% of total). This commit adds `et_vk.apply_rotary_emb_hf` with: - Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using SubgraphMatcher to detect the HF RoPE graph and replace with fused op. Supports both full rotation (freqs_dim == head_dim) and partial rotation (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75) by registering two pattern variants in get_hf_rope_graphs(). - GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2 (half-apart) instead of adjacent pairs, computing half_dim from the metadata UBO for dynamic shape support - C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim - Custom op registration in both xplat and fbcode - Op tests covering multiple configurations and dynamic prefill→decode resize Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS file to enable converting HF checkpoint weights to Meta format. Authored with Claude. Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/) [ghstack-poisoned]

pytorch-bot · 2026-03-30T19:28:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18592

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 2 Pending, 5 Unrelated Failures

As of commit 14f27e1 with merge base d7cc5d7 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-mypy (gh)
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 5b2562a1e4db9943608997c6d7c4c078882625e82666a00a72e7f4d553171369 /exec failed with exit code 139

CANCELLED JOB - The following job was cancelled. Please retry:

Check Labels (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / test-models-linux (buck2, mv3, portable, linux.2xlarge, 90) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-openvino-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add a fused rotary positional embedding operator for the HuggingFace RoPE convention used by Qwen3, Phi-4-mini, and other HF-based models. The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to decompose into ~560 GPU dispatches per decode step instead of 16 fused dispatches (~1,295 µs/decode, 7% of total). This commit adds `et_vk.apply_rotary_emb_hf` with: - Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using SubgraphMatcher to detect the HF RoPE graph and replace with fused op. Supports both full rotation (freqs_dim == head_dim) and partial rotation (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75) by registering two pattern variants in get_hf_rope_graphs(). - GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2 (half-apart) instead of adjacent pairs, computing half_dim from the metadata UBO for dynamic shape support - C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim - Custom op registration in both xplat and fbcode - Op tests covering multiple configurations and dynamic prefill→decode resize Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS file to enable converting HF checkpoint weights to Meta format. Authored with Claude. Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/) ghstack-source-id: 359702351 Pull Request resolved: #18592

github-actions · 2026-03-30T19:29:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…emb_hf)" Add a fused rotary positional embedding operator for the HuggingFace RoPE convention used by Qwen3, Phi-4-mini, and other HF-based models. The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to decompose into ~560 GPU dispatches per decode step instead of 16 fused dispatches (~1,295 µs/decode, 7% of total). This commit adds `et_vk.apply_rotary_emb_hf` with: - Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using SubgraphMatcher to detect the HF RoPE graph and replace with fused op. Supports both full rotation (freqs_dim == head_dim) and partial rotation (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75) by registering two pattern variants in get_hf_rope_graphs(). - GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2 (half-apart) instead of adjacent pairs, computing half_dim from the metadata UBO for dynamic shape support - C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim - Custom op registration in both xplat and fbcode - Op tests covering multiple configurations and dynamic prefill→decode resize Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS file to enable converting HF checkpoint weights to Meta format. Authored with Claude. Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/) [ghstack-poisoned]

Pull Request resolved: #18592 Add a fused rotary positional embedding operator for the HuggingFace RoPE convention used by Qwen3, Phi-4-mini, and other HF-based models. The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to decompose into ~560 GPU dispatches per decode step instead of 16 fused dispatches (~1,295 µs/decode, 7% of total). This commit adds `et_vk.apply_rotary_emb_hf` with: - Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using SubgraphMatcher to detect the HF RoPE graph and replace with fused op. Supports both full rotation (freqs_dim == head_dim) and partial rotation (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75) by registering two pattern variants in get_hf_rope_graphs(). - GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2 (half-apart) instead of adjacent pairs, computing half_dim from the metadata UBO for dynamic shape support - C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim - Custom op registration in both xplat and fbcode - Op tests covering multiple configurations and dynamic prefill→decode resize Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS file to enable converting HF checkpoint weights to Meta format. Authored with Claude. ghstack-source-id: 359963407 @exported-using-ghexport Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)

Pull Request resolved: pytorch#18592 Add a fused rotary positional embedding operator for the HuggingFace RoPE convention used by Qwen3, Phi-4-mini, and other HF-based models. The existing `et_vk.apply_rotary_emb` only matches the stock Meta/Llama RoPE pattern (interleaved pairs via reshape+unbind+stack+flatten). HF models use a different convention (split-half via slice+neg+cat), causing Qwen3's RoPE to decompose into ~560 GPU dispatches per decode step instead of 16 fused dispatches (~1,295 µs/decode, 7% of total). This commit adds `et_vk.apply_rotary_emb_hf` with: - Pattern matching: `HfRotaryEmbeddingPattern` in `patterns/rope_hf.py` using SubgraphMatcher to detect the HF RoPE graph and replace with fused op. Supports both full rotation (freqs_dim == head_dim) and partial rotation (freqs_dim < head_dim, e.g. Phi-4-mini with partial_rotary_factor=0.75) by registering two pattern variants in get_hf_rope_graphs(). - GLSL shader: `rotary_embedding_hf.glsl` which pairs elements at distance D/2 (half-apart) instead of adjacent pairs, computing half_dim from the metadata UBO for dynamic shape support - C++ dispatch: `add_rotary_embedding_hf_node` with corrected assertion (head_dim == freqs_dim, not freqs_dim*2) since HF freqs are full-dim - Custom op registration in both xplat and fbcode - Op tests covering multiple configurations and dynamic prefill→decode resize Also adds a convert_phi4_mini_weights binary target to the phi_4_mini TARGETS file to enable converting HF checkpoint weights to Meta format. Authored with Claude. ghstack-source-id: 359963407 @exported-using-ghexport Differential Revision: [D98741178](https://our.internmc.facebook.com/intern/diff/D98741178/)

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 30, 2026

manuelcandales approved these changes Mar 30, 2026

View reviewed changes

meta-codesync Bot added fb-exported meta-exported labels Mar 30, 2026

meta-codesync Bot merged commit c1d9b15 into gh/SS-JIA/514/base Mar 31, 2026
153 of 170 checks passed

meta-codesync Bot deleted the gh/SS-JIA/514/head branch March 31, 2026 01:44

meta-codesync Bot temporarily deployed to cherry-pick-bot March 31, 2026 01:45 Inactive

pytorchbot mentioned this pull request Mar 31, 2026

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf) #18599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf)#18592

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf)#18592
meta-codesync[bot] merged 2 commits intogh/SS-JIA/514/basefrom
gh/SS-JIA/514/head

SS-JIA commented Mar 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18592

❌ 2 New Failures, 1 Cancelled Job, 2 Pending, 5 Unrelated Failures

Uh oh!

github-actions Bot commented Mar 30, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 30, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 30, 2026 •

edited

Loading

This PR needs a `release notes:` label