[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts by SS-JIA · Pull Request #19395 · pytorch/executorch

SS-JIA · 2026-05-08T16:12:00Z

Stack from ghstack (oldest at bottom):

Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches:

utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)".
op_registry.py: switch view_copy / clone / clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away.

The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone.

On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph).

Differential Revision: D103770022

…d layouts Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/) [ghstack-poisoned]

pytorch-bot · 2026-05-08T16:12:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19395

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long GPU queue (g5, g6) on LF fleet

❌ 1 New Failure

As of commit aaf89d4 with merge base c564936 ():

NEW FAILURE - The following job has failed:

pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-08T16:12:44Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…pt quantized layouts" Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/) [ghstack-poisoned]

…d layouts Pull Request resolved: #19395 Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). ghstack-source-id: 379519734 @exported-using-ghexport Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/)

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 8, 2026

This was referenced May 8, 2026

[ez][ET-VK][q8ta_conv2d_pw] Halve accumulator to lift Adreno occupancy #19396

Merged

[ET-VK][q8ta_pixel_shuffle] Add fused PixelShuffle custom op for channels-packed int8 tensors #19397

Merged

meta-codesync Bot added fb-exported meta-exported labels May 8, 2026

manuelcandales approved these changes May 8, 2026

View reviewed changes

SS-JIA mentioned this pull request May 8, 2026

[ET-VK] Implement aten.pixel_shuffle.default op #19404

Merged

meta-codesync Bot merged commit 76862a6 into gh/SS-JIA/526/base May 9, 2026
173 of 175 checks passed

meta-codesync Bot deleted the gh/SS-JIA/526/head branch May 9, 2026 04:57

meta-codesync Bot temporarily deployed to cherry-pick-bot May 9, 2026 04:57 Inactive

pytorchbot mentioned this pull request May 9, 2026

[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts #19436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts#19395

[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts#19395
meta-codesync[bot] merged 2 commits intogh/SS-JIA/526/basefrom
gh/SS-JIA/526/head

SS-JIA commented May 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19395

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

github-actions Bot commented May 8, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented May 8, 2026 •

edited

Loading

pytorch-bot Bot commented May 8, 2026 •

edited

Loading

This PR needs a `release notes:` label