[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts#19395
[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts#19395meta-codesync[bot] merged 2 commits intogh/SS-JIA/526/basefrom
Conversation
…d layouts Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19395
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit aaf89d4 with merge base c564936 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…pt quantized layouts" Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/) [ghstack-poisoned]
76862a6
into
gh/SS-JIA/526/base
…d layouts Pull Request resolved: #19395 Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches: 1. utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)". 2. op_registry.py: switch view_copy / clone / _clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op_*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away. The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone. On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph). ghstack-source-id: 379519734 @exported-using-ghexport Differential Revision: [D103770022](https://our.internmc.facebook.com/intern/diff/D103770022/)
Stack from ghstack (oldest at bottom):
Two changes that together let the partitioner keep PACKED_INT8 layouts flowing through identity-like ops, eliminating spurious clone dispatches:
utils.py: ANY_STORAGE_INCL_PACKED_INT8 (renamed from ALL_STORAGES_REPSET) previously claimed every layout (including PACKED_INT8_*) on the texture side, but PACKED_INT8 is buffer-only by convention — the texture indexing helpers and required_image_extents don't know about quantized layouts. Narrow the texture side to all_memory_layouts (float-only). Every existing call site is either an intersection identity or a wildcard for non-tensor / not-yet-prepacked args, so this narrow is non-breaking; and now the repset can act as a true universal set when intersected against quant-aware repsets. The new name slots cleanly next to ANY_STORAGE / ANY_BUFFER / ANY_TEXTURE and tells the reader exactly what is added: "like ANY_STORAGE, but also admits PACKED_INT8 (on the buffer side)".
op_registry.py: switch view_copy / clone / clone_dim_order / alias_copy from inputs_storage=ANY_STORAGE to inputs_storage=ANY_STORAGE_INCL_PACKED_INT8. ANY_STORAGE is float-only, so when one of these no-op identity ops sits between two q8ta ops the BFS in TagMemoryMetaPass.constrain_op*_repset short-circuits (zero overlap with PACKED_INT8_BUFFER) and forces transitions on both sides. With ANY_STORAGE_INCL_PACKED_INT8 they now admit both float and quantized layouts and the redundant-op transform folds them away.
The 31 other ops using ANY_STORAGE are real compute ops (binaryop, comparison, softmax, argreduce, permute_copy, etc.) whose float-only kernels do not accept quantized int8x4 layouts (q8ta_* are separate ops); leaving those alone.
On RefineNet 24feat (1x3x256x144) the 8 _clone_dim_order ops the partitioner had been inserting around the 4 fused q8ta_pixel_shuffle nodes are now folded by the delegate. Runtime q8ta_clone dispatches drop from 11 to 3 (the 3 residuals are unrelated, from the original model graph).
Differential Revision: D103770022