Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015
Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015mcremon-meta wants to merge 2 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015
Note: Links to docs will display an error until the docs builds have been completed. ❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated FailuresAs of commit 1736792 with merge base 89600b3 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@mcremon-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100712787. |
This PR needs a
|
|
Hi, great work! I have written some comments below, but overall I think it might be easier for us to land the PR and let you run a final internal check rather than vice versa, just since we have a bit easier access to CI and since we have a broader scope in what the backend needs to handle (e.g. dim-order input). The resulting behavior should be very similar from your perspective. Does that sound good to you?
|
c4dbf62 to
7199184
Compare
Summary: Pull Request resolved: #19002 Move 6 permute optimization passes and their shared infrastructure from executorch/backends/cadence/aot/ to executorch/backends/transforms/ so they can be shared between the Cadence and Arm backends without a cross-backend dependency. New files: - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface, RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims, get_shape, get_edge_overload_packet) - fuse_cascaded_transpose_or_permute_ops.py - fuse_cascaded_view_ops.py - fuse_transpose_or_permute_op_pairs_pass.py - remove_permutes_around_elementwise_ops.py - postpone_permute_below_squeeze_view.py - replace_nop_transpose_or_permute_with_view.py The shared versions omit register_cadence_pass decorators and cadence-specific ops from default op sets. Cadence files will subclass these and re-add the decorators and ops. Added OSS tests (test_permute_optimization_passes.py) for the 4 passes that can be imported without quantized op registration: FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps, PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via pytest and are discovered automatically through pytest.ini testpaths. Differential Revision: D101459577 Reviewed By: ethansfng
Summary: Pull Request resolved: #19015 Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787
7199184 to
40bde3c
Compare
Summary: Pull Request resolved: #19015 Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787
40bde3c to
1736792
Compare
Summary:
Replace implicit
tosa_dim_order-based layout handling with explicitpermute_copyops around TOSA operators that require NHWC layout.Rewrite passes insert explicit NCHW↔NHWC permutes
RewriteConvPass,RewriteAvgPool2dPass, andRewriteMaxPool2dPassnow insert
aten.permute_copynodes (NCHW→NHWC before the TOSA op,NHWC→NCHW after) instead of relying on
ToTosaMemoryFormatPassforlayout conversion. This makes layout transitions visible in the graph.
Grouped conv decomposition in NHWC
RewriteConvPassdecomposes grouped convolutions (non-depthwise) intoper-group
TOSA.CONV2Dops operating entirely in NHWC, with a singleinput/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).
ToTosaMemoryFormatPassscoped downNow only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
permute_copyinstead oftosa.TRANSPOSE. Skips users that alreadycarry a matching permute (inserted by the rewrite passes).
TOSA dialect op metas expect NHWC
All TOSA op meta functions (
CONV2D,CONV3D,DEPTHWISE_CONV2D,AVG_POOL2D,MAX_POOL2D,TRANSPOSE_CONV2D) now assume NHWCinput layout and produce NHWC output shapes.
Removed
tosa_dim_ordershape remappingtosa_shape()no longer reorders dimensions—just resolves symints._get_matching_fake_tensor()returnsnode.meta["val"]directly.Serialisation mapping always uses identity dim_order.
Operator serialisation simplified
op_amax,op_amin,op_any,op_cat,op_sum, andop_permuteno longer remap reduction/concat axes through
dim_ordersincetensors are already in the layout expected by TOSA.
Permute optimisation passes added
Six shared passes from
executorch/backends/transforms/are now runafter TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
RemovePermutesAroundElementwiseOps(extended forRESCALE)FuseTransposeOrPermuteOpPairsPass(extended forRESCALE)ReplaceNopTransposeOrPermuteWithViewPassPostponePermuteOpBelowSqueezeOrUnsqueezeLikeViewFuseCascadedTransposeOrPermuteOpsFuseCascadedViewOpsRemoved passes
DecomposeConvWithInt16ActivationPassandDecomposeGroupedConvPassare removed—their logic is now handled inline by
RewriteConvPass.RewriteSlicePassis repositioned after the permute optimisations.Ethos-U55 partitioner simplified
The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.
Differential Revision: D100712787