Skip to content

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015

Open
mcremon-meta wants to merge 2 commits intomainfrom
export-D100712787
Open

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015
mcremon-meta wants to merge 2 commits intomainfrom
export-D100712787

Conversation

@mcremon-meta
Copy link
Copy Markdown
Contributor

@mcremon-meta mcremon-meta commented Apr 21, 2026

Summary:

Replace implicit tosa_dim_order-based layout handling with explicit
permute_copy ops around TOSA operators that require NHWC layout.

Rewrite passes insert explicit NCHW↔NHWC permutes

RewriteConvPass, RewriteAvgPool2dPass, and RewriteMaxPool2dPass
now insert aten.permute_copy nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on ToTosaMemoryFormatPass for
layout conversion. This makes layout transitions visible in the graph.

Grouped conv decomposition in NHWC

RewriteConvPass decomposes grouped convolutions (non-depthwise) into
per-group TOSA.CONV2D ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

ToTosaMemoryFormatPass scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
permute_copy instead of tosa.TRANSPOSE. Skips users that already
carry a matching permute (inserted by the rewrite passes).

TOSA dialect op metas expect NHWC

All TOSA op meta functions (CONV2D, CONV3D, DEPTHWISE_CONV2D,
AVG_POOL2D, MAX_POOL2D, TRANSPOSE_CONV2D) now assume NHWC
input layout and produce NHWC output shapes.

Removed tosa_dim_order shape remapping

tosa_shape() no longer reorders dimensions—just resolves symints.
_get_matching_fake_tensor() returns node.meta["val"] directly.
Serialisation mapping always uses identity dim_order.

Operator serialisation simplified

op_amax, op_amin, op_any, op_cat, op_sum, and op_permute
no longer remap reduction/concat axes through dim_order since
tensors are already in the layout expected by TOSA.

Permute optimisation passes added

Six shared passes from executorch/backends/transforms/ are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:

  • RemovePermutesAroundElementwiseOps (extended for RESCALE)
  • FuseTransposeOrPermuteOpPairsPass (extended for RESCALE)
  • ReplaceNopTransposeOrPermuteWithViewPass
  • PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView
  • FuseCascadedTransposeOrPermuteOps
  • FuseCascadedViewOps

Removed passes

DecomposeConvWithInt16ActivationPass and DecomposeGroupedConvPass
are removed—their logic is now handled inline by RewriteConvPass.
RewriteSlicePass is repositioned after the permute optimisations.

Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated Failures

As of commit 1736792 with merge base 89600b3 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 21, 2026

@mcremon-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100712787.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@3l1 3l1 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk module: arm Issues related to arm backend labels Apr 21, 2026
@AdrianLundell
Copy link
Copy Markdown
Collaborator

Hi, great work! I have written some comments below, but overall I think it might be easier for us to land the PR and let you run a final internal check rather than vice versa, just since we have a bit easier access to CI and since we have a broader scope in what the backend needs to handle (e.g. dim-order input). The resulting behavior should be very similar from your perspective. Does that sound good to you?

  • I am getting an increased number of transposes on some graphs in our transpose-count suite, mainly due to the following issues:

    • Transposes not fused over elementwise branching in all
    • Transposes not fusing in upwards/downward forks
    • As long as we don't see regression on important models and have good regression tests I'm fine with landing current behaviour for now and follow up with improvements later.
  • The override and beartype dependencies will have to be removed from permute_pass_utils

  • A number of of tests failing:

  • No reason to remove the various vaidations done for the max/avg-pool2d tosa ops.

  • Some aesthetic nits:

    • Rewrite max_pool2d/avg_pool are completely rewritten as call-passes, would be preferrable with a minimal diff only inserting tranposes in the call_operator pass.
    • The way the new passes are introduced into the arm-backend differs from how we do it normally, would be good to follow the general structure.
    • You can remove the ToTosaMemoryFormat and tosa.TRANSPOSE traces and permute op helpers completely since they are not used anymore

Summary:
Pull Request resolved: #19002

Move 6 permute optimization passes and their shared infrastructure from
executorch/backends/cadence/aot/ to executorch/backends/transforms/ so
they can be shared between the Cadence and Arm backends without a
cross-backend dependency.

New files:
- permute_pass_utils.py: base classes (HierarchicalInplacePassInterface,
  RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and
  utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims,
  get_shape, get_edge_overload_packet)
- fuse_cascaded_transpose_or_permute_ops.py
- fuse_cascaded_view_ops.py
- fuse_transpose_or_permute_op_pairs_pass.py
- remove_permutes_around_elementwise_ops.py
- postpone_permute_below_squeeze_view.py
- replace_nop_transpose_or_permute_with_view.py

The shared versions omit register_cadence_pass decorators and
cadence-specific ops from default op sets. Cadence files will subclass
these and re-add the decorators and ops.

Added OSS tests (test_permute_optimization_passes.py) for the 4 passes
that can be imported without quantized op registration:
FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps,
PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and
ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via
pytest and are discovered automatically through pytest.ini testpaths.

Differential Revision: D101459577

Reviewed By: ethansfng
mcremon-meta added a commit that referenced this pull request Apr 22, 2026
Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787
@meta-codesync meta-codesync Bot changed the title Replace tosa_dim_order with explicit NCHW↔NHWC permutes Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015) Apr 22, 2026
Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants