route EthosU input/output memcpy through overridable hook (#19264) by 3l1 · Pull Request #19264 · pytorch/executorch

3l1 · 2026-05-01T21:06:05Z

Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — arm_ethos_io_memcpy
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:

The weak default lives in its own TU so the compiler in the call-site
TUs cannot inline its body and bypass the link-time override. This is
the same pattern bolt_arm_memcpy_external uses.
Three call sites updated: input scratch copy in EthosUBackend.cpp, the
layout-adjustment chunk loop in EthosUBackend.cpp, and the output
scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766

pytorch-bot · 2026-05-01T21:06:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 3 Unrelated Failures

As of commit 845995e with merge base 0f9de6a ():

NEW FAILURES - The following jobs have failed:

pull / test-mcu-cortex-m-backend / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_memory_allocation) / linux-job (gh)
RuntimeError: Command docker exec -t 12387f9c4d3d63bf28e5d5b9ca0c6f067f25c62c18c4f4f0ac77d53880a83036 /exec failed with exit code 2
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u55) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u85) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u55) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_run_ethos_u55) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-ethos-u (test_run_ethos_u85) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-zephyr (cortex-m55) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-zephyr (ethos-u55) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-backend-zephyr (ethos-u85) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-ootb-linux (run_deit_e2e_ethos_u) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-arm-ootb-linux (run_ootb_tests_ethos_u) / linux-job (gh)
undefined reference to io_memcpy'`
trunk / test-cortex-m-e2e / run (mv2) / mv2 (gh)
undefined reference to io_memcpy'`
trunk / test-cortex-m-e2e / run (mv3) / mv3 (gh)
undefined reference to io_memcpy'`

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-05-01T21:06:13Z

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766.

github-actions · 2026-05-01T21:07:00Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

zingo

Nice idea, like it!

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. Reviewed By: rascani Differential Revision: D103455766

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-pytorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766

3l1 · 2026-05-06T01:24:08Z

⚠️ NOTE: many failing tests - looking... (suspect missing inclusion in some build script)

digantdesai · 2026-05-06T02:53:01Z

+// unit so the compiler in the call-site TUs cannot inline this body and
+// bypass the link-time override (same trick as bolt_arm_memcpy_external).
+extern "C" __attribute__((weak)) void
+io_memcpy(void* dst, const void* src, size_t size) {


regular memcpy should already be weak for embedded toolchain or we may be able to override through compiler flags but this is also OK.

3l1 requested a review from digantdesai as a code owner May 1, 2026 21:06

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 1, 2026

meta-codesync Bot added fb-exported meta-exported labels May 1, 2026

meta-codesync Bot force-pushed the export-D103455766 branch from ddea8da to ffc9927 Compare May 1, 2026 21:07

3l1 added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label May 1, 2026

3l1 requested a review from gggekov May 1, 2026 21:10

zingo approved these changes May 4, 2026

View reviewed changes

Comment thread backends/arm/runtime/EthosUBackend_IoMemcpy.cpp Outdated

rascani approved these changes May 4, 2026

View reviewed changes

meta-codesync Bot changed the title ~~route EthosU input/output memcpy through overridable hook~~ route EthosU input/output memcpy through overridable hook (#19264) May 5, 2026

meta-codesync Bot force-pushed the export-D103455766 branch from ffc9927 to 8eeb57c Compare May 5, 2026 22:00

meta-codesync Bot force-pushed the export-D103455766 branch from 8eeb57c to 3fe2220 Compare May 5, 2026 23:59

meta-codesync Bot force-pushed the export-D103455766 branch from 3fe2220 to 845995e Compare May 6, 2026 00:13

digantdesai reviewed May 6, 2026

View reviewed changes

digantdesai approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

route EthosU input/output memcpy through overridable hook (#19264)#19264

route EthosU input/output memcpy through overridable hook (#19264)#19264
3l1 wants to merge 1 commit intomainfrom
export-D103455766

3l1 commented May 1, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

zingo left a comment

Uh oh!

Uh oh!

3l1 commented May 6, 2026 •

edited

Loading

Uh oh!

digantdesai May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

3l1 commented May 1, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

❌ 15 New Failures, 3 Unrelated Failures

Uh oh!

meta-codesync Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

3l1 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digantdesai May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3l1 commented May 1, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

This PR needs a `release notes:` label

3l1 commented May 6, 2026 •

edited

Loading