Skip to content

route EthosU input/output memcpy through overridable hook (#19264)#19264

Open
3l1 wants to merge 1 commit intomainfrom
export-D103455766
Open

route EthosU input/output memcpy through overridable hook (#19264)#19264
3l1 wants to merge 1 commit intomainfrom
export-D103455766

Conversation

@3l1
Copy link
Copy Markdown
Contributor

@3l1 3l1 commented May 1, 2026

Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — arm_ethos_io_memcpy
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:

  • The weak default lives in its own TU so the compiler in the call-site
    TUs cannot inline its body and bypass the link-time override. This is
    the same pattern bolt_arm_memcpy_external uses.
  • Three call sites updated: input scratch copy in EthosUBackend.cpp, the
    layout-adjustment chunk loop in EthosUBackend.cpp, and the output
    scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766

@3l1 3l1 requested a review from digantdesai as a code owner May 1, 2026 21:06
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 3 Unrelated Failures

As of commit 845995e with merge base 0f9de6a (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 1, 2026

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from ddea8da to ffc9927 Compare May 1, 2026 21:07
@3l1 3l1 added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label May 1, 2026
@3l1 3l1 requested a review from gggekov May 1, 2026 21:10
Copy link
Copy Markdown
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, like it!

Comment thread backends/arm/runtime/EthosUBackend_IoMemcpy.cpp Outdated
@meta-codesync meta-codesync Bot changed the title route EthosU input/output memcpy through overridable hook route EthosU input/output memcpy through overridable hook (#19264) May 5, 2026
meta-codesync Bot pushed a commit that referenced this pull request May 5, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from ffc9927 to 8eeb57c Compare May 5, 2026 22:00
meta-codesync Bot pushed a commit that referenced this pull request May 5, 2026
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-pytorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from 8eeb57c to 3fe2220 Compare May 5, 2026 23:59
Summary:

The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: rascani

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from 3fe2220 to 845995e Compare May 6, 2026 00:13
@3l1
Copy link
Copy Markdown
Contributor Author

3l1 commented May 6, 2026

⚠️ NOTE: many failing tests - looking... (suspect missing inclusion in some build script)

// unit so the compiler in the call-site TUs cannot inline this body and
// bypass the link-time override (same trick as bolt_arm_memcpy_external).
extern "C" __attribute__((weak)) void
io_memcpy(void* dst, const void* src, size_t size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular memcpy should already be weak for embedded toolchain or we may be able to override through compiler flags but this is also OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants