Skip to content

optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119

Open
jgibson2 wants to merge 5 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-optimized-grid-sum
Open

optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119
jgibson2 wants to merge 5 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-optimized-grid-sum

Conversation

@jgibson2
Copy link
Copy Markdown
Contributor

Summary

Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with vectorized implementations that accumulate in fp32, which also sidesteps the fp16 precision issue noted in #19117 for grid_sampler_2d bilinear.

Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):

Op Portable This PR Speedup
grid_sampler_2d.out 17.3 ms 3.4 ms 5.1×
sum.IntList_out (5 calls, aggregate) 3.0 ms 0.56 ms 5.4×

grid_sampler_2d.out

aarch64 NEON, bilinear + zeros padding only (the dominant mode for depth / MVS / spatial transformer networks). Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, cast back on store — the portable kernel's fp16 weight subtractions like (ix_se - ix) otherwise suffer catastrophic cancellation (same concern as #19117). Unsupported modes and non-aarch64 targets delegate to the portable kernel.

sum.IntList_out

at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; always accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable.

Integration

  • Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds.
  • optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels.
  • kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.

Test plan

  • Builds cleanly for Android arm64-v8a, Android x86_64 (via scripts/build_android_library.sh), and host (macOS / Apple Clang 21).
  • Existing kernels/test/op_grid_sampler_2d_test.cpp and op_sum_test.cpp unit tests continue to pass — both target the aten::sum_outf / aten::grid_sampler_2d_outf codegen-dispatched entry points, so they automatically exercise the optimized kernels when linked.
  • Numerical verification against an fp32 reference (run portable in fp32, cast to fp16) on the shapes the polycam depth model uses — all cases pass within fp16 ULP.
  • End-to-end Pixel 9 latency on a representative trained model matches the handwritten-NEON reference implementation to within run-to-run noise while producing more accurate fp16 outputs (fp32 accumulation).

Candidate successor to #19117 for the grid_sampler half — applies the same precision fix but at the optimized-kernel layer, so callers who link optimized_ops_lib get both the correctness fix and the speedup.

…List_out

Two new optimized CPU kernels registered alongside the existing
optimized_kernels library. Both replace the portable reference kernel
(still available as fallback for unsupported inputs) with a vectorized
implementation that accumulates in fp32, avoiding the fp16 precision
issues noted in pytorch#19117 for grid_sampler_2d bilinear.

Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes
representative of the model's hot path):

| Op                               | Portable | This PR | Speedup |
| -------------------------------- | -------- | ------- | ------- |
| grid_sampler_2d.out              | 17.3 ms  | 3.4 ms  | 5.1x    |
| sum.IntList_out (5 calls, total) | 3.0 ms   | 0.56 ms | 5.4x    |

### grid_sampler_2d.out

aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per
iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32
for weight computation and accumulation, then cast back on store — the
portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise
suffer catastrophic cancellation. Unsupported modes and non-aarch64
targets delegate to the portable kernel.

### sum.IntList_out

at::vec::Vectorized<float>-based implementation of the single-dim
reduction fast path (both innermost-contiguous and strided cases).
Cross-architecture SIMD via PyTorch's existing vector abstraction;
accumulates in fp32 regardless of input dtype. Multi-dim reductions,
dtype-converting reductions, and complex types delegate to portable.

### Integration

- Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to
  OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of
  truth for both Buck and CMake builds.
- optimized.yaml registers the ops with the standard opt_* naming
  convention used by sibling kernels.
- kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16
  flag to just op_grid_sampler_2d.cpp via set_source_files_properties,
  so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__
  guards and falls through to portable on non-arm64 targets.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19119

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 2 Unrelated Failures

As of commit 4e98f5a with merge base de8ce55 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026
@jgibson2
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: none"

The NEON fast path indexes input/grid/out directly assuming contiguous
NCHW default-dim-order layout — no use of .strides() or .dim_order().
If the caller passes anything else (NHWC, transposed, strided, channels-
last), we'd read wrong memory and silently produce garbage output.

Add the same check pattern op_sum.cpp already uses at L150-151:
tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and
out. If any fails, delegate to the portable kernel (which handles
arbitrary strides / dim orders correctly via .strides()).

No perf impact on the hot path — the checks are a handful of scalar
comparisons run once per call, and the common polycam depth model case
is already default-contiguous so the fast path is still taken.
@GregoryComer
Copy link
Copy Markdown
Member

GregoryComer commented Apr 24, 2026

@JacobSzwejbka @manuelcandales @digantdesai Do you have any concerns with conditionally linking an op in optimized only on some architectures?

Apply lintrunner -a auto-fixes to satisfy CI (CLANGFORMAT and
CMAKEFORMAT). No functional changes.
Copy link
Copy Markdown
Member

@GregoryComer GregoryComer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left one comment about runtime gating the armv8.2+fp16 code. Other than that, it looks good.

We are currently looking at adding better support for in-tree CPU operator implementations with arch-specific dispatch, so this type of thing should become easier soon.

)
set_source_files_properties(
${EXECUTORCH_ROOT}/kernels/optimized/cpu/op_grid_sampler_2d.cpp
PROPERTIES COMPILE_OPTIONS "-march=armv8.2-a+fp16"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split out the native f16 path? Right now, it will potentially SIGILL on ARM hardware without f16 support. If possible, I'd recommend something like this:

  • Move the native f16 impl into a separate source file. Scope the march +fp16 to just this file.
  • Add a variant that does the f16<->f32 conversion in software.
  • In the top-level kernel, check hardware support using cpuinfo_has_arm_neon_fp16 and route to the implementation.

Address review feedback on pytorch#19119: the previous
op_grid_sampler_2d.cpp compiled the whole file with
-march=armv8.2-a+fp16, which meant the resulting binary would SIGILL on
ARMv8.0 / ARMv8.1 chips that lack the fp16 extension.

Split the fp16 path into two translation units:

  * op_grid_sampler_2d_fp16_hw.cpp — hardware fp16 fast path. Uses
    vld1_f16 / vcvt_f32_f16 / vfmaq_f32 / vcvt_f16_f32 / vst1_f16.
    Compiled with -march=armv8.2-a+fp16 (flag scoped to this TU via
    set_source_files_properties in CMake, and via compiler_flags on a
    dedicated runtime.cxx_library in targets.bzl).

  * op_grid_sampler_2d.cpp — hosts the fp32 NEON path, a new fp16
    software-convert path, and the runtime dispatcher. Plain ARMv8
    only. The SW path converts fp16<->fp32 via c10::Half's portable
    operator float() / constructor (no hardware fp16 instructions) and
    does all compute on NEON fp32 lanes. Slower per conversion than the
    HW path but safe on any ARMv8 CPU.

The dispatcher calls cpuinfo_initialize() + cpuinfo_has_arm_neon_fp16()
(cpuinfo already transitively linked via extension_threadpool) and
routes to the appropriate variant. fp32 inputs use the unchanged NEON
fp32 path; any unsupported layout/padding/interpolation still falls
through to the portable kernel.

Buck: adds a new runtime.cxx_library(op_grid_sampler_2d_fp16_hw) in
kernels/optimized/cpu/targets.bzl with the +fp16 compile flag gated on
ovr_config//cpu:arm64, and wires op_grid_sampler_2d to depend on it and
on cpuinfo.

No behavior change on fp32 inputs. fp16 inputs on +fp16-capable chips
keep the existing fast path at the same speed; fp16 inputs on chips
without the extension now run the SW variant instead of crashing.
@jgibson2
Copy link
Copy Markdown
Contributor Author

Addressed the SIGILL concern — new commit 53697e9 splits the fp16 path into HW and SW variants with runtime dispatch:

  • op_grid_sampler_2d_fp16_hw.cpp (new) — hardware fp16 fast path. Uses vld1_f16 / vcvt_f32_f16 / vcvt_f16_f32 / vst1_f16. Compiled with -march=armv8.2-a+fp16 (scoped via set_source_files_properties in CMake; compiler_flags gated on ovr_config//cpu:arm64 in the Buck target).
  • op_grid_sampler_2d.cpp — now hosts the fp32 path, a new fp16 software-convert path, and the runtime dispatcher. Plain ARMv8 only. The SW path converts fp16↔fp32 via c10::Half's portable conversions (no hardware fp16 instructions) and does all compute on NEON fp32 lanes — slower per conversion but safe on any ARMv8 chip.

Dispatcher is at op_grid_sampler_2d.cpp:394-425. Key lines:

if (input.scalar_type() == ScalarType::Half) {
  if (cpuinfo_initialize() && cpuinfo_has_arm_neon_fp16()) {
    // HW path (in _fp16_hw.cpp)
  }
  // fall through to SW path (in this file)
}

cpuinfo is already transitively linked via extension_threadpool, so the dependency was a one-line Buck addition. No change to behavior on fp16-capable chips (Pixel 9, S24 FE, etc.); chips without the +fp16 extension now run the SW variant instead of raising SIGILL.

One thing I'd appreciate guidance on: the fp16 SW and HW paths share ~60 lines of loop body verbatim — same FMA chain, same indexing, same weight math. I kept them copy-pasted for clarity rather than macro/template gymnastics. Happy to DRY that up if you'd prefer.

Buck's op_registration_util._enforce_deps rejects any dep starting with
`:op_` on the theory that op_targets should not depend on other
op_targets. The previously-named `:op_grid_sampler_2d_fp16_hw` tripped
that check when op_grid_sampler_2d (which is an op_target) declared it
as a dep.

Rename the internal helper library to `grid_sampler_2d_fp16_hw_impl`,
matching the existing `add_sub_impl` / `binary_ops` naming for
op-specific implementation helpers. No change to file contents, CMake,
or the C++ dispatch — only the Buck target name and the corresponding
dep reference.
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 24, 2026

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D102420839.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants