optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>) by jgibson2 · Pull Request #19119 · pytorch/executorch

jgibson2 · 2026-04-24T18:36:12Z

Summary

Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with vectorized implementations that accumulate in fp32, which also sidesteps the fp16 precision issue noted in #19117 for grid_sampler_2d bilinear.

Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):

Op	Portable	This PR	Speedup
`grid_sampler_2d.out`	17.3 ms	3.4 ms	5.1×
`sum.IntList_out` (5 calls, aggregate)	3.0 ms	0.56 ms	5.4×

`grid_sampler_2d.out`

aarch64 NEON, bilinear + zeros padding only (the dominant mode for depth / MVS / spatial transformer networks). Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, cast back on store — the portable kernel's fp16 weight subtractions like (ix_se - ix) otherwise suffer catastrophic cancellation (same concern as #19117). Unsupported modes and non-aarch64 targets delegate to the portable kernel.

`sum.IntList_out`

at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; always accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable.

Integration

Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds.
optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels.
kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.

Test plan

Builds cleanly for Android arm64-v8a, Android x86_64 (via scripts/build_android_library.sh), and host (macOS / Apple Clang 21).
Existing kernels/test/op_grid_sampler_2d_test.cpp and op_sum_test.cpp unit tests continue to pass — both target the aten::sum_outf / aten::grid_sampler_2d_outf codegen-dispatched entry points, so they automatically exercise the optimized kernels when linked.
Numerical verification against an fp32 reference (run portable in fp32, cast to fp16) on the shapes the polycam depth model uses — all cases pass within fp16 ULP.
End-to-end Pixel 9 latency on a representative trained model matches the handwritten-NEON reference implementation to within run-to-run noise while producing more accurate fp16 outputs (fp32 accumulation).

Candidate successor to #19117 for the grid_sampler half — applies the same precision fix but at the optimized-kernel layer, so callers who link optimized_ops_lib get both the correctness fix and the speedup.

…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.

pytorch-bot · 2026-04-24T18:36:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19119

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 3 New Failures, 2 Unrelated Failures

As of commit 4e98f5a with merge base de8ce55 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_qs8_conv2d_relu_multi_users
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 8
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 8

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgibson2 · 2026-04-24T18:38:37Z

@pytorchbot label "release notes: none"

The NEON fast path indexes input/grid/out directly assuming contiguous NCHW default-dim-order layout — no use of .strides() or .dim_order(). If the caller passes anything else (NHWC, transposed, strided, channels- last), we'd read wrong memory and silently produce garbage output. Add the same check pattern op_sum.cpp already uses at L150-151: tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and out. If any fails, delegate to the portable kernel (which handles arbitrary strides / dim orders correctly via .strides()). No perf impact on the hot path — the checks are a handful of scalar comparisons run once per call, and the common polycam depth model case is already default-contiguous so the fast path is still taken.

GregoryComer · 2026-04-24T20:15:40Z

@JacobSzwejbka @manuelcandales @digantdesai Do you have any concerns with conditionally linking an op in optimized only on some architectures?

Apply lintrunner -a auto-fixes to satisfy CI (CLANGFORMAT and CMAKEFORMAT). No functional changes.

GregoryComer

Thanks for the PR! I left one comment about runtime gating the armv8.2+fp16 code. Other than that, it looks good.

We are currently looking at adding better support for in-tree CPU operator implementations with arch-specific dispatch, so this type of thing should become easier soon.

GregoryComer · 2026-04-24T20:27:52Z

+)
+  set_source_files_properties(
+    ${EXECUTORCH_ROOT}/kernels/optimized/cpu/op_grid_sampler_2d.cpp
+    PROPERTIES COMPILE_OPTIONS "-march=armv8.2-a+fp16"


Would it be possible to split out the native f16 path? Right now, it will potentially SIGILL on ARM hardware without f16 support. If possible, I'd recommend something like this:

Move the native f16 impl into a separate source file. Scope the march +fp16 to just this file.

Add a variant that does the f16<->f32 conversion in software.

In the top-level kernel, check hardware support using cpuinfo_has_arm_neon_fp16 and route to the implementation.

Address review feedback on pytorch#19119: the previous op_grid_sampler_2d.cpp compiled the whole file with -march=armv8.2-a+fp16, which meant the resulting binary would SIGILL on ARMv8.0 / ARMv8.1 chips that lack the fp16 extension. Split the fp16 path into two translation units: * op_grid_sampler_2d_fp16_hw.cpp — hardware fp16 fast path. Uses vld1_f16 / vcvt_f32_f16 / vfmaq_f32 / vcvt_f16_f32 / vst1_f16. Compiled with -march=armv8.2-a+fp16 (flag scoped to this TU via set_source_files_properties in CMake, and via compiler_flags on a dedicated runtime.cxx_library in targets.bzl). * op_grid_sampler_2d.cpp — hosts the fp32 NEON path, a new fp16 software-convert path, and the runtime dispatcher. Plain ARMv8 only. The SW path converts fp16<->fp32 via c10::Half's portable operator float() / constructor (no hardware fp16 instructions) and does all compute on NEON fp32 lanes. Slower per conversion than the HW path but safe on any ARMv8 CPU. The dispatcher calls cpuinfo_initialize() + cpuinfo_has_arm_neon_fp16() (cpuinfo already transitively linked via extension_threadpool) and routes to the appropriate variant. fp32 inputs use the unchanged NEON fp32 path; any unsupported layout/padding/interpolation still falls through to the portable kernel. Buck: adds a new runtime.cxx_library(op_grid_sampler_2d_fp16_hw) in kernels/optimized/cpu/targets.bzl with the +fp16 compile flag gated on ovr_config//cpu:arm64, and wires op_grid_sampler_2d to depend on it and on cpuinfo. No behavior change on fp32 inputs. fp16 inputs on +fp16-capable chips keep the existing fast path at the same speed; fp16 inputs on chips without the extension now run the SW variant instead of crashing.

jgibson2 · 2026-04-24T21:10:09Z

Addressed the SIGILL concern — new commit 53697e9 splits the fp16 path into HW and SW variants with runtime dispatch:

op_grid_sampler_2d_fp16_hw.cpp (new) — hardware fp16 fast path. Uses vld1_f16 / vcvt_f32_f16 / vcvt_f16_f32 / vst1_f16. Compiled with -march=armv8.2-a+fp16 (scoped via set_source_files_properties in CMake; compiler_flags gated on ovr_config//cpu:arm64 in the Buck target).
op_grid_sampler_2d.cpp — now hosts the fp32 path, a new fp16 software-convert path, and the runtime dispatcher. Plain ARMv8 only. The SW path converts fp16↔fp32 via c10::Half's portable conversions (no hardware fp16 instructions) and does all compute on NEON fp32 lanes — slower per conversion but safe on any ARMv8 chip.

Dispatcher is at op_grid_sampler_2d.cpp:394-425. Key lines:

if (input.scalar_type() == ScalarType::Half) {
  if (cpuinfo_initialize() && cpuinfo_has_arm_neon_fp16()) {
    // HW path (in _fp16_hw.cpp)
  }
  // fall through to SW path (in this file)
}

cpuinfo is already transitively linked via extension_threadpool, so the dependency was a one-line Buck addition. No change to behavior on fp16-capable chips (Pixel 9, S24 FE, etc.); chips without the +fp16 extension now run the SW variant instead of raising SIGILL.

One thing I'd appreciate guidance on: the fp16 SW and HW paths share ~60 lines of loop body verbatim — same FMA chain, same indexing, same weight math. I kept them copy-pasted for clarity rather than macro/template gymnastics. Happy to DRY that up if you'd prefer.

Buck's op_registration_util._enforce_deps rejects any dep starting with `:op_` on the theory that op_targets should not depend on other op_targets. The previously-named `:op_grid_sampler_2d_fp16_hw` tripped that check when op_grid_sampler_2d (which is an op_target) declared it as a dep. Rename the internal helper library to `grid_sampler_2d_fp16_hw_impl`, matching the existing `add_sub_impl` / `binary_ops` naming for op-specific implementation helpers. No change to file contents, CMake, or the C++ dispatch — only the Buck target name and the corresponding dep reference.

meta-codesync · 2026-04-24T23:29:49Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D102420839.

jgibson2 requested review from kirklandsign, larryliu0820 and manuelcandales as code owners April 24, 2026 18:36

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026

pytorch-bot Bot added the release notes: none Do not include this in the release notes label Apr 24, 2026

jgibson2 mentioned this pull request Apr 24, 2026

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out PolyCam/executorch#4

Merged

3 tasks

mergennachin requested review from GregoryComer and digantdesai April 24, 2026 19:47

kernels/optimized: clang-format + cmake-format

fb83fa9

Apply lintrunner -a auto-fixes to satisfy CI (CLANGFORMAT and CMAKEFORMAT). No functional changes.

GregoryComer reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119

optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119
jgibson2 wants to merge 5 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-optimized-grid-sum

jgibson2 commented Apr 24, 2026

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer commented Apr 24, 2026 •

edited

Loading

Uh oh!

GregoryComer left a comment

Uh oh!

GregoryComer Apr 24, 2026

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jgibson2 commented Apr 24, 2026

Summary

grid_sampler_2d.out

sum.IntList_out

Integration

Test plan

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19119

❗ 1 Active SEVs

❌ 3 New Failures, 2 Unrelated Failures

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GregoryComer left a comment

Choose a reason for hiding this comment

Uh oh!

GregoryComer Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`grid_sampler_2d.out`

`sum.IntList_out`

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

GregoryComer commented Apr 24, 2026 •

edited

Loading