optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119
optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>)#19119jgibson2 wants to merge 5 commits intopytorch:mainfrom
Conversation
…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19119
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 2 Unrelated FailuresAs of commit 4e98f5a with merge base de8ce55 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: none" |
The NEON fast path indexes input/grid/out directly assuming contiguous NCHW default-dim-order layout — no use of .strides() or .dim_order(). If the caller passes anything else (NHWC, transposed, strided, channels- last), we'd read wrong memory and silently produce garbage output. Add the same check pattern op_sum.cpp already uses at L150-151: tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and out. If any fails, delegate to the portable kernel (which handles arbitrary strides / dim orders correctly via .strides()). No perf impact on the hot path — the checks are a handful of scalar comparisons run once per call, and the common polycam depth model case is already default-contiguous so the fast path is still taken.
|
@JacobSzwejbka @manuelcandales @digantdesai Do you have any concerns with conditionally linking an op in optimized only on some architectures? |
Apply lintrunner -a auto-fixes to satisfy CI (CLANGFORMAT and CMAKEFORMAT). No functional changes.
GregoryComer
left a comment
There was a problem hiding this comment.
Thanks for the PR! I left one comment about runtime gating the armv8.2+fp16 code. Other than that, it looks good.
We are currently looking at adding better support for in-tree CPU operator implementations with arch-specific dispatch, so this type of thing should become easier soon.
| ) | ||
| set_source_files_properties( | ||
| ${EXECUTORCH_ROOT}/kernels/optimized/cpu/op_grid_sampler_2d.cpp | ||
| PROPERTIES COMPILE_OPTIONS "-march=armv8.2-a+fp16" |
There was a problem hiding this comment.
Would it be possible to split out the native f16 path? Right now, it will potentially SIGILL on ARM hardware without f16 support. If possible, I'd recommend something like this:
- Move the native f16 impl into a separate source file. Scope the march +fp16 to just this file.
- Add a variant that does the f16<->f32 conversion in software.
- In the top-level kernel, check hardware support using cpuinfo_has_arm_neon_fp16 and route to the implementation.
Address review feedback on pytorch#19119: the previous op_grid_sampler_2d.cpp compiled the whole file with -march=armv8.2-a+fp16, which meant the resulting binary would SIGILL on ARMv8.0 / ARMv8.1 chips that lack the fp16 extension. Split the fp16 path into two translation units: * op_grid_sampler_2d_fp16_hw.cpp — hardware fp16 fast path. Uses vld1_f16 / vcvt_f32_f16 / vfmaq_f32 / vcvt_f16_f32 / vst1_f16. Compiled with -march=armv8.2-a+fp16 (flag scoped to this TU via set_source_files_properties in CMake, and via compiler_flags on a dedicated runtime.cxx_library in targets.bzl). * op_grid_sampler_2d.cpp — hosts the fp32 NEON path, a new fp16 software-convert path, and the runtime dispatcher. Plain ARMv8 only. The SW path converts fp16<->fp32 via c10::Half's portable operator float() / constructor (no hardware fp16 instructions) and does all compute on NEON fp32 lanes. Slower per conversion than the HW path but safe on any ARMv8 CPU. The dispatcher calls cpuinfo_initialize() + cpuinfo_has_arm_neon_fp16() (cpuinfo already transitively linked via extension_threadpool) and routes to the appropriate variant. fp32 inputs use the unchanged NEON fp32 path; any unsupported layout/padding/interpolation still falls through to the portable kernel. Buck: adds a new runtime.cxx_library(op_grid_sampler_2d_fp16_hw) in kernels/optimized/cpu/targets.bzl with the +fp16 compile flag gated on ovr_config//cpu:arm64, and wires op_grid_sampler_2d to depend on it and on cpuinfo. No behavior change on fp32 inputs. fp16 inputs on +fp16-capable chips keep the existing fast path at the same speed; fp16 inputs on chips without the extension now run the SW variant instead of crashing.
|
Addressed the SIGILL concern — new commit 53697e9 splits the fp16 path into HW and SW variants with runtime dispatch:
Dispatcher is at if (input.scalar_type() == ScalarType::Half) {
if (cpuinfo_initialize() && cpuinfo_has_arm_neon_fp16()) {
// HW path (in _fp16_hw.cpp)
}
// fall through to SW path (in this file)
}cpuinfo is already transitively linked via One thing I'd appreciate guidance on: the fp16 SW and HW paths share ~60 lines of loop body verbatim — same FMA chain, same indexing, same weight math. I kept them copy-pasted for clarity rather than macro/template gymnastics. Happy to DRY that up if you'd prefer. |
Buck's op_registration_util._enforce_deps rejects any dep starting with `:op_` on the theory that op_targets should not depend on other op_targets. The previously-named `:op_grid_sampler_2d_fp16_hw` tripped that check when op_grid_sampler_2d (which is an op_target) declared it as a dep. Rename the internal helper library to `grid_sampler_2d_fp16_hw_impl`, matching the existing `add_sub_impl` / `binary_ops` naming for op-specific implementation helpers. No change to file contents, CMake, or the C++ dispatch — only the Buck target name and the corresponding dep reference.
|
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D102420839. |
Summary
Two new optimized CPU kernels registered alongside the existing
optimized_kernelslibrary. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with vectorized implementations that accumulate in fp32, which also sidesteps the fp16 precision issue noted in #19117 forgrid_sampler_2dbilinear.Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):
grid_sampler_2d.outsum.IntList_out(5 calls, aggregate)grid_sampler_2d.outaarch64 NEON, bilinear + zeros padding only (the dominant mode for depth / MVS / spatial transformer networks). Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, cast back on store — the portable kernel's fp16 weight subtractions like
(ix_se - ix)otherwise suffer catastrophic cancellation (same concern as #19117). Unsupported modes and non-aarch64 targets delegate to the portable kernel.sum.IntList_outat::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; always accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable.Integration
OPTIMIZED_KERNELS_SRCSinbuild_variables.bzland toOPTIMIZED_ATEN_OPSinop_registration_util.bzl. Single source of truth for both Buck and CMake builds.optimized.yamlregisters the ops with the standardopt_*naming convention used by sibling kernels.kernels/optimized/CMakeLists.txtscopes the-march=armv8.2-a+fp16flag to justop_grid_sampler_2d.cppviaset_source_files_properties, so x86_64 builds are unaffected. The kernel has#ifdef __aarch64__guards and falls through to portable on non-arm64 targets.Test plan
scripts/build_android_library.sh), and host (macOS / Apple Clang 21).kernels/test/op_grid_sampler_2d_test.cppandop_sum_test.cppunit tests continue to pass — both target theaten::sum_outf/aten::grid_sampler_2d_outfcodegen-dispatched entry points, so they automatically exercise the optimized kernels when linked.Candidate successor to #19117 for the grid_sampler half — applies the same precision fix but at the optimized-kernel layer, so callers who link
optimized_ops_libget both the correctness fix and the speedup.