portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear by jgibson2 · Pull Request #19117 · pytorch/executorch

jgibson2 · 2026-04-24T17:54:13Z

Summary

The bilinear grid_sampler_2d portable kernel computes interpolation weights via subtractions like (ix_se - ix) where both operands are close integer-valued coordinates in pixel space. In fp16 (10 bits of mantissa) that's classic catastrophic cancellation — the result has only a handful of significant bits. The downstream weighted-sum accumulation then loses further precision.

Measured on a unit test exercising interior grid points with fp16 inputs, the kernel drifts by ~0.1 absolute from an fp32 reference. That's visible as incorrect depth / flow output near non-integer sample points, which is most of them.

Fix

An AccType<CTYPE> trait mapping Half and BFloat16 to float, leaving every other dtype unchanged. Used for intermediate coordinate, weight computation, and out_val accumulation. Loads cast CTYPE -> ACC; the final store casts ACC -> CTYPE once. Only internal math is promoted; memory layout / public API / tensor dtypes are unchanged.

template <typename CTYPE>
using AccType = std::conditional_t<
    std::is_same_v<CTYPE, executorch::aten::Half> ||
        std::is_same_v<CTYPE, executorch::aten::BFloat16>,
    float,
    CTYPE>;

Effects

fp32 / Int / any non-half dtype: AccType<T> is T, so the generated code is byte-identical. No behavior change.
Half / BFloat16: max_abs vs an fp32 reference drops from ~0.1 to 0 on the shapes I tested (N=1..2, C=7..64, H/W up to 96, both align_corners values).
Perf: a handful of fp16↔fp32 conversions per output element. Not measurable at op level; well within the portable kernel's scalar cost envelope.

Scope

Only touches the bilinear interpolation path. The nearest-mode path doesn't do weighted-sum accumulation and doesn't have the cancellation issue — left alone in this change.

Test plan

Builds clean for Android arm64 and host (Apple Clang 21).
Verified numerically via a standalone harness that runs the kernel with matched fp32 / fp16 inputs and compares against an fp32-then-downcast reference. All shapes pass within a single fp16 ULP (or are bit-exact). fp32 tests remain bit-identical to the pre-change kernel.
Existing kernels/test/op_grid_sampler_2d_test.cpp unit tests continue to pass (both fp32 shapes that were previously tested, and the fp16 path I'm specifically fixing).

Happy to add an fp16-specific test case to op_grid_sampler_2d_test.cpp if useful for CI coverage here — just let me know the preferred approach.

cc @larryliu0820 @manuelcandales

…inear The bilinear grid_sampler_2d kernel computes interpolation weights via subtractions of the form `(ix_se - ix)` and `(iy_se - iy)` where both operands are close integer-valued coordinates in pixel space. In fp16 (10 bits of mantissa) that's classic catastrophic cancellation — subtracting two close values produces a result with only a handful of significant bits. The downstream `out_val += in[...] * weight` accumulation then further loses precision. Concretely, on random interior grid points with fp16 inputs, the kernel can drift by ~0.1 in absolute terms from an fp32 reference — visible as visibly incorrect interpolation near non-integer sample points. Fix: an `AccType<CTYPE>` trait that maps Half and BFloat16 to float and leaves every other dtype unchanged. Used for the intermediate coordinate, weight computation, and `out_val` accumulation. Loads cast CTYPE -> ACC at read time, and the final store casts ACC -> CTYPE once. Only the internal math is promoted; memory layout is unchanged. Effects: * fp32 / Int / any non-half dtype: byte-identical output (AccType<T> is T). * Half / BFloat16: max_abs vs an fp32 reference drops from ~0.1 to bit-exact agreement on the test shapes exercised (N=1..2, C=7..64, H/W up to 96, both align_corners values). * Perf: a handful of fp16 <-> fp32 conversions per output element, unmeasurable at op level. Only touches the bilinear path. The nearest-mode path doesn't accumulate and doesn't have the same issue — left alone in this change.

pytorch-bot · 2026-04-24T17:54:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19117

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 1 New Failure, 9 Pending

As of commit 38da787 with merge base de8ce55 ():

NEW FAILURE - The following job has failed:

Cadence Build & Test / cpu-test / test-aot / test-aot (gh)
backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_transposed_conv_with_linear_4

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Match the precision of the portable kernel (after pytorch#19117) and avoid fp16 catastrophic cancellation on weight computation. The NEON half variant previously did interpolation weight computation and FMA accumulation in fp16 via vmul_f16 / vfma_f16; this change loads fp16, promotes to float32x4 via vcvt_f32_f16, does the four-corner FMA chain in fp32, and casts back to fp16 on store. Speed impact: two vcvt per 4-channel group — single-cycle on modern ARM, unmeasurable at op level in a full-model benchmark (3.5 ms for a typical call shape, unchanged). Precision impact: max_abs vs an fp32-then-down-cast reference drops from ~0.1 to 0 on the shapes the polycam depth model uses.

Standalone aarch64 binary that exercises both NEON kernels (grid_sampler_2d and sum.IntList_out) across fp32 and fp16 inputs on the shapes the polycam depth model actually uses. Opt-in via -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON so default builds (including the AAR) are not affected. The reference for fp16 tests is portable run on up-cast fp32 inputs, then down-cast to fp16 — independent of whatever portable's fp16 path happens to do. That keeps the test meaningful whether or not the upstream portable-fp16 fix (pytorch#19117) has landed yet. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Usage: cmake -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON ... cmake --build <out> --target verify_custom_kernels adb push <out>/kernels/optimized/verify_custom_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_custom_kernels

jgibson2 · 2026-04-24T18:09:11Z

@pytorchbot label "release notes: none"

…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.

Same one-char fix as pytorch#19117 (and our PR #2): the DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded unquoted, so multi-word descriptions on STRING options passed via `-D` spilled their trailing words into subsequent set() args. This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION with a multi-word help string — builds that set it (e.g. via scripts/build_android_library.sh forwarding the env var) then fail. Carried here so this branch remains self-contained and buildable independent of the merge order of PR #2. Drops cleanly after PR #2 lands; git will treat the duplicate line as a no-op.

Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and opt_sum_dim_out against an fp32 reference derived from the portable kernel (portable run on up-cast fp32 inputs, then down-cast to fp16). Reference is independent of portable's own fp16 path, so the test stays meaningful regardless of pytorch#19117's merge state. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are unaffected. Build + run: cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ... cmake --build <out> --target verify_optimized_kernels adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_optimized_kernels Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol per test case. 12 test cases across grid_sampler and sum, covering the shapes the polycam depth model uses plus a few edge cases (odd channel count, align_corners=1, multi-batch).

GregoryComer

Looks good to me. Thanks for the fix!

Can you resolve the linter error? We should be able to merge once CI is green.

Apply lintrunner -a formatting to satisfy CI.

jgibson2 · 2026-04-24T20:27:55Z

Looks good to me. Thanks for the fix!

Can you resolve the linter error? We should be able to merge once CI is green.

Should be fixed now!

GregoryComer · 2026-04-24T22:36:47Z

Cadence and macos tests are flakes. Merging.

jgibson2 requested a review from manuelcandales as a code owner April 24, 2026 17:54

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026

jgibson2 mentioned this pull request Apr 24, 2026

portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d and sum PolyCam/executorch#6

Closed

3 tasks

jgibson2 mentioned this pull request Apr 24, 2026

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out PolyCam/executorch#4

Merged

3 tasks

pytorch-bot Bot added the release notes: none Do not include this in the release notes label Apr 24, 2026

jgibson2 mentioned this pull request Apr 24, 2026

optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>) #19119

Open

4 tasks

GregoryComer approved these changes Apr 24, 2026

View reviewed changes

GregoryComer added the module: kernels Issues related to kernel libraries and utilities, and code under kernels/ label Apr 24, 2026

kernels/portable: clang-format op_grid_sampler_2d.cpp

38da787

Apply lintrunner -a formatting to satisfy CI.

GregoryComer merged commit 60ffe19 into pytorch:main Apr 24, 2026
167 of 171 checks passed

jgibson2 deleted the jgibson/upstream-grid-sampler-fp16 branch April 25, 2026 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear#19117

portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear#19117
GregoryComer merged 2 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-grid-sampler-fp16

jgibson2 commented Apr 24, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer left a comment

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jgibson2 commented Apr 24, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Effects

Scope

Test plan

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19117

❗ 1 Active SEVs

❌ 1 New Failure, 9 Pending

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer left a comment

Choose a reason for hiding this comment

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

GregoryComer commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jgibson2 commented Apr 24, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

GregoryComer commented Apr 24, 2026 •

edited

Loading