Skip to content

portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear#19117

Merged
GregoryComer merged 2 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-grid-sampler-fp16
Apr 24, 2026
Merged

portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear#19117
GregoryComer merged 2 commits intopytorch:mainfrom
PolyCam:jgibson/upstream-grid-sampler-fp16

Conversation

@jgibson2
Copy link
Copy Markdown
Contributor

@jgibson2 jgibson2 commented Apr 24, 2026

Summary

The bilinear grid_sampler_2d portable kernel computes interpolation weights via subtractions like (ix_se - ix) where both operands are close integer-valued coordinates in pixel space. In fp16 (10 bits of mantissa) that's classic catastrophic cancellation — the result has only a handful of significant bits. The downstream weighted-sum accumulation then loses further precision.

Measured on a unit test exercising interior grid points with fp16 inputs, the kernel drifts by ~0.1 absolute from an fp32 reference. That's visible as incorrect depth / flow output near non-integer sample points, which is most of them.

Fix

An AccType<CTYPE> trait mapping Half and BFloat16 to float, leaving every other dtype unchanged. Used for intermediate coordinate, weight computation, and out_val accumulation. Loads cast CTYPE -> ACC; the final store casts ACC -> CTYPE once. Only internal math is promoted; memory layout / public API / tensor dtypes are unchanged.

template <typename CTYPE>
using AccType = std::conditional_t<
    std::is_same_v<CTYPE, executorch::aten::Half> ||
        std::is_same_v<CTYPE, executorch::aten::BFloat16>,
    float,
    CTYPE>;

Effects

  • fp32 / Int / any non-half dtype: AccType<T> is T, so the generated code is byte-identical. No behavior change.
  • Half / BFloat16: max_abs vs an fp32 reference drops from ~0.1 to 0 on the shapes I tested (N=1..2, C=7..64, H/W up to 96, both align_corners values).
  • Perf: a handful of fp16↔fp32 conversions per output element. Not measurable at op level; well within the portable kernel's scalar cost envelope.

Scope

Only touches the bilinear interpolation path. The nearest-mode path doesn't do weighted-sum accumulation and doesn't have the cancellation issue — left alone in this change.

Test plan

  • Builds clean for Android arm64 and host (Apple Clang 21).
  • Verified numerically via a standalone harness that runs the kernel with matched fp32 / fp16 inputs and compares against an fp32-then-downcast reference. All shapes pass within a single fp16 ULP (or are bit-exact). fp32 tests remain bit-identical to the pre-change kernel.
  • Existing kernels/test/op_grid_sampler_2d_test.cpp unit tests continue to pass (both fp32 shapes that were previously tested, and the fp16 path I'm specifically fixing).

Happy to add an fp16-specific test case to op_grid_sampler_2d_test.cpp if useful for CI coverage here — just let me know the preferred approach.

cc @larryliu0820 @manuelcandales

…inear

The bilinear grid_sampler_2d kernel computes interpolation weights via
subtractions of the form `(ix_se - ix)` and `(iy_se - iy)` where both
operands are close integer-valued coordinates in pixel space. In fp16
(10 bits of mantissa) that's classic catastrophic cancellation —
subtracting two close values produces a result with only a handful of
significant bits. The downstream `out_val += in[...] * weight`
accumulation then further loses precision.

Concretely, on random interior grid points with fp16 inputs, the kernel
can drift by ~0.1 in absolute terms from an fp32 reference — visible as
visibly incorrect interpolation near non-integer sample points.

Fix: an `AccType<CTYPE>` trait that maps Half and BFloat16 to float and
leaves every other dtype unchanged. Used for the intermediate coordinate,
weight computation, and `out_val` accumulation. Loads cast CTYPE -> ACC
at read time, and the final store casts ACC -> CTYPE once. Only the
internal math is promoted; memory layout is unchanged.

Effects:

  * fp32 / Int / any non-half dtype: byte-identical output (AccType<T>
    is T).
  * Half / BFloat16: max_abs vs an fp32 reference drops from ~0.1 to
    bit-exact agreement on the test shapes exercised (N=1..2, C=7..64,
    H/W up to 96, both align_corners values).
  * Perf: a handful of fp16 <-> fp32 conversions per output element,
    unmeasurable at op level.

Only touches the bilinear path. The nearest-mode path doesn't accumulate
and doesn't have the same issue — left alone in this change.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19117

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 9 Pending

As of commit 38da787 with merge base de8ce55 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026
jgibson2 added a commit to PolyCam/executorch that referenced this pull request Apr 24, 2026
Match the precision of the portable kernel (after pytorch#19117)
and avoid fp16 catastrophic cancellation on weight computation. The NEON
half variant previously did interpolation weight computation and FMA
accumulation in fp16 via vmul_f16 / vfma_f16; this change loads fp16,
promotes to float32x4 via vcvt_f32_f16, does the four-corner FMA chain in
fp32, and casts back to fp16 on store.

Speed impact: two vcvt per 4-channel group — single-cycle on modern ARM,
unmeasurable at op level in a full-model benchmark (3.5 ms for a typical
call shape, unchanged).

Precision impact: max_abs vs an fp32-then-down-cast reference drops from
~0.1 to 0 on the shapes the polycam depth model uses.
jgibson2 added a commit to PolyCam/executorch that referenced this pull request Apr 24, 2026
Standalone aarch64 binary that exercises both NEON kernels (grid_sampler_2d
and sum.IntList_out) across fp32 and fp16 inputs on the shapes the polycam
depth model actually uses. Opt-in via -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON
so default builds (including the AAR) are not affected.

The reference for fp16 tests is portable run on up-cast fp32 inputs, then
down-cast to fp16 — independent of whatever portable's fp16 path happens
to do. That keeps the test meaningful whether or not the upstream
portable-fp16 fix (pytorch#19117) has landed yet.

Pass/fail uses numpy.testing.assert_allclose semantics:
  |a - b| <= abs_tol + rel_tol * |b|
Avoids the "relative error explodes at zero crossings" trap for
mean-zero reductions and bilinear samples near cancellation points.

Usage:
  cmake -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON ...
  cmake --build <out> --target verify_custom_kernels
  adb push <out>/kernels/optimized/verify_custom_kernels /data/local/tmp/
  adb shell /data/local/tmp/verify_custom_kernels
@jgibson2
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot Bot added the release notes: none Do not include this in the release notes label Apr 24, 2026
jgibson2 added a commit to PolyCam/executorch that referenced this pull request Apr 24, 2026
…List_out

Two new optimized CPU kernels registered alongside the existing
optimized_kernels library. Both replace the portable reference kernel
(still available as fallback for unsupported inputs) with a vectorized
implementation that accumulates in fp32, avoiding the fp16 precision
issues noted in pytorch#19117 for grid_sampler_2d bilinear.

Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes
representative of the model's hot path):

| Op                               | Portable | This PR | Speedup |
| -------------------------------- | -------- | ------- | ------- |
| grid_sampler_2d.out              | 17.3 ms  | 3.4 ms  | 5.1x    |
| sum.IntList_out (5 calls, total) | 3.0 ms   | 0.56 ms | 5.4x    |

### grid_sampler_2d.out

aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per
iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32
for weight computation and accumulation, then cast back on store — the
portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise
suffer catastrophic cancellation. Unsupported modes and non-aarch64
targets delegate to the portable kernel.

### sum.IntList_out

at::vec::Vectorized<float>-based implementation of the single-dim
reduction fast path (both innermost-contiguous and strided cases).
Cross-architecture SIMD via PyTorch's existing vector abstraction;
accumulates in fp32 regardless of input dtype. Multi-dim reductions,
dtype-converting reductions, and complex types delegate to portable.

### Integration

- Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to
  OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of
  truth for both Buck and CMake builds.
- optimized.yaml registers the ops with the standard opt_* naming
  convention used by sibling kernels.
- kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16
  flag to just op_grid_sampler_2d.cpp via set_source_files_properties,
  so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__
  guards and falls through to portable on non-arm64 targets.
jgibson2 added a commit to PolyCam/executorch that referenced this pull request Apr 24, 2026
Same one-char fix as pytorch#19117 (and our PR #2): the
DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded
unquoted, so multi-word descriptions on STRING options passed via `-D`
spilled their trailing words into subsequent set() args.

This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION
with a multi-word help string — builds that set it (e.g. via
scripts/build_android_library.sh forwarding the env var) then fail.

Carried here so this branch remains self-contained and buildable
independent of the merge order of PR #2. Drops cleanly after PR #2
lands; git will treat the duplicate line as a no-op.
jgibson2 added a commit to PolyCam/executorch that referenced this pull request Apr 24, 2026
Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and
opt_sum_dim_out against an fp32 reference derived from the portable
kernel (portable run on up-cast fp32 inputs, then down-cast to fp16).
Reference is independent of portable's own fp16 path, so the test stays
meaningful regardless of pytorch#19117's merge state.

Pass/fail uses numpy.testing.assert_allclose semantics:
  |a - b| <= abs_tol + rel_tol * |b|
Avoids the "relative error explodes at zero crossings" trap for
mean-zero reductions and bilinear samples near cancellation points.

Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are
unaffected. Build + run:

  cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ...
  cmake --build <out> --target verify_optimized_kernels
  adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/
  adb shell /data/local/tmp/verify_optimized_kernels

Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol
per test case. 12 test cases across grid_sampler and sum, covering the
shapes the polycam depth model uses plus a few edge cases (odd channel
count, align_corners=1, multi-batch).
Copy link
Copy Markdown
Member

@GregoryComer GregoryComer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the fix!

Can you resolve the linter error? We should be able to merge once CI is green.

@GregoryComer GregoryComer added the module: kernels Issues related to kernel libraries and utilities, and code under kernels/ label Apr 24, 2026
Apply lintrunner -a formatting to satisfy CI.
@jgibson2
Copy link
Copy Markdown
Contributor Author

Looks good to me. Thanks for the fix!

Can you resolve the linter error? We should be able to merge once CI is green.

Should be fixed now!

@GregoryComer
Copy link
Copy Markdown
Member

GregoryComer commented Apr 24, 2026

Cadence and macos tests are flakes. Merging.

@GregoryComer GregoryComer merged commit 60ffe19 into pytorch:main Apr 24, 2026
167 of 171 checks passed
@jgibson2 jgibson2 deleted the jgibson/upstream-grid-sampler-fp16 branch April 25, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: kernels Issues related to kernel libraries and utilities, and code under kernels/ release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants