portable: accumulate in fp32 for Half/BFloat16 in grid_sampler_2d bilinear#19117
Merged
GregoryComer merged 2 commits intopytorch:mainfrom Apr 24, 2026
Merged
Conversation
…inear
The bilinear grid_sampler_2d kernel computes interpolation weights via
subtractions of the form `(ix_se - ix)` and `(iy_se - iy)` where both
operands are close integer-valued coordinates in pixel space. In fp16
(10 bits of mantissa) that's classic catastrophic cancellation —
subtracting two close values produces a result with only a handful of
significant bits. The downstream `out_val += in[...] * weight`
accumulation then further loses precision.
Concretely, on random interior grid points with fp16 inputs, the kernel
can drift by ~0.1 in absolute terms from an fp32 reference — visible as
visibly incorrect interpolation near non-integer sample points.
Fix: an `AccType<CTYPE>` trait that maps Half and BFloat16 to float and
leaves every other dtype unchanged. Used for the intermediate coordinate,
weight computation, and `out_val` accumulation. Loads cast CTYPE -> ACC
at read time, and the final store casts ACC -> CTYPE once. Only the
internal math is promoted; memory layout is unchanged.
Effects:
* fp32 / Int / any non-half dtype: byte-identical output (AccType<T>
is T).
* Half / BFloat16: max_abs vs an fp32 reference drops from ~0.1 to
bit-exact agreement on the test shapes exercised (N=1..2, C=7..64,
H/W up to 96, both align_corners values).
* Perf: a handful of fp16 <-> fp32 conversions per output element,
unmeasurable at op level.
Only touches the bilinear path. The nearest-mode path doesn't accumulate
and doesn't have the same issue — left alone in this change.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19117
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 9 PendingAs of commit 38da787 with merge base de8ce55 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Closed
3 tasks
jgibson2
added a commit
to PolyCam/executorch
that referenced
this pull request
Apr 24, 2026
Match the precision of the portable kernel (after pytorch#19117) and avoid fp16 catastrophic cancellation on weight computation. The NEON half variant previously did interpolation weight computation and FMA accumulation in fp16 via vmul_f16 / vfma_f16; this change loads fp16, promotes to float32x4 via vcvt_f32_f16, does the four-corner FMA chain in fp32, and casts back to fp16 on store. Speed impact: two vcvt per 4-channel group — single-cycle on modern ARM, unmeasurable at op level in a full-model benchmark (3.5 ms for a typical call shape, unchanged). Precision impact: max_abs vs an fp32-then-down-cast reference drops from ~0.1 to 0 on the shapes the polycam depth model uses.
jgibson2
added a commit
to PolyCam/executorch
that referenced
this pull request
Apr 24, 2026
Standalone aarch64 binary that exercises both NEON kernels (grid_sampler_2d and sum.IntList_out) across fp32 and fp16 inputs on the shapes the polycam depth model actually uses. Opt-in via -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON so default builds (including the AAR) are not affected. The reference for fp16 tests is portable run on up-cast fp32 inputs, then down-cast to fp16 — independent of whatever portable's fp16 path happens to do. That keeps the test meaningful whether or not the upstream portable-fp16 fix (pytorch#19117) has landed yet. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Usage: cmake -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON ... cmake --build <out> --target verify_custom_kernels adb push <out>/kernels/optimized/verify_custom_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_custom_kernels
Merged
3 tasks
Contributor
Author
|
@pytorchbot label "release notes: none" |
4 tasks
jgibson2
added a commit
to PolyCam/executorch
that referenced
this pull request
Apr 24, 2026
…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.
jgibson2
added a commit
to PolyCam/executorch
that referenced
this pull request
Apr 24, 2026
Same one-char fix as pytorch#19117 (and our PR #2): the DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded unquoted, so multi-word descriptions on STRING options passed via `-D` spilled their trailing words into subsequent set() args. This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION with a multi-word help string — builds that set it (e.g. via scripts/build_android_library.sh forwarding the env var) then fail. Carried here so this branch remains self-contained and buildable independent of the merge order of PR #2. Drops cleanly after PR #2 lands; git will treat the duplicate line as a no-op.
jgibson2
added a commit
to PolyCam/executorch
that referenced
this pull request
Apr 24, 2026
Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and opt_sum_dim_out against an fp32 reference derived from the portable kernel (portable run on up-cast fp32 inputs, then down-cast to fp16). Reference is independent of portable's own fp16 path, so the test stays meaningful regardless of pytorch#19117's merge state. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are unaffected. Build + run: cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ... cmake --build <out> --target verify_optimized_kernels adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_optimized_kernels Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol per test case. 12 test cases across grid_sampler and sum, covering the shapes the polycam depth model uses plus a few edge cases (odd channel count, align_corners=1, multi-batch).
GregoryComer
approved these changes
Apr 24, 2026
Member
GregoryComer
left a comment
There was a problem hiding this comment.
Looks good to me. Thanks for the fix!
Can you resolve the linter error? We should be able to merge once CI is green.
Apply lintrunner -a formatting to satisfy CI.
Contributor
Author
Should be fixed now! |
Member
|
Cadence and macos tests are flakes. Merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The bilinear grid_sampler_2d portable kernel computes interpolation weights via subtractions like
(ix_se - ix)where both operands are close integer-valued coordinates in pixel space. In fp16 (10 bits of mantissa) that's classic catastrophic cancellation — the result has only a handful of significant bits. The downstream weighted-sum accumulation then loses further precision.Measured on a unit test exercising interior grid points with fp16 inputs, the kernel drifts by ~0.1 absolute from an fp32 reference. That's visible as incorrect depth / flow output near non-integer sample points, which is most of them.
Fix
An
AccType<CTYPE>trait mappingHalfandBFloat16tofloat, leaving every other dtype unchanged. Used for intermediate coordinate, weight computation, andout_valaccumulation. Loads castCTYPE -> ACC; the final store castsACC -> CTYPEonce. Only internal math is promoted; memory layout / public API / tensor dtypes are unchanged.Effects
AccType<T>isT, so the generated code is byte-identical. No behavior change.max_absvs an fp32 reference drops from ~0.1 to 0 on the shapes I tested (N=1..2, C=7..64, H/W up to 96, bothalign_cornersvalues).Scope
Only touches the bilinear interpolation path. The nearest-mode path doesn't do weighted-sum accumulation and doesn't have the cancellation issue — left alone in this change.
Test plan
kernels/test/op_grid_sampler_2d_test.cppunit tests continue to pass (both fp32 shapes that were previously tested, and the fp16 path I'm specifically fixing).Happy to add an fp16-specific test case to
op_grid_sampler_2d_test.cppif useful for CI coverage here — just let me know the preferred approach.cc @larryliu0820 @manuelcandales