Skip to content

geotiff: ngjit 2x2 kernels for float mean/min/max/median overviews (#2413)#2475

Merged
brendancol merged 3 commits into
mainfrom
issue-2413
May 27, 2026
Merged

geotiff: ngjit 2x2 kernels for float mean/min/max/median overviews (#2413)#2475
brendancol merged 3 commits into
mainfrom
issue-2413

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2413.

Summary

  • Adds xrspatial/geotiff/_overview_kernels.py with four type-specialized @ngjit 2x2 kernels (mean, min, max, median) that walk output pixels directly, fuse the sentinel comparison into the read, and bounds-check odd shapes so no NaN pad copy is needed.
  • Wires the kernels into the float branch of _block_reduce_2d via a _NGJIT_REDUCE_METHODS set. Integer dtypes plus mode / cubic / nearest keep their existing numpy paths.
  • All-NaN / all-sentinel blocks still return NaN so the caller's downstream sentinel rewrite is unchanged.

The integer-dtype native kernel is intentionally left as a follow-up. The existing numpy path computes the sentinel mask at native integer width before promoting to float64, which is required for 64-bit sentinels (INT64_MAX, UINT64_MAX) that round when cast to float64. A test pins that contract.

Backend coverage

CPU only. GPU (_block_reduce_2d_gpu) and dask paths aren't changed.

Measured

Best-of-5 on this machine, jit warmup excluded:

case new old
1024^2 f32 mean 0.58 ms ~11 ms
1024^2 f32 mean + sentinel 0.63 ms ~18 ms
4096^2 f32 mean + sentinel 10.86 ms ~331 ms
4096^2 f32 median + sentinel 24.60 ms ~338 ms

Matches the numbers in #2413.

Test plan

  • 129 new parity tests in tests/test_overview_kernels_2413.py (kernel vs numpy reference across float32/float64, even/odd shapes, with/without sentinel, NaN edge cases, inf sentinel, NaN-as-nodata no-op, median-of-4 vs np.median)
  • All 328 existing cases in tests/write/test_overview.py still pass
  • 912 tests across tests/write/ plus the new file pass together
  • Integer 64-bit sentinel contract pinned by test_integer_dtype_still_routes_through_numpy_path

…2413)

Replace the generic numpy reshape + nan-aggregation pipeline in the float
mean / min / max / median branch of `_block_reduce_2d` with type-specialized
ngjit kernels that walk output pixels directly, fuse the sentinel
comparison into the read, and bounds-check odd shapes instead of
pre-padding. All-NaN / all-sentinel blocks still return NaN so the
caller's downstream sentinel rewrite is unchanged.

Integer dtypes and the `mode` / `cubic` / `nearest` methods keep their
existing numpy paths because the integer sentinel mask has to be
computed at native dtype width (64-bit sentinels round when cast to
float64).

Measured speedups on this machine match the issue's numbers:
1024^2 f32 mean 0.58 ms vs ~11 ms numpy; 4096^2 f32 mean+sentinel
10.86 ms vs ~331 ms numpy.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 27, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: ngjit 2x2 kernels for float mean/min/max/median overviews

The change is tight: drop-in float fast path, integer/mode/cubic/nearest untouched, parity pinned by a numpy reference. Existing 328 cases in tests/write/test_overview.py still pass and 129 new parametrized cases match the reference. No correctness blockers.

Blockers

None.

Suggestions

  • Accumulator dtype drift for float32 inputs. In _overview_kernels.py:42 (total = 0.0) and :81 / :120 (best = 0.0), the locals start as Python floats so numba unifies them to float64. For float32 input, the mean is computed in float64 and downcast on store; min/max widen each compare to float64. Behavior is fine (and arguably more precise for mean than the prior np.nanmean of float32), but it does mean the "bit-identical" claim in #2413 will not hold across the board. Worth pinning the choice with an arr.dtype.type(0) initializer (or a typed signature) if you want the float32 fast path to actually run in float32. Otherwise add a one-liner in the module docstring noting that float32 inputs are accumulated in float64.
  • No benchmark in benchmarks/. The geotiff module has no benchmarks at all today, so this is not a regression. #2413 cited specific numbers, so a small benchmarks/benchmarks/geotiff_overview.py (1024^2 and 4096^2 f32, with/without sentinel, mean/median) would let asv catch a future numba- or compiler-driven regression. Could land as a follow-up.

Nits

  • _overview.py:347 — the dispatch does a dict lookup (_KERNELS[method]) per call. Fine in practice; an if/elif chain would shave ~1 us but adds branch noise. Leave as is unless profiling says otherwise.
  • _overview_kernels.py:189-201 — the median sort network is correct but dense. A short example trace in the comment (e.g. "(4,2,3,1) -> (2,4,3,1) -> (2,3,4,1) -> ... -> (1,2,3,4)") would help future readers verify the swap order at a glance.
  • _overview_kernels.py — the four kernels share a near-identical 4-cell fetch + NaN/sentinel filter. Factoring would hurt ngjit inlining, so the duplication is justified, but worth a one-line note in the module docstring saying the duplication is intentional so the next reader does not try to DRY it.

What looks good

  • The integer-routing contract is pinned by test_integer_dtype_still_routes_through_numpy_path using INT64_MAX. That is exactly the case a future refactor would silently break.
  • inf-as-sentinel and NaN-as-nodata edge cases are both covered, mirroring the existing _block_reduce_2d gates.
  • Module placement matches the _compression.py precedent the issue cited.
  • The bounds-check pattern (r0+1 < h, c0+1 < w) avoids the NaN-pad copy for odd shapes — that was one of the perf wins called out in the issue and the kernel structure makes it explicit.

Checklist

  • Algorithm matches reference (numpy nan-aware aggregation, pinned by parity tests)
  • All implemented backends consistent — CPU only by design, GPU/dask untouched
  • NaN handling correct (all-NaN -> NaN; partial -> in-bounds-only)
  • Edge cases covered (1x1, single row/col, all-sentinel, all-NaN, partial residual)
  • Dask chunk boundaries — N/A (CPU-only kernel, dask path unchanged)
  • No premature materialization or unnecessary copies (output buffer only)
  • Benchmark — not added; module has none today
  • README feature matrix — N/A (no new public function, no backend coverage change)
  • Docstrings present and accurate

Address review nits on #2475:
- Module docstring now spells out that the mean accumulator is float64
  by way of numba's Python-float unification (matches the prior
  np.nanmean accuracy on float32) and that min/max follow the same
  pattern with no value drift since both are exact reductions.
- Note that per-method kernel duplication is intentional so the JIT
  keeps the inner loop body visible to inlining; do not refactor into
  a shared helper.
- Annotate the median compare-swap ladder with a concrete n=4 trace
  so the swap order is verifiable at a glance.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review

Dispositions on the prior review:

Suggestions

  • Accumulator dtype drift: addressed via docstring (commit 7abf705). The mean accumulator is intentionally float64 (matches np.nanmean accuracy on float32 inputs) and the docstring now states that explicitly. Min/max use the same pattern but the result is identical since both reductions are exact — also called out.
  • No benchmark in benchmarks/: deferred. The geotiff module has no benchmarks at all today, and adding one for just overview reduction would set a one-off precedent. A broader benchmarks/benchmarks/geotiff_*.py cluster is a better follow-up issue and out of scope here.

Nits

  • _KERNELS[method] dict lookup: dismissed — micro-overhead, current shape is fine.
  • Median sort network needs a trace: addressed (commit 7abf705). Comment now includes the n=4 (4,2,3,1) → (1,2,3,4) trace.
  • Kernel duplication note: addressed (commit 7abf705). Module docstring now records that the per-method duplication is deliberate so ngjit can inline the loop body.

Re-check

  • All 129 new parity cases and the 328 existing tests/write/test_overview.py cases pass after the docstring edits.

@brendancol brendancol merged commit 619cdd8 into main May 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoTIFF: replace _block_reduce_2d numpy nan-aggregations with @ngjit 2x2 kernels

1 participant