Fix GPU variety undercount for kernels larger than 5x5 by brendancol · Pull Request #2800 · xarray-contrib/xarray-spatial

brendancol · 2026-06-01T17:44:55Z

What changed

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. For a correctness module that is wrong.

Rewrote _focal_variety_cuda to count distinct values without a scratch buffer. For each valid non-NaN cell it scans only the earlier cells in the same window and increments the count when no earlier cell matches. O(window^2) per pixel, no cuda.local.array, so it works for arbitrary kernel sizes and matches the CPU _calc_variety exactly.
The rewrite removes both the 25-value cap and the register-pressure concern the old buffer was sized for.

Backend coverage

GPU only (cupy / dask+cupy go through this kernel). numpy and dask+numpy were already correct and are unchanged.

Test plan

test_variety_gpu_large_kernel_parity[7] / [9]: cupy matches numpy on 7x7 (49) and 9x9 (81) all-unique windows. Verified on a real GPU.
test_variety_large_kernel_numpy[7] / [9]: numpy reference returns 49 and 81; runs without a GPU.
Full test_focal.py suite: 154 passed.

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. Rewrite the kernel to count distinct values without a scratch buffer: for each valid non-NaN cell, scan only the earlier cells in the same window and increment the count when no earlier cell matches. This drops the cap and the register-pressure concern, and matches the CPU implementation for arbitrary kernel sizes. Add test_variety_gpu_large_kernel_parity asserting cupy matches numpy on 7x7 and 9x9 all-unique windows, plus a numpy-only large-kernel test that runs without a GPU.

brendancol

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

focal.py:934-951: the buffer-free scan is correct, but the inner double-break does a few no-op passes. Once the running flat index pk*kcols+ph reaches k*kcols+h, the ph loop breaks out of the current row, then every later pk row breaks again on its first iteration. No correctness impact. Comparing against a precomputed target = k*kcols+h would exit cleanly, but the current form is fine.
benchmarks/benchmarks/focal.py:36: the focal_stats benchmark runs with default stats, so it never exercises the variety path. Not required for this fix, but a variety case would catch future regressions in this kernel's cost.

What looks good

The GPU count now matches the CPU _calc_variety exactly: each distinct value is counted once, at its earliest occurrence in the window. Verified on a real GPU (cupy == numpy for 7x7 -> 49 and 9x9 -> 81).
NaN handling matches the CPU path (v != v skip, all-NaN window returns NaN).
Dropping the cuda.local.array removes both the 25-value cap and the register-pressure reason it existed.
Tests are split so the numpy reference is checked even without a GPU, and the cupy parity assertion is gated by the cuda_and_cupy_available marker.
Full test_focal.py suite passes (154 tests).

Checklist

Algorithm matches CPU reference
All implemented backends produce consistent results (cupy verified on GPU)
NaN handling is correct
Edge cases covered (existing all-NaN, single-cell tests still pass)
Dask chunk boundaries handled (dask+cupy focal test passes)
No premature materialization or unnecessary copies
Benchmark exists or is not needed (no variety-specific benchmark; not required)
README feature matrix updated (n/a, no new function)
Docstrings present and accurate (internal kernel, comment added)

Break the outer pk loop once pk*kcols reaches the target flat index so the scan stops at the row boundary instead of re-breaking on the first cell of every later row. No behaviour change; verified by the variety tests including the GPU parity cases.

brendancol

Follow-up review (after `8710016`)

The prior-cell scan now breaks the outer pk loop once pk*kcols reaches the target flat index (focal.py:935-937), so it stops at the row boundary instead of re-breaking on the first cell of each later row. This resolves the only actionable nit from the first pass. No behaviour change.

Variety tests pass, including the on-GPU cupy/numpy parity cases for 7x7 (49) and 9x9 (81).
flake8 clean on focal.py.

Remaining item, dismissed with reason:

Variety-specific benchmark: out of scope for a correctness fix. A benchmark measures cost, not correctness, so it would not have caught the 25-value cap this PR removes. The focal_stats benchmark already exists for cost tracking.

No blockers or suggestions.

github-actions Bot added the performance PR touches performance-sensitive code label Jun 1, 2026

brendancol commented Jun 1, 2026

View reviewed changes

brendancol merged commit b7ae072 into main Jun 2, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU variety undercount for kernels larger than 5x5#2800

Fix GPU variety undercount for kernels larger than 5x5#2800
brendancol merged 2 commits into
mainfrom
issue-2775

brendancol commented Jun 1, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Jun 1, 2026

What changed

Backend coverage

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (after 8710016)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up review (after `8710016`)