Skip to content

Fix GPU variety undercount for kernels larger than 5x5#2800

Merged
brendancol merged 2 commits into
mainfrom
issue-2775
Jun 2, 2026
Merged

Fix GPU variety undercount for kernels larger than 5x5#2800
brendancol merged 2 commits into
mainfrom
issue-2775

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2775

What changed

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. For a correctness module that is wrong.

  • Rewrote _focal_variety_cuda to count distinct values without a scratch buffer. For each valid non-NaN cell it scans only the earlier cells in the same window and increments the count when no earlier cell matches. O(window^2) per pixel, no cuda.local.array, so it works for arbitrary kernel sizes and matches the CPU _calc_variety exactly.
  • The rewrite removes both the 25-value cap and the register-pressure concern the old buffer was sized for.

Backend coverage

GPU only (cupy / dask+cupy go through this kernel). numpy and dask+numpy were already correct and are unchanged.

Test plan

  • test_variety_gpu_large_kernel_parity[7] / [9]: cupy matches numpy on 7x7 (49) and 9x9 (81) all-unique windows. Verified on a real GPU.
  • test_variety_large_kernel_numpy[7] / [9]: numpy reference returns 49 and 81; runs without a GPU.
  • Full test_focal.py suite: 154 passed.

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it
silently capped unique-value counts at 25. A 7x7 all-unique window
returned 25 on GPU versus 49 on CPU.

Rewrite the kernel to count distinct values without a scratch buffer:
for each valid non-NaN cell, scan only the earlier cells in the same
window and increment the count when no earlier cell matches. This drops
the cap and the register-pressure concern, and matches the CPU
implementation for arbitrary kernel sizes.

Add test_variety_gpu_large_kernel_parity asserting cupy matches numpy on
7x7 and 9x9 all-unique windows, plus a numpy-only large-kernel test that
runs without a GPU.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 1, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • focal.py:934-951: the buffer-free scan is correct, but the inner double-break does a few no-op passes. Once the running flat index pk*kcols+ph reaches k*kcols+h, the ph loop breaks out of the current row, then every later pk row breaks again on its first iteration. No correctness impact. Comparing against a precomputed target = k*kcols+h would exit cleanly, but the current form is fine.
  • benchmarks/benchmarks/focal.py:36: the focal_stats benchmark runs with default stats, so it never exercises the variety path. Not required for this fix, but a variety case would catch future regressions in this kernel's cost.

What looks good

  • The GPU count now matches the CPU _calc_variety exactly: each distinct value is counted once, at its earliest occurrence in the window. Verified on a real GPU (cupy == numpy for 7x7 -> 49 and 9x9 -> 81).
  • NaN handling matches the CPU path (v != v skip, all-NaN window returns NaN).
  • Dropping the cuda.local.array removes both the 25-value cap and the register-pressure reason it existed.
  • Tests are split so the numpy reference is checked even without a GPU, and the cupy parity assertion is gated by the cuda_and_cupy_available marker.
  • Full test_focal.py suite passes (154 tests).

Checklist

  • Algorithm matches CPU reference
  • All implemented backends produce consistent results (cupy verified on GPU)
  • NaN handling is correct
  • Edge cases covered (existing all-NaN, single-cell tests still pass)
  • Dask chunk boundaries handled (dask+cupy focal test passes)
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed (no variety-specific benchmark; not required)
  • README feature matrix updated (n/a, no new function)
  • Docstrings present and accurate (internal kernel, comment added)

Break the outer pk loop once pk*kcols reaches the target flat index so
the scan stops at the row boundary instead of re-breaking on the first
cell of every later row. No behaviour change; verified by the variety
tests including the GPU parity cases.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after 8710016)

The prior-cell scan now breaks the outer pk loop once pk*kcols reaches the target flat index (focal.py:935-937), so it stops at the row boundary instead of re-breaking on the first cell of each later row. This resolves the only actionable nit from the first pass. No behaviour change.

  • Variety tests pass, including the on-GPU cupy/numpy parity cases for 7x7 (49) and 9x9 (81).
  • flake8 clean on focal.py.

Remaining item, dismissed with reason:

  • Variety-specific benchmark: out of scope for a correctness fix. A benchmark measures cost, not correctness, so it would not have caught the 25-value cap this PR removes. The focal_stats benchmark already exists for cost tracking.

No blockers or suggestions.

@brendancol brendancol merged commit b7ae072 into main Jun 2, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU variety focal statistic silently undercounts unique values for kernels larger than 5x5

1 participant