Skip to content

proximity: parallelize per-chunk-row compute in dask KDTree count pass#2726

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-proximity-2026-05-29-01
May 30, 2026
Merged

proximity: parallelize per-chunk-row compute in dask KDTree count pass#2726
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-proximity-2026-05-29-01

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2724.

The unbounded dask KDTree path in proximity/allocation/direction runs a Phase-0 pass that counts target pixels chunk by chunk. That pass called raster.data.blocks[iy, ix].compute() one chunk at a time inside a nested Python loop, so the reads ran strictly serially even when the scheduler could have read several chunks at once.

Change

Measurement

64-chunk grid, 3 ms/chunk simulated read latency: ~350 ms sequential vs ~57 ms row-batched (~6x). Target counts and caches come out byte-identical.

Backends

Affects the dask+numpy and dask+cupy unbounded KDTree path (dask+cupy routes through this after .get()). numpy and cupy single-array paths are untouched. Verified the dask+cupy result still matches the numpy baseline end-to-end on this CUDA host.

Test plan

  • xrspatial/tests/test_proximity.py passes (71 tests)
  • New: structural test that .compute() is not called in the inner chunk loop
  • New: behavioural test that target_counts/total/caches match a sequential baseline
  • dask+cupy unbounded run matches numpy baseline

_stream_target_counts called raster.data.blocks[iy, ix].compute() one
chunk at a time in a nested Python loop, so chunk reads ran strictly
serially even when the scheduler could read several at once. Switch to
one da.compute() per chunk-row, which lets the scheduler read each row
in parallel while keeping peak driver memory bounded to a single row of
chunks (the streaming guarantee from gh-879 still holds). Same shape as
the polygonize row-batch in #2608.

Measured ~6x on a 64-chunk grid with 3ms/chunk read latency; counts and
caches are byte-identical to the sequential pass.

Add a structural test that compute() is not called in the inner chunk
loop, and a behavioural test that target_counts/total/caches match a
sequential baseline.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: proximity: parallelize per-chunk-row compute in dask KDTree count pass

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • test_stream_target_counts_matches_sequential_baseline recomputes every chunk three times (once inside _stream_target_counts, then twice more in the two assertion loops). Cheap for a 24x30 raster, so fine as-is; flagging it in case the fixture grows.

What looks good

  • The fix is the right shape. da.compute(*(blocks for ix in row)) returns results in argument order, so row_chunks[ix] lines up with chunk (iy, ix) and everything downstream stays untouched. Counting, the 25% cache budget, and eviction are unchanged.
  • Memory stays bounded. Peak goes from one chunk to one chunk-row (sqrt(N) chunks), which is the bound the #879 streaming design needs. The code comment calls this out.
  • Good test pairing: the AST check that .compute() no longer sits in the inner loop catches a regression to the old pattern, and the baseline test confirms counts and caches match the sequential version exactly.
  • dask+cupy was checked end-to-end against the numpy baseline (it reaches this pass after .get()), so both dask backends are covered.

Checklist

  • Algorithm matches reference/paper (no algorithm change; read scheduling only)
  • All implemented backends produce consistent results (dask+numpy and dask+cupy verified)
  • NaN handling is correct (unchanged)
  • Edge cases are covered by tests (no-targets and all-targets paths already covered; baseline test added)
  • Dask chunk boundaries handled correctly (per-row batching preserves the memory bound)
  • No premature materialization or unnecessary copies (peak bounded to one chunk-row)
  • Benchmark exists or is not needed (benchmarks/benchmarks/proximity.py exists; internal change)
  • README feature matrix updated (not applicable; no API change)
  • Docstrings present and accurate (no signature change)

@brendancol brendancol merged commit 1a0433a into main May 30, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

proximity: serialized per-chunk .compute() in dask KDTree target-count pass

1 participant