proximity: parallelize per-chunk-row compute in dask KDTree count pass by brendancol · Pull Request #2726 · xarray-contrib/xarray-spatial

brendancol · 2026-05-29T23:55:56Z

Closes #2724.

The unbounded dask KDTree path in proximity/allocation/direction runs a Phase-0 pass that counts target pixels chunk by chunk. That pass called raster.data.blocks[iy, ix].compute() one chunk at a time inside a nested Python loop, so the reads ran strictly serially even when the scheduler could have read several chunks at once.

Change

_stream_target_counts now issues one da.compute() per chunk-row instead of one blocking .compute() per chunk. The scheduler reads each row in parallel, and peak driver memory stays bounded to a single row of chunks, so the streaming guarantee from proximity KDTree: sequential .compute() loop + unbounded target accumulation #879 still holds. Same shape as the polygonize row-batch in polygonize dask backend serializes chunks via per-chunk dask.compute() #2608.
Per-chunk counting, the 25% memory cache budget, and cache eviction are unchanged.

Measurement

64-chunk grid, 3 ms/chunk simulated read latency: ~350 ms sequential vs ~57 ms row-batched (~6x). Target counts and caches come out byte-identical.

Backends

Affects the dask+numpy and dask+cupy unbounded KDTree path (dask+cupy routes through this after .get()). numpy and cupy single-array paths are untouched. Verified the dask+cupy result still matches the numpy baseline end-to-end on this CUDA host.

Test plan

xrspatial/tests/test_proximity.py passes (71 tests)
New: structural test that .compute() is not called in the inner chunk loop
New: behavioural test that target_counts/total/caches match a sequential baseline
dask+cupy unbounded run matches numpy baseline

_stream_target_counts called raster.data.blocks[iy, ix].compute() one chunk at a time in a nested Python loop, so chunk reads ran strictly serially even when the scheduler could read several at once. Switch to one da.compute() per chunk-row, which lets the scheduler read each row in parallel while keeping peak driver memory bounded to a single row of chunks (the streaming guarantee from gh-879 still holds). Same shape as the polygonize row-batch in #2608. Measured ~6x on a 64-chunk grid with 3ms/chunk read latency; counts and caches are byte-identical to the sequential pass. Add a structural test that compute() is not called in the inner chunk loop, and a behavioural test that target_counts/total/caches match a sequential baseline.

brendancol

PR Review: proximity: parallelize per-chunk-row compute in dask KDTree count pass

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

test_stream_target_counts_matches_sequential_baseline recomputes every chunk three times (once inside _stream_target_counts, then twice more in the two assertion loops). Cheap for a 24x30 raster, so fine as-is; flagging it in case the fixture grows.

What looks good

The fix is the right shape. da.compute(*(blocks for ix in row)) returns results in argument order, so row_chunks[ix] lines up with chunk (iy, ix) and everything downstream stays untouched. Counting, the 25% cache budget, and eviction are unchanged.
Memory stays bounded. Peak goes from one chunk to one chunk-row (sqrt(N) chunks), which is the bound the #879 streaming design needs. The code comment calls this out.
Good test pairing: the AST check that .compute() no longer sits in the inner loop catches a regression to the old pattern, and the baseline test confirms counts and caches match the sequential version exactly.
dask+cupy was checked end-to-end against the numpy baseline (it reaches this pass after .get()), so both dask backends are covered.

Checklist

Algorithm matches reference/paper (no algorithm change; read scheduling only)
All implemented backends produce consistent results (dask+numpy and dask+cupy verified)
NaN handling is correct (unchanged)
Edge cases are covered by tests (no-targets and all-targets paths already covered; baseline test added)
Dask chunk boundaries handled correctly (per-row batching preserves the memory bound)
No premature materialization or unnecessary copies (peak bounded to one chunk-row)
Benchmark exists or is not needed (benchmarks/benchmarks/proximity.py exists; internal change)
README feature matrix updated (not applicable; no API change)
Docstrings present and accurate (no signature change)

…e-proximity-2026-05-29-01

github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026

brendancol commented May 29, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into deep-sweep-performanc…

7fa7291

…e-proximity-2026-05-29-01

brendancol merged commit 1a0433a into main May 30, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proximity: parallelize per-chunk-row compute in dask KDTree count pass#2726

proximity: parallelize per-chunk-row compute in dask KDTree count pass#2726
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-proximity-2026-05-29-01

brendancol commented May 29, 2026

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 29, 2026

Change

Measurement

Backends

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: proximity: parallelize per-chunk-row compute in dask KDTree count pass

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant