proximity: parallelize per-chunk-row compute in dask KDTree count pass#2726
Merged
Merged
Conversation
_stream_target_counts called raster.data.blocks[iy, ix].compute() one chunk at a time in a nested Python loop, so chunk reads ran strictly serially even when the scheduler could read several at once. Switch to one da.compute() per chunk-row, which lets the scheduler read each row in parallel while keeping peak driver memory bounded to a single row of chunks (the streaming guarantee from gh-879 still holds). Same shape as the polygonize row-batch in #2608. Measured ~6x on a 64-chunk grid with 3ms/chunk read latency; counts and caches are byte-identical to the sequential pass. Add a structural test that compute() is not called in the inner chunk loop, and a behavioural test that target_counts/total/caches match a sequential baseline.
brendancol
commented
May 29, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: proximity: parallelize per-chunk-row compute in dask KDTree count pass
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
test_stream_target_counts_matches_sequential_baselinerecomputes every chunk three times (once inside_stream_target_counts, then twice more in the two assertion loops). Cheap for a 24x30 raster, so fine as-is; flagging it in case the fixture grows.
What looks good
- The fix is the right shape.
da.compute(*(blocks for ix in row))returns results in argument order, sorow_chunks[ix]lines up with chunk(iy, ix)and everything downstream stays untouched. Counting, the 25% cache budget, and eviction are unchanged. - Memory stays bounded. Peak goes from one chunk to one chunk-row (sqrt(N) chunks), which is the bound the #879 streaming design needs. The code comment calls this out.
- Good test pairing: the AST check that
.compute()no longer sits in the inner loop catches a regression to the old pattern, and the baseline test confirms counts and caches match the sequential version exactly. - dask+cupy was checked end-to-end against the numpy baseline (it reaches this pass after
.get()), so both dask backends are covered.
Checklist
- Algorithm matches reference/paper (no algorithm change; read scheduling only)
- All implemented backends produce consistent results (dask+numpy and dask+cupy verified)
- NaN handling is correct (unchanged)
- Edge cases are covered by tests (no-targets and all-targets paths already covered; baseline test added)
- Dask chunk boundaries handled correctly (per-row batching preserves the memory bound)
- No premature materialization or unnecessary copies (peak bounded to one chunk-row)
- Benchmark exists or is not needed (benchmarks/benchmarks/proximity.py exists; internal change)
- README feature matrix updated (not applicable; no API change)
- Docstrings present and accurate (no signature change)
…e-proximity-2026-05-29-01
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2724.
The unbounded dask KDTree path in
proximity/allocation/directionruns a Phase-0 pass that counts target pixels chunk by chunk. That pass calledraster.data.blocks[iy, ix].compute()one chunk at a time inside a nested Python loop, so the reads ran strictly serially even when the scheduler could have read several chunks at once.Change
_stream_target_countsnow issues oneda.compute()per chunk-row instead of one blocking.compute()per chunk. The scheduler reads each row in parallel, and peak driver memory stays bounded to a single row of chunks, so the streaming guarantee from proximity KDTree: sequential.compute()loop + unbounded target accumulation #879 still holds. Same shape as the polygonize row-batch in polygonize dask backend serializes chunks via per-chunk dask.compute() #2608.Measurement
64-chunk grid, 3 ms/chunk simulated read latency: ~350 ms sequential vs ~57 ms row-batched (~6x). Target counts and caches come out byte-identical.
Backends
Affects the dask+numpy and dask+cupy unbounded KDTree path (dask+cupy routes through this after
.get()). numpy and cupy single-array paths are untouched. Verified the dask+cupy result still matches the numpy baseline end-to-end on this CUDA host.Test plan
xrspatial/tests/test_proximity.pypasses (71 tests).compute()is not called in the inner chunk looptarget_counts/total/caches match a sequential baseline