Skip to content

Merge sink_d8 labels across dask tile boundaries (#1394)#1395

Merged
brendancol merged 1 commit intomainfrom
issue-1394
Apr 30, 2026
Merged

Merge sink_d8 labels across dask tile boundaries (#1394)#1395
brendancol merged 1 commit intomainfrom
issue-1394

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Fixes #1394.

Summary

  • _run_dask_numpy in xrspatial/hydro/sink_d8.py ran per-tile CCL with globally unique IDs but never merged equivalent labels across tile boundaries, so a single connected sink that spanned a chunk showed up as several separate sinks.
  • Added a union-find pass that walks every interior tile edge (4-connected) plus both diagonals (so 8-connectivity is preserved across corner-shared tiles), records label equivalences, and remaps labels to their roots via a second map_blocks pass.
  • The dask+cupy path delegates to the numpy dask path, so it inherits the fix.

Test plan

  • pytest xrspatial/hydro/tests/test_sink_d8.py — 40 passing (21 original + 19 new)
  • pytest xrspatial/hydro/tests/ — full hydro suite still passes (772 tests)
  • Original reproducer from sink_d8 dask backend splits connected sinks across tile boundaries #1394 now matches the numpy result
  • Added regression tests for horizontal, vertical, diagonal, four-tile-block, and separation cases at multiple chunk shapes
  • Added a _label_count_matches_numpy test so the number of unique sink labels stays equal across backends

Notes

  • Cross-tile merging needs a global view of the labeled raster, so the implementation calls .compute() once on the per-tile result. The streaming benefit of dask is preserved during the per-tile CCL phase; only the boundary scan and label remap require the materialized array. CCL is fundamentally a global operation, so this matches what xrspatial/sieve.py already does for its dask path.
  • Labels in the dask output are not byte-identical to the numpy output (the dask path uses position-based IDs from each tile, then merges, while the numpy path uses one position-based ID across the whole raster). The new tests check label-partition equivalence (cells in the same numpy component are in the same dask component) rather than literal ID equality.

The per-tile CCL in _run_dask_numpy assigned globally unique IDs but
never merged equivalent labels across tile boundaries, so a single
connected sink that spanned a chunk ended up as several separate
sinks. Add a union-find pass over boundary equivalences (4-connected
edges plus the two diagonals for 8-connectivity) and remap labels
to their roots. Cover horizontal, vertical, diagonal, four-tile
block, and separation cases in test_sink_d8.py.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Apr 30, 2026
@brendancol brendancol merged commit 7653275 into main Apr 30, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sink_d8 dask backend splits connected sinks across tile boundaries

1 participant