-
Notifications
You must be signed in to change notification settings - Fork 86
sieve: fix convergence warning and speed up labeling/adjacency #1162
Description
Author of Proposal: @brendancol
Reason or problem
Reviewed the sieve implementation (added in #1159) for accuracy and performance. Found a few things worth fixing.
Accuracy
-
Silent convergence limit. The merge loop caps at 50 iterations but never warns if it hits the limit. On pathological inputs (lots of cascading same-value merges), the function just returns a partially-sieved result with no indication anything went wrong.
-
Integer nodata gap. Integer rasters can't express NaN, so classified rasters that use sentinel values like -9999 or 255 for nodata get those pixels treated as valid data. Not a bug -- the docstring says "NaN pixels are preserved" -- but worth noting for a future
nodataparameter.
Performance
-
Per-value labeling.
_label_all_regionscallsscipy.ndimage.labelonce per unique raster value. For a land-cover raster with 50 classes on a 10k x 10k grid, that's 50 separate label passes over 100M pixels. A numba union-find can do this in one pass. -
Python-level adjacency loop.
_build_adjacencyextracts unique border pairs withnp.uniquethen iterates them in a Pythonforloop. For large rasters with many region boundaries, this is the bottleneck after labeling. -
Unnecessary re-labeling. The outer loop re-labels the entire raster every iteration even when the inner merge loop didn't create any new same-value adjacencies (which is the only case that changes component structure).
Proposal
- Warn when the 50-iteration limit is reached
- Replace per-value
scipy.ndimage.labelcalls with a single-pass numba union-find - Vectorize the adjacency builder (numpy fancy indexing instead of Python loop)
- Track whether merges changed the value topology; skip re-labeling when they didn't
- Add tests for the convergence warning and for larger synthetic rasters
Value
The sieve function targets noisy classified rasters, which can easily have thousands of small regions. These changes keep it usable on larger inputs without changing the public API.
Drawbacks
Adds numba as a runtime dependency for the labeling path, but the project already uses numba everywhere (@ngjit).
Unresolved questions
Whether to add a nodata parameter for integer sentinel values. Leaving that for a separate issue.