Skip to content

sieve: fix convergence warning and speed up labeling/adjacency #1162

@brendancol

Description

@brendancol

Author of Proposal: @brendancol

Reason or problem

Reviewed the sieve implementation (added in #1159) for accuracy and performance. Found a few things worth fixing.

Accuracy

  1. Silent convergence limit. The merge loop caps at 50 iterations but never warns if it hits the limit. On pathological inputs (lots of cascading same-value merges), the function just returns a partially-sieved result with no indication anything went wrong.

  2. Integer nodata gap. Integer rasters can't express NaN, so classified rasters that use sentinel values like -9999 or 255 for nodata get those pixels treated as valid data. Not a bug -- the docstring says "NaN pixels are preserved" -- but worth noting for a future nodata parameter.

Performance

  1. Per-value labeling. _label_all_regions calls scipy.ndimage.label once per unique raster value. For a land-cover raster with 50 classes on a 10k x 10k grid, that's 50 separate label passes over 100M pixels. A numba union-find can do this in one pass.

  2. Python-level adjacency loop. _build_adjacency extracts unique border pairs with np.unique then iterates them in a Python for loop. For large rasters with many region boundaries, this is the bottleneck after labeling.

  3. Unnecessary re-labeling. The outer loop re-labels the entire raster every iteration even when the inner merge loop didn't create any new same-value adjacencies (which is the only case that changes component structure).

Proposal

  • Warn when the 50-iteration limit is reached
  • Replace per-value scipy.ndimage.label calls with a single-pass numba union-find
  • Vectorize the adjacency builder (numpy fancy indexing instead of Python loop)
  • Track whether merges changed the value topology; skip re-labeling when they didn't
  • Add tests for the convergence warning and for larger synthetic rasters

Value

The sieve function targets noisy classified rasters, which can easily have thousands of small regions. These changes keep it usable on larger inputs without changing the public API.

Drawbacks

Adds numba as a runtime dependency for the labeling path, but the project already uses numba everywhere (@ngjit).

Unresolved questions

Whether to add a nodata parameter for integer sentinel values. Leaving that for a separate issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions