Skip to content

sieve(): numpy and cupy backends have no memory guard #1296

@brendancol

Description

@brendancol

Description

sieve() on the numpy and cupy backends has no memory guard. _label_connected allocates parent, rank, and region_map_flat as int32 arrays plus a float64 result copy — about 20 bytes/pixel of working memory — before any check runs.

The dask paths guard via _available_memory_bytes() (lines 343-355, 366-381). The public sieve() API at line 489 dispatches numpy DataArrays straight into _sieve_numpy with no check. _sieve_cupy at line 308 calls data.get() then _sieve_numpy, inheriting the gap.

A 50000x50000 numpy raster asks for ~50 GB of host memory before anything errors out.

Same asymmetric-guard pattern already fixed in cost_distance #1262, mahalanobis #1288, multispectral #1291, and kde #1287.

Expected behavior

sieve() raises MemoryError with a clear message on every backend when the projected working set exceeds available memory, matching the existing _sieve_dask behavior.

Proposed fix

Add _check_memory(rows, cols) and _check_gpu_memory(rows, cols) helpers (28 bytes/pixel, 50% threshold) and call them from _sieve_numpy and _sieve_cupy before the union-find allocations run.

Followup (separate)

The int32 indices in _label_connected silently truncate when n > 2^31 (rasters above ~46340x46340). The docstring notes the limit but nothing enforces it at runtime. The memory guard rejects rasters that large before the int32 issue triggers, so this is a documentation/clarity follow-up rather than an exploitable bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityoomOut-of-memory risk with large datasets

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions