Skip to content

Drop defensive copies of freshly-allocated nodata-mask arrays #1553

@brendancol

Description

@brendancol

Reason or Problem

xrspatial/geotiff/__init__.py has four spots where the code does:

arr = arr.copy()
arr[mask] = np.nan

at lines 518, 990, 1050, and 1448. Each one duplicates a multi-MB array right before an in-place mutation, doubling peak memory during a TIFF read or write with nodata. On a 1024x1024 float32 read that wastes 4 MB; on a 1Gx1G read it's catastrophic.

Proposal

Drop the .copy() only where the source array is provably uniquely owned. Audit each site:

  • Line 518 (open path, _geotiff_to_xarray): arr comes from read_to_array, which returns a fresh allocation from _read_tiles or _read_strips (both call np.empty or np.full). The orientation handler returns np.ascontiguousarray(...) for non-default orientations, and the band-slice path produces a view of a freshly-allocated 3D buffer the parent function does not retain. Drop the copy. The array is uniquely owned.

  • Line 1448 (_delayed_read_window): arr comes from either _fetch_decode_cog_http_tiles or read_to_array. Both freshly allocate; the optional band slice is a view of a buffer no caller holds. Drop the copy.

  • Line 990 (to_geotiff): arr comes from np.asarray(raw) or np.moveaxis(arr, 0, -1), both of which can be views of a caller-owned numpy buffer. Mutating without a copy would corrupt the user's input. Keep the copy and document why.

  • Line 1050 (_write_single_tile): same shape. arr = np.asarray(chunk_data) may alias caller data. Keep the copy and document why.

Net change: drop two unnecessary copies on the read path; keep two defensive copies on the write path.

Design:
Each dropped site is local. Replace arr = arr.copy(); arr[mask] = np.nan with arr[mask] = np.nan. No API changes.

Usage:
No user-facing change. Reads of TIFFs with nodata return the same array as before; allocated once instead of twice.

Value:
Removes a peak-memory doubler from the eager-read and dask-delayed-read paths. For large rasters this is the difference between fitting in RAM and OOM.

Stakeholders and Impacts

Stakeholders: anyone reading large TIFFs with sentinel nodata. Impact: lower peak memory, no behavioural change.

Drawbacks

None I see. Safety relies on the source allocations remaining unique-owned; a regression test exercises the float-nodata read path.

Alternatives

Leave the copies in place. That costs a 2x peak-memory hit on every nodata-bearing read.

Additional Notes

Found during an efficiency audit. One fix per PR per project guidelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions