Reason or Problem
xrspatial/geotiff/__init__.py has four spots where the code does:
arr = arr.copy()
arr[mask] = np.nan
at lines 518, 990, 1050, and 1448. Each one duplicates a multi-MB array right before an in-place mutation, doubling peak memory during a TIFF read or write with nodata. On a 1024x1024 float32 read that wastes 4 MB; on a 1Gx1G read it's catastrophic.
Proposal
Drop the .copy() only where the source array is provably uniquely owned. Audit each site:
-
Line 518 (open path, _geotiff_to_xarray): arr comes from read_to_array, which returns a fresh allocation from _read_tiles or _read_strips (both call np.empty or np.full). The orientation handler returns np.ascontiguousarray(...) for non-default orientations, and the band-slice path produces a view of a freshly-allocated 3D buffer the parent function does not retain. Drop the copy. The array is uniquely owned.
-
Line 1448 (_delayed_read_window): arr comes from either _fetch_decode_cog_http_tiles or read_to_array. Both freshly allocate; the optional band slice is a view of a buffer no caller holds. Drop the copy.
-
Line 990 (to_geotiff): arr comes from np.asarray(raw) or np.moveaxis(arr, 0, -1), both of which can be views of a caller-owned numpy buffer. Mutating without a copy would corrupt the user's input. Keep the copy and document why.
-
Line 1050 (_write_single_tile): same shape. arr = np.asarray(chunk_data) may alias caller data. Keep the copy and document why.
Net change: drop two unnecessary copies on the read path; keep two defensive copies on the write path.
Design:
Each dropped site is local. Replace arr = arr.copy(); arr[mask] = np.nan with arr[mask] = np.nan. No API changes.
Usage:
No user-facing change. Reads of TIFFs with nodata return the same array as before; allocated once instead of twice.
Value:
Removes a peak-memory doubler from the eager-read and dask-delayed-read paths. For large rasters this is the difference between fitting in RAM and OOM.
Stakeholders and Impacts
Stakeholders: anyone reading large TIFFs with sentinel nodata. Impact: lower peak memory, no behavioural change.
Drawbacks
None I see. Safety relies on the source allocations remaining unique-owned; a regression test exercises the float-nodata read path.
Alternatives
Leave the copies in place. That costs a 2x peak-memory hit on every nodata-bearing read.
Additional Notes
Found during an efficiency audit. One fix per PR per project guidelines.
Reason or Problem
xrspatial/geotiff/__init__.pyhas four spots where the code does:at lines 518, 990, 1050, and 1448. Each one duplicates a multi-MB array right before an in-place mutation, doubling peak memory during a TIFF read or write with nodata. On a 1024x1024 float32 read that wastes 4 MB; on a 1Gx1G read it's catastrophic.
Proposal
Drop the
.copy()only where the source array is provably uniquely owned. Audit each site:Line 518 (open path,
_geotiff_to_xarray):arrcomes fromread_to_array, which returns a fresh allocation from_read_tilesor_read_strips(both callnp.emptyornp.full). The orientation handler returnsnp.ascontiguousarray(...)for non-default orientations, and the band-slice path produces a view of a freshly-allocated 3D buffer the parent function does not retain. Drop the copy. The array is uniquely owned.Line 1448 (
_delayed_read_window):arrcomes from either_fetch_decode_cog_http_tilesorread_to_array. Both freshly allocate; the optional band slice is a view of a buffer no caller holds. Drop the copy.Line 990 (
to_geotiff):arrcomes fromnp.asarray(raw)ornp.moveaxis(arr, 0, -1), both of which can be views of a caller-owned numpy buffer. Mutating without a copy would corrupt the user's input. Keep the copy and document why.Line 1050 (
_write_single_tile): same shape.arr = np.asarray(chunk_data)may alias caller data. Keep the copy and document why.Net change: drop two unnecessary copies on the read path; keep two defensive copies on the write path.
Design:
Each dropped site is local. Replace
arr = arr.copy(); arr[mask] = np.nanwitharr[mask] = np.nan. No API changes.Usage:
No user-facing change. Reads of TIFFs with nodata return the same array as before; allocated once instead of twice.
Value:
Removes a peak-memory doubler from the eager-read and dask-delayed-read paths. For large rasters this is the difference between fitting in RAM and OOM.
Stakeholders and Impacts
Stakeholders: anyone reading large TIFFs with sentinel nodata. Impact: lower peak memory, no behavioural change.
Drawbacks
None I see. Safety relies on the source allocations remaining unique-owned; a regression test exercises the float-nodata read path.
Alternatives
Leave the copies in place. That costs a 2x peak-memory hit on every nodata-bearing read.
Additional Notes
Found during an efficiency audit. One fix per PR per project guidelines.