From f3837dd4c4e811c0e364a228799cf001d00a114f Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Wed, 27 May 2026 17:39:30 -0700 Subject: [PATCH 1/2] resample: preserve scalar non-dim coords and refresh stale nodata attr (#2542) The 2D non-identity path in `resample()` rebuilt the output `xr.DataArray` with only the spatial dim-coords, silently dropping any scalar non-dim coord on the input. rioxarray stores CRS metadata on a `spatial_ref` scalar coord, so chained `out.rio.crs` calls lost their CRS even though `attrs['crs']` round-tripped fine. The identity path (`scale_factor=1.0`, `agg.copy()`) and the 3D path (per-band `xr.concat`) both happened to preserve the coord, leaving the 2D path as the inconsistent outlier (Cat 5 backend / path-inconsistent metadata). Separately, `_resolve_nodata()` reads `attrs['nodata']` as a sentinel fallback after `_FillValue`, but the output post-processing only refreshed `_FillValue` and `nodatavals`. Inputs declaring `attrs['nodata'] = -9999` came back with that same finite value alongside data that was now NaN, mismatching any downstream consumer that trusts the attr (Cat 4 dtype/nodata semantics). Fixes both gaps in one place at the final `xr.DataArray` constructor: refresh `attrs['nodata']` to NaN when present, and fold any zero-dim non-dim coord into the output coords dict. Spatially-shaped non-dim coords (whose length is tied to the input resolution) intentionally do not carry; the user would need to resample them separately. Backend coverage: numpy / cupy / dask+numpy / dask+cupy all route through the same constructor and were verified end-to-end for the spatial_ref carry. Nodata-attr refresh was verified on numpy / cupy / dask+numpy; dask+cupy with a non-NaN nodata sentinel hits a pre-existing xr.where + cupy.astype quirk unrelated to this audit. 7 new tests in TestMetadataPropagation cover nodata-attr refresh, spatial_ref / scalar selector coord carry, identity-vs-downsample coord parity, and the intentional drop of spatially-shaped extras. Closes #2542 --- .claude/sweep-metadata-state.csv | 1 + xrspatial/resample.py | 31 ++++++++-- xrspatial/tests/test_resample.py | 103 +++++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 6 deletions(-) diff --git a/.claude/sweep-metadata-state.csv b/.claude/sweep-metadata-state.csv index c6bdd0d9a..e9ee6f468 100644 --- a/.claude/sweep-metadata-state.csv +++ b/.claude/sweep-metadata-state.csv @@ -3,3 +3,4 @@ geotiff,2026-05-18,1909,HIGH,4;5,"Re-audit 2026-05-15 (agent-a55b69cec1ef2a092 w polygonize,2026-05-19,2149,MEDIUM,1,"Audited 2026-05-19 (agent-ad1070530d37a4fdf worktree, branch deep-sweep-metadata-polygonize-2026-05-19). Output is vector (column, polygon_points / GeoDataFrame / GeoJSON dict / awkward) so Cat 2/3 do not apply in the DataArray sense. Cat 1 MEDIUM finding #2149: GeoDataFrame output drops raster.attrs['crs'] (and crs_wkt and rioxarray rio.crs); GeoDataFrame.crs is always None even when input is georeferenced. Fix: new _detect_raster_crs helper + crs= kwarg threaded into _to_geopandas; df.set_crs is called when a CRS is detected. spatialpandas has no CRS slot and GeoJSON RFC 7946 is WGS84-only, so propagation lives only on the geopandas path. CRS propagation runs at the public API level so all 4 backends (numpy / cupy / dask+numpy / dask+cupy) propagate consistently -- verified end-to-end with EPSG:4326 attrs across all 4 backends. 8 new tests in TestPolygonizeCRSPropagation cover EPSG string/int, crs_wkt, no CRS, unparseable CRS, attrs-vs-rioxarray preference, rioxarray-only path, and simplify interaction. Cat 2 LOW (not fixed): output coords are pixel-space when input has georeferenced x/y or attrs['transform']; user must pass transform= explicitly. Documented behavior, leave as-is. Cat 4 LOW (not fixed): nodatavals from input attrs is not auto-applied as a mask; documented behavior (explicit mask= kwarg)." rasterize,2026-05-21,2251,HIGH,1,"rasterize() drops like.attrs, rebuilds like.coords via linspace (not bit-identical), and never emits _FillValue/nodatavals even when fill is non-NaN. Cat 1 HIGH: chained pipelines like slope(rasterize(gdf, like=elevation)) silently lose crs/res/transform. Cat 2 MEDIUM: linspace round-trip from re-derived bounds breaks xr.align with like. Cat 4 MEDIUM: rasterize(..., fill=-9999, dtype=int32) emits no _FillValue. All 4 backends share the same final return so the fix is one place. Fixed in deep-sweep-metadata-rasterize-2026-05-17-01 (worktree agent-ab7a9aee97c1e4cdf): _extract_grid_from_like now returns coords/attrs; rasterize() reuses like.coords directly when grid matches, copies like.attrs, and emits _FillValue + nodatavals when fill is not NaN. 9 new tests in TestMetadataPropagation cover attrs propagation, bit-identical coord reuse, fill-value emission, isolation from template attrs, and parity across numpy/cupy/dask+numpy/dask+cupy backends. Full test suite (193 passing) clean. | Re-audited 2026-05-21 (agent-a645dc07f847ae8ae worktree, branch deep-sweep-metadata-rasterize-2026-05-21). 4-backend (numpy/cupy/dask+numpy/dask+cupy) metadata parity reverified: all 4 backends route through the same final xr.DataArray constructor in rasterize(); crs / spatial_ref non-dim coord / coords / dims agree across backends. NEW HIGH finding #2251 (Cat 1): when rasterize(geoms, like=template, bounds=..., width=..., height=..., resolution=...) overrides the grid relative to like, the inherited attrs['transform'] and attrs['res'] from like are propagated unchanged so they describe the template's grid, not the actual output. get_dataarray_resolution() prefers attrs['res'] over calc_res from coords, so downstream slope/aspect/proximity see the wrong cellsize. Same class as #1407 sky_view_factor bug. Fix in rasterize(): out_attrs.pop('res') / out_attrs.pop('transform') when like_attrs is present but reuse_like_coords is False (output grid != template grid). Preserves crs / nodata triplet / spatial_ref handling. 9 new tests in TestLikeStaleGridAttrs2251 cover bounds override, width/height override, resolution override, matching width/height preserves attrs, get_dataarray_resolution consistency, and parity across all 4 backends. Full rasterize test suite (224 passed, 2 skipped) clean." reproject,2026-05-10,1572;1573,HIGH,1;3;4,geoid_height_raster dropped input attrs and used dims[-2:] for 3D inputs (#1572). reproject/merge ignored nodatavals (rasterio convention) when rioxarray absent (#1573). Fixed in same branch. +resample,2026-05-27,2542,MEDIUM,2;4;5,"Audited 2026-05-27 (agent-a8135a6a246ecb93c worktree, branch deep-sweep-metadata-resample-2026-05-27). Cat 2 MEDIUM + Cat 4 MEDIUM + Cat 5 MEDIUM all rolled into issue #2542. (a) 2D non-identity path dropped scalar non-dim coords like rioxarrays spatial_ref and squeezed time/band selectors; identity path (scale==1.0, agg.copy()) and 3D path (per-band xr.concat) preserved them, so the bug was path-inconsistent (Cat 5). (b) _resolve_nodata reads attrs[nodata] as a fallback sentinel but the output post-processing only refreshed _FillValue and nodatavals, leaving attrs[nodata]=-9999 alongside data that was now NaN. Fix in resample(): refresh attrs[nodata] to NaN whenever the input had it, and carry across zero-dim non-dim coords on the 2D non-identity path. 7 new tests in TestMetadataPropagation cover nodata-attr refresh, spatial_ref/scalar coord carry, identity-vs-downsample coord parity, and the explicit choice to drop spatially-shaped extra coords. 4-backend (numpy/cupy/dask+numpy/dask+cupy) parity verified for spatial_ref carry; nodata-attr refresh verified on numpy/cupy/dask+numpy (dask+cupy non-NaN nodata masking hits a pre-existing xarray xr.where + cupy.astype quirk unrelated to this audit). Full resample test suite (175 passed) clean." diff --git a/xrspatial/resample.py b/xrspatial/resample.py index b38b31947..6a36c3cb5 100644 --- a/xrspatial/resample.py +++ b/xrspatial/resample.py @@ -1384,22 +1384,41 @@ def _new_coords(vals, n_out): out_res_x, 0.0, left, 0.0, -out_res_y, top, ) - # Resample currently emits float32 with NaN as the missing-data - # sentinel regardless of input dtype. If the input declared a - # different sentinel via `_FillValue` or `nodatavals`, replace the - # value with NaN so the metadata matches the actual data. Leave the - # keys absent when the input did not have them. + # Resample replaces sentinel pixels with NaN regardless of input + # dtype. If the input declared a sentinel via `_FillValue`, + # `nodatavals`, or the rasterio-style `nodata` attr, refresh each + # one to NaN so the metadata matches the actual data. Leave the + # keys absent when the input did not have them. `_resolve_nodata` + # reads `nodata` as a fallback, so we must refresh it too -- a + # stale finite value here would silently mismatch the masked data + # on any downstream consumer that trusts `attrs['nodata']`. if '_FillValue' in agg.attrs: new_attrs['_FillValue'] = float('nan') if 'nodatavals' in agg.attrs: old = agg.attrs['nodatavals'] new_attrs['nodatavals'] = tuple(float('nan') for _ in old) + if 'nodata' in agg.attrs: + new_attrs['nodata'] = float('nan') + + # Carry across scalar (zero-dim) non-dim coords like rioxarray's + # `spatial_ref` or a squeezed `time` / `band` selector. The + # identity path (scale==1.0) preserves these via `agg.copy()`; + # the 2D non-identity path must match so chained rioxarray + # pipelines don't silently lose CRS / spatial_ref / scalar + # selector coords. Spatially-shaped non-dim coords (dims include + # ydim or xdim) are not carried because their length changed. + extra_coords = {} + for coord_name, coord in agg.coords.items(): + if coord_name in (ydim, xdim): + continue # spatial dim-coords are rebuilt above + if coord.ndim == 0: + extra_coords[coord_name] = coord result = xr.DataArray( result_data, name=name, dims=agg.dims, - coords={ydim: new_y, xdim: new_x}, + coords={ydim: new_y, xdim: new_x, **extra_coords}, attrs=new_attrs, ) diff --git a/xrspatial/tests/test_resample.py b/xrspatial/tests/test_resample.py index b456cd9c4..732f8d790 100644 --- a/xrspatial/tests/test_resample.py +++ b/xrspatial/tests/test_resample.py @@ -200,6 +200,109 @@ def test_nodatavals_replaced_with_nan(self): assert len(nv) == 1 assert np.isnan(nv[0]) + def test_nodata_attr_refreshed_to_nan(self): + # `_resolve_nodata` reads `attrs['nodata']` as a fallback sentinel. + # If we mask its pixels to NaN we must also refresh the attr so + # downstream consumers don't trust the now-stale finite value. + data = np.array([[-9999, -9999, 10, 10], + [-9999, -9999, 10, 10], + [20, 20, 30, 30], + [20, 20, 30, 30]], dtype=np.float32) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={'y': np.linspace(3, 0, 4), 'x': np.linspace(0, 3, 4)}, + attrs={'res': (1.0, 1.0), 'nodata': -9999}, + ) + out = resample(raster, scale_factor=0.5, method='nearest') + assert 'nodata' in out.attrs + assert np.isnan(out.attrs['nodata']) + # And the corresponding data pixel is NaN, matching the new sentinel. + assert np.isnan(out.values[0, 0]) + + def test_nodata_attr_absent_stays_absent(self): + data = np.arange(16, dtype=np.float32).reshape(4, 4) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={'y': np.linspace(3, 0, 4), 'x': np.linspace(0, 3, 4)}, + attrs={'res': (1.0, 1.0)}, + ) + out = resample(raster, scale_factor=0.5) + assert 'nodata' not in out.attrs + + def test_spatial_ref_coord_preserved(self): + # rioxarray stores CRS metadata on a scalar `spatial_ref` non-dim + # coord. Dropping it on resample silently breaks chained + # ``out.rio.crs`` calls even when ``attrs['crs']`` round-trips. + data = np.arange(16, dtype=np.float32).reshape(4, 4) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'spatial_ref': 0, + }, + attrs={'res': (1.0, 1.0), 'crs': 'EPSG:4326'}, + ) + out = resample(raster, scale_factor=0.5) + assert 'spatial_ref' in out.coords + assert int(out.coords['spatial_ref'].values) == 0 + + def test_scalar_band_coord_preserved(self): + # A squeezed scalar band/time selector should round-trip. + data = np.arange(16, dtype=np.float32).reshape(4, 4) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'band': 1, + 'time': np.datetime64('2024-01-01'), + }, + attrs={'res': (1.0, 1.0)}, + ) + out = resample(raster, scale_factor=0.5) + assert 'band' in out.coords + assert int(out.coords['band'].values) == 1 + assert 'time' in out.coords + + def test_identity_and_downsample_coords_consistent(self): + # Cat 5: identity path (scale=1.0, agg.copy()) and the 2D + # non-identity path must agree on which non-dim coords survive. + # Before #2542 only the identity path kept `spatial_ref`. + data = np.arange(16, dtype=np.float32).reshape(4, 4) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'spatial_ref': 0, + }, + attrs={'res': (1.0, 1.0)}, + ) + identity = resample(raster, scale_factor=1.0) + downsampled = resample(raster, scale_factor=0.5) + assert set(identity.coords) - {'y', 'x'} == set(downsampled.coords) - {'y', 'x'} + + def test_2d_spatial_extra_coord_not_carried(self): + # A non-dim coord whose dims include the spatial axes has a + # length tied to the input resolution; it must NOT be carried + # over after a resize. Skipping it cleanly is the right + # behaviour (the user would need to resample it separately). + data = np.arange(16, dtype=np.float32).reshape(4, 4) + mask = np.zeros((4, 4), dtype=np.float32) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'mask': (('y', 'x'), mask), + }, + attrs={'res': (1.0, 1.0)}, + ) + out = resample(raster, scale_factor=0.5) + # Spatial extra coords drop (would otherwise shape-mismatch the output). + assert 'mask' not in out.coords + def test_other_attrs_preserved(self): # crs, units, long_name, scales, offsets should round-trip. data = np.arange(16, dtype=np.float32).reshape(4, 4) From afd88707bd5a94b3823a3eb0a50d74f6906a12d0 Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Thu, 28 May 2026 07:27:47 -0700 Subject: [PATCH 2/2] resample tests: address review suggestions - Add test_spatial_ref_preserved_3d: pin the 3D per-band -> xr.concat path, which inherits the carry-across transitively via the per-band 2D resample() call. - Add test_spatial_ref_preserved_cupy / _dask: pin the 4-backend parity claim from the PR body so a backend-divergent regression would surface (the fix lives at the shared 2D xr.DataArray constructor, but the contract is now explicitly tested). - Drop redundant int() casts on scalar coord assertions; np scalar equality is one less coercion to read. 178 passed (was 175 + 3 new). --- xrspatial/tests/test_resample.py | 62 ++++++++++++++++++++++++++++++-- 1 file changed, 60 insertions(+), 2 deletions(-) diff --git a/xrspatial/tests/test_resample.py b/xrspatial/tests/test_resample.py index 732f8d790..09b434bd7 100644 --- a/xrspatial/tests/test_resample.py +++ b/xrspatial/tests/test_resample.py @@ -245,7 +245,7 @@ def test_spatial_ref_coord_preserved(self): ) out = resample(raster, scale_factor=0.5) assert 'spatial_ref' in out.coords - assert int(out.coords['spatial_ref'].values) == 0 + assert out.coords['spatial_ref'].values == 0 def test_scalar_band_coord_preserved(self): # A squeezed scalar band/time selector should round-trip. @@ -262,7 +262,7 @@ def test_scalar_band_coord_preserved(self): ) out = resample(raster, scale_factor=0.5) assert 'band' in out.coords - assert int(out.coords['band'].values) == 1 + assert out.coords['band'].values == 1 assert 'time' in out.coords def test_identity_and_downsample_coords_consistent(self): @@ -283,6 +283,64 @@ def test_identity_and_downsample_coords_consistent(self): downsampled = resample(raster, scale_factor=0.5) assert set(identity.coords) - {'y', 'x'} == set(downsampled.coords) - {'y', 'x'} + def test_spatial_ref_preserved_3d(self): + # The 3D per-band path hands each band to the 2D resample(), so + # spatial_ref should ride through xr.concat back onto the stacked + # result. + data = np.arange(3 * 4 * 4, dtype=np.float32).reshape(3, 4, 4) + raster = xr.DataArray( + data, dims=('band', 'y', 'x'), + coords={ + 'band': [1, 2, 3], + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'spatial_ref': 0, + }, + attrs={'res': (1.0, 1.0)}, + ) + out = resample(raster, scale_factor=0.5) + assert 'spatial_ref' in out.coords + assert out.coords['spatial_ref'].values == 0 + + @cuda_and_cupy_available + def test_spatial_ref_preserved_cupy(self): + # The fix lives at the shared 2D output constructor, so all four + # backends should preserve spatial_ref. Pin the cupy path + # explicitly so a backend-divergent regression would surface. + import cupy + data = cupy.arange(16, dtype=cupy.float32).reshape(4, 4) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'spatial_ref': 0, + }, + attrs={'res': (1.0, 1.0), 'crs': 'EPSG:4326'}, + ) + out = resample(raster, scale_factor=0.5) + assert 'spatial_ref' in out.coords + assert out.coords['spatial_ref'].values == 0 + + @dask_array_available + def test_spatial_ref_preserved_dask(self): + import dask.array as da + data = da.from_array( + np.arange(16, dtype=np.float32).reshape(4, 4), chunks=(2, 2), + ) + raster = xr.DataArray( + data, dims=('y', 'x'), + coords={ + 'y': np.linspace(3, 0, 4), + 'x': np.linspace(0, 3, 4), + 'spatial_ref': 0, + }, + attrs={'res': (1.0, 1.0), 'crs': 'EPSG:4326'}, + ) + out = resample(raster, scale_factor=0.5) + assert 'spatial_ref' in out.coords + assert out.coords['spatial_ref'].values == 0 + def test_2d_spatial_extra_coord_not_carried(self): # A non-dim coord whose dims include the spatial axes has a # length tied to the input resolution; it must NOT be carried