Describe the bug
Two related problems in xrspatial/resample.py's dask aggregate paths (_run_dask_numpy and _run_dask_cupy):
-
Boundary contamination. The aggregate dask path calls dask.array.overlap.overlap with boundary='nearest'. At the global edge of the input array, the overlap pad is filled with duplicated edge cells. Output pixels whose aggregate window straddles that edge then sample those duplicates, which biases min/max/median. Mean is less affected because the duplicates are real values, but they're still triple-counted near corners.
-
Wasted/inconsistent cumulative bookkeeping. The aggregate path computes global_in_h, cum_in_y, cum_in_x, out_y, out_x, cum_out_y, cum_out_x once before _ensure_min_chunksize may rechunk for the depth requirement, then conditionally recomputes when the rechunk changed the layout. The first compute is wasted, and the conditional recompute uses data.chunks[0] != tuple(cum_in_y[1:] - cum_in_y[:-1]) as a roundabout chunk-equality check.
Expected behavior
Aggregate dask results should match eager numpy bit-identically for min/max/median (same kernel, no boundary padding bias). The bookkeeping should compute once.
Fix
- Use
boundary=np.nan on the aggregate overlap. The aggregate kernels already skip NaN via if not np.isnan(v) and return NaN for empty windows, so padded NaN cells are ignored naturally.
- Compute
min_size from the scale-driven minimum and the depth-driven max(2*depth_y+1, 2*depth_x+1) up front, call _ensure_min_chunksize once, then build the cumulative arrays once.
- Leave the interp dask path on
boundary='nearest' so it stays consistent with scipy's mode='nearest' semantics that the eager numpy interp path uses.
Describe the bug
Two related problems in
xrspatial/resample.py's dask aggregate paths (_run_dask_numpyand_run_dask_cupy):Boundary contamination. The aggregate dask path calls
dask.array.overlap.overlapwithboundary='nearest'. At the global edge of the input array, the overlap pad is filled with duplicated edge cells. Output pixels whose aggregate window straddles that edge then sample those duplicates, which biases min/max/median. Mean is less affected because the duplicates are real values, but they're still triple-counted near corners.Wasted/inconsistent cumulative bookkeeping. The aggregate path computes
global_in_h,cum_in_y,cum_in_x,out_y,out_x,cum_out_y,cum_out_xonce before_ensure_min_chunksizemay rechunk for the depth requirement, then conditionally recomputes when the rechunk changed the layout. The first compute is wasted, and the conditional recompute usesdata.chunks[0] != tuple(cum_in_y[1:] - cum_in_y[:-1])as a roundabout chunk-equality check.Expected behavior
Aggregate dask results should match eager numpy bit-identically for min/max/median (same kernel, no boundary padding bias). The bookkeeping should compute once.
Fix
boundary=np.nanon the aggregate overlap. The aggregate kernels already skip NaN viaif not np.isnan(v)and return NaN for empty windows, so padded NaN cells are ignored naturally.min_sizefrom the scale-driven minimum and the depth-drivenmax(2*depth_y+1, 2*depth_x+1)up front, call_ensure_min_chunksizeonce, then build the cumulative arrays once.boundary='nearest'so it stays consistent with scipy'smode='nearest'semantics that the eager numpy interp path uses.