Two small performance items in xrspatial/reproject/__init__.py.
Batch the four CuPy .get() calls per chunk
_reproject_chunk_cupy (around lines 357-364) does four sequential .get() calls to bring nanmin/nanmax of the row/col pixel arrays back to host:
r_min_val = float(cp.nanmin(src_row_px).get())
if not np.isfinite(r_min_val):
return cp.full(chunk_shape, nodata, dtype=cp.float64)
r_max_val = float(cp.nanmax(src_row_px).get())
c_min_val = float(cp.nanmin(src_col_px).get())
c_max_val = float(cp.nanmax(src_col_px).get())
Each .get() is a synchronous device-to-host transfer and stalls the GPU. Stacking the four reductions into a single 4-element CuPy array and pulling that across in one .get() cuts the round-trips from four to one per chunk. The finite checks then run on host scalars, which is free.
The same pattern repeats in _reproject_dask_cupy around lines 1122-1128.
Drop redundant .copy() after .astype()
numpy.ndarray.astype() and cupy.ndarray.astype() both default to copy=True, so they always return a new array. The follow-up .copy() in:
_reproject_chunk_numpy multi-band path (line ~290)
_reproject_chunk_numpy single-band path (line ~305)
_reproject_chunk_cupy (line ~443)
_reproject_dask_cupy (line ~1193)
is therefore redundant and can be removed. No correctness change; one fewer array allocation per chunk.
Impact
For an N-chunk reprojection on GPU the batching saves roughly 3 * N synchronous device-to-host syncs. The .copy() removal saves one window-sized allocation per chunk. Existing parity tests cover correctness.
Two small performance items in
xrspatial/reproject/__init__.py.Batch the four CuPy
.get()calls per chunk_reproject_chunk_cupy(around lines 357-364) does four sequential.get()calls to bringnanmin/nanmaxof the row/col pixel arrays back to host:Each
.get()is a synchronous device-to-host transfer and stalls the GPU. Stacking the four reductions into a single 4-element CuPy array and pulling that across in one.get()cuts the round-trips from four to one per chunk. The finite checks then run on host scalars, which is free.The same pattern repeats in
_reproject_dask_cupyaround lines 1122-1128.Drop redundant
.copy()after.astype()numpy.ndarray.astype()andcupy.ndarray.astype()both default tocopy=True, so they always return a new array. The follow-up.copy()in:_reproject_chunk_numpymulti-band path (line ~290)_reproject_chunk_numpysingle-band path (line ~305)_reproject_chunk_cupy(line ~443)_reproject_dask_cupy(line ~1193)is therefore redundant and can be removed. No correctness change; one fewer array allocation per chunk.
Impact
For an N-chunk reprojection on GPU the batching saves roughly
3 * Nsynchronous device-to-host syncs. The.copy()removal saves one window-sized allocation per chunk. Existing parity tests cover correctness.