Skip to content

Reduce CuPy host round-trips and remove redundant copies in reproject #1457

@brendancol

Description

@brendancol

Two small performance items in xrspatial/reproject/__init__.py.

Batch the four CuPy .get() calls per chunk

_reproject_chunk_cupy (around lines 357-364) does four sequential .get() calls to bring nanmin/nanmax of the row/col pixel arrays back to host:

r_min_val = float(cp.nanmin(src_row_px).get())
if not np.isfinite(r_min_val):
    return cp.full(chunk_shape, nodata, dtype=cp.float64)
r_max_val = float(cp.nanmax(src_row_px).get())
c_min_val = float(cp.nanmin(src_col_px).get())
c_max_val = float(cp.nanmax(src_col_px).get())

Each .get() is a synchronous device-to-host transfer and stalls the GPU. Stacking the four reductions into a single 4-element CuPy array and pulling that across in one .get() cuts the round-trips from four to one per chunk. The finite checks then run on host scalars, which is free.

The same pattern repeats in _reproject_dask_cupy around lines 1122-1128.

Drop redundant .copy() after .astype()

numpy.ndarray.astype() and cupy.ndarray.astype() both default to copy=True, so they always return a new array. The follow-up .copy() in:

  • _reproject_chunk_numpy multi-band path (line ~290)
  • _reproject_chunk_numpy single-band path (line ~305)
  • _reproject_chunk_cupy (line ~443)
  • _reproject_dask_cupy (line ~1193)

is therefore redundant and can be removed. No correctness change; one fewer array allocation per chunk.

Impact

For an N-chunk reprojection on GPU the batching saves roughly 3 * N synchronous device-to-host syncs. The .copy() removal saves one window-sized allocation per chunk. Existing parity tests cover correctness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions