Skip to content

Batch GDS-fallback per-tile GPU to host transfer #1552

@brendancol

Description

@brendancol

Reason or Problem

gpu_decode_tiles_from_file in xrspatial/geotiff/_gpu_decode.py has a fallback path that fires when kvikio successfully reads compressed tiles to GPU but nvCOMP can't decompress them on-device (LZW and any codec other than ZSTD/deflate). On that path, the tiles have to come back to host memory so the CPU decoder can run.

Today (line 1491) that copy is a per-tile loop:

compressed_tiles = [t.get().tobytes() for t in d_tiles]

Each .get() is its own D2H copy on the default stream, so they serialise. With a lot of tiles, the per-DMA setup overhead is what eats the wall time, not the actual bytes moved.

Proposal

The same module already has the batched D2H pattern. In the deflate path (lines 2317-2330) the code concatenates per-tile cupy buffers into one contiguous device buffer, runs a single .get(), then slices the host bytes by tile offsets. The LZW fallback should do the same thing:

sizes = [int(t.size) for t in d_tiles]
offsets = np.concatenate(([0], np.cumsum(sizes))).astype(np.int64)
combined = cupy.concatenate(d_tiles)
host_buf = combined.get()
compressed_tiles = [
    bytes(host_buf[offsets[i]:offsets[i + 1]])
    for i in range(len(d_tiles))
]

d_tiles is a list of 1-D cupy.uint8 arrays (see _try_kvikio_read_tiles, line 870), so axis-0 concatenation works.

Design: One concat kernel plus one D2H DMA replaces N kernels and N DMAs. The host slice still hands back the same list[bytes] gpu_decode_tiles expects, so nothing downstream changes.

Usage: No public API change. Purely internal to the GDS fallback branch.

Value: Expected ~1.66x on this hop, matching the speedup quoted in the H2D batched-upload comment a few hundred lines down. Bigger COGs with lots of small LZW tiles benefit the most.

Stakeholders and Impacts

Users reading LZW-compressed COGs from NVMe with kvikio installed. The ZSTD-direct, nvCOMP, and non-GDS paths are not touched.

Drawbacks

cupy.concatenate allocates a device buffer the size of all compressed tiles. For LZW that's small compared to decompressed size, but worth flagging.

Alternatives

Pinned host buffer + async copies would also work. Concatenate is simpler and matches the existing pattern in the same file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgpuCuPy / CUDA GPU supportperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions