Batch GDS-fallback per-tile GPU to host transfer

## Reason or Problem

`gpu_decode_tiles_from_file` in `xrspatial/geotiff/_gpu_decode.py` has a fallback path that fires when kvikio successfully reads compressed tiles to GPU but nvCOMP can't decompress them on-device (LZW and any codec other than ZSTD/deflate). On that path, the tiles have to come back to host memory so the CPU decoder can run.

Today (line 1491) that copy is a per-tile loop:

```python
compressed_tiles = [t.get().tobytes() for t in d_tiles]
```

Each `.get()` is its own D2H copy on the default stream, so they serialise. With a lot of tiles, the per-DMA setup overhead is what eats the wall time, not the actual bytes moved.

## Proposal

The same module already has the batched D2H pattern. In the deflate path (lines 2317-2330) the code concatenates per-tile cupy buffers into one contiguous device buffer, runs a single `.get()`, then slices the host bytes by tile offsets. The LZW fallback should do the same thing:

```python
sizes = [int(t.size) for t in d_tiles]
offsets = np.concatenate(([0], np.cumsum(sizes))).astype(np.int64)
combined = cupy.concatenate(d_tiles)
host_buf = combined.get()
compressed_tiles = [
    bytes(host_buf[offsets[i]:offsets[i + 1]])
    for i in range(len(d_tiles))
]
```

`d_tiles` is a list of 1-D `cupy.uint8` arrays (see `_try_kvikio_read_tiles`, line 870), so axis-0 concatenation works.

**Design:** One concat kernel plus one D2H DMA replaces N kernels and N DMAs. The host slice still hands back the same `list[bytes]` `gpu_decode_tiles` expects, so nothing downstream changes.

**Usage:** No public API change. Purely internal to the GDS fallback branch.

**Value:** Expected ~1.66x on this hop, matching the speedup quoted in the H2D batched-upload comment a few hundred lines down. Bigger COGs with lots of small LZW tiles benefit the most.

## Stakeholders and Impacts

Users reading LZW-compressed COGs from NVMe with kvikio installed. The ZSTD-direct, nvCOMP, and non-GDS paths are not touched.

## Drawbacks

`cupy.concatenate` allocates a device buffer the size of all compressed tiles. For LZW that's small compared to decompressed size, but worth flagging.

## Alternatives

Pinned host buffer + async copies would also work. Concatenate is simpler and matches the existing pattern in the same file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch GDS-fallback per-tile GPU to host transfer #1552

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch GDS-fallback per-tile GPU to host transfer #1552

Description

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions