Skip to content

GPU JPEG-tiled 3-band read crashes with cudaErrorIllegalAddress and poisons CUDA context #1549

@brendancol

Description

@brendancol

Bug

open_geotiff(path, gpu=True) and read_geotiff_gpu(path) raise cupy.cuda.runtime.CUDARuntimeError: cudaErrorIllegalAddress on a tiled JPEG TIFF with PhotometricInterpretation=RGB (3-band, samples=3) using JPEGTables (TIFF tag 347). The CPU path reads the same file fine. Single-band JPEG-tiled (photometric='minisblack', samples=1) also works on GPU. Only the 3-band JPEGTables splice path crashes.

After the crash the CUDA context is poisoned. Any later GPU call in the same Python process fails with the same illegal-address error, even on an unrelated file.

Reproducer

import numpy as np
import tifffile
from xrspatial.geotiff import open_geotiff

np.random.seed(0)
arr = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
path = '/tmp/jpeg_rgb.tif'
tifffile.imwrite(path, arr, photometric='rgb', tile=(128, 128), compression='jpeg')

cpu = open_geotiff(path)
print('CPU OK:', cpu.shape, cpu.dtype)

gpu = open_geotiff(path, gpu=True)

Observed:

CPU OK: (256, 256, 3) uint8
RuntimeWarning: read_geotiff_gpu: GPU decode failed (CUDARuntimeError:
cudaErrorIllegalAddress: an illegal memory access was encountered);
falling back to CPU.
...
File "cupy/cuda/pinned_memory.pyx", line 309, in PinnedMemoryPool.malloc
File "cupy_backends/cuda/api/runtime.pyx", line 584, in hostAlloc
File "cupy_backends/cuda/api/runtime.pyx", line 146, in check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError:
cudaErrorIllegalAddress: an illegal memory access was encountered

The auto-mode wrapper at xrspatial/geotiff/__init__.py:372 catches the first failure and warns about CPU fallback, but the call with explicit gpu=True then re-raises during pinned-memory allocation. The pinned allocator failing on its first call after the JPEG decode is the signature of a corrupted CUDA context, not a fresh allocation problem.

Expected

Either decode the file, or raise a RuntimeError that the auto-mode wrapper can catch cleanly. The CUDA context must remain usable so that follow-up GPU calls (and the auto-fallback path) keep working.

Impact

A single corrupted file kills GPU support for the rest of the Python process. Worker processes in a dask cluster die. The auto-mode fallback is rendered unreliable because the warning fires but downstream GPU work still crashes.

Audit pass

Found in geotiff backend parity sweep on 2026-05-09. Reproduced cleanly on this host with tifffile==2024.9.20, cupy-cuda12x, Python 3.14.

Related: #1520 (added JPEGTables splice for tiled JPEG read on the CPU side).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions