Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/sweep-performance-state.csv
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
geotiff,2026-05-12,SAFE,IO-bound,1,1688,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
geotiff,2026-05-12,SAFE,IO-bound,0,1713,"Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
Expand Down
103 changes: 74 additions & 29 deletions xrspatial/geotiff/_compression.py
Original file line number Diff line number Diff line change
Expand Up @@ -834,44 +834,89 @@ def unpack_bits(data: np.ndarray, bps: int, pixel_count: int) -> np.ndarray:
out = np.unpackbits(data)[:pixel_count]
return out.astype(np.uint8)
elif bps == 2:
# 4 pixels per byte, MSB-first
# 4 pixels per byte, MSB-first. Vectorised across the packed
# byte array so a large tile decodes in one pass instead of N
# Python iterations (issue #1713). The strided assignments
# below mirror the original per-byte loop's write pattern,
# including the contract that positions beyond
# ``len(data) * 4`` are left as the ``np.empty`` initial
# garbage value -- matching the prior behaviour for short
# input buffers.
out = np.empty(pixel_count, dtype=np.uint8)
for i in range(min(len(data), (pixel_count + 3) // 4)):
b = data[i]
base = i * 4
if base < pixel_count:
out[base] = (b >> 6) & 0x03
if base + 1 < pixel_count:
out[base + 1] = (b >> 4) & 0x03
if base + 2 < pixel_count:
out[base + 2] = (b >> 2) & 0x03
if base + 3 < pixel_count:
out[base + 3] = b & 0x03
n_full_bytes = min(len(data), (pixel_count + 3) // 4)
if n_full_bytes > 0:
raw = data[:n_full_bytes]
written = n_full_bytes * 4
limit = min(written, pixel_count)
# Compute each of the four bit-slots for every covered byte
# then write the prefix that fits into ``out``.
slot0 = (raw >> 6) & np.uint8(0x03)
slot1 = (raw >> 4) & np.uint8(0x03)
slot2 = (raw >> 2) & np.uint8(0x03)
slot3 = raw & np.uint8(0x03)
full_idx = limit - (limit % 4)
if full_idx > 0:
quart = full_idx // 4
out[0:full_idx:4] = slot0[:quart]
out[1:full_idx:4] = slot1[:quart]
out[2:full_idx:4] = slot2[:quart]
out[3:full_idx:4] = slot3[:quart]
# Handle the tail (1 to 3 elements past the last full quad)
tail = limit - full_idx
tail_byte = full_idx // 4
if tail >= 1:
out[full_idx] = slot0[tail_byte]
if tail >= 2:
out[full_idx + 1] = slot1[tail_byte]
if tail >= 3:
out[full_idx + 2] = slot2[tail_byte]
return out
elif bps == 4:
# 2 pixels per byte, high nibble first
# 2 pixels per byte, high nibble first. Vectorised across the
# packed byte array (issue #1713). Positions beyond
# ``len(data) * 2`` keep the ``np.empty`` initial value,
# matching the prior per-byte loop's contract for short
# input buffers.
out = np.empty(pixel_count, dtype=np.uint8)
for i in range(min(len(data), (pixel_count + 1) // 2)):
b = data[i]
base = i * 2
if base < pixel_count:
out[base] = (b >> 4) & 0x0F
if base + 1 < pixel_count:
out[base + 1] = b & 0x0F
n_full_bytes = min(len(data), (pixel_count + 1) // 2)
if n_full_bytes > 0:
raw = data[:n_full_bytes]
written = n_full_bytes * 2
limit = min(written, pixel_count)
hi = (raw >> 4) & np.uint8(0x0F)
lo = raw & np.uint8(0x0F)
full_idx = limit - (limit % 2)
if full_idx > 0:
pairs = full_idx // 2
out[0:full_idx:2] = hi[:pairs]
out[1:full_idx:2] = lo[:pairs]
if limit - full_idx >= 1:
out[full_idx] = hi[full_idx // 2]
return out
elif bps == 12:
# 2 pixels per 3 bytes, MSB-first
# 2 pixels per 3 bytes, MSB-first. Vectorised across the
# packed byte array (issue #1713). Pairs whose third byte is
# past ``len(data)`` are skipped, preserving the prior
# ``np.empty`` semantics for short input buffers.
out = np.empty(pixel_count, dtype=np.uint16)
n_pairs = pixel_count // 2
remainder = pixel_count % 2
for i in range(n_pairs):
off = i * 3
if off + 2 < len(data):
b0 = int(data[off])
b1 = int(data[off + 1])
b2 = int(data[off + 2])
out[i * 2] = (b0 << 4) | (b1 >> 4)
out[i * 2 + 1] = ((b1 & 0x0F) << 8) | b2
if n_pairs > 0:
# Only those triples that fit entirely within ``data`` get
# written. The prior code used ``off + 2 < len(data)``
# (strict less than), which is satisfied when ``3i + 2 <
# len(data)``. Solving for the largest valid ``i`` gives
# ``i_max = (len(data) - 3) // 3``, so the pair count cap
# is ``i_max + 1 = len(data) // 3``.
max_pairs = len(data) // 3
usable_pairs = min(n_pairs, max_pairs)
if usable_pairs > 0:
raw = data[:usable_pairs * 3].astype(np.uint16)
b0 = raw[0::3]
b1 = raw[1::3]
b2 = raw[2::3]
out[0:usable_pairs * 2:2] = (b0 << 4) | (b1 >> 4)
out[1:usable_pairs * 2:2] = ((b1 & 0x0F) << 8) | b2
if remainder and n_pairs * 3 + 1 < len(data):
off = n_pairs * 3
out[pixel_count - 1] = (int(data[off]) << 4) | (int(data[off + 1]) >> 4)
Expand Down
Loading
Loading