Skip to content

geotiff: grow HTTP COG header prefetch past 64 KiB#1727

Merged
brendancol merged 1 commit into
mainfrom
issue-1718
May 12, 2026
Merged

geotiff: grow HTTP COG header prefetch past 64 KiB#1727
brendancol merged 1 commit into
mainfrom
issue-1718

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

_parse_cog_http_meta used to fetch 16 KiB, parse IFDs, and retry once with 64 KiB if nothing parsed. Files whose IFD chain or out-of-line tag arrays (TileOffsets, GeoAsciiParams, GDAL_METADATA) sat past 64 KiB silently lost IFDs or raised ValueError from parse_ifd's bounds checks. Local-file reads of the same files worked because they aren't bounded by the prefetch.

Fix

Replace the 16 KiB / 64 KiB two-shot with a grow loop:

  • Start at INITIAL_HTTP_HEADER_BYTES = 16 KiB. Fast path is unchanged.
  • Parse IFDs. If parse_all_ifds raised because an out-of-line tag pointer landed past the buffer, treat it as truncation.
  • Look at the highest byte offset the parsed IFDs reference (currently the tail IFD's next_ifd_offset; parse_ifd already validates tag value pointers against len(data) before returning).
  • If the buffer covers the chain, done; otherwise double the fetch size and retry.
  • Cap at MAX_HTTP_HEADER_BYTES = 4 MiB and raise a clear ValueError when exceeded.
  • Stop growing when the server returns the same length as the previous call (EOF).

Repro

A COG written with a large gdal_metadata_xml and several overview levels places IFD #2 past 65536 bytes. Before this PR, _read_cog_http either lost the overview IFDs or raised. After, it grows the prefetch and reads correctly.

Tests

xrspatial/geotiff/tests/test_http_meta_buffer_1718.py:

  • test_small_cog_uses_single_initial_read: small COG fires exactly one 16 KiB read.
  • test_ifd_chain_past_64kib_resolves: direct _parse_cog_http_meta against an in-memory source returns the full IFD list when the chain extends past 64 KiB.
  • test_end_to_end_http_read_with_big_metadata: full _read_cog_http round-trip via a local http.server, plus an overview-level read.
  • test_cap_raises_clear_error_on_excessive_chain: patched cap + crafted next-IFD pointer triggers the clear ValueError.

Test plan

  • pytest xrspatial/geotiff/tests/test_http_meta_buffer_1718.py -xvs (4 passed)
  • pytest xrspatial/geotiff/tests/test_cog_http_concurrent.py -x (6 passed)
  • Full xrspatial/geotiff/tests/ suite (1819 passed; 4 pre-existing failures unrelated to this change)

Closes #1718.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 19:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the GeoTIFF HTTP COG reader by making _parse_cog_http_meta dynamically grow its initial header/metadata prefetch beyond the previous 64 KiB limit, so COGs with deeper IFD chains or large out-of-line tag payloads can be parsed correctly over HTTP range requests.

Changes:

  • Replace the fixed 16 KiB + one-time 64 KiB retry logic with a doubling “grow loop” capped by MAX_HTTP_HEADER_BYTES.
  • Add helper logic to decide whether the parsed IFD chain is fully resolved (vs truncated by the current buffer).
  • Add a new test suite covering the fast path, grow path, end-to-end HTTP reads, and cap/exception behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
xrspatial/geotiff/_reader.py Adds grow-loop prefetch logic (with initial/cap constants and an extent helper) for HTTP COG metadata parsing.
xrspatial/geotiff/tests/test_http_meta_buffer_1718.py Introduces tests that reproduce the >64 KiB metadata/IFD-chain scenario and validate the new grow behavior and cap error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1596 to +1611
while True:
try:
ifds = parse_all_ifds(header_bytes, header)
required = _ifd_required_extent(ifds, header, len(header_bytes))
# Chain is fully resolved when every IFD parsed cleanly and
# the tail next_ifd_offset is reachable within the buffer
# (required == 0 means end-of-chain).
if ifds and required <= len(header_bytes):
break
except ValueError:
# parse_ifd raises when an out-of-line tag points past the
# buffer. Treat it the same as a truncated chain: grow and
# retry. If we are already at the cap and still failing, let
# the next iteration's cap check raise a clear error.
ifds = []

Comment on lines +1626 to +1627
except ValueError:
ifds = []
Comment on lines +1536 to +1548
"""Return the highest byte offset the parsed IFDs reference.

Used to decide whether the prefetch buffer is large enough to hold the
entire IFD chain plus every out-of-line tag value. We compare this
against ``len(data)`` in :func:`_parse_cog_http_meta`; if it exceeds the
buffer, the chain is truncated and the caller must grow and retry.

The walk re-derives each tag's value-area placement directly from the
IFD layout (entry table base + entry slot) rather than re-parsing the
raw bytes. For out-of-line tags ``parse_ifd`` already resolved the
pointer and validated ``ptr + size <= data_len``; the *interesting*
extent for the grow loop is the next-IFD pointer of the chain tail,
plus an "is there a next IFD we have not yet seen" probe.
Comment on lines +216 to +223
import tempfile
with tempfile.NamedTemporaryFile(suffix='_cap_1718.tif', delete=False) as f:
path = f.name
write(arr, path, compression='deflate', tiled=True, tile_size=16,
cog=True, overview_levels=[1])
with open(path, 'rb') as f:
payload = bytearray(f.read())

Comment on lines +229 to +236
# far-off value that no buffer growth will ever satisfy.
bo = header.byte_order
first_ifd_off = header.first_ifd_offset
import struct as _struct
num_entries = _struct.unpack_from(f'{bo}H', payload, first_ifd_off)[0]
next_off_pos = first_ifd_off + 2 + num_entries * 12
far = 10**12 # 1 TB, well past any cap
_struct.pack_into(f'{bo}I', payload, next_off_pos, far & 0xFFFFFFFF)
_parse_cog_http_meta used to fetch 16 KiB and retry once with 64 KiB.
COGs whose IFD chain or out-of-line tag arrays (TileOffsets,
GeoAsciiParams, GDAL_METADATA) sat past 64 KiB silently lost IFDs or
raised from parse_ifd's bounds checks. Replace the two-shot with a
grow loop that doubles the buffer until the IFD chain resolves,
capped at MAX_HTTP_HEADER_BYTES (4 MiB). Fast path is unchanged.

Closes #1718.
@brendancol brendancol merged commit eb6955e into main May 12, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: HTTP COG metadata parsing capped at 64 KiB drops large IFD chains

2 participants