Skip to content

geotiff: golden corpus phase 3 PR 5, HTTP/COG backend (#1930)#2041

Merged
brendancol merged 2 commits into
mainfrom
1930-phase3-5-http
May 18, 2026
Merged

geotiff: golden corpus phase 3 PR 5, HTTP/COG backend (#1930)#2041
brendancol merged 2 commits into
mainfrom
1930-phase3-5-http

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Phase 3 PR 5 of the golden corpus plan in #1930. The HTTP path in xrspatial.geotiff is COG-only: open_geotiff('http://...') routes to _reader._read_cog_http which uses range requests to fetch only metadata and the tiles a window needs. Plain stripped or tiled fixtures over HTTP would require a full-object download, which is a different code path; this module focuses narrowly on the COG fixture.

The in-process HTTP server reuses the range-supporting BaseHTTPRequestHandler pattern from test_http_meta_buffer_1718.py. It binds an ephemeral 127.0.0.1 port, serves the COG fixture's bytes, and lets the test read through open_geotiff(url) with the SSRF guard opened for loopback via XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 (the same hatch every other in-process HTTP test in the repo uses).

The corpus has one COG fixture today (cog_internal_overview_uint16, phase 2 PR 7). It passes the oracle through the HTTP path. A test_at_least_one_cog_fixture_exists guard catches the case where a future refactor accidentally drops the only COG entry.

Test plan

  • pytest xrspatial/geotiff/tests/test_golden_corpus_http_1930.py: 2 passed
  • In-process server cleans up cleanly on test exit (httpd.shutdown() + server_close() in a try/finally)

Refs #1930. Fifth of six backend-wiring PRs; VRT remains.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 18, 2026
brendancol added a commit that referenced this pull request May 18, 2026
* geotiff: oracle understands masked-nodata candidates (#1930)

Closes the 5-of-7 xfail shared by every phase-3 backend module.

xrspatial reads integer GeoTIFFs whose nodata tag carries an
integer sentinel by masking sentinel-equal pixels to NaN and
upcasting the array to float (issue #1988). It stamps
``attrs['masked_nodata'] = True`` so a write round-trip can put
the tag back. The oracle previously did strict dtype + raw-pixel
comparison, so the dtype shift and sentinel-to-NaN rewrite both
registered as mismatches.

This change adds ``_normalise_for_masked_nodata`` to ``_oracle.py``.
When the candidate reports the contract, the rasterio reference is
cast to the candidate's float dtype and pixels equal to the
sentinel are replaced with NaN; the dtype and pixel assertions
then run on directly comparable arrays. Fixtures that do not
report the contract pass through unchanged, so the 23+ currently
passing fixtures and the citation / JPEG xfails are untouched.

Drops the five nodata-masking entries from ``_PARITY_GAPS`` in
each of the three on-main backend modules (eager numpy, dask
numpy, GPU). The matching ``xfail(strict=True)`` flips to a real
failure when the test starts passing, so the oracle change and
the cleanup must land together to keep main green.

Test coverage in ``golden_corpus/test_oracle.py``:

* masked_nodata candidate matches the oracle
* masked_nodata flag missing -- strict dtype check still fires
* candidate that forgot to mask a sentinel pixel -- pixel check fires
* candidate with wrong ``attrs['nodata']`` -- nodata check fires
* plain float-NaN fixtures are not affected by the new path
* masked_nodata flag plus NaN sentinel passes through cleanly

Follow-ups: the same five entries will be removed from the
``_PARITY_GAPS`` tables in the still-open phase-3 PRs #2040
(dask+GPU), #2041 (HTTP), and #2042 (VRT) via separate commits
on each branch.

* geotiff: tighten masked-nodata oracle guards (#1930)

Address PR #2046 review:

* Reject fractional sentinels via ``float(nd_float).is_integer()``.
  Without the guard, a sentinel like ``3.5`` would cast to ``3``
  and mask every 3-valued pixel.
* Reject sentinels outside the source integer dtype's range via
  ``info.min <= int(nd_float) <= info.max``. Without the guard,
  ``np.uint16(-1.0)`` wraps to ``65535`` and would mask the
  dtype-max pixel.

Both guards mirror the upstream xrspatial reader's checks in
``xrspatial/geotiff/__init__.py``, so the oracle's interpretation
of the masked-nodata contract is now strictly tighter than what
the reader can emit.

Added two tests:

* ``test_masked_nodata_fractional_sentinel_does_not_mask`` --
  builds a fractional-nodata fixture, confirms the oracle stays on
  the raw-pixel path (dtype mismatch fires, not pixel mismatch).
* ``test_masked_nodata_out_of_range_sentinel_does_not_mask`` --
  rasterio refuses to write an out-of-range nodata at the writer
  level, so this calls ``_normalise_for_masked_nodata`` directly
  with synthesised inputs and confirms the inputs pass through
  unchanged.
The HTTP path in ``xrspatial.geotiff`` is COG-only:
``open_geotiff('http://...')`` routes to ``_read_cog_http`` which
uses range requests to fetch only metadata and the tiles a window
needs. Plain stripped/tiled fixtures would require a full-object
download, which is a different code path; this module focuses
narrowly on the COG fixture.

The in-process HTTP server reuses the range-supporting
``BaseHTTPRequestHandler`` pattern from
``test_http_meta_buffer_1718.py``. It binds an ephemeral
127.0.0.1 port, serves the COG fixture's bytes, and lets the test
read through ``open_geotiff(url)`` with the SSRF guard opened for
loopback via ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1``.

The single corpus COG fixture (``cog_internal_overview_uint16``)
passes the oracle through the HTTP path. A
``test_at_least_one_cog_fixture_exists`` guard surfaces the case
where a future refactor accidentally drops the only COG entry.
Address phase 3 PR 5 review:

* Pull the ``lossy`` flag from the manifest entry via a new
  ``_is_lossy(fixture_id)`` helper. The hardcoded ``lossy=False`` in
  the oracle call would silently fail bit-exact comparison if a
  future lossy COG (e.g. JPEG-COG) landed in the corpus.
* Add ``_resolved_fixtures()`` so the manifest is parsed once per
  helper. Mirrors the eager / dask / GPU modules.
* Update the module docstring to reflect that the current corpus has
  no COG fixture in a parity gap; document the path forward if one
  is added.
* Join the HTTP server thread with a 2 s timeout in the test's
  ``finally`` block so a hung ``httpd.shutdown()`` surfaces cleanly
  rather than waiting on process exit.
@brendancol brendancol force-pushed the 1930-phase3-5-http branch from 16f6c50 to e9e66e1 Compare May 18, 2026 17:58
@brendancol brendancol merged commit 983156e into main May 18, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant