Skip to content

geotiff: golden corpus phase 3 PR 6, VRT backend (#1930)#2042

Merged
brendancol merged 2 commits into
mainfrom
1930-phase3-6-vrt
May 18, 2026
Merged

geotiff: golden corpus phase 3 PR 6, VRT backend (#1930)#2042
brendancol merged 2 commits into
mainfrom
1930-phase3-6-vrt

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Phase 3 PR 6 of the golden corpus plan in #1930. The VRT path is reached when open_geotiff sees a .vrt source; it delegates to read_vrt which parses the GDAL VRT XML and stitches the listed source GeoTIFFs back into a single DataArray.

This module synthesises a two-source horizontal mosaic from a corpus fixture: the source .tif is rasterio-copied twice into tmp_path with the second copy's origin shifted east by one image width, then write_vrt builds the VRT XML, and open_geotiff reads it back. The expected mosaic is np.concatenate of the source pixels along the x axis.

Each cell asserts three things:

  1. The rasterio read of the VRT matches the manual concat (sanity check that the VRT itself is well-formed).
  2. The xrspatial read matches the same concat (the actual parity assertion).
  3. The xrspatial dtype matches the rasterio dtype.

Two source fixtures are parametrised: dtype_uint8 and dtype_uint16. Both are bit-exact comparable and use the manifest default transform. _VRT_SOURCE_IDS is validated against the manifest so a renamed source fixture surfaces a typo rather than silently letting the test skip.

Test plan

  • pytest xrspatial/geotiff/tests/test_golden_corpus_vrt_1930.py: 3 passed
  • Mosaic geometry verified independently via the rasterio read assertion

Refs #1930. Sixth and final backend-wiring PR; closes the phase 3 set (eager numpy #2036, dask+numpy #2038, GPU #2039, dask+GPU #2040, HTTP/COG #2041, VRT here).

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 18, 2026
brendancol added a commit that referenced this pull request May 18, 2026
* geotiff: oracle understands masked-nodata candidates (#1930)

Closes the 5-of-7 xfail shared by every phase-3 backend module.

xrspatial reads integer GeoTIFFs whose nodata tag carries an
integer sentinel by masking sentinel-equal pixels to NaN and
upcasting the array to float (issue #1988). It stamps
``attrs['masked_nodata'] = True`` so a write round-trip can put
the tag back. The oracle previously did strict dtype + raw-pixel
comparison, so the dtype shift and sentinel-to-NaN rewrite both
registered as mismatches.

This change adds ``_normalise_for_masked_nodata`` to ``_oracle.py``.
When the candidate reports the contract, the rasterio reference is
cast to the candidate's float dtype and pixels equal to the
sentinel are replaced with NaN; the dtype and pixel assertions
then run on directly comparable arrays. Fixtures that do not
report the contract pass through unchanged, so the 23+ currently
passing fixtures and the citation / JPEG xfails are untouched.

Drops the five nodata-masking entries from ``_PARITY_GAPS`` in
each of the three on-main backend modules (eager numpy, dask
numpy, GPU). The matching ``xfail(strict=True)`` flips to a real
failure when the test starts passing, so the oracle change and
the cleanup must land together to keep main green.

Test coverage in ``golden_corpus/test_oracle.py``:

* masked_nodata candidate matches the oracle
* masked_nodata flag missing -- strict dtype check still fires
* candidate that forgot to mask a sentinel pixel -- pixel check fires
* candidate with wrong ``attrs['nodata']`` -- nodata check fires
* plain float-NaN fixtures are not affected by the new path
* masked_nodata flag plus NaN sentinel passes through cleanly

Follow-ups: the same five entries will be removed from the
``_PARITY_GAPS`` tables in the still-open phase-3 PRs #2040
(dask+GPU), #2041 (HTTP), and #2042 (VRT) via separate commits
on each branch.

* geotiff: tighten masked-nodata oracle guards (#1930)

Address PR #2046 review:

* Reject fractional sentinels via ``float(nd_float).is_integer()``.
  Without the guard, a sentinel like ``3.5`` would cast to ``3``
  and mask every 3-valued pixel.
* Reject sentinels outside the source integer dtype's range via
  ``info.min <= int(nd_float) <= info.max``. Without the guard,
  ``np.uint16(-1.0)`` wraps to ``65535`` and would mask the
  dtype-max pixel.

Both guards mirror the upstream xrspatial reader's checks in
``xrspatial/geotiff/__init__.py``, so the oracle's interpretation
of the masked-nodata contract is now strictly tighter than what
the reader can emit.

Added two tests:

* ``test_masked_nodata_fractional_sentinel_does_not_mask`` --
  builds a fractional-nodata fixture, confirms the oracle stays on
  the raw-pixel path (dtype mismatch fires, not pixel mismatch).
* ``test_masked_nodata_out_of_range_sentinel_does_not_mask`` --
  rasterio refuses to write an out-of-range nodata at the writer
  level, so this calls ``_normalise_for_masked_nodata`` directly
  with synthesised inputs and confirms the inputs pass through
  unchanged.
Synthesises a two-source horizontal mosaic from a corpus fixture:
the source ``.tif`` is rasterio-copied twice into ``tmp_path`` with
the second copy's origin shifted east by one image width, then
``write_vrt`` builds the VRT XML and ``open_geotiff`` reads it
back. The expected mosaic is ``np.concatenate`` of the source
pixels along the x axis.

Each cell asserts three things: the rasterio read of the VRT
matches the manual concat (sanity check that the VRT itself is
well-formed), the xrspatial read matches the same concat (the
actual parity assertion), and the xrspatial dtype matches the
rasterio dtype.

Two source fixtures are parametrised: ``dtype_uint8`` and
``dtype_uint16``. Both are bit-exact comparable and use the
manifest default transform. ``_VRT_SOURCE_IDS`` is validated
against the manifest so a renamed source fixture surfaces a
typo rather than silently letting the test skip.
Address phase 3 PR 6 review:

* Replace the hand-rolled pixel + dtype assertion with
  ``compare_to_oracle(vrt_path, candidate)`` so the VRT cell
  checks pixels, dtype, transform, CRS, and nodata the same way
  every other phase-3 backend module does.
* Keep the independent ``np.concatenate`` sanity check before the
  oracle call. The oracle also reads through rasterio; if the VRT
  writer drifts the oracle would silently agree with itself, so
  the manual concat verifies the rasterio reference is what we
  expect before using it.
* Use ``_height, width`` to mark the unpacked height as unused.
@brendancol brendancol force-pushed the 1930-phase3-6-vrt branch from c7a999d to fb50742 Compare May 18, 2026 18:00
@brendancol brendancol merged commit 1fae0a5 into main May 18, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant