Skip to content

Surface chunked VRT decode-time hole gap under missing_sources='warn' (#2989)#2991

Merged
brendancol merged 2 commits into
mainfrom
issue-2989
Jun 6, 2026
Merged

Surface chunked VRT decode-time hole gap under missing_sources='warn' (#2989)#2991
brendancol merged 2 commits into
mainfrom
issue-2989

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2989

What

The chunked VRT path pre-populates attrs['vrt_holes'] from a parse-time os.path.exists sweep, which only catches sources whose backing file is missing. A source that exists but fails to decode at compute time (corrupt, truncated, codec error) reads as a hole inside the per-chunk worker and is warned there, but that record cannot be reduced back onto the parent DataArray's attrs without eagerly decoding every source. Eager decode would defeat lazy reading, so it is off the table.

This PR makes the gap explicit instead of silent:

  • Documents attrs['vrt_holes'] under chunked dispatch as a lower bound (statically-detectable missing-file holes only), in the _read_vrt docstring, the inline static-sweep comment, and docs/source/reference/geotiff.rst.
  • Emits one GeoTIFFFallbackWarning up front under missing_sources='warn' when the requested window touches an existing (decode-capable) source, so a caller cannot treat the absence of the attr as proof of a complete mosaic. The warning is suppressed when no in-window source exists on disk, since nothing can decode-fail in that case.

No behavior change for missing_sources='raise' (still fails closed up front on missing files) or for the eager path.

Backend coverage

VRT reads are CPU/dask only and do not go through the GPU decoder pipeline; the warning lands on the chunked dispatcher shared by dask+numpy and dask+cupy. No numpy/cupy eager-path change.

Test plan

  • Present-but-corrupt source is absent from build-time attrs['vrt_holes']
  • Build-time heads-up GeoTIFFFallbackWarning fires under chunked 'warn'
  • Corrupt source still warns at compute time
  • Heads-up suppressed when no in-window source exists on disk
  • Full xrspatial/geotiff/tests/vrt/ suite green (502 passed)

…#2989)

The chunked VRT path's parse-time os.path.exists sweep records only
missing-file holes in attrs['vrt_holes']. A source that exists but fails
to decode at compute time reads as a hole in the per-chunk worker and is
warned there, but cannot be reduced back onto the parent DataArray's
attrs without an eager decode of every source (which would defeat lazy
reading).

Document attrs['vrt_holes'] under chunked dispatch as a lower bound
(statically-detectable missing-file holes only) and emit one build-time
GeoTIFFFallbackWarning under missing_sources='warn' when the requested
window touches an existing (decode-capable) source, so callers do not
treat the absence of the attr as proof of a complete mosaic.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 6, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Surface chunked VRT decode-time hole gap under missing_sources='warn' (#2991)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_backends/vrt.py:1290-1308 -- the decode_capable scan re-walks vrt.bands and every source with the same window-intersection test the static-hole loop at line ~1185 already runs. On a mosaic with many tiles that is a second full pass. Consider computing decode_capable inside the existing static-hole loop: a source that intersects the window and exists is decode-capable, one that intersects and is missing is a hole. That avoids the duplicate traversal at the cost of a slightly busier loop body. Not blocking: the scan breaks on the first existing in-window source, so the common case is cheap.

Nits (optional improvements)

  • xrspatial/geotiff/_backends/vrt.py:1310-1312 -- import warnings and from .._runtime import GeoTIFFFallbackWarning sit inside the if decode_capable block. That matches the eager path's local-import style in _vrt.py:1621, so it is consistent. A module-level import would be marginally cleaner; leave it if matching the existing pattern is the intent.

What looks good

  • The window-intersection predicate is copied verbatim from the static-hole guard, so the heads-up and the attrs population agree on what "in window" means.
  • The warning is gated on missing_sources == 'warn' and sits after the raise guard, so 'raise' and strict mode are untouched.
  • Suppressing the warning when no in-window source exists on disk keeps it from firing on all-missing VRTs where nothing can decode-fail, and a test pins that.
  • Docstring, inline comment, and geotiff.rst describe the lower-bound contract consistently.

Checklist

  • Algorithm matches reference (N/A, no numerical change)
  • All implemented backends consistent (VRT reads are CPU/dask only)
  • NaN handling correct (N/A, no pixel-value change)
  • Edge cases covered by tests (corrupt source, all-missing, present source)
  • Dask chunk boundaries handled correctly (window test reused from existing guard)
  • No premature materialization (build-time scan is path-existence only, no decode)
  • Benchmark not needed
  • README feature matrix not applicable (no new function)
  • Docstrings accurate

Compute the build-time heads-up flag inside the existing source-window
scan instead of re-walking vrt.bands a second time. The static-hole loop
now applies the window-intersection test to every in-window source and
branches on os.path.exists: present sources mark decode_capable, missing
ones append to chunked_holes. Single pass, same behaviour.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after a69477a)

Disposition of prior findings

  • Suggestion (duplicate traversal, _backends/vrt.py): fixed. decode_capable is now computed inside the existing static-hole sweep. The loop applies the window-intersection test to every in-window source once and branches on os.path.exists: present sources set the flag, missing ones append to chunked_holes. No second pass.
  • Nit (local import of warnings / GeoTIFFFallbackWarning): dismissed. It matches the eager path's local-import of the same symbols at _vrt.py:1621. A module-level import here would diverge from that established pattern, so leaving it keeps the two paths consistent.

State now

  • Single source-list walk; behaviour unchanged.
  • Full xrspatial/geotiff/tests/vrt/ suite green (502 passed); the four new #2989 tests pass.
  • flake8 clean on the changed file.

No new findings.

@brendancol brendancol merged commit 9a1b360 into main Jun 6, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chunked VRT missing_sources='warn' under-reports decode-time holes in attrs['vrt_holes']

1 participant