Skip to content

geotiff: push down byte-affecting validation into array-level writers (#2138)#2147

Merged
brendancol merged 5 commits into
mainfrom
issue-2138
May 20, 2026
Merged

geotiff: push down byte-affecting validation into array-level writers (#2138)#2147
brendancol merged 5 commits into
mainfrom
issue-2138

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2138.

Summary

  • Push byte-affecting validation from to_geotiff down into _write and _write_streaming so direct callers cannot bypass it: compression-name list, JPEG opt-in gate (write_geotiff_gpu emits JPEG TIFFs that other readers reject #1845), max_z_error sign + LERC-only pairing, crs_epsg bool rejection, crs_wkt fallback gate (geotiff: CRS resolution should fail closed on unparseable strings by default #1929), defensive copy on the NaN-to-sentinel rewrite, and float16 / bool_ auto-promotion.
  • Rename the three array-level entry points to underscore-prefixed canonical names (_write, _write_streaming, _read_to_array). The non-underscored names are kept as aliases so the ~50 internal call sites that import them still work. Module docstrings on _writer.py and _reader.py now spell out the private contract.
  • Out of scope (kept in to_geotiff): the four checks that need data.dims / data.attrs state -- 3D dim-order validation, band-first reorder, fail-closed georeferenced-transform, and the read-side masked_nodata handling. The issue's hybrid recommendation calls these out as DataArray-level by construction.

Backend coverage

CPU writer (numpy / dask-via-_write_streaming). The GPU writer already runs its own JPEG + CRS-fallback gates before calling into _write, so its CPU-fallback path now inherits the same set of checks without further changes.

Test plan

  • pytest xrspatial/geotiff/tests/test_lowlevel_write_pushdown_2138.py -- 26 new tests covering each push-down gap plus byte parity between _write and to_geotiff for every lossless codec in _VALID_COMPRESSIONS.
  • pytest xrspatial/geotiff/tests/ -- full geotiff suite: 4271 passed, 25 skipped.
  • pytest xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py xrspatial/tests/test_rasterize.py -- dependent tests outside geotiff/.

…#2138)

Hardens the lower-level write entry points so direct callers cannot
silently bypass the checks that ``to_geotiff`` runs upstream. Six
byte-affecting gaps now fire at ``_write`` / ``_write_streaming``:
compression-name validation against the canonical list, JPEG opt-in
gate (#1845), ``max_z_error`` sign + LERC pairing, ``crs_epsg`` bool
rejection, defensive copy on the NaN-to-sentinel rewrite, and
``float16`` / ``bool_`` auto-promotion. Renames ``write`` /
``write_streaming`` / ``read_to_array`` to underscore-prefixed
canonical names, with the old names kept as aliases to avoid breaking
the ~50 internal call sites that import them. Adds 26 push-down +
byte-parity tests and updates module docstrings to spell out the
private contract.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
…xport (#2138)

Two follow-ups uncovered by self-review of the push-down work:

* ``_write_vrt_tiled`` / ``_write_single_tile`` now forward
  ``allow_internal_only_jpeg`` and ``allow_unparseable_crs`` to the
  per-tile ``_write`` call. Without this, ``to_geotiff(da, '...vrt',
  compression='jpeg', allow_internal_only_jpeg=True)`` would clear
  the upstream gate then trip the new push-down gate inside
  ``_write``.
* ``xrspatial/geotiff/__init__.py`` no longer does
  ``from ._writer import write``. The import bound ``write`` as an
  attribute of ``xrspatial.geotiff`` even though the name was not in
  ``__all__`` and not part of the documented public API. Mirrors the
  ``read_to_array`` cleanup from #1708.

Adds a namespace-no-leak regression test alongside the existing
``read_to_array`` pin.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: push down byte-affecting validation into array-level writers (#2138)

Read the PR head at issue-2138. The hybrid approach matches the issue's recommendation: byte-affecting checks pushed down into _write / _write_streaming, DataArray-state-dependent checks left in to_geotiff, and the three array-level entry points renamed to underscore-prefixed canonical names with public-named aliases kept to avoid disturbing ~50 internal call sites.

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_writers/eager.py:776,1063,1080 -- _write_single_tile did not forward the new allow_internal_only_jpeg / allow_unparseable_crs kwargs to its inner write(...) call. A direct caller of to_geotiff(da, 'foo.vrt', compression='jpeg', allow_internal_only_jpeg=True) would clear the upstream gate then trip the new push-down gate inside _write. Fixed in commit b81d298 by threading the kwargs through _write_vrt_tiled -> _write_single_tile -> write.
  • xrspatial/geotiff/__init__.py:112 -- from ._writer import write leaks the array-level entry point onto xrspatial.geotiff.write even though write is not in __all__. Mirrors the read_to_array cleanup #1708 already did. Fixed in commit b81d298 by dropping the unused re-export and adding a namespace-no-leak regression test.

Nits (optional improvements)

  • xrspatial/geotiff/_writer.py:1711-1714 -- the new TypeError branch on non-string compression is reachable only by direct callers (the public to_geotiff wrapper rejects non-string compression upstream with the same type-only check). The downstream _compression_tag call would still raise AttributeError on the bare string-method access, so the explicit guard is defensible. Worth a one-line comment noting that the branch is unreachable from to_geotiff and exists only for direct callers.
  • xrspatial/geotiff/_writer.py:1920-1924 -- the float16 / bool_ promotion guards on isinstance(data, np.ndarray) so dask arrays fall through. _write is documented as numpy-only (the dask path goes through _write_streaming), so the guard is correct, but a one-line comment explaining why the promotion is skipped for dask arrays would help the next reader.

What looks good

  • The hybrid scope matches the issue exactly. Eight write-side validation gaps are inventoried; the six byte-affecting ones are pushed down, the two that need DataArray state (3D dim-order, fail-closed transform) stay in to_geotiff. Read-side gaps are correctly identified as DataArray-attrs-dependent and left in open_geotiff.
  • The _validate_lowlevel_write_kwargs helper centralises the gates so _write and _write_streaming cannot drift. The entry_point argument makes the error source unambiguous.
  • restore_sentinel-gated NaN-to-sentinel rewrite with defensive copy at _writer.py:1933-1941 matches the eager-path semantics exactly and respects the #1988 opt-out.
  • Byte-parity tests at tests/test_lowlevel_write_pushdown_2138.py:243-273 are the right shape: every entry in the lossless codec set rather than spot-checking one. JPEG is correctly excluded with an explicit reason in a comment.
  • The aliases write = _write, write_streaming = _write_streaming, read_to_array = _read_to_array are object-identity preserving, which test_aliases_match_underscore_names pins so a future refactor cannot drift the alias out of sync.
  • Test count is right for the scope: 26 new push-down + byte-parity tests, three existing tests updated for the renamed entry-point name in error messages.

Checklist

  • Algorithm matches issue gap inventory
  • All implemented backends produce consistent results (numpy via _write, dask via _write_streaming; byte-parity tests pass on every lossless codec)
  • NaN handling is correct (defensive copy preserves caller buffer; sentinel-on-disk verified)
  • Edge cases are covered by tests (bool crs_epsg, negative max_z_error, float16, bool_, unknown compression)
  • Dask chunk boundaries handled correctly (push-down validation fires before any tile-row math)
  • No premature materialization or unnecessary copies (the new arr.copy() only runs when there is a NaN to rewrite)
  • Benchmark exists or is not needed (validation-only PR; no perf-sensitive path)
  • README feature matrix updated (if applicable) -- N/A, no public API change
  • Docstrings present and accurate (module docstrings spell out the private contract; _write / _write_streaming document the new kwargs and the DataArray-state gap)

…uards (#2138)

Two comments clarifying the reasoning for guards that are intentionally
narrower than they look:

* The ``TypeError`` branch on non-string ``compression`` is unreachable
  from ``to_geotiff`` (which only forwards ``str``); it exists so
  direct callers get a typed error instead of an ``AttributeError``
  from ``compression.lower()``.
* The ``isinstance(data, np.ndarray)`` gate on the float16 / bool_
  promotion intentionally skips dask arrays because ``_write_streaming``
  handles dtype promotion separately on the dask path.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review pass after addressing review feedback

Second pass against the head of issue-2138 after the merge from main and the review nits.

Disposition of the original review findings

  • Suggestion: _write_single_tile does not forward JPEG / CRS opt-ins -- fixed in commit b81d298. The kwargs now thread through to_geotiff -> _write_vrt_tiled -> _write_single_tile -> _write.
  • Suggestion: __init__.py re-exports write even though it is not in __all__ -- fixed in commit b81d298. Dropped the unused re-export and added a namespace-no-leak test pinning the contract.
  • Nit: comment the TypeError branch as unreachable from to_geotiff -- fixed in commit adbfd38.
  • Nit: comment the isinstance(data, np.ndarray) gate as intentional -- fixed in commit adbfd38.

Re-review notes

  • The merge from origin/main (commits b0c6ff9 and 24b014c, both allow_rotated CRS handling) touched __init__.py. Confirmed no interaction with the #2138 changes -- the _writer.write re-export drop sat in the writer-imports block, the rotated-CRS work sits in the reader-side block.
  • Full geotiff suite: 4285 passed, 25 skipped, 1 xfailed after the merge.
  • mergeStateStatus is still computing on the GitHub side; CI is running.

No new findings.

CI on Ubuntu / Windows / macOS does not include the optional ``lz4``
and ``lerc`` codec packages in every matrix slot. The byte-parity
sweep and the LERC lossless round-trip test now probe the codec
import the way ``_compression`` itself does and skip cleanly instead
of raising ``ImportError``. The push-down validation tests
(reject negative ``max_z_error``, reject ``max_z_error`` with non-LERC
codecs) do not need the codecs because they raise before any encode
step runs.
@brendancol brendancol merged commit 621e420 into main May 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: harden or privatize lower-level write/read entry points that bypass top-level validation

1 participant