Skip to content

Refactor GeoTIFF Phase 5b: extract _decode.py from _reader.py#2254

Merged
brendancol merged 3 commits into
mainfrom
issue-2246-decode-extraction
May 21, 2026
Merged

Refactor GeoTIFF Phase 5b: extract _decode.py from _reader.py#2254
brendancol merged 3 commits into
mainfrom
issue-2246-decode-extraction

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2246
Part of #2211

Summary

Mechanical extraction of strip/tile decode orchestration out of _reader.py into a new _decode.py. Behavior-neutral.

Moves to _decode.py:

  • _apply_predictor, _packed_byte_count, _int_nodata_in_range, _resolve_masked_fill, _decode_strip_or_tile
  • _read_strips, _read_tiles (the local strip/tile readers)
  • _apply_orientation, _apply_orientation_with_geo, _apply_photometric_miniswhite, _miniswhite_inverted_nodata
  • _NATIVE_ORDER, _PARALLEL_DECODE_PIXEL_THRESHOLD

Left in _reader.py:

_reader.py drops from 2201 to 1407 lines; the moved code lands in _decode.py at 897 lines. The public and internal import surface from xrspatial.geotiff._reader is preserved via a back-import block, matching how PR-E handled _sources.

Test plan

  • Full geotiff test suite passes: 5033 passed, 68 skipped (one pre-existing lz4 failure on main deselected and unrelated to this PR)
  • Direct imports work: from xrspatial.geotiff._reader import _read_strips, _read_tiles, _apply_predictor, _decode_strip_or_tile, ...
  • Direct imports from new module work: from xrspatial.geotiff._decode import ...
  • Parallel-decode dispatch tests still observe ThreadPoolExecutor construction (patched at concurrent.futures.ThreadPoolExecutor rather than _reader.ThreadPoolExecutor to track the move)

Notes

Two strip-decode tests that monkey-patched _reader_mod.ThreadPoolExecutor now patch concurrent.futures.ThreadPoolExecutor instead. The dispatch contract under test (parallel branch engages on multi-strip, sparse-only short-circuits) is unchanged; the patch location is an implementation detail that tracks the move.

No public API change.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 21, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Refactor GeoTIFF Phase 5b, extract _decode.py from _reader.py

Blockers

None. The extraction is mechanical and behavior-preserving. All 5033 geotiff tests pass.

Suggestions

  1. xrspatial/geotiff/_decode.py lines 343-348 and 543-548: the lazy from ._reader import ... inside _read_strips / _read_tiles is a hidden runtime dependency on _reader.py. The module docstring already calls this a leaf module for transport-independent decode, but it doesn't say the module deliberately avoids a top-level _reader import to break a cycle. The inline comments are good; lifting that note into the docstring is a one-liner.

  2. xrspatial/geotiff/_decode.py lines 364 and 521: _read_strips / _read_tiles lost their max_pixels: int annotation when the signature became max_pixels=_MAX_PIXELS_UNSET. _resolve_max_pixels keeps the runtime behavior, but type checkers and IDEs no longer see the int. The sentinel itself is the right shape given MAX_PIXELS_DEFAULT stays in _reader.py. Annotating max_pixels: int = ... (callers should still pass an int) restores the hint at zero cost.

  3. xrspatial/geotiff/_decode.py lines 478 and 678: the function-local from concurrent.futures import ThreadPoolExecutor is the seam tests now patch. The comment says so on the tile path. The strip path got the same comment but lost the #1551 issue cross-reference the tile path has. One-line fix.

Nits

  1. xrspatial/geotiff/_decode.py line 22: import os as _os_module mirrors the alias in _reader.py. There's no os shadow risk inside _decode.py, so plain import os would read fine here. Keeping the alias for grep continuity is also defensible.

  2. xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py: a few patches still target _reader_mod._PARALLEL_DECODE_PIXEL_THRESHOLD. They work because the back-import binds at module load, but the canonical home is now _decode. Switching the patch target (or leaving a comment about why both work) avoids a future puzzle.

  3. xrspatial/geotiff/_decode.py lines 50-60: _resolve_max_pixels does from ._reader import MAX_PIXELS_DEFAULT on every default-value call. It's a sys.modules dict lookup once _reader is loaded, so the cost is microseconds. Mentioning it for completeness, not worth changing.

What looks good

  • Function bodies are byte-identical to the originals, diffed against origin/main.
  • The back-import block in _reader.py keeps every downstream consumer working: _backends/dask.py, _backends/gpu.py, _backends/vrt.py, _vrt.py, _sidecar.py, _gpu_decode.py, _writer.py, the test suite.
  • _decode.py has zero top-level dependency on _reader.py, so the import cycle is broken at module load.
  • Tests that monkey-patched _reader_mod.ThreadPoolExecutor were updated to patch concurrent.futures.ThreadPoolExecutor, which is the more durable seam now that the executor is function-local in both readers.
  • The redundant import os as _os shadow inside the original _read_tiles was cleaned up to use the module-level _os_module binding. Behavior-neutral side benefit.
  • Unused imports dropped from _reader.py: decompress, predictor_decode, fp_predictor_decode, unpack_bits, lerc_decompress_with_mask, COMPRESSION_NONE, extract_geo_info, RASTER_PIXEL_IS_POINT, GeoTransform. All were only used inside the moved functions.
  • _reader.py drops from 2201 to 1407 lines (-794), in the "another ~600-800 lines move" range the issue targets.

Checklist

  • Algorithm matches reference / paper, n/a, mechanical extraction
  • All implemented backends produce consistent results
  • NaN handling is correct (unchanged)
  • Edge cases are covered by tests (5033 passed)
  • Dask chunk boundaries handled correctly (unchanged)
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed, n/a
  • README feature matrix updated (if applicable), n/a
  • Docstrings present and accurate (preserved verbatim)

Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review (round 2): follow-up after review fixes

Blockers

None.

Suggestions

None. The three round-1 suggestions are addressed:

  1. The _decode.py module docstring now calls out that the module deliberately has no top-level _reader import, names the lazily-imported symbols, and points at PR-H (#2247) as the point where the lazy imports collapse back.
  2. _read_strips and _read_tiles got their max_pixels: int annotation back (with a type: ignore[assignment] on the sentinel default).
  3. The strip path's ThreadPoolExecutor comment now cross-references issue #1551.

Nits

The audit on nit #5 from round 1 turned up a real correctness issue worth recording. Several local-path patches of _PARALLEL_DECODE_PIXEL_THRESHOLD were targeting _reader_mod, but after PR-G the live binding _read_strips / _read_tiles reads is in _decode. Patching the _reader back-import alone leaves the live binding in _decode unchanged, so the "force serial" half of the parallel-vs-serial parity tests was silently a no-op (both sides ran parallel and trivially compared equal). Local-path patches in test_parallel_strip_decode_2100.py and test_parallel_strip_decode_sparse_2100.py now target _decode_mod; HTTP-path patches stay on _reader_mod because _fetch_decode_cog_http_* still lives there and uses the back-imported binding.

Remaining round-1 nits intentionally not addressed:

  • Nit #4 (import os as _os_module): kept for grep continuity with _reader.py.
  • Nit #6 (_resolve_max_pixels lazy lookup): microsecond optimization in a non-hot path. PR-H will collapse the lazy imports anyway.

What looks good

  • 5033 geotiff tests still pass on the latest commit (pre-existing lz4 writer test deselected, unrelated).
  • The patch-target audit confirms the only _reader_mod._decode_strip_or_tile patches remaining are in HTTP-path tests (test_cog_http_parallel_decode_2026_05_15.py, test_parallel_strip_decode_2100.py line 229/236) where _fetch_decode_cog_http_* calls the back-imported binding.
  • The doc note about PR-H means a future maintainer reading _decode.py cold will not be surprised by the lazy from ._reader import ... inside the read functions.

Checklist

  • Algorithm matches reference / paper, n/a, mechanical extraction
  • All implemented backends produce consistent results
  • NaN handling is correct (unchanged)
  • Edge cases are covered by tests (5033 passed)
  • Dask chunk boundaries handled correctly (unchanged)
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed, n/a
  • README feature matrix updated (if applicable), n/a
  • Docstrings present and accurate (preserved verbatim, plus expanded module docstring)

Clean. Nits-only remaining, all by design.

Mechanical extraction of strip/tile decode orchestration out of
_reader.py into a new _decode.py module. Behavior-neutral.

Moves the transport-independent decode helpers:

- _apply_predictor, _packed_byte_count, _int_nodata_in_range,
  _resolve_masked_fill, _decode_strip_or_tile
- _read_strips, _read_tiles (the local strip/tile readers)
- _apply_orientation, _apply_orientation_with_geo,
  _apply_photometric_miniswhite, _miniswhite_inverted_nodata
- _NATIVE_ORDER, _PARALLEL_DECODE_PIXEL_THRESHOLD

Left in _reader.py: top-level entry points (_read_to_array,
read_to_array), the COG-HTTP fetch+decode paths, the pixel-safety
guards (MAX_PIXELS_DEFAULT, _check_dimensions,
_check_source_dimensions, PixelSafetyLimitError), and the sparse-
layout helpers (_sparse_fill_value, _has_sparse,
_compute_full_image_byte_budget, _ifd_required_extent) which move
with _layout.py in PR-H (#2247).

_reader.py drops from 2201 to 1407 lines; the moved code lands in
_decode.py at 897 lines. The full public/internal import surface
from xrspatial.geotiff._reader is preserved via a back-import
block, matching the pattern PR-E used for _sources.

Two strip-decode tests that monkey-patched
``_reader_mod.ThreadPoolExecutor`` now patch
``concurrent.futures.ThreadPoolExecutor`` instead, tracking the
decode functions to their new home. The dispatch contract under
test (parallel branch engages on multi-strip, sparse-only short-
circuits) is unchanged.

Part of #2211.
- Docstring: spell out that _decode.py deliberately has no top-level
  _reader import so the two modules can sit on either side of the
  circular relationship.
- Restore the ``max_pixels: int`` annotation on _read_strips and
  _read_tiles. _MAX_PIXELS_UNSET is still the runtime default so the
  lookup of MAX_PIXELS_DEFAULT can stay lazy.
- Cross-reference issue #1551 in the strip path's ThreadPoolExecutor
  comment so it matches the tile path note.
- Update local-path _PARALLEL_DECODE_PIXEL_THRESHOLD patches in the
  parallel-strip tests to target ``_decode`` instead of ``_reader``.
  After PR-G the live binding _read_strips reads is in _decode; the
  back-imported name in _reader is a separate reference, so patching
  _reader alone silently no-ops the "force serial" half of the
  parallel-vs-serial parity tests. HTTP-path patches stay on _reader
  because _fetch_decode_cog_http_strips still lives there and uses
  the back-imported binding.
The previous assertion required coalesced wall time to be less than half
the baseline, which couples the test to per-tile decode cost on the
runner. On macOS arm64 in CI the constant decode overhead is large enough
that the ratio fails even when coalescing correctly saves the expected
~7 RTTs of network time.

Assert on absolute RTTs saved instead, which is what coalescing actually
controls.
@brendancol brendancol force-pushed the issue-2246-decode-extraction branch from bd13f0e to 20bdb28 Compare May 21, 2026 15:46
@brendancol brendancol merged commit 709c814 into main May 21, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor GeoTIFF Phase 5b: extract _decode.py from _reader.py (PR-G of #2211)

1 participant