Accumulate color_ramp stats during the streaming dask write by brendancol · Pull Request #3600 · xarray-contrib/xarray-spatial

brendancol · 2026-07-02T03:09:07Z

to_geotiff(dask_data, path, color_ramp=...) executed the source graph twice: the streaming write computed every chunk, then _finite_stats ran dask.compute over the same source to get the sidecar statistics. A counting map_blocks layer on a 16-chunk source measured 32 chunk executions. After this change it's 16.

_write_streaming takes an optional chunk_observer callback, called with each materialized buffer at its three materialization sites (row band, wide-raster column segment, strip band), after the CuPy transfer and before the dtype cast / NaN-to-sentinel restore. The buffers partition the raster, so the observer sees each pixel exactly once, even on the segmented wide path where source chunks can be computed more than once.
A new StreamingStats accumulator in _symbology.py folds those buffers into min/max/mean/stddev (Chan's parallel mean/M2 combine, float64 accumulators, population stddev, same finite/nodata exclusion as _finite_stats). write_symbology_sidecars accepts it through a new stats= parameter.
The GPU (gpu=True) and VRT write paths keep the separate stats pass: the GPU writer fully materializes anyway, and VRT writes per-tile through a different code path. Docstrings now say which paths pay the extra pass and where color_ramp_range still helps.

Backend coverage: dask+numpy and dask+cupy (gpu=False) go through the accumulator. numpy/cupy eager writes are unchanged; the data is already in memory there and stats are one cheap pass.

Bench on this machine: 8192x8192 float32 deflate round trip (read chunks=1024, write with stats) went from 1.53s to 1.44s. The bigger effect is that total source reads halve, which matters when the source graph is expensive to re-execute (HTTP COGs, cold storage, long compute pipelines).

Test plan:

13 new tests in test_color_ramp_single_pass_3597.py: execution counters on the row-band, strip, and segmented wide paths; stats parity against _finite_stats and plain numpy; nodata sentinel and int dtype exclusion; all-NaN and multiband gating; color_ramp_range still skips stats; a dask+cupy gpu=False streaming leg (run on a CUDA host).
Full geotiff write suite passes (1216 passed, 1 skipped); whole geotiff suite 5371 passed.
flake8 + isort clean on touched files.

to_geotiff(dask_data, path, color_ramp=...) executed the source graph twice: once in _write_streaming for the pixels and again in _finite_stats for the sidecar statistics (measured 32 chunk executions for a 16-chunk source). The streaming writer already materialises every pixel exactly once, so thread an optional chunk_observer through its three materialisation sites (row band, wide-raster segment, strip band) and fold the buffers into a StreamingStats accumulator (Chan's parallel mean/M2 combine, float64 accumulators, population stddev to match _finite_stats). The GPU and VRT write paths keep the separate stats pass; docs updated to say which paths pay it.

brendancol

PR Review: Accumulate color_ramp stats during the streaming dask write

Blockers (must fix before merge)

None found.

Suggestions (should fix, not blocking)

xrspatial/geotiff/_symbology.py:271 -- n_b = vals.size is a numpy scalar (np.intp), so self._count becomes a numpy int after the first update. The moment math is safe (every product folds into float64 left-to-right), but coercing with int(vals.size) keeps the running count in unbounded Python ints and removes any doubt about numpy-scalar arithmetic at very large pixel counts.
xrspatial/geotiff/tests/write/test_color_ramp_single_pass_3597.py:203 -- rel=1e-12 compares Chan's combine against dask's nanstd (a different float64 reduction order). It passes here with margin, but two distinct algorithms agreeing to 1e-12 is tighter than the test needs; 1e-9 would still catch real regressions without inviting a platform-specific flake.

Nits (optional improvements)

xrspatial/geotiff/_writer.py:1219 (and the two sibling sites) -- the observer call is three near-identical two-line blocks. Fine as is; a tiny _observe(buf) local would make the partition contract greppable from one place, but the writer's style leans explicit, so either way.

What looks good

The observer hook sits after the CuPy .get() / np.asarray and before the dtype cast and NaN-to-sentinel restore, so the accumulator sees logical values on all three paths (row band _writer.py:1218, wide segment _writer.py:1269, strip band _writer.py:1392), and the buffers partition the raster even when the segmented path recomputes source chunks. The segmented-path test pins exactly this with full-width chunks.
StreamingStats matches _finite_stats semantics point for point: finite-only, nodata excluded unless NaN, population stddev, None on empty. The NaN-nodata normalization in __init__ mirrors the has_nodata gate in _finite_stats.
The wiring is gated so nobody pays who didn't opt in: color_ramp_range set, multiband, categorical, pack=True, and the GPU/VRT paths all keep their previous behavior, and the docstrings now say which paths still run the separate stats pass.
Execution-count tests use a zero-size-guarded counter under a lock, so they are robust to dask meta-inference calls and threaded schedulers.
13 new tests cover row-band, strip, and segmented paths, int/nodata/all-NaN/multiband gating, accumulator parity, and a dask+cupy gpu=False leg run on a CUDA host.

Checklist

Algorithm matches reference (Chan et al. parallel mean/M2 combine; population stddev matches GDAL's STATISTICS_STDDEV convention)
All implemented backends produce consistent results (dask+numpy, dask+cupy via streaming; eager numpy/cupy unchanged)
NaN handling is correct (finite-only accumulation; all-NaN write emits no sidecars)
Edge cases are covered by tests (int dtypes, nodata sentinel, constant buffers, empty accumulator)
Dask chunk boundaries handled correctly (buffers partition the raster; recomputed chunks on the segmented path do not double-count)
No premature materialization or unnecessary copies (all-valid buffers skip the boolean-index copy; no astype(float64) copy)
Benchmark exists or is not needed (no ASV geotiff benchmarks exist; execution counts are pinned by tests instead)
README feature matrix updated (not applicable -- no new public function)
Docstrings present and accurate (chunk_observer, stats=, and the color_ramp/color_ramp_range path notes)

)

brendancol

Follow-up review after 94f20a5:

Suggestion 1 (numpy-scalar count): fixed. update now coerces n_b = int(vals.size) so the running count and the moment arithmetic stay in Python ints / float64.
Suggestion 2 (tolerance): fixed. The two cross-algorithm StreamingStats vs _finite_stats assertions now use rel=1e-9.
Nit (shared _observe helper in _writer.py): dismissed. Three explicit two-line guards match the writer's existing style of repeating the get/asarray/astype/restore blocks per path; a closure would save four lines and add indirection.

38 symbology/color-ramp tests pass after the changes; flake8/isort clean. No remaining findings.

…e-geotiff-2026-07-01

/#3598/#3600 docs + sibling CSV rows + #3595 fix)

deep-sweep security pass 2026-07-02 against xrspatial/geotiff. Delta since 2026-07-01 is write-path/docs only (#3595, #3600, #3592, #3593, #3588); no untrusted-input parser changed. Re-verified the read/parse surface (header caps, safe_xml DOCTYPE rejection, cog_http byte caps, sidecar SSRF hardening, decompression-bomb cap, VRT path-traversal containment, LZW/GPU kernel bounds). No new CRITICAL/HIGH/MEDIUM. Only state CSV updated.

brendancol added 2 commits July 1, 2026 23:07

Update performance sweep state for geotiff pass 16 (#3597)

fba4321

brendancol added performance PR touches performance-sensitive code geotiff GeoTIFF module dask Dask backend / chunked arrays sweep-performance Found by /sweep-performance labels Jul 2, 2026

brendancol commented Jul 2, 2026

View reviewed changes

Address review: plain-int count, loosen cross-algorithm tolerance (#3597

94f20a5

)

brendancol commented Jul 2, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into deep-sweep-performanc…

094734e

…e-geotiff-2026-07-01

brendancol merged commit a7e94a1 into main Jul 2, 2026
10 checks passed

brendancol added a commit that referenced this pull request Jul 2, 2026

Supersede af18dfe: keep the both-sides conflict resolution (main's #3596

9e012c7

/#3598/#3600 docs + sibling CSV rows + #3595 fix)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accumulate color_ramp stats during the streaming dask write#3600

Accumulate color_ramp stats during the streaming dask write#3600
brendancol merged 4 commits into
mainfrom
deep-sweep-performance-geotiff-2026-07-01

brendancol commented Jul 2, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brendancol commented Jul 2, 2026

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Accumulate color_ramp stats during the streaming dask write

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant