Performance + correctness fixes for the chunked-read path. User-impacting because the previous behaviour turned ov.pp.scale on a million-cell h5ad into a ~one-day operation; the post-fix wall-clock is on the order of minutes for the same shape.
Fixes
- chunked() on TransformedBackedArray: O(n²) → O(n) — the wrapped (normalize/log1p) chunked iterator was making each chunk read scan from row 0 of the Rust backend; combined with a stale
_read_rowsfallback this was quadratic in the number of chunks. (#3) - BackedArray._read_rows: use
elem[s:e]as the primary slice path (the comment claiming PyArrayElem has no__getitem__was outdated); fall back to the chunked scan only on exception. (#3) - chunked_scale refuses re-scale on an already-scaled X (the
ScaledBackedArraysubclass check that silently dropped the prior mean/std). (#3) - chunked_normalize_total NaN guard —
np.nan_to_num(..., nan=1.0, posinf=1.0, neginf=1.0)onnorm_factorsbefore the== 0check, before storing onadata.obs. (#3) - chunked_mean_var sparse-native —
E[X²] − E[X]²per batch viachunk.multiply(chunk).sum(axis=0)+ Welford merge across batches. Eliminates the per-chunk densification that dominated wall-clock at atlas scale. (#3)
Tests
9 regression tests added in tests/test_chunked_perf.py + tests/test_chunked_correctness.py: position-independent slice reads, wrapped-vs-raw chunked ratio < 5×, chunk-content equivalence, double-scale refusal, zero-count cell NaN guard, sparse-native mean/var matches dense-Welford within float32-origin tolerance.
Wheels
Linux x86_64 · macOS x86_64 · macOS aarch64 + sdist. Trusted-publisher upload to PyPI.