Compute the Getis-Ord Gi* statistic in emerging_hotspots()#3040
Conversation
emerging_hotspots() carried its own copy of the local-mean z-score bug that hotspots() had before #2803: it convolved each time step with a normalized kernel and z-scored the local mean against the global mean and std, dropping the neighborhood weight sum, the squared weight sum, the sample count n, and the variance adjustment term. Replace the per-step local-mean z-score with the real Gi* z-score, reusing the _gistar_zscore and _gistar_global_stats helpers from #2803. The global mean, population std, and valid-cell count n are taken over the whole space-time cube; the per-step weight sums come from convolving the validity mask with the kernel and squared kernel, so raster edges and interior NaN cells drop out of every neighborhood sum. The dask backends now run the per-step convolutions through the dispatching convolve_2d (matching the hotspots() dask path) instead of a manual map_overlap, so the dask edge handling lines up with numpy and cupy. Empty-neighborhood cells stay NaN so the Mann-Kendall step can still detect all-NaN pixels. The Mann-Kendall trend classification is unchanged.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Compute the Getis-Ord Gi* statistic in emerging_hotspots()
This swaps the per-step local-mean z-score for the real Gi* statistic, following the #2803 fix to hotspots(). I read the full emerging_hotspots.py and the focal.py Gi* helpers it reuses, ran the existing suite plus the new tests, and checked numpy/dask parity across all four boundary modes.
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
- benchmarks/benchmarks/focal.py has a time_hotspots benchmark but none for emerging_hotspots. Pre-existing gap, not introduced here, and not needed for a bugfix. Worth a follow-up only if the function's runtime starts to matter.
- _gistar_step_zscore (emerging_hotspots.py:130) uses cupy.get_array_module guarded by
cupy is not None, while _gistar_step_dask uses da.* directly. Same convolution terms, two idioms. Fine as is, but a one-line comment on why the single-array path needs the xp switch and the dask path does not would save the next reader a double-take.
What looks good
- The core fix is correct. Global mean, population std, and valid-cell count n come from the whole space-time cube, and the per-step weight sums come from convolving the validity mask with the kernel and squared kernel. The per-step output matches a hand-rolled Gi* reference to 1e-5 for both a binary and a weighted kernel.
- Reworking the dask backends to call convolve_2d instead of a manual map_overlap was the right move. It drops the edge-handling discrepancy the old map_overlap path had: numpy and dask now agree to 2.4e-7 on the interior with identical NaN positions for nan, nearest, reflect, and wrap.
- Keeping empty-neighborhood cells as NaN in _gistar_step_finalize is well reasoned and tested. A plain 0 would have broken the all-NaN-pixel detection in the Mann-Kendall step.
- Orphaned imports from the removed map_overlap path (_convolve_2d_numpy, _convolve_2d_cupy, _boundary_to_dask) were cleaned up, and the pre-existing unused not_implemented_func import was left alone, which is the right call for a surgical bugfix.
- New tests cover the binary kernel, the weighted kernel where W != W2, a regression guard that the output no longer matches the old local-mean formula, and interior-NaN exclusion.
Checklist
- Algorithm matches reference: yes, the Gi* formula from #2803
- All implemented backends consistent: yes, numpy/dask verified, cupy paths reuse the same helpers
- NaN handling correct: yes
- Edge cases covered by tests: yes
- Dask chunk boundaries handled correctly: yes, via convolve_2d dispatch
- No premature materialization: yes, global stats computed once with a single da.compute
- Benchmark exists or not needed: not needed for a bugfix
- README feature matrix updated: not applicable, no new function
- Docstrings present and accurate: yes, unchanged and still correct
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (after 86fc634)
Re-reviewed after the follow-up commit. The only change is a 4-line comment in _gistar_step_zscore explaining why the single-array path picks numpy vs cupy via cupy.get_array_module while the dask path uses da.* directly. That clears the one nit worth fixing in-PR.
The other nit (no emerging_hotspots benchmark in benchmarks/benchmarks/focal.py) is deferred: it is a pre-existing gap, the benchmark file is not touched by this PR, and adding one would push past the scope of a bugfix.
No blockers, no suggestions, nothing else outstanding. All 49 tests in test_emerging_hotspots.py pass locally.
Closes #2804
emerging_hotspots()carried its own copy of the local-mean z-score bug thathotspots()had before #2803. It convolved each time step with a normalized kernel and z-scored the local mean against the global mean and std, which drops the neighborhood weight sum, the squared weight sum, the sample count n, and the variance adjustment term.What changed:
_gistar_zscoreand_gistar_global_statshelpers from Implement the Getis-Ord Gi* statistic in hotspots() #2803. The global mean, population std, and valid-cell count n are taken over the whole space-time cube; the per-step weight sums come from convolving the validity mask with the kernel and squared kernel, so raster edges and interior NaN cells drop out of every neighborhood sum.convolve_2d(matching thehotspots()dask path) instead of a manualmap_overlap, so the dask edge handling matches numpy and cupy.Backends: numpy, cupy, dask+numpy, dask+cupy all use the same Gi* math through the existing
ArrayTypeFunctionMappingdispatch.Test plan:
pytest xrspatial/tests/test_emerging_hotspots.py xrspatial/tests/test_focal.pypass (327 tests).