Describe the bug
hand_mfd._hand_mfd_dask (xrspatial/hydro/hand_mfd.py L600-628) builds the
final result by computing every tile via blocks[iy, ix].compute() into a
list-of-lists of numpy arrays, then calls da.block(rows) to assemble. Peak
driver memory holds the full output array, not one chunk at a time.
The sister implementation hand_dinf._hand_dinf_dask (xrspatial/hydro/hand_dinf.py
L605-624) handles the same step correctly: it uses da.map_blocks so each
tile is recomputed lazily from the converged boundary state.
The boundary-propagation phase before assembly (L577-596) is fine — it iterates
per-tile via _process_tile_hand_mfd which holds at most one tile in RAM.
The bug is purely in the assembly phase.
Expected behavior
hand_mfd on a dask DataArray returns a lazy dask result whose memory cost
scales with chunk size, not with the full grid — matching hand_dinf and the
rest of the hydro subpackage.
Affected backends
dask+numpy: directly affected.
dask+cupy: affected via _hand_mfd_dask_cupy (L631-647), which converts
to numpy dask, calls _hand_mfd_dask, then converts back.
Severity
HIGH. On a 30 TB grid this materializes 30 TB on the driver before output.
Tests use 3x3 chunks so the issue does not surface in CI.
Fix sketch
Replace the assembly loop with da.map_blocks over (fa_da, elev_da), slicing
fractions_da[:, y_start:y_end, x_start:x_end] per tile inside the closure
(or rechunk fractions once up front to align with fa_da chunks).
Additional context
Found during the 2026-05-01 hydro performance sweep. CSV updated:
hydro row now reads oom_verdict=WILL OOM, bottleneck=memory-bound,
high_count=1.
Describe the bug
hand_mfd._hand_mfd_dask(xrspatial/hydro/hand_mfd.py L600-628) builds thefinal result by computing every tile via
blocks[iy, ix].compute()into alist-of-lists of numpy arrays, then calls
da.block(rows)to assemble. Peakdriver memory holds the full output array, not one chunk at a time.
The sister implementation
hand_dinf._hand_dinf_dask(xrspatial/hydro/hand_dinf.pyL605-624) handles the same step correctly: it uses
da.map_blocksso eachtile is recomputed lazily from the converged boundary state.
The boundary-propagation phase before assembly (L577-596) is fine — it iterates
per-tile via
_process_tile_hand_mfdwhich holds at most one tile in RAM.The bug is purely in the assembly phase.
Expected behavior
hand_mfdon a dask DataArray returns a lazy dask result whose memory costscales with chunk size, not with the full grid — matching
hand_dinfand therest of the hydro subpackage.
Affected backends
dask+numpy: directly affected.dask+cupy: affected via_hand_mfd_dask_cupy(L631-647), which convertsto numpy dask, calls
_hand_mfd_dask, then converts back.Severity
HIGH. On a 30 TB grid this materializes 30 TB on the driver before output.
Tests use 3x3 chunks so the issue does not surface in CI.
Fix sketch
Replace the assembly loop with
da.map_blocksover(fa_da, elev_da), slicingfractions_da[:, y_start:y_end, x_start:x_end]per tile inside the closure(or rechunk fractions once up front to align with
fa_dachunks).Additional context
Found during the 2026-05-01 hydro performance sweep. CSV updated:
hydrorow now readsoom_verdict=WILL OOM,bottleneck=memory-bound,high_count=1.