Skip to content

Add preview() for memory-safe downsampling of large dask arrays #986

@brendancol

Description

@brendancol

Author of Proposal: Community request

Reason or problem

You have a 30TB dask-backed raster. You want to see what it looks like. You can't .compute() it because you have 16GB of RAM. So users write throwaway slicing hacks that lose coordinates and mishandle NaN.

Proposal

Add a preview() function that downsamples a DataArray (or Dataset) to a target pixel size (e.g. 1000x1000) without blowing memory.

Design:

Use xarray's coarsen with block averaging. For dask arrays this stays lazy: each chunk gets reduced on its own, so peak memory is just the largest chunk plus the output array. Numpy/cupy arrays already fit in memory, so same operation, nothing special needed.

Accepts both xr.DataArray and xr.Dataset. For Datasets, each data variable is independently downsampled via the existing @supports_dataset decorator, then collected back into a smaller Dataset. Same memory guarantees apply.

Backend support:

  • NumPy: coarsen().mean()
  • CuPy: stride-based subsampling (xarray coarsen has edge cases with cupy)
  • Dask+NumPy: coarsen().mean(), lazy, won't OOM
  • Dask+CuPy: coarsen().mean(), lazy, stays on GPU

Returns a small xr.DataArray (or xr.Dataset) with the coordinates downsampled to match.

Usage:

import xarray as xr
import xrspatial

big = xr.open_zarr("huge_dem.zarr")["elevation"]  # 30TB dask array
small = xrspatial.preview(big, width=1000)  # ~8MB output
small.plot()

# Also works with Datasets
ds = xr.open_zarr("huge_dem.zarr")  # multiple variables
small_ds = xrspatial.preview(ds, width=1000)

Stakeholders and impacts

Anyone working with large dask rasters. No changes to existing functions.

Drawbacks

Block averaging blurs fine detail. That's the tradeoff with any downsampling for preview.

Alternatives

  • canvas_like() exists but needs datashader and materializes the full array.
  • [::stride_y, ::stride_x] slicing works but loses coordinates and doesn't average.

Unresolved questions

  • Whether to support other aggregation methods (min, max, median) beyond mean.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions