Skip to content

shoyer/xarray-beam

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xarray-Beam

Xarray-Beam is a Python library for building Apache Beam pipelines with Xarray datasets.

The project aims to facilitate data transformations and analysis on large-scale multi-dimensional labeled arrays, such as:

  • Ad-hoc computation on Xarray data, by dividing a xarray.Dataset into many smaller pieces ("chunks").
  • Adjusting array chunks, using the Rechunker algorithm
  • Ingesting large multi-dimensional array datasets into an analysis-ready, cloud-optimized format, namely Zarr (see also Pangeo Forge)
  • Calculating statistics (e.g., "climatology") across distributed datasets with arbitrary groups.

Xarray-Beam is implemented as a thin layer on top of existing libraries for working with large-scale Xarray datasets. For example, it leverages Dask for describing lazy arrays and for executing multi-threaded computation on a single machine.

🚨 Warning: Xarray-Beam is new and unpolished 🚨

Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the management of lazy data with Dask and reading/writing data with Zarr. We have used it to efficiently process 5 TB datasets. We expect it to scale to PB size datasets but that's easier said than done. We welcome feedback and contributions from early adopters, and hope to have it ready for wider audience soon.

How does Xarray-Beam compare to Dask?

We love Dask! Xarray-Beam explores a different part of the design space for distributed data pipelines than Xarray's built-in Dask integration:

  • Xarray-Beam is built around explicit manipulation of (ChunkKey, xarray.Dataset) pairs to perform operations on distributed datasets, where ChunkKey is an immutable dict keeping track of the offsets from the origin for a small contiguous "chunk" of a larger distributed dataset. This requires more boilerplate but is also more robust than generating distributed computation graphs in Dask using Xarray's built-in API. The user is expected to have a mental model for how their data pipeline is distributed across many machines.
  • Xarray-Beam distributes datasets by splitting them into many xarray.Dataset chunks, rather than the chunks of NumPy arrays typically used by Xarray with Dask (unless using xarray.map_blocks). Chunks of datasets is a more convenient data-model for writing ad-hoc whole dataset transformations, but is potentially a bit less efficient.
  • Beam (like Spark) was designed around a higher-level model for distributed computation than Dask (although Dask has been making progress in this direction). Roughly speaking, this trade-off favors scalability over flexibility.
  • Beam allows for executing distributed computation using multiple runners, notably including Google Cloud Dataflow and Apache Spark. These runners are more mature than Dask, and in many cases are supported as a service by major commercial cloud providers.

Xarray-Beam datamodel vs Xarray-Dask

These design choices are not set in stone. In particular, in the future we could imagine writing a high-level xarray_beam.Dataset that emulates the xarray.Dataset API, similar to the popular high-level DataFrame APIs in Beam, Spark and Dask. This could be built on top of the lower-level transformations currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays" representation similar to that used by dask.array.

Getting started

Xarray-Beam requires recent versions of xarray, dask, rechunker and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good performance when writing Zarr files, we strongly recommend patching Xarray with this pull request.

TODO(shoyer): write a tutorial here! For now, see the test suite for examples.

Disclaimer

Xarray-Beam is an experiment that we are sharing with the outside world in the hope that it will be useful. It is not a supported Google product. We welcome feedback, bug reports and code contributions, but cannot guarantee they will be addressed.

See the "Contribution guidelines" for more.

Credits

Contributors:

  • Stephan Hoyer
  • Jason Hickey
  • Cenk Gazen

About

Distributed Xarray with Apache Beam

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%