Xarray-Beam

Xarray-Beam is a Python library for building Apache Beam pipelines with Xarray datasets.

The project aims to facilitate data transformations and analysis on large-scale multi-dimensional labeled arrays, such as:

Ad-hoc computation on Xarray data, by dividing a xarray.Dataset into many smaller pieces ("chunks").
Adjusting array chunks, using the Rechunker algorithm
Ingesting large multi-dimensional array datasets into an analysis-ready, cloud-optimized format, namely Zarr (see also Pangeo Forge)
Calculating statistics (e.g., "climatology") across distributed datasets with arbitrary groups.

Xarray-Beam is implemented as a thin layer on top of existing libraries for working with large-scale Xarray datasets. For example, it leverages Dask for describing lazy arrays and for executing multi-threaded computation on a single machine.

🚨 Warning: Xarray-Beam is new and unpolished 🚨

Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the management of lazy data with Dask and reading/writing data with Zarr. We have used it to efficiently process 5 TB datasets. We expect it to scale to PB size datasets but that's easier said than done. We welcome feedback and contributions from early adopters, and hope to have it ready for wider audience soon.

How does Xarray-Beam compare to Dask?

We love Dask! Xarray-Beam explores a different part of the design space for distributed data pipelines than Xarray's built-in Dask integration:

Xarray-Beam is built around explicit manipulation of (ChunkKey, xarray.Dataset) pairs to perform operations on distributed datasets, where ChunkKey is an immutable dict keeping track of the offsets from the origin for a small contiguous "chunk" of a larger distributed dataset. This requires more boilerplate but is also more robust than generating distributed computation graphs in Dask using Xarray's built-in API. The user is expected to have a mental model for how their data pipeline is distributed across many machines.
Xarray-Beam distributes datasets by splitting them into many xarray.Dataset chunks, rather than the chunks of NumPy arrays typically used by Xarray with Dask (unless using xarray.map_blocks). Chunks of datasets is a more convenient data-model for writing ad-hoc whole dataset transformations, but is potentially a bit less efficient.
Beam (like Spark) was designed around a higher-level model for distributed computation than Dask (although Dask has been making progress in this direction). Roughly speaking, this trade-off favors scalability over flexibility.
Beam allows for executing distributed computation using multiple runners, notably including Google Cloud Dataflow and Apache Spark. These runners are more mature than Dask, and in many cases are supported as a service by major commercial cloud providers.

These design choices are not set in stone. In particular, in the future we could imagine writing a high-level xarray_beam.Dataset that emulates the xarray.Dataset API, similar to the popular high-level DataFrame APIs in Beam, Spark and Dask. This could be built on top of the lower-level transformations currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays" representation similar to that used by dask.array.

Getting started

Xarray-Beam requires recent versions of xarray, dask, rechunker and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good performance when writing Zarr files, we strongly recommend patching Xarray with this pull request.

TODO(shoyer): write a tutorial here! For now, see the test suite for examples.

Disclaimer

Xarray-Beam is an experiment that we are sharing with the outside world in the hope that it will be useful. It is not a supported Google product. We welcome feedback, bug reports and code contributions, but cannot guarantee they will be addressed.

See the "Contribution guidelines" for more.

Credits

Contributors:

Stephan Hoyer
Jason Hickey
Cenk Gazen

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
examples		examples
static		static
xarray_beam		xarray_beam
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Xarray-Beam

How does Xarray-Beam compare to Dask?

Getting started

Disclaimer

Credits

About

Uh oh!

Releases

Packages

Languages

License

shoyer/xarray-beam

Folders and files

Latest commit

History

Repository files navigation

Xarray-Beam

How does Xarray-Beam compare to Dask?

Getting started

Disclaimer

Credits

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages