Skip to content

Latest commit

 

History

History
120 lines (88 loc) · 5.29 KB

faq.rst

File metadata and controls

120 lines (88 loc) · 5.29 KB

Frequently Asked Questions

Why is pandas not enough?

pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?

Sometimes, we really want to work with collections of higher dimensional array (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn't really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas does support N-dimensional panels, but the implementation is very limited:

  • You need to create a new factory type for each dimensionality.
  • You can't do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., 'labels', 'items', 'major_axis', etc.) but the dimension names refer to order, not their meaning. You can't specify an operation as to be applied along the "time" axis.

Fundamentally, the N-dimensional panel is limited by its context in pandas's tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. pandas gets a lot of things right, but scientific users need fully multi-dimensional data structures.

When should I use xray instead of pandas?

It's not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.

How can I use xray with heterogeneous data?

All items in a :py~xray.DataArray must have a single (homogeneous) data type. To work with heterogeneous or structured data types in xray, put separate DataArray objects in a single :py~xray.Dataset.

The Dataset object allows for most of the flexibility of heterogenerous data without the complexity or performance cost, because its constituent arrays only have a single dtype.

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically intrepret and enforce units or CF conventions. (An exception is serialization to netCDF with cf_conventions=True.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray usually drops conflicting attrs when combining arrays and datasets instead of raising an exception, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

netCDF4-python provides a lower level interface for working with netCDF and OpenDAP datasets in Python. We use netCDF4-python internally in xray, and have contributed a number of improvements and fixes upstream. xray deos not yet support all of netCDF4-python's features, such as writing to netCDF groups or modifying files on-disk.

Iris (supported by the UK Met office) provides similar tools for in-memory manipulation of labeled arrays, aimed specifically at weather and climate data needs. Indeed, the Iris :py~iris.cube.Cube was direct inspiration for xray's :py~xray.DataArray. xray and Iris take very different approaches to handling metadata: Iris strictly interprets CF conventions. Iris particularly shines at mapping, thanks to its integration with Cartopy.

UV-CDAT is another Python library that implements in-memory netCDF-like variables and tools for working with climate data.

We think the design decisions we have made for xray (namely, basing it on pandas) make it a faster and more flexible data analysis tool. That said, Iris and CDAT have some great domain specific functionality, and we would love to have support for converting their native objects to and from xray (see 37 and 133)