# Data Structures 

xarray has two core data structures. Both are fundamentally N-dimensional:

- DataArray is xarray's implementation of a labeled, N-dimensional array.
- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions.

-----

### Outline
- Arrays of Numbers (Numpy is Python's most common array library)
- The common data model (labels, netcdf)
- N-Dimensional labeled arrays (xarray)
  - DataArray
  - Dataset
  
### Tutorial Duriation
10 minutes

### Going Further

Xarray Documentation on Data Structures: http://xarray.pydata.org/en/latest/data-structures.html

In [3]:
import numpy as np
import xarray as xr

## Arrays of Numbers

Unlabeled, N-dimensional arrays of numbers (e.g., NumPy’s ndarray) are the most widely used data structure in scientific computing. However, they lack a meaningful representation of the metadata associated with their data. Implementing such functionality is left to individual users and domain-specific packages. As a result, programmers frequently encounter pitfalls in the form of questions like “is the time axis of my array in the first or third index position?” or “does my array of timestamps still align with my data after resampling?”.

In [4]:
myvar = np.random.random(size=(2, 3, 6))
myvar

array([[[0.51797238, 0.36960625, 0.50105619, 0.86524812, 0.84462665,
         0.66155016],
        [0.92739413, 0.98730513, 0.38429811, 0.77778468, 0.96466869,
         0.95699085],
        [0.39442295, 0.96815621, 0.66922827, 0.95608537, 0.43409976,
         0.61155936]],

       [[0.66672558, 0.71694335, 0.87073351, 0.8472963 , 0.77740641,
         0.23121478],
        [0.81680208, 0.01495479, 0.78600093, 0.9823639 , 0.01090374,
         0.74945738],
        [0.36535074, 0.14439526, 0.72094285, 0.4588826 , 0.32375205,
         0.84569376]]])

## The Common Data Model and Inspiration from NetCDF

![](images/dataset-diagram.png)
*An example of how a dataset (netCDF or xarray) for a weather forecast might be structured. This dataset has three dimensions, time, y, and x, each of which is also a one-dimensional coordinate. Temperature and precipitation are three-dimensional data variables. Also included in the dataset are two-dimensional coordinates latitude and longitude, having dimensions y and x, and reference time, a zero-dimensional (scalar) coordinate.*

xarray adopts Unidata’s self-describing Common Data Model on which the network Common Data Form (netCDF) is built [20, 7]. NetCDF provides a well-defined data model for labeled N-dimensional array-oriented scientific data analysis.

## Xarray Data Structures

![](images/xarray-data-structures.png)

The Common Data Model and NetCDF forms the basis of the xarray data model and provides a natural and portable serialization format. Building on netCDF, xarray features two main data structures: the DataArray and the Dataset. The API for these data structures is summarized in the following sections and in the figure above.

## `xarray.DataArray`

The DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:

- data: N-dimensional array (NumPy or dask) holding the array's values,
- coords: dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings),
- dims: dimension names for each axis [e.g., (‘time’, ‘latitude’, ‘longitude’)],
- attrs: OrderedDict holding arbitrary metadata (e.g. units or descriptions), and
- name: an arbitrary name for the array.

xarray uses dims and coords to enable its core metadata-aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many NumPy functions. Coordinates are ancillary variables used to enable fast label based indexing and alignment, building on the functionality of the pandas Index. DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property, which can be used to further describe data (e.g. by providing units). Names and attributes are strictly for users and user-written code; in general xarray makes no attempt to interpret them, and propagates them only in unambiguous cases. In contrast, xarray does interpret and persist coordinates in operations that transform xarray objects.



In [None]:
my_da = xr.DataArray(myvar)
my_da

In [None]:
# Adding labels/metadata
my_da = xr.DataArray(myvar,
                     dims=('lat', 'lon', 'time'),
                     coords={'lat': [15., 30.], 'lon': [-110., -115., -120.]},
                     attrs={'long_name': 'temperature', 'units': 'C'},
                     name='temp')
my_da

In [None]:
# The underlying data is still there:
my_da.data

## `xarray.Dataset`

The Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArrays) with aligned dimensions. It is designed as an in-memory representation of a netCDF dataset. In addition to the dict-like interface of the dataset itself, which can be used to access any DataArray in a Dataset, datasets have four key properties:

- data_vars: OrderedDict of DataArray objects corresponding to data variables,
- coords: OrderedDict of DataArray objects intended to label points used in data_vars (e.g., 1-dimensional arrays of numbers, datetime objects or strings),
- dims: dictionary mapping from dimension names to the fixed length of each dimension (e.g., {‘x’: 6, ‘y’: 6, ‘time’: 8}), and
- attrs: OrderedDict to hold arbitrary metadata pertaining to the dataset.
DataArray objects inside a Dataset may have any number of dimensions but are presumed to share a common coordinate system. Coordinates can also have any number of dimensions but denote constant/independent quantities, unlike the varying/dependent quantities that belong in data. Figure 3 illustrates these concepts for an example Dataset containing meteorological data.

In [None]:
# Datasets are dict-like containers of DataArrays

xr.Dataset()

In [None]:
my_ds = xr.Dataset({'temperature': my_da})
# also equivalent to:
# my_da.to_dataset()
my_ds

In [None]:
my_ds['precipitation'] = xr.DataArray(np.random.random(myvar.shape),
                                      dims=('lat', 'lon', 'time'),
                                      coords={'lat': [15., 30.], 'lon': [-110., -115., -120.]},
                                      attrs={'long_name': 'precipitation', 'units': 'mm'},
                                      name='pcp') 

my_ds.attrs['history'] = 'created for the xarray tutorial'

In [None]:
my_ds