In [None]:
import xarray as xr
import numpy as np
import pandas as pd

# Introduction

- Why xarray?
  - numpy arrays are not enough
  - names, labels, attributes
  - distinction between dimension coordinates and normal coordinates (this distinction might disappear in the future)

# Data structures

xarray mainly provides two types: `DataArray` and `Dataset`. The `DataArray` class attaches dimension names, coordinates and attributes to multi-dimensional arrays while `Dataset` combines multiple arrays.

Both classes are normally created by reading data, but to understand them let's first look at creating them programmatically.

## DataArray

- in-detail description
  - attach labels, name and attribute to array

structure:
- DataArray construction
  - data + dims
  - coords
  - attrs
  - name
 
- data: array-like
- coords: dict of str to array-like / DataArray
- dims: sequence (tuple / list) of hashable (mostly str)
- name: hashable (mostly str)
- attrs: dict (arbitrary dict)

**Todo**: dtypes?

- construction and repr
  - numeric types (bool, int, float, complex)
  - strings
  - datetime / cftime
  - object

To programmatically create a `DataArray`, we use its constructor:
```python
xr.DataArray([data, coords, dims, name, attrs])
```
where `data` can be anything with the interface of a `numpy` array (`numpy`, `dask`, `sparse` (WIP), `pint` (WIP), etc) or something that can be converted to a `numpy` array using `numpy.array`.

**Todo**: `dims`, `coords`, `attrs`

As an example, let's create a `DataArray` with two dimensions from a `numpy` array:

In [None]:
da = xr.DataArray(np.ones((3, 4)), dims=("x", "y"), name="a")
da

**Todo**: explain the HTML/text repr in depth

The representation of the new array (its `repr`) consists of:
- the name of the `DataArray` (`'a'`). If we don't provide a name, this will be omitted.
- the dimensions of the array `(x: 3, y: 4)`: this tells us that the first dimension is named `x` and has a size of `3` while the second dimension is named `y` and has a size of `4`
- a preview of the data
- a list of coordinates
- a list of attributes

Since we didn't provide them, these dimensions don't have coordinates and there are no attributes. If we want to attach coordinates and/or attributes, we can do that with the `coords` and `attrs` parameters:

In [None]:
da = xr.DataArray(
    np.ones((3, 4)),
    dims=("x", "y"),
    coords={"x": ["a", "b", "c"], "y": np.arange(4), "u": ("x", np.arange(3), {"attr1": 0})},
    attrs={"attribute": "string", "flag": 1},
)
da

With the values passed to `coords`, we attached values to `x` and `y` and also created a non-dimension coordinate named `u` with the tuple syntax. That special syntax can be used as a shortcut and is roughly equivalent to
```python
xr.DataArray(data=np.arange(3), dims="x", attrs={"attr1": 0})
```
so we can use that to add `attrs` to the coordinate. Note: using `{"y": np.arange(4)}` has the same result as `{"y": ("y", np.arange(4)}`

Since `attrs` is a normal python `dict`, there is no restriction on the keys / values. However, by convention big arrays should not be used as values. Instead, use coordinates or a data variable in a `Dataset`.

Once we have created the `DataArray`, we can look at its data:

In [None]:
da.data

In [None]:
da.dims

In [None]:
da.coords

In [None]:
da.attrs

Coordinates become useful when we try to operate on two objects with different coordinate values:

In [None]:
a = xr.DataArray(np.full((3, 4), 3), dims=("x", "y"), coords={"x": ["a", "b", "c"], "y": np.arange(4)})
a

In [None]:
b = xr.DataArray(
    np.full((5, 4, 2), 0.5),
    dims=("x", "y", "z"),
    coords={"x": ["z", "f", "c", "r", "b"], "y": [5, 1, 0, 9], "z": [-1, 4]},
)
b

In [None]:
a * b

where only the matching coordinates for common dimensions were used.

# Dataset

`Dataset` objects collect multiple data variables, each with possibly different dimensions.

The constructor of `Dataset` takes three optional parameters:
```python
xr.Dataset([data_vars, coords, attrs])
```

where `coords` and `attrs` have the same structure as for `DataArray`. `data_vars` is a dictionary mapping names to either `DataArray` objects or the special tuple syntax (`(dims, data, [, attrs])`).

For example, let's create a `Dataset` with two variables:

In [None]:
xr.Dataset(data_vars={
    "a": (("x", "y"), np.ones((3, 4))),
    "b": (("t", "x"), np.full((8, 3), 3), {"attr": "value"}),
})

We can see that in total, `Dataset` has three dimensions: `t`, `x`, and `y`. However, neither `a` nor `b` is three dimensional.

As with `DataArray`, a `Dataset` really becomes useful once we assign coordinates:

In [None]:
xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.ones((3, 4))),
        "b": (("t", "x"), np.full((8, 3), 3)),
    },
    coords={
        "x": ["a", "b", "c"],
        "y": np.arange(4),
        "t": pd.date_range("2020-07-05", periods=8, freq="D"),
    },
)

If we have variables with different values along the same dimension, we can't use the shortcut syntax anymore. Instead, we need to use `DataArray` objects:

In [None]:
x_a = np.arange(1, 4)
x_b = np.arange(-1, 3)

a = xr.DataArray(np.linspace(0, 1, 3), dims="x", coords={"x": x_a})
b = xr.DataArray(np.zeros(4), dims="x", coords={"x": x_b})

xr.Dataset(data_vars={"a": a, "b": b})

which combines the coordinates and fills in `float` `nan` values for missing data: for example, `b` doesn't have a value for `x == 3` so `nan` was used.