# Data Structures

To keep this documentation generic we typically use dimensions `x` or `y`, but this should *not* be seen as a recommendation to use these labels for anything but actual positions or offsets in space.

## Variable

### Basics

[scipp.Variable](../generated/scipp.Variable.rst#scipp.Variable) is a labeled multi-dimensional array.
A variable can be constructed using:

- `values`: a multi-dimensional array of values, e.g., a [numpy.ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray)
- `variances`: a (optional) multi-dimensional array of variances for the array values
- `dims`: a list of dimension labels (strings) for each axis of the array
- `unit`: a (optional) physical unit of the values in the array

Note that variables, unlike [DataArray](data-structures.ipynb#DataArray) and its eponym [xarray.DataArray](http://xarray.pydata.org/en/stable/data-structures.html#dataarray), variables do *not* have coordinate dicts.

In [None]:
import numpy as np
import scipp as sc

In [None]:
var = sc.Variable(values=np.random.rand(2, 4), dims=['x', 'y'])

In [None]:
sc.show(var)

In [None]:
var

In [None]:
var.unit

In [None]:
var.values

In [None]:
print(var.variances)

Variances must have the same shape as values, and units are specified using the [scipp.units](../python-reference/units.rst) module:

In [None]:
var = sc.Variable(values=np.random.rand(2, 4),
                  variances=np.random.rand(2, 4),
                  dims=['x', 'y'],
                  unit=sc.units.m/sc.units.s)
sc.show(var)

In [None]:
var

In [None]:
var.variances

### 0-D variables (scalars)

A 0-dimensional variable contains a single value (and an optional variance).
The most convenient way to create a scalar variable is by multiplying a value by a unit:

In [None]:
scalar = 1.2 * sc.units.m
sc.show(scalar)
scalar

For convenience, singular versions of the `values` and `variances` properties are provided:

In [None]:
print(scalar.value)
print(scalar.variance)

Creating scalar variables with variances or with custom `dtype` is possible using the constructor:

In [None]:
var_0d = sc.Variable(variances=True, dtype=sc.dtype.float32, unit=sc.units.kg)
var_0d

In [None]:
var_0d.value = 2.3
var_0d.variance

An exception is raised from the `value` and `variance` properties if the variable is not 0-dimensional.
Note that a variable with one or more dimension extent(s) of 1 contains just a single value as well, but the `value` property will nevertheless raise an exception.

## DataArray

### Basics

[scipp.DataArray](../generated/scipp.DataArray.rst#scipp.DataArray) is a labeled array with associated coordinates.
A data array is essentially a [Variable](../generated/scipp.Variable.rst#scipp.Variable) object with attached dicts of coordinates, masks, and attributes.

A data array has the following key properties:

- `data`: the variable holding the array data.
- `coords`: a dict-like container of coordinates for the array, accessed using a string as dict key.
- `masks`: a dict-like container of masks for the array, accessed using a string as dict key.
- `attrs`: a dict-like container of "attributes" for the array, accessed using a string as dict key.

See also the [xarray documentation](http://xarray.pydata.org/en/stable/data-structures.html#coordinates).

The key distinction between coordinates (added via the `coords` property) and attributes (added via the `attrs` property) is that the former are required to match ("align") in operations between data arrays whereas the latter are not.

`masks` allows for storing boolean-valued masks alongside data.
All four have items that are internally a [Variable](../generated/scipp.Variable.rst#scipp.Variable), i.e., they have a physical unit and optionally variances.

In [None]:
d = sc.DataArray(
    data = sc.Variable(dims=['y', 'x'], values=np.random.rand(2, 3)),
    coords={
        'y': sc.Variable(['y'], values=np.arange(2.0), unit=sc.units.m),
        'x': sc.Variable(['x'], values=np.arange(3.0), unit=sc.units.m)},
    attrs={
        'aux': sc.Variable(['x'], values=np.random.rand(3))})
sc.show(d)

Note how the `'aux'` attribute is essentially a secondary coordinate for the x dimension.
The dict-like `coords` and `masks` properties give access to the respective underlying variables:

In [None]:
d.coords['x']

In [None]:
d.attrs['aux']

Access to coords and attrs in a unified manner is possible with the `meta` property.
Essentially this allows us to ignore whether a coordinate is aligned or not:

In [None]:
d.meta['x']

In [None]:
d.meta['aux']

Further details about data arrays are implicitly discussed in the next section, which is covering datasets, since each item in a dataset behaves equivalently to a data array.

### Distinction between dimension coords and non-dimension coords, and aligned and unaligned coords

When the name of a coord matches its dimension, e.g., if `d.coord['x']` depends on dimension `'x'` as in the above example, we call this coord *dimension coordinate*.
Otherwise it is called *non-dimension coord*.
It is important to highlight that for practical purposes (such as matching in operations) **dimension coords and non-dimension coords are handled equivalently**.
Essentially:

- **Non-dimension coordinates are coordinates**.
- There is at most one dimension coord for each dimension, but there can be multiple non-dimension coords.
- In the special case of non-dimension coords that have more than 1 dimension, they are considered to be labels for their inner dimension.
- Operations such as value-based slicing that accept an input dimension and require lookup of coordinate values will only consider dimension coordinates.

The concept of dimension coords is unrelated to the concept of coord "alignment", i.e., whether axis labels are stored in `coords` or `attrs`.
In particular, dimension coords could be made attrs if desired, and non-dimension coords can (and often are) "aligned" coords.

## Dataset

[scipp.Dataset](../generated/scipp.Dataset.rst#scipp.Dataset) is a dict-like container of data arrays.
Individual items of a dataset ("data arrays") are accessed using a string as a dict key.

In a dataset the coordinates of the sub-arrays are enforced to be *aligned*.
That is, a dataset is not actually just a dict of data arrays.
Instead, the individual arrays share coordinates, labels, and attributes.
It is therefore not possible to combine arbitrary data arrays into a dataset.
If, e.g., the extents in a certain dimension mismatch, or if coordinate/label values mismatch, insertion of the mismatching data array will fail.

Typically a dataset is not created from individual data arrays.
Instead we may provide a dict of variables (the data of the items), and dicts for coords and labels:

In [None]:
d = sc.Dataset(
            {'a': sc.Variable(dims=['x', 'y'], values=np.random.rand(2, 3)),
             'b': sc.Variable(dims=['x'], values=np.random.rand(2)),
             'c': sc.Variable(1.0)},
             coords={
                 'x': sc.Variable(['x'], values=np.arange(2.0), unit=sc.units.m),
                 'y': sc.Variable(['y'], values=np.arange(3.0), unit=sc.units.m),
                 'aux': sc.Variable(['y'], values=np.random.rand(3))})
sc.show(d)

In [None]:
d

In [None]:
d.coords['x'].values

The name of a data item serves as a dict key.
Item access returns a view (`DataArrayView`) onto the data in the dataset and its corresponding coordinates, i.e., no copy is made.
Apart from that it behaves exactly like `DataArray`.

In [None]:
sc.show(d['a'])
d['a']

Each data item is linked to its corresponding coordinates, labels, and attributes.
These are accessed using the `coords` and `attrs` properties.
The variable holding the data of the dataset item is accessible via the `data` property:

In [None]:
d['a'].data

For convenience, properties of the data variable are also properties of the data item:

In [None]:
d['a'].values

In [None]:
d['a'].variances

In [None]:
d['a'].unit

Coordinates and attributes of a data item include only those that are relevant to the item's dimensions, all others are hidden.
For example, when accessing `'b'`, which does not depend on the `'y'` dimension, the coord for `'y'` as well as the `'aux'` labels are not part of the items `coords`:

In [None]:
sc.show(d['b'])

Similarely, when accessing a 0-dimensional data item, it will have no coordinates or labels:

In [None]:
sc.show(d['c'])

All variables in a dataset must have consistent dimensions.
Thanks to labeled dimensions transposed data is supported:

In [None]:
d['d'] = sc.Variable(dims=['y', 'x'], values=np.random.rand(3, 2))
sc.show(d)
d

The usual `dict`-like methods are available for `Dataset`:

In [None]:
for name in d:
    print(name)

In [None]:
'a' in d

In [None]:
'e' in d