# Introduction to Scipp

###### Multi-D data arrays with labeled dimensions

`scipp` is heavily inspired by `xarray`. While for many applications `xarray` is certainly more suitable (and definitely much more matured) than `scipp`, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using `scipp` may be worth considering:

- Handling of physical units.
- Propagation of uncertainties.
- Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.
- Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.
- Written in C++ for better performance (for certain applications), in combination with Python bindings.

This notebook demonstrates key functionality and usage of the `scipp` library. See the [documentation](https://scipp.readthedocs.io/en/latest/) for more information.

## Getting started
### What is a `Dataset`?
The main data container in `scipp` is called a `Dataset`.
There are two basic analogies to aid in thinking about a `Dataset`:
1. As a `dict` of `numpy.ndarray`s, with the addition of named dimensions and units.
2. As a table.

### Creating a dataset

In [None]:
import numpy as np
import scipp as sc

In [None]:
d = sc.Dataset()
d

## Using `Dataset` as a table
We can not only think about a dataset as a table, we can also use it as one.
This will demonstrate the basic ways of creating datasets and interacting with them.

In [None]:
d.set_coord(sc.Dim.Row, values=np.arange(3))
d["alice"] = sc.Variable([sc.Dim.Row], values=[1.0,1.1,1.2], variances=[0.01,0.01,0.02], unit=sc.units.m)
d

The datatype (`dtype`) is derived from the provided data, so passing `np.arange(3)` will yield a variable (column) containing 64-bit integers.

Datasets with up to one dimension can be displayed as a simple table:

In [None]:
sc.table(d)

A variable (column) in a dataset (table) is identified by its name (`"alice"`). A 1D variable will have a coordinate (`Row`), and holds `Values` and optionally `Variances` which are grouped together inside a common structure.

Each variable (column) comes with a physical unit attached to it, which we should set up correctly as early as possible.

In [None]:
d["alice"].unit = sc.units.m
sc.table(d)

Setting the units can also be done when constructing the `Variable` by using the `units` keyword argument

In [None]:
d["alice"] = sc.Variable([sc.Dim.Row], values=[1.0,1.1,1.2], variances=[0.01,0.01,0.02], unit=sc.units.m)
sc.table(d)

Units and uncertainties are handled automatically in operations.

In [None]:
d *= d
sc.table(d)

Operations between columns are supported by indexing into a dataset with a name.

In [None]:
d["bob"] = d["alice"]
d

In [None]:
sc.table(d)

It is also possible to get a quick graphical preview on the contents of your `Dataset` by using the `show()` function

In [None]:
sc.show(d)

In [None]:
d["bob"] += d["alice"]
sc.table(d)

The contents of  `Dataset` can also be displayed on a graph using the `plot` function:

In [None]:
sc.plot(d)

Operations between rows are supported by indexing into a dataset with a dimension label and an index.

Slicing dimensions behaves similar to `numpy`:
If a single index is given, the dimension is dropped, if a range is given, the dimension is kept.
For a `Dataset`, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved.

In [None]:
a = np.arange(8)

In [None]:
a[4]

In [None]:
a[4:5]

In [None]:
d[sc.Dim.Row, 1] += d[sc.Dim.Row, 2]
sc.table(d)

Note the key advantage over `numpy` or `MATLAB`:
We specify the index dimension, so we always know which dimension we are slicing.
The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.

### Summary
There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset: 

In [None]:
# Single row (dropping corresponding coordinates)
sc.table(d[sc.Dim.Row, 0])
# Size-1 row range (keeping corresponding coordinates)
sc.table(d[sc.Dim.Row, 0:1])
# Range of rows
sc.table(d[sc.Dim.Row, 1:3])
# Single variable
sc.table(d["alice"].data)
# Subset containing a single variable, keeping coordinates
sc.table(d["alice"])

### Exercise 1
1. Combining row slicing and "column" slicing, add the last row of the data for Alice to the first row of data for Bob.
2. Using the slice-range notation `a:b`, try adding the last two rows to the first two rows. Why does this fail?

### Solution 1

In [None]:
d["bob"][sc.Dim.Row, 0] += d["alice"][sc.Dim.Row, -1]
sc.table(d)

If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.

In [None]:
try:
    d["bob"][sc.Dim.Row, 0:2] += d["alice"][sc.Dim.Row, 1:3]
except RuntimeError:
    print("Failed as expected!")

We can operate with individual variables to circumvent the safety catch:

In [None]:
d["bob"][sc.Dim.Row, 0:2].values += d["alice"][sc.Dim.Row, 1:3].values
sc.table(d)

but note that the propagation of errors is then not taken into account by the operation, as we are simply adding two `numpy` arrays together.

We can also imagine ways to explicitly drop coordinates from a subset, e.g., `d['bob'].drop_coords()`, to allow for direct operation with subset. This is currently not supported.

### Exercise 2

The slicing notation for variables (columns) and rows does not return a copy, but a view object.
This is very similar to how `numpy` operates:

In [None]:
a_slice = a[0:3]
a_slice += 100
a

Using the slicing notation, create a new table (or replace the existing dataset `d`) by one that does not contain the first and last row of `d`.

### Solution 2

In [None]:
d2 = d[sc.Dim.Row, 1:-1].copy()

# Or:
# from copy import copy
# table = copy(d[Dim.Row, 1:-1])

sc.table(d2)

## More advanced operations with tables
In addition to binary operators, basic functions like `concatenate`, `sort`, and `merge` are available.

In [None]:
d = sc.concatenate(d[sc.Dim.Row, 0:3], d[sc.Dim.Row, 1:3], sc.Dim.Row)
d = sc.sort(d, sc.Dim.Row)
eve = sc.Dataset()
eve["eve"] = sc.Variable([sc.Dim.Row], values=np.arange(5).astype(np.float64))
d.merge(eve)
sc.table(d)

### Exercise 3
Add the sum of the data for `alice` and `bob` as a new variable (column) to the dataset.

### Solution 3

In [None]:
d['sum'] = d['alice'] + d['bob']
sc.table(d)

### Interaction with `numpy` and scalars
Variable in a dataset are exposed in a `numpy`-compatible buffer format, so we can directly hand them to `numpy` functions.

In [None]:
d['eve'] = np.exp(d['eve'])
sc.table(d)

Direct access to the `numpy`-like underlying data array is possible using the `values` property:

In [None]:
d['eve'].values

### Exercise 4
1. As above for `np.exp` applied to the data for Eve, apply a `numpy` function to the data for Alice.
2. What happens to the unit and uncertanties when modifying data with external code such as `numpy`?

### Solution 4

In [None]:
d['alice'] = np.sin(d['alice'])
sc.table(d)

Numpy operations are not aware of the unit and uncertainties. Therefore the result is garbage, unless the user has ensured herself that units and uncertainties are handled manually.

Corollary: Whenever available, built-in operators and functions should be preferred over the use of `numpy`.

### Exercise 5
1. Try adding a scalar value such as `1.5` to the data for Eve.
2. Try the same for Alice or Bob. Why is it not working?

### Solution 5

In [None]:
d['eve'] += 1.5
sc.table(d)

The data for Alice has a unit, so a direct addition with a dimensionless quantity fails:

In [None]:
try:
    d['alice'] += 1.5
except RuntimeError:
    print("Failed as expected!")

We can use `Variable` to provide scalar quantity with attached unit:

In [None]:
d['alice'] += sc.Variable(1.5, unit=sc.units.m*sc.units.m)
sc.table(d)

Continue to [Dataset in a Nutshell - Part 2](demo-part2.ipynb) to see how datasets are used with multi-dimensional data.