**Audience**: Beginner.

**Prerequisites**:
Familiarity with basics of scipp.
If you are new to scipp we recommend to walk though the [Getting Started](https://scipp.github.io/tutorials/getting-started.html) tutorial.

**Objectives**:
Develop an understanding of `scipp.Dataset` and how to use it for representing tabular data.

# 1-D datasets and tables
## What is a `Dataset`?

If you are familiar with Pandas then you can think of scipp's `Dataset` as an equivalent to [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
While `Dataset` is more general in that it supports multi-dimensional entries, we can use a dataset with 1-D entries as a table, similar to `pandas.DataFrame`.
Pandas has a lot of more powerful and more specific functionality for processing of tabular data.
On the other hand, scipp provides features such as support for physical units and powerful routines for binning and histogramming data (study the follow-up tutorial
[From tabular data to binned data](https://scipp.github.io/tutorials/from-tabular-data-to-binned-data.html) for an introduction on this topic).
Therefore the choice between Pandas and scipp depends on the application.

## Creating a dataset

In [None]:
import numpy as np

import scipp as sc

We start by creating an empty dataset:

In [None]:
ds = sc.Dataset()
ds

## Using `Dataset` as a table

We can think about, and indeed use a dataset as a table.
This will demonstrate the basic ways of creating datasets and interacting with them.
Columns can be added one by one.
We create a `scipp.Variable` using `array`, one of scipp's [creation functions](https://scipp.github.io/reference/creation-functions.html) and immediately insert it into the dataset:

In [None]:
ds['alice'] = sc.array(
    dims=['row'], values=[1.0, 1.1, 1.2], variances=[0.01, 0.01, 0.02], unit='m'
)
ds

Under the hood, the column for `'alice'` contains two sub-columns with values and associated variances (uncertainties).
The uncertainties are optional.
The datatype (`dtype`) is derived from the provided data, so passing `np.arange(3)` will yield a variable (column) containing 64-bit integers.

As this dataset is 1-D we can visualize it as a table:

In [None]:
sc.table(ds)

For many practicle purposes we want to associate a set of values (optionally a unit) with our dimension.
Let us introduce a coordinate for `row` so that we can assign a row number starting at zero.
Coordinates, just like data columns, are variables:

In [None]:
ds.coords['row'] = sc.arange('row', 3, unit=None)
sc.table(ds)

Here the coordinate acts as a row header for the table.
Coordinates have a purpose similar to Pandas' *index*, but they are more general.
For example we can have multiple coordinates.

More details of the dataset are visible in its HTML representation:

In [None]:
ds

A data item (column) in a dataset (table) is identified by its name (`'alice'`).
Note how each coordinate and data item is associated with a tuple of dimension labels and a shape tuple:

In [None]:
print(ds.coords['row'].dims)
print(ds.coords['row'].shape)
print(ds['alice'].dims)
print(ds['alice'].shape)

It is important to understand the difference between items in a dataset (`scipp.DataArray`, includes coordinates), the variable that holds the data of the item (`scipp.Variable`), and the actual values.
The following illustrates the differences:

In [None]:
from IPython.display import display

display(sc.table(ds['alice']))  # data array, includes coordinates
display(
    sc.table(ds['alice'].data)
)  # the variable holding the data, i.e., the dimension labels, units, values, and optional variances
print(
    "values:", ds['alice'].values
)  # just the array of values, shorthand for d['alice'].data.values

A dataset works very similar to a Python `dict`.
For example we can insert new entries (here: columns):

In [None]:
ds['bob'] = ds['alice'].copy()  # make a deep copy

The `show()` function provides a quick graphical preview on the structure of a dataset:

In [None]:
sc.show(ds)

Operations between columns are supported by indexing into a dataset with a name:

In [None]:
ds['bob'] += ds['alice']
sc.table(ds)

Note how the coordinate is unchanged by this operations.
As a rule, operations *compare* coordinates (and fail if there is a mismatch).
In this case the coordinates are guaranteed to be the same since we operate with two columns of the same dataset, but the same logic applies when using entries from different datasets.

The contents of a dataset can be displayed on a graph using the `plot` method (or function):

In [None]:
ds.plot()

This plot demonstrates the advantage of "labeled" data, provided by a dataset:
Axes are automatically labeled and multiple items identified by their name are plotted.
Furthermore, scipp's support for units and uncertainties means that all relevant information is directly included in a default plot.

Operations between rows are supported by indexing into a dataset with a dimension label and an index.

Slicing dimensions behaves similar to `numpy`:
If a single index is given, the dimension is dropped, if a range is given, the dimension is kept.
For a dataset, in the former case the corresponding coordinates are turned into attributes, whereas in the latter case the coordinate is preserved.
Compare:

In [None]:
a = np.arange(8)

In [None]:
a[4]

In [None]:
a[4:5]

In [None]:
ds['row', 1]

In [None]:
ds['row', 1:2]

Attributes are stored in a dictionary similar to coordinates.
The difference is that attributes are not required to match in operations.
Therefore we can perform the following operation without resulting in a error about a coordinate mismatch:

In [None]:
ds['row', 1] += ds['row', 2]
sc.table(ds)

Note the key advantage over `numpy` or `MATLAB`:
We specify the index dimension, so we always know which dimension we are slicing.
The advantage is not so apparent in 1-D, but will become clear once we move to higher-dimensional data.

### Summary

There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset: 

In [None]:
# Single row
display(sc.table(ds['row', 0:1]))
# Range of rows
display(sc.table(ds['row', 1:3]))
# Single column (column pair if variance is present) including coordinate columns
display(sc.table(ds["alice"]))
# Single variable (column pair if variance is present)
display(sc.table(ds["alice"].data))
# Column values without header
print("values:", ds["alice"].values)

### Exercise 1
1. Combining row slicing and "column" indexing, add the last row of the data for `'alice'` to the first row of data for `'bob'`.
2. Using the slice-range notation `a:b`, try adding the last two rows to the first two rows. Why does this fail?

### Solution 1

In [None]:
ds['bob']['row', 0] += ds['alice']['row', -1]
sc.table(ds)

If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.

In [None]:
ds['bob']['row', 0:2] += ds['alice']['row', 1:3]  # will raise an exception

To circumvent the safety catch we can operate on the underlying variables containing the data.
The data is accessed using the `data` property:

In [None]:
ds['bob']['row', 0:2].data += ds['alice']['row', 1:3].data
sc.table(ds)

### Exercise 2

The slicing notation for variables (columns) and rows does not return a copy, but a view object.
This is very similar to how `numpy` operates:

In [None]:
a_slice = a[0:3]
a_slice += 100
a

Using the slicing notation, create a new table (or replace the existing dataset `ds`) by one that does not contain the first and last row of `ds`.

### Solution 2

In [None]:
ds2 = ds['row', 1:-1].copy()

# Or:
# from copy import copy
# table = copy(ds['row', 1:-1])

sc.table(ds2)

Note that the call to `copy()` is essential.
If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:

In [None]:
just_a_view = ds['row', 1:-1]
sc.to_html(just_a_view)
just_a_view['alice'].values[0] = 666
sc.table(ds)

## Appending rows and columns
We can append rows using `concat`, and add columns using `merge`:

In [None]:
ds = sc.concat([ds['row', 0:3], ds['row', 1:3]], 'row')

eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})
ds = sc.merge(ds, eve)

sc.table(ds)

### Exercise 3
Add the sum of the data for `alice` and `bob` as a new variable (column) to the dataset.

### Solution 3

In [None]:
ds['sum'] = ds['alice'] + ds['bob']
sc.table(ds)

## Interaction with `numpy` and scalars

Values (or variances) in a dataset are exposed in a `numpy`-compatible buffer format.
Direct access to the `numpy`-like underlying data array is possible using the `values` and `variances` properties:

In [None]:
ds['eve'].values

In [None]:
ds['alice'].variances

We can directly hand the buffers to `numpy` functions:

In [None]:
ds['eve'].values = np.exp(ds['eve'].values)
sc.table(ds)

### Exercise 4
1. As above for `np.exp` applied to the data for Eve, apply a `numpy` function to the data for Alice.
2. What happens to the unit and uncertanties when modifying data with external code such as `numpy`?

### Solution 4

In [None]:
ds['alice'].values = np.sin(ds['alice'].values)
sc.table(ds)

Numpy operations are not aware of the unit and uncertainties. Therefore the result is "garbage", unless the user has ensured herself that units and uncertainties are handled manually.

Corollary: Whenever available, built-in operators and functions should be preferred over the use of `numpy`: these will handle units and uncertanties for you.

### Exercise 5
1. Try adding a scalar value such as `1.5` to the `values` for `'eve'` or and `'alice'`.
2. Try the same using the `data` property, instead of the `values` property.
   Why is it not working for `'alice'`?

### Solution 5

In [None]:
ds['eve'].values += 1.5
ds['alice'].values += 1.5
sc.table(ds)

Instead of `values` we can use the `data` property.
This will also correctly deal with variances, if applicable, whereas the direction operation with `values` is unaware of the presence of variances:

In [None]:
ds['eve'].data += 1.5

The `data` for Alice has a unit, so a direct addition with a dimensionless quantity fails:

In [None]:
ds['alice'].data += 1.5  # will raise an exception

We can use `Variable` to provide a scalar quantity with attached unit:

In [None]:
scale = sc.scalar(1.5, unit='m**2')
ds['alice'].data += scale
sc.table(ds)

Continue to [Multi-dimensional datasets](multi-d-datasets.ipynb) to see how datasets are used with multi-dimensional data.