# Dataset: A motivation from the Mantid perspective

If you are not familiar with Mantid or data-reduction for neutron-scattering data, this part of the introduction is likely irrelevant for you.
Skip ahead to [Dataset in a Nutshell - Part 1](demo-part1.ipynb) to start learning about the basics of using `Dataset`.

## Status quo

An imcomplete overview of aspects of Mantid that motivated the development of the `Dataset` container is given below.

![Mantid workspace inheritance tree](img/WorkspaceHierarchy.jpg)

- Not enough workspace types to support all aspects of all use-cases.
  Often a bit too inflexible to readily support new requirements.
  - Constant-wavelength data stored as length-1 histograms.
  - Polarization analysis?
  - Sample-environment parameter scans?
  - Handling multiple runs or data belonging to the same measurement/experiment.
  - `HKL` values stored in a histogram?!
  - Imaging?
  - Store a variety of beamline monitors in a flexibly way?
  - <img src="img/sans-limitation.png" height="80%" width="80%">
  - Difficult to store certain information in a natural way, e.g., `E_i` or `E_f` for inelastic scattering.
  - Little hope to improve performance across the board due to existing data structure eternalized in a large number of algorithms.
- Partially inconsistent and incomplete Python interface.
- Limit leeway in interaction with other Python packages such as `numpy`, and Python-style code in general.
- Have to memorize the API of many different workspaces, and the names of the corresponding algorithms.
  - Similar or identical concepts require the use of different algorithms for different workspace types, and sometimes even different algorithms for the same workspace type.
  - Hard to teach and long learning curve for developers as well as users.
- [List of other issues](https://github.com/mantidproject/documents/blob/master/Project-Management/CoreTeam/workspace-notes.md).

## Google Trends (category "Sience")
<img src="img/google-trends.png" height="70%" width="70%">

## Mantid going the same direction?
<img src="img/cumulative-algorithm-count.png" height="80%" width="80%">

## Existing technology?
<img src="img/1280px-NumPy_logo.png" height="25%" width="25%">
<img src="img/1280px-Pandas_logo.png" height="25%" width="25%">
<img src="img/dataset-diagram-logo.png" height="25%" width="25%">

Of these, [xarray](http://xarray.pydata.org/en/stable/) is has most overlap with our requirements on a data container.
However, we are missing:
- C++ backend.
- Handling of physical units.
- Propagation of uncertainties.
- Support of histograms, i.e., we cannot have a bin-edge axis, which is by 1 longer than the other variables in that dimension.
- Support for event data.

## Dataset

`Dataset` as presented here is a C++ library with Python exports.
It is inspired by `xarray.Dataset`, but adds support for the key features that are (currently) missing in `xarray`.

A good way to think about a `Dataset` is as a Python `dict` of `numpy.ndarray` objects (in reality everything is implemented in C++), with the addition of labeled dimensions and units.
The entries in the `dict` are grouped into two main categories, coordinates and data.
This distinction yields an intuitive and well-defined behavior of operations:
- Coordinates are compared.
- Data is operated on.

*Example: An addition of two datasets will compare all coordinates and abort if they are incompatible. If it succeeds, all data variables are matches based on their tag/name and added.*

`Dataset` is a *single* and *simple* container that provides a *uniform* interface to data the Mantid stores in:
- `MatrixWorkspace` and its child classes, notably `Workspace2D` and `EventWorkspace`.
- `TableWorkspace`.
- `MDHistoWorkspace`.
- Instrument-2.0 types (`DetectorInfo` and `ComponentInfo`).
- `EventList`, but other, more efficient event-storage would be supported.
- `Histogram` in special cases where 1D histograms is needed, the default data layout of a dataset removes the need for this.
- Various other like `Run` or `TimeSeriesProperty`, if desired.

In addition, it covers many other cases that are currently impossible to represent in a single workspace or in an intuitive manner.

A basic dataset might look like this:

<img src="img/dataset-3d-two-data-variables.png" height="66%" width="66%">

At this point, we recommend a glimpse at [Dataset in a Nutshell - Part 3](demo-part3.ipynb) to give an idea what the basics we introduce in Part 1 and 2 can be used for.
We do not suggest attempting to undertand the details, but consider that in very few lines of code we can:
- Merge event data into a dataset with multiple variables (sample and vanadium run). Subsequent operations are applied to both.
- Inspect the structure of the dataset, including the nested "tables" of event data for each pixel.
- Histogram the data, keeping or removing the events.
- Add three different types of monitors to the same dataset (event-mode, histogram-mode, pixellated).
- Plot various slices of the data and monitors.
- Rebin, convert units, and normalize to monitors or vanadium.
- Add a temperature dimension and axis and concatenate data from different sample temperatures, transforming the initial 2D data into 3D data.

Continue to [Dataset in a Nutshell - Part 1](demo-part1.ipynb) to start learning about the basics of using `Dataset`.