# Exploring data

When working with a dataset, the first step is usually to understand what data and metadata it contains.
In this chapter we explore how scipp supports this.

This tutorial contains exercises, but solutions are included directly.
We encourage you to download this notebook and run through it step by step before looking at the solutions.

First, in addition to importing `scipp`, we import `scippneutron` since this is required for loading Nexus files:

In [1]:
import scipp as sc
import scippneutron as scn
import numpy as np

You are running a "Debug" build of scipp. For optimal performance use a "Release" build.


We  start by loading some data (download [here](https://github.com/ess-dmsc-dram/loki_tube_scripts/raw/master/test/test_data/LARMOR00049338.nxs)), in this case measured with a prototype of the [LoKI](https://europeanspallationsource.se/instruments/loki) detectors at the [LARMOR beamline](https://www.isis.stfc.ac.uk/Pages/Larmor.aspx):

In [2]:
data = scn.load(filename='/Users/spu92482/Downloads/LARMOR00049338.nxs')

Workspace run log 'good_frames' has unrecognised units: 'frames'
Workspace run log 'raw_frames' has unrecognised units: 'frames'


Note that the exercises in the following are fictional and do not represent the actual SANS data reduction workflow.

### Step 1: Use the HTML representation to see what the loaded data contains

The HTML representation is what Jupyter displays for a scipp object.
- Take some time to explore this view and try to understand all the information (dimensions, dtypes, units, ...).
- Note that sections can be expanded, and values can shown by clicking the icons to the right.

In [3]:
data

<IPython.core.display.Javascript object>

BinEdgeError: Flatten: the bin edges cannot be joined together.

<scipp.DataArray>
Dimensions: Sizes[spectrum:114688, tof:1000, ]
Coordinates:
  position                vector_3_float64              [m]  (spectrum)  [(0.778000, 0.130467, 29.858778), (0.775065, 0.130467, 29.858778), ..., (-0.569652, -0.022866, 29.953283), (-0.572000, -0.022866, 29.953283)]
  sample_position         vector_3_float64              [m]  ()  [(0.000000, 0.000000, 25.300000)]
  source_position         vector_3_float64              [m]  ()  [(0.000000, 0.000000, 0.000000)]
  spectrum                    int32  [dimensionless]  (spectrum)  [11, 12, ..., 114697, 114698]
  tof                       float64            [µs]  (tof [bin-edge])  [5.000000, 105.000000, ..., 99905.000000, 100000.000000]
Data:
                            float64         [counts]  (spectrum, tof)  [0.000000, 0.000000, ..., 0.000000, 0.000000]  [0.000000, 0.000000, ..., 0.000000, 0.000000]
Attributes:
  A1HCent                 DataArray  [dimensionless]  ()  [<scipp.DataArray>
Dimensions: Sizes[time:24, 

### Step 2: Plot the data

Scipp objects can be created using the `plot()` method.
Alternatively `sc.plot(obj)` can be used.
Since this is neutron-scattering data, we can also use the "instrument view", provided by `scn.instrument_view(obj)` (assuming `scippneutron` was imported as `scn`).

- Plot the loaded data and familiarize yourself with the controls.
- Create the instrument view and familiarize yourself with the controls.

In [4]:
data.plot()

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

In [5]:
scn.instrument_view(data)

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

### Step 3: Exploring meta data

Above we saw that many attributes are scalar variables with `dtype=DataArray`.
The single value in a scalar variable is accessed using the `value` property.
Compare:

In [6]:
data.attrs['proton_charge_by_period']

In [7]:
data.attrs['proton_charge_by_period'].value

0.3437764048576355

Exercises:
1. Find some attributes of `data` with `dtype=DataArray` and plot their `value`.
   Also try `sc.table(attr.value)` to show a table representation.
2. Find and plot a monitor.
3. Try to normalize `data` to monitor 1.
   Why does this fail?
4. Plot all the monitors on the same plot.
   Note that `sc.plot()` can be used with a Python `dict` for this purpose: `sc.plot({'a':something, 'b':else})`.
5. Convert all the monitors from `'tof'` to `'wavelength'` using, e.g., `mon1_wav = sc.neutron.convert(mon1, 'tof', 'wavelength', scatter=False)`.
6. Inspect the HTML view and note how the "unit conversion" changed the dimensions and units.
7. Re-plot all the monitors on the same plot, now in `'wavelength'`.

In [8]:
sc.table(data.attrs['DCMagField2'].value)

VBox(children=(HTML(value="<span class='sc-title'>DataArray</span>"), VBox(children=(HTML(value="<span class='…

In [9]:
try:
    data / data.attrs['monitor1'].value
except sc.CoordError:
    print('Data and monitor are in unit TOF, but pixels and monitors are at different position, so data is not comparable')

Data and monitor are in unit TOF, but pixels and monitors are at different position, so data is not comparable


In [10]:
mon1 = data.attrs['monitor1'].value
scn.convert(mon1, 'tof', 'wavelength', scatter=False)

In [11]:
sc.plot({f'monitor{i}':data.attrs[f'monitor{i}'].value for i in [1,2,3,4,5]}, norm='log')

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

In [12]:
sc.plot({f'monitor{i}':scn.convert(data.attrs[f'monitor{i}'].value, 'tof', 'wavelength', scatter=False) for i in [1,2,3,4,5]}, norm='log')

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

### Step 4: Fixing metadata

Exercises:
1. The `sample_position` coord is wrong, shift the sample by `delta = sc.scalar(value=np.array([0.01,0.01,0.04]), unit=sc.units.m)`.
2. Because of a glitch in the timing system the time-of-flight has an offset of $2.3~\mu s$.
   Fix the corresponding coordinate.
3. Use the HTML view of `data` to verify that you applied the corrections/calibrations there, rather than in a copy.

In [13]:
data.coords['sample_position'] += sc.vector(value=[0.01,0.01,0.04], unit=sc.units.m)
data.coords['tof'] += 2.3 * sc.Unit('us') # note how we forgot to fix the monitor's TOF
data

BinEdgeError: Flatten: the bin edges cannot be joined together.

<scipp.DataArray>
Dimensions: Sizes[spectrum:114688, tof:1000, ]
Coordinates:
  position                vector_3_float64              [m]  (spectrum)  [(0.778000, 0.130467, 29.858778), (0.775065, 0.130467, 29.858778), ..., (-0.569652, -0.022866, 29.953283), (-0.572000, -0.022866, 29.953283)]
  sample_position         vector_3_float64              [m]  ()  [(0.010000, 0.010000, 25.340000)]
  source_position         vector_3_float64              [m]  ()  [(0.000000, 0.000000, 0.000000)]
  spectrum                    int32  [dimensionless]  (spectrum)  [11, 12, ..., 114697, 114698]
  tof                       float64            [µs]  (tof [bin-edge])  [7.300000, 107.300000, ..., 99907.300000, 100002.300000]
Data:
                            float64         [counts]  (spectrum, tof)  [0.000000, 0.000000, ..., 0.000000, 0.000000]  [0.000000, 0.000000, ..., 0.000000, 0.000000]
Attributes:
  A1HCent                 DataArray  [dimensionless]  ()  [<scipp.DataArray>
Dimensions: Sizes[time:24, 

Note how adding such offsets fails if we fail to specify a unit:

In [14]:
try:
    data.coords['tof'] += 2.3
except sc.UnitError as e:
    print(e)

Cannot add µs and dimensionless.


This has several advantages:
- We are protected from accidential errors.
  If someone changes the unit of data or metatdata without our knowledge, e.g., from `us` to `ms` this mechanism protects us from silent errors corrupting the data.
- It makes the code clearer and more readable, both for others as well as for our future selves.

### Step 5: A closer look at the data

The 2-D plot we obtain above by default is often not very enlightening.
Define:

In [15]:
counts = sc.sum(data, 'tof')

Exercises:
1. Create a plot of `counts` and also try the instrument view.
2. How many counts are there in total, in all spectra combined?
3. Plot a single spectrum of `data` as a 1-D plot using the slicing syntax to access the spectrum.

In [16]:
# slice is optional, making plot more readable in the documentation
counts['spectrum', 56000:62000].plot()

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

In [17]:
scn.instrument_view(counts, norm='log')

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

In [18]:
# sc.sum(counts, 'spectrum') # would be another solution
sc.sum(data).value

58142417.0

In [19]:
data['spectrum',10000].plot()

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

As seen in the instrument view the detectors consist of 4 layers of tubes, each containing 7 straws.
Let us try to split up our data, so we can compare layers.
There are other (and probably better) ways to do this, but here we try to define an integer variable containing a layer index:

In [20]:
z = sc.geometry.z(data.coords['position'])
near = sc.min(z)
far = sc.max(z)
layer = ((z-near)*400).astype(sc.dtype.int32)
layer.unit = ''
layer.plot()

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

Exercises:
- Change the magic parameter `400` in the cell above until pixels fall cleanly into layers, either 4 layers of tubes or 12 layers of straws.
- Store `layer` as a new coord in `data`.
- Use `sc.groupby(data, group='layer').sum('spectrum')` to group spectra into layers.
- Inspect and understand the HTML view of the result.
- Plot the result.
  There are two options:
  - Use `plot` with `projection='1d'`
  - Use `sc.plot` after collapsing dimensions, `sc.collapse(grouped, keep='tof')`
- Bonus: When grouping by straw layers, there is a different number of straws in the center layer of each tube (3 instead of 2) due to the flower-pattern arrangement of straws.
  Define a helper data array with data set to 1 for each spectrum, group by layers and sum over spectrum as above, and use this result to normalize the layer-grouped data from above to spectrum count.

In [21]:
# NOTE:
# - set magic factor to, e.g., 150 to group by straw layer
# - set magic factor to, e.g., 40 to group by tube layer
layer = ((z-near)*150).astype(sc.dtype.int32)
layer.unit = ''
data.coords['layer'] = layer
grouped = sc.groupby(data, group='layer').sum('spectrum')
grouped.plot(projection='1d')
sc.plot(sc.collapse(grouped, keep='tof'))

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…

In [22]:
norm = sc.DataArray(data=layer*0+1, coords={'layer':layer})
norm = sc.groupby(norm, group='layer').sum('spectrum')
sc.plot(sc.collapse(grouped/norm, keep='tof'))

VBox(children=(HBox(children=(VBox(children=(Button(icon='home', layout=Layout(padding='0px 0px 0px 0px', widt…