# From tabular data to binned data

## Overview

Binned data in scipp is conceptually equivalent to an array of tables.
In other words, it represents an array of records (table rows) sorted into an (often multi-dimensional) array of "bins".
In this tutorial we begin by learning how to setup tabular data appropriate for histogramming and binning with scipp.
The main focus will then be binning the tabular data and basic usage of the resulting binned data.

We will use a file of a simulated neutron-scattering experiment &mdash; at the powder diffractometer *DREAM* at the European Spallation Source.
The approach and techniques displayed here are however applicable for more generally and not specific to this scientific area.

## Loading tabular data

We will use a file created by a McStas simulation for a diamond sample, with Geant4 simulating the detectors.
We can use [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to load the table as a [pandas.Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
`pandas.read_csv` is much faster than [numpy.loadtxt](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) and also slightly more convenient:

In [None]:
import pandas as pd

filename = 'https://public.esss.dk/groups/scipp/scippneutron/4/data_dream_diamond.zip'
df = pd.read_csv(filename, sep='\t')
df

`scipp.compat.from_pandas` can convert the `pandas.Dataframe` to a `scipp.Dataset`.
The column names encode the physical units so we must extract them manually:

In [None]:
import scipp as sc

ds = sc.compat.from_pandas(df)
ds.coords.pop('row')  # we have no use for this row index
for key in list(ds):
    name, *remainder = key.split(' ')
    ds[name] = ds.pop(key)
    ds[name].unit = remainder[0][1:-1] if remainder else None
sc.table(ds[:10])

This 1-D dataset represents the tabular data that was read from the file.
In the above table, each row (record) describes an *event*, in this case the detection of a neutron, with its associated metadata such as the detector module or the x, y, and z position.

To histogram or bin data by a column, scipp must know which columns are metadata and which column holds data values.
The table is actually a table of metadata values for events with an implicit data value of "1 count" each.
To continue we convert this into a data array:

In [None]:
table = sc.DataArray(sc.ones(sizes=ds.sizes, unit='counts'))
for name in ds:
    table.coords[name] = ds[name].data
table

## Histogramming and binning

We are now ready to bin or histogram our data.
Scipp uses the following terminology:

- Binning preserves the original data records as a table associated with each bin.
- Histogramming adds up the values from all contributing records into a single value per bin.

In both cases we need to define bin edges.
As an initial 1-D example, we will compute a wavelength histogram, so we create a variable of wavelength edges:

In [None]:
wavelength = sc.linspace(
    'wavelength',
    table.coords['wavelength'].min().value,
    table.coords['wavelength'].max().value,
    num=1001,
    unit='angstrom',
)

The table can now be histogrammed using `sc.histogram`:

In [None]:
histogrammed = sc.histogram(table, bins=wavelength)
histogrammed

Alternatively, we can use `sc.bin`, which keeps the underlying events and their metadata:

In [None]:
binned = sc.bin(table, edges=[wavelength])
binned

Since we used the same bin edges for histogramming and binning, computing the sum of values within each bin (given by `binned.bins.sum()`) yields the same result as histogramming directly.
Therefore only a single line is visible in the following plot:

In [None]:
bin_sums = binned.bins.sum()
sc.plot({'histogrammed': histogrammed, 'binned': bin_sums})

Since the above combines data from all pixels, the wavelength distribution is not really meaningful.

While the result of histogramming may appear similar or identical, the internal structure is very different.
The histogrammed data consists of essentially two arrays, one for the values (yellow) and one for the wavelengths (green):

In [None]:
sc.show(histogrammed)

The top level structure of the binned data is the same, i.e., we have and array of values and an array of wavelengths.
The difference is that each value (bin) stores all contributing table rows:

In [None]:
sc.show(binned)

### Exercise 1

- Define bin edges for `z_pos` and use it to histogram and bin `table`.
  Plot the results.
- Define different bin edges for `z_pos`, e.g., with more values.
  Use `sc.bin` with the new edges *on the result of the binning from the first bullet*, i.e., *not* on the original table `table`.
  Why is this possible?

In [None]:
# your code here

## Multi-dimensional spatial binning

`sc.bin` can handle multiple dimensions:

In [None]:
x = table.coords['x_pos']
y = table.coords['y_pos']
z = table.coords['z_pos']
x_pos = sc.linspace('x_pos', x.min().value, x.max().value, num=31, unit='mm')
y_pos = sc.linspace('y_pos', y.min().value, y.max().value, num=31, unit='mm')
z_pos = sc.linspace('z_pos', z.min().value, z.max().value, num=31, unit='mm')
binned = sc.bin(table, edges=[z_pos, y_pos, x_pos])
binned['z_pos', 20:].bins.sum().plot(norm='log', aspect='equal')

Above we can see a cut through the detector assembly, which has the shape of a thick cylinder mantle.

The advantage of binned data over histogrammed data is that metadata for each underlying event is still present.
We can therefore change the binning, or bin in additional dimensions.
For example, we can select the slice containing $z = 0$ and turn it into a higher-resolution cut:

In [None]:
x_fine = sc.linspace('x_pos', x.min().value, x.max().value, num=41, unit='mm')
y_fine = sc.linspace('y_pos', y.min().value, y.max().value, num=101, unit='mm')
z_slice = binned['z_pos', sc.scalar(0.0, unit='mm')]
xy_cut = sc.bin(z_slice, edges=[y_fine, x_fine])
xy_cut

In [None]:
xy_cut.bins.sum().transpose().plot(aspect='equal')

### Exercise 2

- Compute the radius from `x` and `y` (defined above) and store it as a new coordinate in `table`.
- Define bin edges for the radius.
- Bin `table` by `z_pos` and the radius.
- Plot the result.

In [None]:
# your code here

In [None]:
# TODO remove
radius = sc.sqrt(x**2+y**2)
table.coords['radius'] = radius
radius = sc.linspace('radius', radius.min().value, radius.max().value, num=13, unit='mm')
sc.bin(table, edges=[z_pos,radius]).bins.sum().plot()
#da.coords['phi'] = sc.atan2(y=y,x=x)
#phi = sc.linspace('phi', 0.0, 0.7, num=200, unit='rad')
#da.coords['theta'] = sc.atan(sc.sqrt(x**2+y**2)/z)
#theta = sc.linspace('theta', 0.7, 1.6, num=400, unit='rad')

## Multi-dimensional logical binning

Above we binned according to x, y, and z.
This reflects neither the physics nor the logical structure of the detectors and is generally not very useful.
The original table additionally contains information about the logical structure of the detector array.
In this case it is divided into modules, segments, counters, wires, and strips.
We define:

In [None]:
# Note that indices in the file are 1-based, not 0-based
groups = {
    dim: sc.arange(dim, 1, table.coords[dim].max().value + 1, unit=None, dtype='int64')
    for dim in ['module', 'segment', 'counter', 'wire', 'strip']
}
groups

Instead of using `sc.bin` with the `edges` keyword argument we can use the `groups` keyword argument to perform a binning based on discrete values.
The result is 5-D:

In [None]:
binned = sc.bin(table, groups=list(groups.values()))
binned

### Exercise 3

- Group `table` using `sc.bin` but only by strip and wire.
- Plot the result.
- The wire index increases with the cylinder radius, the strip index increases with z (or decreases with scattering angle).
  Explain the plot, e.g., why is intensity decreasing with increasing wire index?

In [None]:
# your code here

In [None]:
#TODO remove
sc.bin(table, groups=[groups['strip'], groups['wire']]).bins.sum().plot()

## From event-based metadata to bin-based metadata

For each detected neutron our data records the position of the associated voxel.
After the logical grouping above, every bin corresponds to a voxel.

It can be more practical to store the voxel position for every bin (voxel) instead of for every event.
This can be achieved, e.g., by computing the mean for every bin.
Note that in this case all events in a voxel record the same voxel position so this proceedure is wasteful &mdash; in practice we may prefer loading the voxel positions directly from a file.

We can also combine the x, y, and z components into a single array of position vectors:

In [None]:
pos = sc.zeros(sizes=binned.sizes, dtype=sc.DType.vector3, unit='mm')
pos.fields.x = binned.bins.coords['voxel_x'].bins.mean()
pos.fields.y = binned.bins.coords['voxel_y'].bins.mean()
pos.fields.z = binned.bins.coords['voxel_z'].bins.mean()
binned.coords['position'] = pos
binned

Equipped with the position of every voxel, we can compute the number of neutrons counted per voxel and create a 3-D scatter plot.
The "scatter points" correspond to the voxel positions.
In this particular case some voxels had no associated neutrons so the computed position is invalid and no scatter point is shown:

In [None]:
counts_per_voxel = binned.bins.sum()
counts_per_voxel.plot(projection='3d', positions='position', pixel_size=10)

We can also inspect an individual component such as a strip:

In [None]:
binned['strip', 200].bins.sum().plot(projection='3d', positions='position', pixel_size=10)

### Exercise 4

Above, in [Logical binning](#Logical-binning), we binned into individual voxels (based on 5 distinct logical indices) and then computed voxel positions.

- Repeat this without binning by wire, i.e., use only module, segment, counter, and strip.
- Compute the resulting mean positions analogously to before.
- Create a scatter plot as before.
  This should yield a rough projection onto a cylinder.

In [None]:
# your code here
# TODO remove
proj = sc.bin(table, groups=[groups[key] for key in groups if key != 'wire'])
pos = sc.zeros(sizes=proj.sizes, dtype=sc.DType.vector3, unit='mm')
pos.fields.x = proj.bins.coords['voxel_x'].bins.mean()
pos.fields.y = proj.bins.coords['voxel_y'].bins.mean()
pos.fields.z = proj.bins.coords['voxel_z'].bins.mean()
proj.bins.sum().plot(projection='3d', positions=pos, pixel_size=10, norm='log')

## Binning with edges and groups combined

It is also possible to "group" and "bin" at the same time.
Since strips roughly correspond to scattering angle, a plot against wavelength and strip may be useful.

### Exercise 5

- Use `sc.bin` to bin `table` by strip and wavelength.
- Plot the result.

#### Solution

In [None]:
sc.bin(table, groups=[groups['strip']], edges=[wavelength]).bins.sum().plot(norm='log')