# From tabular data to binned data

## Loading tabular data

In this tutorial we will use a file created by a McStas simulation for a diamond sample, with Geant4 simulating the detectors.
The file looks as follows:

In [None]:
filename = '/home/simon/mantid/instruments/DREAM/data_dream_diamond.csv'

with open(filename) as f:
    header = f.readline()
    print(header, f.readline())

The header line defines columns names and (in some cases) units.
We extract them:

In [None]:
import re

pattern = re.compile(r'(\w+)(?:\s*\[(\w+)\])?')
name_to_unit = {m[1]: m[2] for m in pattern.finditer(header)}

We can now use [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to load the table as a [pandas.Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
It is about 10 times faster than [numpy.loadtxt](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) and also slightly more convenient:

In [None]:
import pandas as pd

df = pd.read_csv(filename, sep='\t', header=0, names=name_to_unit)
df

`scipp.compat.from_pandas` can convert the `pandas.Dataframe` to a `scipp.Dataset`.
Only the units must be set manually:

In [None]:
import scipp as sc
ds = sc.compat.from_pandas(df)
for name, unit in name_to_unit.items():
    ds[name].unit = unit
ds

In [None]:
import pandas as pd

filename = '/home/simon/mantid/instruments/DREAM/data_dream_diamond.zip'
df = pd.read_csv(filename, sep='\t')
import scipp as sc
ds = sc.compat.from_pandas(df)
for key in list(ds.keys()):
    name, *remainder = key.split(' ')
    ds[name] = ds.pop(key)
    ds[name].unit = remainder[0][1:-1] if remainder else None
ds

This 1-D dataset represents the tabular data that was read from the file:

In [None]:
sc.table(ds[:10])

In the above table, each row (record) describes and *event*, in this case the detection of a neutron, with its associated metadata such as the detector module or the x, y, and z position.
The table is thus actually a table of metadata values for events with a value of "1 count" each.
To continue we convert this into a data array:

In [None]:
da = sc.DataArray(sc.ones(sizes=ds.sizes, unit='counts'))
da.coords.update({name: ds[name].data for name in ds})
da

## Basic histogramming and binning

We are now ready to bin or histogram our data.
In both cases we need to define bin edges or grouping coordinates.
As an initial 1-D example, we will compute a wavelength histogram:

In [None]:
wavelength = sc.linspace('lambda',
                         da.coords['lambda'].min().value,
                         da.coords['lambda'].max().value, num=1001, unit='Angstrom' )

This can be histogrammed using `sc.histogram`:

In [None]:
histogrammed = sc.histogram(da, bins=wavelength)
histogrammed

Alternatively, we can use `sc.bin`, which keeps the underlying events and their metadata:

In [None]:
binned = sc.bin(da, edges=[wavelength])
binned

Since we used the same bin edges for histogramming and binning, computing the sum of values within each bin (given by `binned.bins.sum()`) gives the same result as histogramming directly.
Therefore only a single line is visible in the following plot:

In [None]:
sc.plot({'histogrammed':histogrammed, 'binned':binned.bins.sum()})

Since the above combines data from all pixels, the wavelength distribution is not really meaningful.

## Multi-dimensional binning

Binning is more powerful than `sc.histogram`.
Let us bin in multiple dimensions:

In [None]:
x = da.coords['x_pos']
y = da.coords['y_pos']
z = da.coords['z_pos']
x_pos = sc.linspace('x_pos', x.min().value, x.max().value, num=31, unit='mm')
y_pos = sc.linspace('y_pos', y.min().value, y.max().value, num=31, unit='mm')
z_pos = sc.linspace('z_pos', z.min().value, z.max().value, num=31, unit='mm')
binned = sc.bin(da, edges=[z_pos, y_pos, x_pos])
binned['z_pos', 20:].bins.sum().plot(norm='log', aspect='equal')

Above we can see a cut through the detector, which has the shape of a thick cylinder mantle.
The advantage of binned data over histogrammed data is that the meta data for each underlying event is still present.
We can therefore change the binning, or bin in additional dimensions.
For example, we can select the slice containing $z = 0$ and turn it into a higher-resolution cut:

In [None]:
x_fine = sc.linspace('x_pos', x.min().value, x.max().value, num=41, unit='mm')
y_fine = sc.linspace('y_pos', y.min().value, y.max().value, num=101, unit='mm')
xy_cut = sc.bin(binned['z_pos', sc.scalar(0.0, unit='mm')], edges=[y_fine, x_fine])
xy_cut

In [None]:
xy_cut.bins.sum().transpose().plot(aspect='equal')

Above we binned according to x, y, and z.
This reflects neither the physics nor the logical structure of the detectors and is generally not very useful.
The original table additionally contains information about the logical structure of the detector array.
In this case it is divided into modules, segments, counters, wires, and strips.
Instead of using `scipp.bin` with the `edges` keyword argument we can use the`groups` keyword argument to perform a binning based on discrete values.
The result is 5-D:

In [None]:
groups = {
    dim: sc.arange(dim, 1, da.coords[dim].max().value + 1, unit=None, dtype='int64')
    for dim in ['module', 'segment', 'counter', 'wire', 'strip']
}
binned = sc.bin(da, groups=list(groups.values()))
binned

We can select an plot slices as usual:

In [None]:
binned['module', 4]['segment', 3]['counter', 1].bins.sum().plot()

Each of the bins in the above data array corresponds to a detector voxel.

In [None]:
binned.bins.sum().data.max()

In [None]:
voxel = binned['module', 4]['segment', 3]['counter', 1]['strip', 152]['wire', 4]
sc.bin(voxel, edges=[wavelength]).bins.sum().plot()

Our data contains, for each detected neutron, the position of the associated voxel.
It is more practical to store this for every bin (voxel) instead of very every event.
We can also combine the x, y, and z components into a single array of position vectors:

In [None]:
pos = sc.zeros(sizes=binned.sizes, dtype=sc.DType.vector3, unit='mm')
pos.fields.x = binned.bins.coords['voxel_x'].bins.mean()
pos.fields.y = binned.bins.coords['voxel_y'].bins.mean()
pos.fields.z = binned.bins.coords['voxel_z'].bins.mean()
binned.coords['position'] = pos
binned

This can be used to createa 3-D scatter plot:

In [None]:
binned.plot(projection='3d', positions='position', pixel_size=10)

We can also inspect an individual component such as a strip:

In [None]:
binned['strip', 200].plot(projection='3d', positions='position', pixel_size=10)

It is also possible to "group" and "bin" at the same time.
Since strips roughly correspond to scattering angle, a plot against wavelength and strip may be useful:

In [None]:
sc.bin(da, groups=[groups['strip']], edges=[wavelength]).bins.sum().plot(norm='log')