# Unaligned and Realigned Data

## Introduction

Scipp supports features for *realigning* "unaligned" data.
Unaligned data in this context refers to data values irregularly placed in, e.g., space or time.
Realignment lets us:

- Map a table of position-based data to an X-Y-Z grid.
- Map a table of position-based data to an angle such as $\theta$.
- Map event time stamps to time bins.

The key feature here is that *realignment does not actually histogram or resample data*.
Data is kept in its original form.
The realignment just adds a wrapper with a coordinate system more adequate for working with the scientific data.
Where possible, operations with the realigned wrapper are supported "as if" working with dense histogrammed data.

## From unaligned to realigned data

We outline the underlying concepts based on a simple example.

In [None]:
import scipp as sc
import numpy as np
from scipp.plot import plot
import matplotlib.pyplot as plt

np.random.seed(1) # Fixed for reproducibility

Consider a list of measurements at various "points" in space.
Here we restrict ourselves to the X-Y plane for visualization purposes:

In [None]:
N = 50
values = 10*np.random.rand(N)
data = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
data

For every point we measured at the auxiliary coordinates `'x'` and `'y'` give the position in the X-Y plane.
These are *not* dimension-coordinates, since our measurements are *not* on a 2-D grid, but rather points with an irregular distribution.
`data` is essentially a 1-D table of measurements.
We can plot this data:

In [None]:
plot(data)

The `'position'` dimension is not a continuous dimension but essentially just a row in our table.
In practice, such a figure and this representation of data in general may therefore not be very useful.

As an alternative view of our data we can create a scatter plot.
We do this explicitly here to demonstrate how the content of `data` is connected to elements of the figure:

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(
    x=data.coords['x'].values,
    y=data.coords['y'].values,
    c=data.values)
ax.set_xlabel('x [{}]'.format(data.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(data.coords['y'].unit))
cbar = plt.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig.show()

This shows the distribution in space, but for real datasets with millions of points this may not be convenient.
Furthermore, operating with scattered data is often inconvenient and may require knowledge of the underlying representation.

We can now use `scipp.realign` to provide a more accessible wrapper for our data:

In [None]:
xbins = sc.Variable(dims=['x'], unit=sc.units.m, values=[0.1,0.5,0.9])
ybins = sc.Variable(dims=['y'], unit=sc.units.m, values=[0.1,0.3,0.5,0.7,0.9])
realigned = sc.realign(data, {'y':ybins,'x':xbins})
realigned

`realigned` is a 2-D data array, but it contains the original "unaligned" data, accessible through the `unaligned` property:

In [None]:
realigned.unaligned

The "realignment" procedure based on bin edges for `'x'` and `'y'` is *not* performing the actual histogramming step.
However, since its dimensions are defined by the bin-edge coordinates for `'x'` and `'y'`, we will see below that it behaves much like normal dense data for operations such as slicing.

We create another figure to better illustrate the structure of `realigned`:

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(
    x=realigned.unaligned.coords['x'].values,
    y=realigned.unaligned.coords['y'].values,
    c=realigned.unaligned.values)
ax.set_xlabel('x [{}]'.format(realigned.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(realigned.coords['y'].unit))
ax.set_xticks(realigned.coords['x'].values)
ax.set_yticks(realigned.coords['y'].values)
ax.grid()
cbar = fig.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig.show()

This is essentially the same figure as the scatter plot for the original `data`.
The differences are:

- A "grid" (the bin edges) that is stored alongside the data.
- All points outside the limits of the specified bin edges have been dropped

`realigned` can now directly be histogrammed, without the need for specifying bin boundaries:

In [None]:
plot(sc.histogram(realigned))

Here `histogram` performs histogramming for all "realigned" dimensions, in this case `x` and `y`.
The resulting values in the X-Y bins are the counts accumulated from measurements at all points falling in a given bin.

Note also that since `realigned` contains the bin edges for the underlying unaligned data, the histogramming can actually be performed automatically and on-the-fly by the plotting function. Hence, the call to `sc.histogram` above is redundant and the same figure is obtained by calling

In [None]:
plot(realigned)

## Working with realigned data

### Slicing

The realigned data can be sliced as usual, e.g., to create plots of subregions:

In [None]:
plot(realigned['x', 0])

Copying a slice of realigned data drops all unaligned data falling into areas outside the slice:

In [None]:
s = realigned['x', 0].copy()
print('before: {}'.format(len(realigned.unaligned.values)))
print('after:  {}'.format(len(s.unaligned.values)))

This can provide an intuitive way of "filtering" lists of data based on some property of the list items.

### Masking

Masks can be defined for the unaligned data array, as well as the realigned wrapper.
This gives fine-grained and intuitive control, for e.g., masking invalid list entries on the one hand, and excluding regions in space on the other hand, without the need of manually determining which list entries fall into the exclusion zone.

We define two masks, one for positions, and one in the X-Y plane:

In [None]:
# In general npos != N since positions out of bounds are dropped by `realign`
npos = len(realigned.unaligned.coords['position'].values)
position_mask = sc.Variable(
    dims=['position'],
    values=[False if i>npos/4 else True for i in range(npos)]
)
x_y_mask = sc.Variable(
    dims=realigned.dims,
    values=np.array([[True, False], [True, False], [False, False], [False, False]])
)

Then, we add the masks `realigned`.
The position mask has to be added to the underlying unaligned data array:

In [None]:
realigned.unaligned.masks['broken_sensor'] = position_mask
realigned.masks['exclude'] = x_y_mask

As usual, more masks can be added if required, and masks can be removed as long as no reduction operation such as summing or histogramming took place.

We can then plot the result.
The mask of the underlying unaligned data is applied during the histogram step, i.e., masked positions are excluded.
The mask of the realigned wrapper is indicated in the plot and carried through the histogram step.
Make sure to compare this figure with the one we obtained earlier, before masking, and note how the values of the un-masked X-Y bins have changed due to masked positions of the underlying unaligned data:

In [None]:
plot(realigned)

### Plotting higher dimensions

On-the-fly histogramming is also supported for plotting realigned data with more than 2 dimensions:

In [None]:
N = 50
values = 10*np.random.rand(N)
data3d = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'z':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
zbins = sc.Variable(dims=['z'], unit=sc.units.m, values=np.linspace(0.1, 0.9, 20))
realigned = sc.realign(data3d, {'z':zbins,'y':ybins,'x':xbins})
plot(realigned)

<div class="alert alert-info">

**Note**
    
In this case, since the histogramming is performed on-the-fly for every slice through the data cube, the colorscale limits cannot be known in advance. They will then grow automatically as we navigate through the cube, but will not shink if the range of displayed values gets smaller again, to give a better feel of the relative values contained in different slices.

</div>

The automatic histogramming also works in a 1-dimensional projection: 

In [None]:
plot(realigned, projection="1d")

### Arithmetic operations

Arithmetic operations for realigned data arrays are currently only supported for realigned [Event data](../event-data/overview.ipynb).