<a href="https://scipp.github.io"><img src="https://scipp.github.io/_static/logo-2022.svg" width="600" /></a>

# Multi-dimensional arrays with labeled dimensions and physical units

## [scipp.github.io](https://scipp.github.io)

<br>

<table style="margin-left:0px;">
    <tr>
        <td>
            <h1>About me</h1><br>
            <h2>Neil Vaytet</h2>
            <h3>
            <ul>
                <li>Scientific software developer @ <a href="https://europeanspallationsource.se/">European Spallation Source (DK/SE)</a></li>
                <li>Python for scientific data analysis</li>
                <li>Data visualization</li>
                <li><a href="neil.vaytet@ess.eu">neil.vaytet@ess.eu</a></li>
            </ul>
            </h3>
        </td>
        <td>
            <img src="neil.png" width="200" /> &nbsp;
            <img src="https://europeanspallationsource.se/themes/custom/ess/logo.svg" width="300" />
        </td>
    </tr>
</table>


<h3>
    <img src="simon.png" width="60" /> Simon Heybrock &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="janlukas.png" width="60" />Jan-Lukas Wynen &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="sunyoung.png" width="60" />Sunyoung Yoo
</h3>

<br><br><br><br>
<br><br><br><br>
<br><br><br><br>

In [None]:
%matplotlib inline
import numpy as np
import scipp as sc
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=1234)

In [None]:
def plot(*x):
    """
    Useful plot function for 1d and 2d data
    """
    fig, ax = plt.subplots()
    for a in x:
        if a.ndim == 1:
            ax.plot(np.arange(len(a)), a)
        elif a.ndim == 2:
            ax.imshow(a, origin="lower")

def scatter(x, y):
    """
    Simple scatter plot
    """
    fig, ax = plt.subplots()
    ax.scatter(x, y, marker=".", s=1)
    ax.set_aspect("equal")
    ax.set_xlim(x.min(), x.max())
    ax.set_ylim(y.min(), y.max())
    return ax

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

## 1. Introduction to labeled dimensions: why do we need them?

In [None]:
ny, nx = 10, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

In [None]:
# Slice out row number 4
plot(a[4, :])

### We can't always deduce from the shape

In [None]:
ny, nx = 20, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

In [None]:
# Not always obvious which dimension is which
plot(a[:, 4], a[4, :])

### The situation gets worse with more dimensions

Say I now have an array that has 4 dimensions: `x, y, z, time` (in that order, maybe?)

In [None]:
a = np.random.random([20] * 4)
a.shape

I want to get the first `z` slice...

Which one was it again?

In [None]:
z_slice = a[:, :, 0, :]  # x,y,z,t
z_slice = a[0, :, :, :]  # z,y,x,t
z_slice = a[:, :, :, 0]  # t,x,y,z

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>


### Introducing labeled dimensions

<img src="https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png" width="220" /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

[Xarray](https://docs.xarray.dev/en/stable/index.html) (https://docs.xarray.dev) introduced labels to multi-dimensional Numpy arrays.

"*real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.*"

We have embraced, and to a large extent copied, the Xarray mechanism.

In [None]:
var = sc.array(dims=["x", "y", "z", "time"], values=a)
var

Getting the `z` slice is now easy and **readable**

In [None]:
var["z", 0]

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

### Adding coordinates

- Coordinates can be specified for each dimension.
- They describe the extent of each axis, as well as how far each data point is from its neighbours.

In [None]:
data = sc.array(dims=["altitude", "year"], values=rng.random((5, 9)))
sc.show(data)

In Scipp and Xarray, coordinates are added in a data structure called `DataArray`:

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "year": sc.arange("year", 2015, 2024),
        "altitude": sc.linspace("altitude", 0, 8000, 5),
    },
)
sc.show(da)

In [None]:
da

In [None]:
da.coords['year-since-2000'] = da.coords['year'] - 2000
da

<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>

## 2. Going further

<img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

### 2.1 Physical units

Every data variable and coordinate in Scipp has physical units.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
(see also [pint](https://pint.readthedocs.io/en/stable/), [astropy.units](https://docs.astropy.org/en/stable/units/index.html), [pint-xarray](https://pint-xarray.readthedocs.io/en/stable/), ...)

In [None]:
x = sc.array(dims=['row'], values=rng.normal(size=10000), unit='cm')
y = sc.array(dims=['row'], values=rng.normal(size=10000), unit='cm')
recording = sc.DataArray(data=sc.ones(sizes=x.sizes, unit='counts'),
                         coords={'x': x, 'y': y})
image = recording.hist(y=100, x=100)
image

In [None]:
image.plot(aspect="equal")

In [None]:
integration_time = sc.scalar(300.0, unit="s")
image /= integration_time
print(image.unit)

image.plot(aspect="equal")

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

### Units also provide protection

Say I now have a background image (dark frame) which I want to subtract from the signal image above,
but I forgot to first normalize it by integration time

In [None]:
background = sc.array(dims=["y", "x"], values=rng.random((100, 100)), unit="counts")

image - background

In [None]:
background_integration_time = sc.scalar(60.0, unit="s")
background /= background_integration_time

background_subtracted = image - background

- The units are very useful in early prevention of difficult-to-spot bugs in a workflow.
- They save **hours** of debugging time, free-up mental capacity and let the user focus on the important thing: **doing science**.

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

### Using units for label-based indexing

We also use units to distinguish between positional indexing and label-based indexing:

In [None]:
image['x', -0.5 * sc.Unit('cm')].plot()

<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>

### 2.2 Bin-edge coordinates

- It is sometimes necessary to have coordinates that represent a range for each data value.
- E.g. "the temperature was 310 K in the time span between 10 and 20 seconds".
- This also arises every time we histogram data, as in the image above.
- Scipp supports this by having **bin-edge coordinates**: a coordinate which has a length of 1 more than the dimension length.

In [None]:
image = recording.hist(y=8, x=8)
sc.show(image)

In [None]:
image

In [None]:
image.plot(aspect='equal')

- Numpy and Matplotlib return the bin edges and the data counts separately
- We have everything stored inside a single data structure

<br><br><br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br><br><br>

## 3. Binned data

Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

<img src="binned_drawing.svg" />

<br><br><br><br>

This is conceptually similar to a multi-dimensional <a href="https://awkward-array.org/doc/main/"><img src="https://iris-hep.org/assets/logos/awkward.svg" width="200" /></a>.

It is best illustrated with an example of data analysis.
For this, we will use one of the NYC taxi datasets.

<br><br><br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br><br><br>

### NYC yellow taxi dataset

<img src="https://vaex.readthedocs.io/en/latest/_images/datasets_2_1.png" /> <img src="https://cdn-images-1.medium.com/v2/resize:fit:2680/1*fqrY2h4uLD3eKEvJ6hlI2g.png" width="600" />

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs,
and subsequently cleaned of outliers).

In [None]:
%matplotlib widget

da = sc.io.load_hdf5('nyc_taxi_data_2015_small.h5')
da

In [None]:
n = 100
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)

### Binning the data records

- Working with binned data is most efficient when keeping the number of bins relatively low.
- Binning is essentially like overlaying a grid of bin edges onto our data

In [None]:
ax = scatter(x, y)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")

In [None]:
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned

In [None]:
# Histogramming is summing all the counts in each bin
binned.hist().plot(aspect="equal", norm="log")

<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>


### Selecting/slicing bins

- Binning *groups* the data into bins, but keeps the underlying table of records beneath
- **No information is lost, it is simply re-ordered**
- The bins can then be used for slicing the data, providing extremely efficient data selection and filtering

In [None]:
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh

In [None]:
# We can now histogram this with a much finer resolution

manh.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

In [None]:
# We select another bin, which contains the JFK airport

jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

![jfk](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/JFK_airport_terminal_map.png/640px-JFK_airport_terminal_map.png)

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)

<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>

### Binning into a new dimension

- Data that has already been binned can also be binned further into new dimensions

In [None]:
manh

- we look at the trip distances inside the Manhattan and JFK bins we have selected above.

In [None]:
# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist

In [None]:
manh_dist.hist().plot()

In [None]:
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()

<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>


### Other operations on bins: what is the fare amount as a function of distance?

- In addition to summing/histogramming, bins can be used for other reduction operations: `min()`, `max()`, and `mean()`.

In [None]:
manh_dist

- To get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

In [None]:
manh_dist.bins.coords['fare_amount'].min(), manh.bins.coords['fare_amount'].max()

- These values are somewhat strange, indicative of bad data in the table.
- We restrict our fare range from \\$0 to \\$200.

In [None]:
# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace('fare_amount', 0, 200, nbins + 1, unit='dollar')

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")

Some things we can say about the data:

- there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you'll have to pay
- for a given trip distance, clients usually pay above the diagonal line, rarely below
- there appears to be a magic fare amount of \\$52 that will take you anywhere from 0 to 60 miles!

<br><br><br><br><br><br>
<br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br>


## 4. Plopp: interactive data visualization tools

<img src="https://scipp.github.io/plopp/_static/logo.svg" width="200" />

https://scipp.github.io/plopp 

In [None]:
import plopp as pp

fare_lat_lon = da.hist(fare_amount=fare_bins, dropoff_latitude=300, dropoff_longitude=300)
fare_lat_lon

In [None]:
pp.inspector(fare_lat_lon, dim='fare_amount', norm='log')

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

<a href="https://scipp.github.io"><img src="https://scipp.github.io/_static/logo-2022.svg" width="600" /></a>

# Thank you for listening! &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="https://img.icons8.com/?size=512&id=24511&format=png" width="50" /> [scipp.github.io](https://scipp.github.io) &nbsp;&nbsp;&nbsp;&nbsp; <img src="https://cdn-icons-png.flaticon.com/512/25/25231.png" width="50" /> [github.com/scipp](https://github.com/scipp)

<br>

<h1 style="color:#C70039;">We are hiring! Permanent position as a software engineer for science tools</h1> 

<br>

## Neil Vaytet &nbsp;&nbsp;&nbsp;&nbsp; <a href="mailto:neil.vaytet@ess.eu">neil.vaytet@ess.eu</a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="neil.png" width="100" /> &nbsp; <img src="https://europeanspallationsource.se/themes/custom/ess/logo.svg" width="200" />

<br>

<h3>
    <img src="simon.png" width="60" /> Simon Heybrock &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="janlukas.png" width="60" />Jan-Lukas Wynen &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="sunyoung.png" width="60" />Sunyoung Yoo
</h3>


<!-- <br>

# Links:

## &bull; Docs: [scipp.github.io](https://scipp.github.io)

## &bull; Github: [github.com/scipp/scipp](https://github.com/scipp/scipp)

## &bull; Plopp: [github.com/scipp/plopp](https://github.com/scipp/plopp) -->

<br>



### Differences with Xarray

- automatic alignment of coords
-

### Awkward array

- more generic and flexible
- nested levels of binning
- they expose the inner dimensions in the top level, whereas scipp hides this and make it look like a normal array
- attempts of putting awkward array in Xarray, e.g. having shape of None
- we say the inner dims don't exist on the top level

### Other

- Say it is Multi-threaded by default?
- 

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

## 4. Plopp: building interactive visualizations

<img src="https://scipp.github.io/plopp/_static/logo.svg" width="600" />

https://scipp.github.io/plopp

In [None]:
import plopp as pp
from plopp import widgets
import ipywidgets as ipw
from scipp.scipy.ndimage import gaussian_filter

In [None]:
data = da.group("hour").hist(latitude=500, longitude=500)
data

### Goal: make an interactive visualization with 3 panels and a slider

![plopp_visu](plopp.png)

In [None]:
slider = ipw.IntSlider(description="Hour:", min=0, max=23)
slider_node = pp.widget_node(slider)

slice_node = pp.Node(lambda da, ind: da["hour", ind], da=data, ind=slider_node)

fig2d = pp.figure2d(slice_node, norm="log", cbar=False)

sum_lat = pp.Node(sc.sum, slice_node, dim="latitude")
sum_lon = pp.Node(sc.sum, slice_node, dim="longitude")

smooth = pp.Node(gaussian_filter, sum_lat, sigma=5)

fig_lon = pp.figure1d(sum_lon, norm="log")
fig_lat = pp.figure1d(sum_lat, smooth, norm="log")

widgets.Box([slider, [fig2d, fig_lon], fig_lat])

In [None]:
pp.show_graph(fig_lat)

### Adding a second widget for the Gaussian smoothing kernel size

In [None]:
slider = ipw.IntSlider(description="Hour:", min=0, max=23)
slider_node = pp.widget_node(slider)

slice_node = pp.Node(lambda da, ind: da["hour", ind], da=data, ind=slider_node)

fig2d = pp.figure2d(slice_node, norm="log", cbar=False)

sum_lat = pp.Node(sc.sum, slice_node, dim="latitude")
sum_lon = pp.Node(sc.sum, slice_node, dim="longitude")


# Add a new slider that will act as input to the Gaussian smoothing node
smooth_slider = ipw.IntSlider(description="kernel:", min=1, max=25)
smooth_slider_node = pp.widget_node(smooth_slider)

# Use slider as input node for smoothing kernel size
smooth = pp.Node(gaussian_filter, sum_lat, sigma=smooth_slider_node)


fig_lon = pp.figure1d(sum_lon, norm="log")
fig_lat = pp.figure1d(sum_lat, smooth, norm="log")

widgets.Box([[slider, smooth_slider], [fig2d, fig_lon], fig_lat])  # Container box

In [None]:
pp.show_graph(fig_lat)