In [None]:
%matplotlib widget
import plopp
plopp.patch_scipp()

## Quick Reference

[scipp.hist](../../generated/functions/scipp.hist.rst)
[scipp.bin](../../generated/functions/scipp.bin.rst)
[scipp.group](../../generated/functions/scipp.group.rst)
[scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst)
[scipp.lookup](../../generated/functions/scipp.lookup.rst)

### Extract events matching parameter value

Use [label-based indexing on the `bins` property](../../generated/classes/scipp.Bins.rst#scipp.Bins.__getitem__).
This works similar to regular [label-based indexing](../slicing.rst#Label-based-indexing) but operates on the unordered bin contents.
Example:

```python
param_value = sc.scalar(1.2, unit='m')
filtered = da.bins['param', param_value]
```

- The output data array has the same dimensions as the input `da`.
- `filtered` contains a *copy* of the filtered events.

### Extract events falling into a parameter interval

Use [label-based indexing on the `bins` property](../../generated/classes/scipp.Bins.rst#scipp.Bins.__getitem__).
This works similar to regular [label-based indexing](../slicing.rst#Label-based-indexing) but operates on the unordered bin contents.
Example:

```python
start = sc.scalar(1.2, unit='m')
stop = sc.scalar(1.3, unit='m')
filtered = da.bins['param', start:stop]
```

- The output data array has the same dimensions as the input `da`.
`filtered` contains a *copy* of the filtered events.
- Note that as usual the upper bound of the interval (here $1.3~\text{m}$) is *not* included.

### Split into bins based on discrete event parameter

Use [scipp.group](../../generated/functions/scipp.group.rst).
Example:

```python
split = da.group('param')
```

- The output data array has a new dimension `'param'` in addition to the dimensions of the input.
- `split` contains a *copy* of the reordered events.
- Pass an explicit variable to `group` listing desired groups to limit what is included in the output.

### Split into bins based on contiguous event parameter

Use [scipp.bin](../../generated/functions/scipp.bin.rst).
Example:

```python
split = da.bin(param=10)
```

- The output data array has a new dimension `'param'` in addition to the dimensions of the input.
- `split` contains a *copy* of the reordered events.
- Provide an explicit variable to `bin` to limit the parameter interval that is included in the output, or for fine-grained control over the sub-intervals.

### Compute derived event parameters for subsequent extracting or splitting

Use [scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst).
Example:

```python
da2 = da.transform_coords(derived_param=lambda p1, p2: p1 + p2)
```

`da2` can now be used with any of the methods for exctracting or splitting data described above.
The intermediate variable can also be omitted, and we can directly extract or split the result:

```python
filtered = da.transform_coords(derived_param=lambda p1, p2: p1 + p2) \
             .bin(new_param=10)
```

### Compute derived event parameters from time-series or other metadata

In practice, events are often tagged with a timestamp, which can be used to lookup parameter values from, e.g., a time-series log.
Use [scipp.lookup](../../generated/functions/scipp.lookup.rst) with [scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst). Example:

```python
# Data array, dims=('time',), values are temperatures measured at given time
temperature = da.attrs['sample_temperature'].value
interp_temperature = sc.lookup(temperature, mode='previous')
filtered = da.transform_coords(temperature=interp_temperature) \
             .bin(temperature=10)
```

what | use | dims
---|---|---
extract events matching param value|`da.bins['param', param_value]`| same as `da`
extract events in param interval|`da.bins['param', start:stop]`| same as `da`
split into bins based on discrete event param|`da.group('param')`| additional new dim `'param'`
split into bins based on contiguous<br> event param|`da.bin(param=100)`| additional new dim `'param'`


In [None]:
import scipp as sc

# Stainless steel tensile bar for NX school
da = sc.io.open_hdf5('scipp-filtering-docs-data.h5')
da

In [None]:
da.hist().plot()

## Extract time interval

strain... drops off for some reason, use 'filtering' to remove those events

In [None]:
strain = da.attrs['loadframe.strain'].value
strain.plot()

In [None]:
import numpy as np

start = strain.coords['time'][0]
stop = strain.coords['time'][np.argmax(strain.values)]
da = da.bins['time', start:stop]

<div class="alert alert-info">

**Note**
    
The above is just a concise way of binning into a single time interval and squeezing the time dimension from the result.

If *multiple* intervals are to be extracted then the mechanism based on `start` and `stop` values becomes highly inefficient, as every time `da.bins['param', start:stop]` is called *all* of the events have to be processed.
Instead prefer using `da.bin(param=param_bin_edges)` and slice the result using regular positional (or label-based) indexing.
Similarly, prefer using `da.group('param')` to extract based on multiple discrete values.
    
</div>

## Filter bad pulses

In the previous example we directly used an existing event-coordinate (`da.bins.coords['time']`) for selecting the desired subset of data.
In many practical cases, such a coordinate may not be available yet and needs to be computed as a preparatory step.
Scipp facilitates this using `transform_coords` and `lookup`.
When the desired event-coordinate can be computed directly from existing coordinates then `transform_coords` can to the job on its own.
In other cases, such as the following example, we combine it with `lookup` to, e.g., map timestamps to corresponding sensor readings.

Our data stores the so called *proton charge*, the total charge of protons per pulse (which produced the neutrons scattered off the sample):

In [None]:
proton_charge = da.attrs['proton_charge'].value
proton_charge.plot()

Some pulses have a very low proton charge which may indicate a problem with the source, so we may want to remove events that were produced from these pulses.
We can use `lookup` to define the following "interpolation function", marking any pulse as "good" if it has more than 90% of the mean proton charge:

In [None]:
good_pulse = sc.lookup(proton_charge > 0.9 * proton_charge.mean(), mode='previous')

`transform_coords` can utilize this interpolation function to compute a new coordinate (`good_pulse`, with `True` and `False` values) from the `da.bins.coords['time']` coordinate.
We used `mode='previous'` above, so an event's `good_pulse` value will be defined by the *previous* pulse, i.e., the one that produced the neutron event.
See the documentation of `lookup` for a full list of available options.

The return value of `transform_coords` can then be used to index the `bins` property, here to extract only the events that have `good_pulse=True`, i.e., were created by a proton pulse that fulfilled the above critereon:

In [None]:
filtered = da.transform_coords(good_pulse=good_pulse) \
             .bins['good_pulse', sc.index(True)]
filtered

## Strain

using bin here... should we also show group?

In [None]:
#tmp.bins.coords['time'] = tmp.bins.coords.pop('pulse_time')

filtered = da.transform_coords(strain=sc.lookup(strain, mode='previous')).bin(strain=100)
#filtered = da.transform_coords(strain=sc.lookup(strain, mode='previous')).group('strain')
filtered.hist().transpose().plot()

In [None]:
proton_charge = da.attrs['proton_charge'].value
charge_per_strain_value = proton_charge.bin(time=strain.coords['time'])
charge_per_strain_value.coords['strain'] = strain.data[:-1]
#charge_per_strain_value.coords['strain'] = strain.data.rename(time='pulse_time')[:-1]
#norm = charge_per_strain_value.group('strain').hist()

In [None]:
norm = charge_per_strain_value.bin(strain=100).hist()
normalized = (filtered/norm)
normalized.hist(dspacing=300, strain=30).transpose().plot(norm='log')

In [None]:
lines = sc.collapse(normalized.hist(dspacing=200, strain=6), keep='dspacing')
sc.plot(lines, norm='log')

- select based on event params (field values of records) in bin
  - select value or range
  - add new dim with param range
- prep 1: create event labels from others (such as timestamp), using `transform_coords`
- prep 2: pre-process metadata

1. Preprocess the metadata used for filtering.
   For example, a noisy time series of temperature values needs to converted into a series of time intervals with a fixed temperature value within the interval.
   This process might involve defining thresholds and tolerances or interpolation methods between measured temperature values.
2. Map event timestamps to temperature values.
3. Filter data based on temperature values.

In [None]:
# 2022-07-12T18:40:27 to 2022-07-12T19:45:39
start = sc.datetime('2022-07-12T18:45:00', unit='ns')
stop = sc.datetime('2022-07-12T18:50:00', unit='ns')
da.bins['time', start:stop]
da.bins['time', :stop]

# Rearranging and Filtering Binned Data

Event filtering refers to the process of removing or extracting a subset of events based on some criterion such as the temperature of the measured sample at the time an event was detected.
Scipp's binned data can be used for this purpose.

Below, we describe two cases.
In the simple case the data contains the required coordinate and [scipp.bin](../../generated/functions/scipp.bin.rst) can be used directly.
In the more complex case metadata requires preprocessing, and generally there are three steps to take:

1. Preprocess the metadata used for filtering.
   For example, a noisy time series of temperature values needs to converted into a series of time intervals with a fixed temperature value within the interval.
   This process might involve defining thresholds and tolerances or interpolation methods between measured temperature values.
2. Map event timestamps to temperature values.
3. Filter data based on temperature values.

## Preparation

We create some fake data for illustration purposes.

<div class="alert alert-info">

**Note**

In practice data to be filtered would be based on a loaded file. Details of this subsection can safely by skipped, as long as all cells are executed.

</div>

In [None]:
import numpy as np
import scipp as sc

In [None]:
np.random.seed(1) # Fixed for reproducibility
end_time = 100000
tof_max = 10000
width = tof_max/20
sizes = 4*np.array([7000, 3333, 3000, 5000])
size = np.sum(sizes)
data = sc.ones(dims=['event'], unit='counts', shape=[size], with_variances=True)
time = sc.zeros(dims=['event'], unit='s', dtype='datetime64', shape=[size])
# time-of-flight in a neutron-scattering experiment
tof = sc.zeros(dims=['event'], unit='us', dtype='float64', shape=[size])
table = sc.DataArray(data=data, coords={'time':time, 'tof':tof})
table

In [None]:
ntemp = 100
sample_temperature = sc.DataArray(
    data=sc.array(dims=['time'], unit='K',
                  values=5*np.random.rand(100)+np.linspace(100, 120, num=ntemp)),
    coords={'time':sc.Variable(dims=['time'], unit='s',
                               values=np.linspace(0, end_time, num=ntemp).astype('datetime64[s]'))})
x = sc.linspace(dim='x', unit='m', start=0, stop=1, num=4)

end = sc.array(dims=['x'], values=np.cumsum(sizes), unit=None)
begin = end.copy()
begin.values -= sizes
events = sc.DataArray(
    data=sc.bins(begin=begin, end=end, dim='event', data=table),
    coords={'x': x},
    attrs={'sample_temperature': sc.scalar(value=sample_temperature)})
for size, bucket in zip(sizes, events.values):
    bucket.coords['time'].values = np.linspace(0, end_time, num=size).astype('datetime64[s]')
    bucket.coords['tof'].values = np.concatenate(
        (np.concatenate(
            (7*width + width*np.random.randn(size//4),
             13*width + width*np.random.randn(size//4))),
         10*width + width*np.random.randn(size//2)))
events

## Filtering based on existing coords

### Extracting data based on an interval

We can use [scipp.bin](../../generated/functions/scipp.bin.rst) with the desired bounds to extract all data points (events) that have coord values falling within an interval:

In [None]:
tof_interval = sc.array(dims=['tof'], values=[2000.0, 3000.0], unit='us')
filtered = events.bin(tof=tof_interval)
filtered

### Extracting/splitting data based on multiple intervals

In the same manner, we can extract data with a list of (adjacent) intervals:

In [None]:
tof_intervals = sc.linspace(dim='tof', start=2000, stop=3000, num=4, unit='us')
filtered = events.bin(tof=tof_intervals)
filtered

Events in each of the subintervals can then be accessed using the usual slicing syntax:

In [None]:
filtered['tof',2]

## Filtering based on arbitrary metadata
### Step 1: Preprocess metadata

Our data contains a coordinate with metadata related to the temperature of the measured sample:

In [None]:
timeseries = events.attrs['sample_temperature'].value
timeseries.plot()

This is a timeseries with noisy measurements, as could be obtained, e.g., from a temperature sensor.
For event filtering we require intervals with a fixed temperature.
This can be obtained in many ways.
In this example we do so by taking the mean over subintervals:

In [None]:
average=4
temperature = sc.fold(timeseries, dim='time', sizes={'time': ntemp//average, 'dummy': average})
time_coord = temperature.coords['time']['dummy', 0]
temperature.coords['time'] = sc.concat([time_coord, time_coord['time', -1] + 1*sc.units.s], 'time')
temperature = temperature.mean('dummy')
temperature.plot()

### Step 2: Map time stamps

The `temperature` data array computed above can be seen as a discretized functional dependence of temperature on time.
This "function" can now be used to map the `time` of each event to the `temperature` of each event:

In [None]:
event_temp = sc.lookup(temperature, 'time')[events.bins.coords['time']]
events.bins.coords['temperature'] = event_temp

The event lists with temperature values created by `scipp.map` have been added as a new coordinate:

In [None]:
events.values[0]

### Step 3: Filter with `scipp.bin`

The temperature coordinate created in the previous step can now be used for the actual filtering step.
With a `temperature` coordinate stored as part of `events` it is possible to use `scipp.bin` with temperature bins:

In [None]:
temp_bins = sc.linspace('temperature', 100.0, 130.0, num=6, unit='K')
binned_events = events.bin(temperature=temp_bins, tof=100)
binned_events

Filtering is then performed by slicing and, if desired, copying:

In [None]:
filtered_view = binned_events['temperature', 0:3] # view containing only relevant events
filtered = binned_events['temperature', 0:3].copy() # extract only relevant events by copying

Slicing combined with histogramming also performs a filter operation since all events outside the histogram bounds are dropped:

In [None]:
binned_events['temperature', 1].hist().plot()

In [None]:
binned_events['temperature', 3].hist().plot()

Results from filter operations can also be inserted into a dataset for convenient handling of further operations such as histogramming, summing, or plotting:

In [None]:
d = sc.Dataset()
d['below_T_c'] = binned_events['temperature', 1]
d['above_T_c'] = binned_events['temperature', 3]
d.bins.sum().sum('x').plot()

We can also bin without the time-of-flight coordinate to obtain the temperature dependence of the total event count, e.g., for normalization purposes:

In [None]:
binned_events = events.bin(temperature=temp_bins)
binned_events.hist(temperature=50).plot()