Add batch_processor and batch_compute for batch reduction #177

nvaytet · 2025-09-17T20:35:10Z

This is an alternative to #166

Instead of trying to create a master workflow with the new groupby mechanism from Cyclebane/Sciline, we go back to making a class that holds a collection of workflows.

Reasons behind the motivation:

All parameter tables and computed results were in Pandas DataFrames. While this seemed like a good idea to start with (nice notebook displays, powerful manipulation opportunities), I felt it mostly got in the way, especially since having types as column identifiers instead of strings is not well supported by Pandas, and we had to implement workarounds using iloc in many places.
Having a mapped/grouped workflow prevents us from caching intermediate results (see issue comment), which was required here to avoid re-computing the raw loaded data multiple times.
Using groupby for the rotation angle is not really supported out of the box, because the rotation angle is loaded from file, not inputted by the user in the parameter table.
groupby does not work out-of-the-box with Scipp variables, so we had to convert them to float in a new column before being able to group

I still believe that groupby is a great and essential addition to Sciline.
However, all the reasons above lead to an implementation that got very messy, and still did not support all intended behaviour for the BatchProcessor.
The desired behaviour is roughly:

workflow = AmorWorkflow()

runs = {
    '608': {
        Filename[SampleRun]: amor.data.amor_run(608)
    },
    '609': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(609),
    },
    '610': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(610),
    },
    '611': {
        # If a parameter is missing here compared to other entries (e.g. SampleRotationOffset),
        # the workflow default value is added to the param table
        Filename[SampleRun]: amor.data.amor_run(611),
    },
}

batch = batch_processor(workflow, runs)

# Compute R(Q) for all runs (hiding calls like `sciline.compute_mapped()`)
reflectivity = batch.compute(ReflectivityOverQ)

Caching intermediate results:

batch[UnscaledReducibleData[SampleRun]] = batch.compute(
    UnscaledReducibleData[SampleRun]
)

Concatenating event lists from multiple runs is done by supplying a list of filenames for one entry (this is what was done by groupby in the aforementioned PR):

runs = {
    '608': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(608),
    },
    '611+612': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: (amor.data.amor_run(611), amor.data.amor_run(612)),
    },
}

batch = batch_processor(workflow, runs)

Update:

In addition to the initial changes here, we included the helper introduced in #155 to compute targets in one go, without having to handle a multi-workflow BatchProcessor object; the function from_measurements has been renamed to batch_compute.
The two approaches complement each other nicely, and the batch_compute is aimed at more novice users of the reduction workflow.

…ultiple

docs/user-guide/amor/amor-reduction.ipynb

jokasimr · 2025-09-26T08:58:08Z

src/ess/reflectometry/tools.py

-def scale_reflectivity_curves_to_overlap(
-    curves: Sequence[sc.DataArray],
+def scale_for_reflectivity_overlap(
+    reflectivities: sc.DataArray | Mapping[str, sc.DataArray] | sc.DataGroup,


Allowing DataGroup is a nice improvement of the interface.
So is only returning the scaling factors instead of both the scaled curves and the factors.

…lists

nvaytet · 2025-09-26T13:25:32Z

The latest update, after in-person discussion, is that we will merge the changes from #155 in this PR.
We see the use of having a helper that computes (and possibly applies scaling) all in one go to avoid errors and bookkeeping.
We also see the advantages of having the BatchProcessor for easier data/intermediate results exploration, as well as caching intermediate results.

The helper is seen as a function most users should call.
The BatchProcessor should be for more advanced users who whish to inspect steps in the workflow and possibly fiddle with it.

I added a simpler notebook to the docs for Amor that uses the batch_compute helper, which does not have as much detail as in the other notebook.

docs/user-guide/amor/amor-reduction-simple.ipynb

jokasimr · 2025-09-26T14:18:27Z

src/ess/reflectometry/tools.py

+            Filename[SampleRun]: amor.data.amor_run(610),
+        },
+        '611': {
+            SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),


It's probably good if the docs mention that you can pass a list of values to Filename[SampleRun].

jokasimr · 2025-09-26T14:23:55Z

src/ess/reflectometry/tools.py

-    return datasets
+        for fname in set(reference_filenames.values()):
+            wf[Filename[ReferenceRun]] = fname
+            reference_results[fname] = wf.compute(ReducedReference)


Two ReducedReference are not guaranteed to be the same even if they have the same file name. Other parameters also affect the reference. For example the wavelength interval and detector region of interest etc.

But it's true that usually those should stay constant across all runs that use the same reference file, so this will likely be correct most of the time. But since it's not guaranteed it's probably better to let the user make the decision to cache or not.

But since it's not guaranteed it's probably better to let the user make that decision.

How do we achieve that? Just let the user cache the reference on the workflow manually as is currently done in the notebook?

How do we achieve that? Just let the user cache the reference on the workflow manually as is currently done in the notebook?

Yes that's what I had in mind.

…teed that the same filename yields the same reference result

jokasimr · 2025-09-29T07:21:54Z

src/ess/reflectometry/tools.py

+            for name, wf in self.workflows.items():
+                try:
+                    out[t][name] = wf.compute(t, **kwargs)
+                except sl.UnsatisfiedRequirement as e:


When does this exception occur?

This happens for example if you are trying to compute DetectorData[SampleRun] but your workflow has been mapped over multiple files. The single DetectorData is no longer in the workflow, only the names of the mapped nodes will be (those are obtained using sciline.get_mapped_node_names).

Basically, I think that in the old version of the helper (in #155), if you gave a tuple of filenames, and then computed DetectorData[SampleRun] (or any target before the event lists are concatenated), then it would fail with sl.UnsatisfiedRequirement saying it can't find the type in the pipeline.

Aha I see 👍

So if the target is upstream of the reduction operation (for example DetectorData[SampleRun]) does it now return a list of the result, one for each file?

Yes it returns a list

jokasimr · 2025-09-29T07:24:36Z

src/ess/reflectometry/tools.py

-    reflectivity curve for each of the provided runs.
+    scale_to_overlap: bool
+    | tuple[sc.Variable, sc.Variable]
+    | list[sc.Variable, sc.Variable] = False,


(minor) I don't think list[T, T] is a valid python type annotation

…couple of tests on batch processor when working with a mapped workflow

…many times

jokasimr and others added 30 commits May 5, 2025 15:10

feat: more flexible helper for batch reduction

edcf4c2

fix: remove duplicate

7dbbf2c

update docs

25add6b

spelling

eda422e

fix: add theta to reference, can be useful in some contexts

98a9389

fix: handle case when SampleRotation etc are set in workflow

743743a

fix: add parameters before setting filenames

45bcd0e

docs: fix

7fb806b

tests

f097188

Merge branch 'main' into helper-for-reducing-multiple

4f9cbd2

Merge branch 'main' into helper-for-reducing-multiple

7c1750d

Merge remote-tracking branch 'origin/main' into helper-for-reducing-m…

1926849

…ultiple

Merge branch 'main' into helper-for-reducing-multiple

fc4214a

add scaling factor to Amor workflow

f158ac2

map on unscaled data

268f487

add batch_processor and WorkflowCollection to the tools

eb56472

udpate Amor notebook to use workflow collection

9637acd

fix critical_edge parameter in scaling

2d2871d

remove commented code

2a5120d

Merge branch 'main' into pipeline-collection

84167a5

add tests for wf collection

aced050

improve docstring

504405b

start fixing existing unit tests

7e0c24d

refactor first tests slightly

4bd66d3

simplify critical edge handling

9f14288

update/fix tools tests

fff9b4e

fix scaling for a single workflow and fix amor pipeline tests

a8d1b3a

Merge branch 'main' into pipeline-collection

31c437f

cleanup

7567565

do not fail if UnscaledReducibleData is not in graph

7f884fb

use the fact that we are returning results as DataGroup for plotting

19a3f1b