Skip to content

Conversation

@nvaytet
Copy link
Member

@nvaytet nvaytet commented Sep 17, 2025

This is an alternative to #166

Instead of trying to create a master workflow with the new groupby mechanism from Cyclebane/Sciline, we go back to making a class that holds a collection of workflows.

Reasons behind the motivation:

  • All parameter tables and computed results were in Pandas DataFrames. While this seemed like a good idea to start with (nice notebook displays, powerful manipulation opportunities), I felt it mostly got in the way, especially since having types as column identifiers instead of strings is not well supported by Pandas, and we had to implement workarounds using iloc in many places.
  • Having a mapped/grouped workflow prevents us from caching intermediate results (see issue comment), which was required here to avoid re-computing the raw loaded data multiple times.
  • Using groupby for the rotation angle is not really supported out of the box, because the rotation angle is loaded from file, not inputted by the user in the parameter table.
  • groupby does not work out-of-the-box with Scipp variables, so we had to convert them to float in a new column before being able to group

I still believe that groupby is a great and essential addition to Sciline.
However, all the reasons above lead to an implementation that got very messy, and still did not support all intended behaviour for the BatchProcessor.
The desired behaviour is roughly:

workflow = AmorWorkflow()

runs = {
    '608': {
        Filename[SampleRun]: amor.data.amor_run(608)
    },
    '609': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(609),
    },
    '610': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(610),
    },
    '611': {
        # If a parameter is missing here compared to other entries (e.g. SampleRotationOffset),
        # the workflow default value is added to the param table
        Filename[SampleRun]: amor.data.amor_run(611),
    },
}

batch = batch_processor(workflow, runs)

# Compute R(Q) for all runs (hiding calls like `sciline.compute_mapped()`)
reflectivity = batch.compute(ReflectivityOverQ)

Caching intermediate results:

batch[UnscaledReducibleData[SampleRun]] = batch.compute(
    UnscaledReducibleData[SampleRun]
)

Concatenating event lists from multiple runs is done by supplying a list of filenames for one entry (this is what was done by groupby in the aforementioned PR):

runs = {
    '608': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: amor.data.amor_run(608),
    },
    '611+612': {
        SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
        Filename[SampleRun]: (amor.data.amor_run(611), amor.data.amor_run(612)),
    },
}

batch = batch_processor(workflow, runs)

Update:

In addition to the initial changes here, we included the helper introduced in #155 to compute targets in one go, without having to handle a multi-workflow BatchProcessor object; the function from_measurements has been renamed to batch_compute.
The two approaches complement each other nicely, and the batch_compute is aimed at more novice users of the reduction workflow.

def scale_reflectivity_curves_to_overlap(
curves: Sequence[sc.DataArray],
def scale_for_reflectivity_overlap(
reflectivities: sc.DataArray | Mapping[str, sc.DataArray] | sc.DataGroup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing DataGroup is a nice improvement of the interface.
So is only returning the scaling factors instead of both the scaled curves and the factors.

@nvaytet
Copy link
Member Author

nvaytet commented Sep 26, 2025

The latest update, after in-person discussion, is that we will merge the changes from #155 in this PR.
We see the use of having a helper that computes (and possibly applies scaling) all in one go to avoid errors and bookkeeping.
We also see the advantages of having the BatchProcessor for easier data/intermediate results exploration, as well as caching intermediate results.

The helper is seen as a function most users should call.
The BatchProcessor should be for more advanced users who whish to inspect steps in the workflow and possibly fiddle with it.

I added a simpler notebook to the docs for Amor that uses the batch_compute helper, which does not have as much detail as in the other notebook.

Filename[SampleRun]: amor.data.amor_run(610),
},
'611': {
SampleRotationOffset[SampleRun]: sc.scalar(0.05, unit='deg'),
Copy link
Contributor

@jokasimr jokasimr Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably good if the docs mention that you can pass a list of values to Filename[SampleRun].

return datasets
for fname in set(reference_filenames.values()):
wf[Filename[ReferenceRun]] = fname
reference_results[fname] = wf.compute(ReducedReference)
Copy link
Contributor

@jokasimr jokasimr Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two ReducedReference are not guaranteed to be the same even if they have the same file name. Other parameters also affect the reference. For example the wavelength interval and detector region of interest etc.

Copy link
Contributor

@jokasimr jokasimr Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's true that usually those should stay constant across all runs that use the same reference file, so this will likely be correct most of the time. But since it's not guaranteed it's probably better to let the user make the decision to cache or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since it's not guaranteed it's probably better to let the user make that decision.

How do we achieve that? Just let the user cache the reference on the workflow manually as is currently done in the notebook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we achieve that? Just let the user cache the reference on the workflow manually as is currently done in the notebook?

Yes that's what I had in mind.

@nvaytet nvaytet changed the title Add BatchProcessor for batch reduction Add batch_processor and batch_compute for batch reduction Sep 27, 2025
@nvaytet nvaytet changed the title Add batch_processor and batch_compute for batch reduction Add batch_processor and batch_compute for batch reduction Sep 27, 2025
for name, wf in self.workflows.items():
try:
out[t][name] = wf.compute(t, **kwargs)
except sl.UnsatisfiedRequirement as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this exception occur?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens for example if you are trying to compute DetectorData[SampleRun] but your workflow has been mapped over multiple files. The single DetectorData is no longer in the workflow, only the names of the mapped nodes will be (those are obtained using sciline.get_mapped_node_names).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I think that in the old version of the helper (in #155), if you gave a tuple of filenames, and then computed DetectorData[SampleRun] (or any target before the event lists are concatenated), then it would fail with sl.UnsatisfiedRequirement saying it can't find the type in the pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha I see 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the target is upstream of the reduction operation (for example DetectorData[SampleRun]) does it now return a list of the result, one for each file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it returns a list
Screenshot_20250929_143839

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👌

reflectivity curve for each of the provided runs.
scale_to_overlap: bool
| tuple[sc.Variable, sc.Variable]
| list[sc.Variable, sc.Variable] = False,
Copy link
Contributor

@jokasimr jokasimr Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) I don't think list[T, T] is a valid python type annotation

@nvaytet nvaytet merged commit 0d9ce1c into main Sep 29, 2025
4 checks passed
@nvaytet nvaytet deleted the workflow-collection branch September 29, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants