Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UX: Improve readability for ISIS workflows #136

Merged
merged 21 commits into from
May 7, 2024
Merged

Conversation

SimonHeybrock
Copy link
Member

Implements changes as suggested for guideline addition in scipp/essreduce#22.

This goes one step further and also improves the currently very confusing way of switching between tutorial data from pooch and "local" data.

"workflow[DirectBeamFilename] = 'DIRECT_SANS2D_REAR_34327_4m_8mm_16Feb16.dat.h5'\n",
"workflow[Filename[SampleRun]] = 'SANS2D00063114.nxs.h5'\n",
"workflow[Filename[BackgroundRun]] = 'SANS2D00063159.nxs.h5'\n",
"workflow[Filename[EmptyBeamRun]] = 'SANS2D00063091.nxs.h5'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by this approach. It is nice that get_sans2d_tutorial_data_folder places all files in the same folder and that we can then simply list the file names. But this pretends that the mechanism is flexible while in reality, the filenames are determined by the get_*_data_folder functions. So there is no need to list them separately.

If you want to use this as an example for users, this doesn't quite work either. Users would typically (I think) not provide a folder with all their files and list Filenames. Instead, they would, if they use local files directly, list FilePaths.

With SciCat, it would also not be possible to list file names after downloading because the names are required for the download itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use this as an example for users, this doesn't quite work either. Users would typically (I think) not provide a folder with all their files and list Filenames. Instead, they would, if they use local files directly, list FilePaths.

Yes, this was intended as an example for users. But this mechanism is what we have previously implemented, using the DataFolder mechanism. They can just set the to '.', no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With SciCat, it would also not be possible to list file names after downloading because the names are required for the download itself.

Isn't this a different purpose?

  • For SciCat download, the user needs to list which files they need.
  • For the workflow, they need to specify which file has which meaning (sample, background, ...).

So it is not clear to me if that duplication can be avoided? Even though one could imagine that download all files from an experiment, such that there is no need to list them individually for download?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was intended as an example for users. But this mechanism is what we have previously implemented, using the DataFolder mechanism. They can just set the to '.', no?

Only if file paths are specified relative to CWD. And then it still goes through the path resolution Filename -> FilePath. I think it is cleanest to specify a FilePath directly to avoid any potential issues with having a bad resolver.

Isn't this a different purpose?

* For SciCat download, the user needs to list which files they need.

* For the workflow, they need to specify which file has which meaning (sample, background, ...).

So it is not clear to me if that duplication can be avoided? Even though one could imagine that download all files from an experiment, such that there is no need to list them individually for download?

The datasets should, at least partially, indicate the meaning of a file. E.g., a sample dataset should link to a background dataset, explicitly stating that this is a background. See https://github.com/scipp/essreduce/blob/5f23eaca66727f2507eb0d89896a947d106029d0/src/ess/reduce/scicat.py#L29 as proposed in scipp/essreduce#18.

And conceptually, we also have a (Filename, DatasetId) -> FilePath resolver. So it is pretty much the same as in the pooch case except that it needs an additional input.

Copy link
Member Author

@SimonHeybrock SimonHeybrock May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion from discussion:

We are considering to remove DataFolder and FilePath. These were previously added to "hide" differences between local files and files obtained via Pooch or Scicat. However, they are confusing to users, and complicate the visualized task graph.
Instead, use something like:

workflow[Filename[SampleRun]] = download_tutorial_sample_data()  # returns path after pooch download
workflow[Filename[SampleRun]] = download_scicat_dataset(12345)

There are variations to this where a SciCat helper finds, e.g., background runs related to the specified sample run, such that those can be inserted into the workflow automatically.

There are some related questions around other meta data in the SciCat dataset that may be needed by the workflow. There are several solutions to this, but none of them appears to interfere with the decision to remove DataFolder and FilePath.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvaytet Thoughts?

Copy link
Member

@nvaytet nvaytet May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great if we can remove DataFolder and FilePath.
I was wondering if the above could still be a bit confusing. If you read it twice, or are used to it, then it makes sense that download_tutorial_sample_data() would download the file somewhere, and then return the path to that file.
But the first time I read it, putting myself in the mind of a scientist/user, I sort of expected download_tutorial_sample_data() to return the actual data, not the Filename.
Can a slighty different (but not too lengthy) name help?

Comment on lines +86 to +96
def Sans2dTutorialWorkflow() -> sciline.Pipeline:
"""
Create Sans2d tutorial workflow.

Equivalent to :func:`Sans2dWorkflow`, but with loaders for tutorial data instead
of Mantid-based loaders.
"""
workflow = Sans2dWorkflow()
workflow.insert(load_tutorial_run)
workflow.insert(load_tutorial_direct_beam)
return workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to hide the insertions in this function? It is convenient for the tutorials but I see 2 issues with it:

  • It increases the number of functions that a user needs to know about.
  • It is unclear what is different about this tutorial workflow compared to the regular one. Users may assume that there are more fundamental differences than just how files are loaded.

Copy link
Member Author

@SimonHeybrock SimonHeybrock May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why does it increase the number? It decreases it from 2 (load_*) to 1 (Sans2dTutorialWorkflow).
  • Why is it unclear? It is in the docstring as well as mentioned in the notebook (even though that could be extended).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why does it increase the number? It decreases it from 2 (load_*) to 1 (Sans2dTutorialWorkflow).

ok

  • Why is it unclear? It is in the docstring as well as mentioned in the notebook (even though that could be extended).

The explanation in the notebook is not completely clear. And many users won't find the docstring. I would like the explanation to be clear about the fact that the workflows are identical except for how they read files and that the files are entirely equivalent and just use a different format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to add an extra sentence in notebook to address this.

"We define the reduction parameters, with keys and types given by aliases or types defined in `ess.sans.types`:"
"We begin by creating the Sans2d workflow object (this is a [sciline.Pipeline](https://scipp.github.io/sciline/generated/classes/sciline.Pipeline.html) which can be consulted for advanced usage).\n",
"The Sans2d workflow uses Mantid to load files.\n",
"This tutorial comes with files that do not require Mantid, so we use slightly modified a workflow that does not require Mantid:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"This tutorial comes with files that do not require Mantid, so we use slightly modified a workflow that does not require Mantid:"
"This tutorial comes with files that do not require Mantid, so we use a slightly modified workflow that does not require Mantid:"

"metadata": {},
"outputs": [],
"source": [
"workflow.insert(isis.data.transmission_from_background_run)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments explaining a little more what they are doing? e.g. Use the background run as the transmission run for background, instead of a separate file.
etc...

"id": "0977a3dc-eb5f-4067-80c5-cc19137cc915",
"metadata": {},
"source": [
"## Loading local files\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I still think scientists would like to have an example using local files spelled out, as it may not be completely obvious from the explanation about DataFolder above? What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I think we will remove DataFolder and FilePath, as it is not needed any more and too confusing.

"\n",
"We begin by creating the Zoom workflow object (this is a [sciline.Pipeline](https://scipp.github.io/sciline/generated/classes/sciline.Pipeline.html) which can be consulted for advanced usage).\n",
"The Zoom workflow uses Mantid to load files.\n",
"This tutorial comes with files that do not require Mantid, so we use slightly modified a workflow that does not require Mantid:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"This tutorial comes with files that do not require Mantid, so we use slightly modified a workflow that does not require Mantid:"
"This tutorial comes with files that do not require Mantid, so we use a slightly modified workflow that does not require Mantid:"

"]"
"workflow = isis.zoom.ZoomTutorialWorkflow()\n",
"# For real data use:\n",
"# workflow = isis.zoom.ZoomWorkflow()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Zoom need to be repeated in the name of the workflow? It's already in the zoom module?
How about naming it isis.zoom.SANSWorkflow(), to handle cases here some instruments can do different kinds of workflow (e.g. dream can do powder and sans)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see nothing wrong with the repetition, maybe we will find it useful in the name. Also, otherwise you have multiple identically named functions or classes in different submodules of the same package.

"params[CorrectForGravity] = True\n",
"params[UncertaintyBroadcastMode] = UncertaintyBroadcastMode.upper_bound\n",
"params[ReturnEvents] = False"
"workflow.insert(isis.data.transmission_from_background_run)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments as suggested above for the sans2d notebook?

'SANS2D00063159.nxs.h5',
):
path = get_path(filename)
return DataFolder(str(Path(path).parent))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just returning the path of the last file? Maybe that was intentional? Should we check the paths are all the same for all files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change as discussed above this would become irrelevant.

@SimonHeybrock
Copy link
Member Author

I think I address all comments now. FilePath and DataFolder are gone. Ready for another review!

"metadata": {},
"outputs": [],
"source": [
"# left-right layout works better for this graph\n",
"pipeline.visualize(BackgroundSubtractedIofQ, graph_attr={'rankdir': 'LR'})"
"workflow[DirectBeamFilename] = isis.data.sans2d_tutorial_direct_beam()\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely sure about this change. I think it may be making the notebook less clear.
Our initial users will be ISs and IDSs, and I think they liked to be able to see the name of the file, because it's usually a file they gave us, and they can then know exactly what is being loaded (even if the file names were not the best).
In addition, it won't necessarily be clear to them what to do if they just want to load a local disk file.

It is also a bit annoying for us that we have to write a function for each file.

Could we make something in-between, like an object or a function that returns file paths? e.g.

finder = TutorialFileFinder()
workflow[DirectBeamFilename] = finder('DIRECT_SANS2D_REAR_34327_4m_8mm_16Feb16.dat')
workflow[Filename[SampleRun]] = finder('SANS2D00063114.nxs')
...

You could then have a ScicatFileFinder or something with a better name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make something in-between, like an object or a function that returns file paths?

That worked out badly so far:

  • We had to have special filename mappings (adding suffix, special handling of zip files)
  • It is confusing for users, since they may thing that they can use the finder for their own files.

In addition, it won't necessarily be clear to them what to do if they just want to load a local disk file.

I would argue against that. It think it is much clearer now than before, and having finder would imho just add more confusion.

It is also a bit annoying for us that we have to write a function for each file.

I think this is actually a positive change. It avoid hard-coding filenames in notebooks and unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since they may thing that they can use the finder for their own files.

True

It avoid hard-coding filenames in notebooks and unit tests.

I agree that it means if we want to change the file used in tests and the notebooks, we only have one place to change.
However, does this mean we should avoid putting file numbers in the name of the function then? e.g. loki_tutorial_sample_run_60339 as this would mean if we change that file we have to go and change multiple places in the codebase?
If so, we may struggle to come up with names for all these functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this would mean if we change that file we have to go and change multiple places in the codebase?

A modern IDE can easily rename such functions (easier than finding all occurrences of a filename)?

I chose to include the run numbers in the Loki case (and only there), since the scientists are actually working with those files, and there are multiple. We can name them sample_run_1 and sample_run_2 if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really mind either way.
Maybe it's not really that we need to change multiple places, but it would prevent having a function named loki_tutorial_sample_run_60339() whose contents get edited and it now returns the file 60777 if one forgets to rename the function.
But once again, I'm ok with either solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it then.

"workflow[Filename[EmptyBeamRun]] = isis.data.zoom_tutorial_empty_beam_run()\n",
"workflow[isis.SampleOffset] = sc.vector([0.0, 0.0, 0.11], unit='m')\n",
"workflow[isis.DetectorBankOffset] = sc.vector([0.0, 0.0, 0.5], unit='m')\n",
"masks = isis.data.zoom_tutorial_mask_filenames()\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as above, I'm not sure I like hiding the file names.

)


def loki_tutorial_run_60392() -> Filename[TransmissionRun[BackgroundRun]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def loki_tutorial_run_60392() -> Filename[TransmissionRun[BackgroundRun]]:
def loki_tutorial_background_transmission_run() -> Filename[TransmissionRun[BackgroundRun]]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that... it is actually also being used as empty beam run, so I left the name unspecific. Should it be?

Copy link
Member

@nvaytet nvaytet May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just also have a loki_tutorial_empty_beam_run() function that returns the same file name, which could then be changed to return something else in the future if we need that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the user-visible code is "lying", since it will be invisible that the same run is used. I can change it if you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, you can leave it as is.

@SimonHeybrock SimonHeybrock merged commit 53f868a into main May 7, 2024
3 checks passed
@SimonHeybrock SimonHeybrock deleted the ux-readability branch May 7, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants