Transparent file access with pooch or local override #264

jl-wynen · 2025-09-29T10:56:23Z

This allows overriding the get_path function for data files to use a local file instead of downloading from pooch. Simply define the env var SCIPP_OVERRIDE_DATA_DIR=/path/to/data and it will return paths in that folder instead of downloading anything. The code in downstream packages only needs to be updated and use the new make_registry function.

The override folder must have the same layout as the http server. I achieved that by symlinking /dmsc/codeshelf/ci/ess to /nfs/www/html/groups/scipp/ess. So any files that we use in any ess.* package should be accessible automatically.

This is a binary setting, either all files are downloaded or all files are accessed locally. I did this to indicate to us if we forget to provide a file locally and it downloads the file instead which can lead to (flaky) timeouts.

See https://git.esss.dk/dram/code-shelf/code-shelf-template/-/merge_requests/7 for how it can be used on gitlab.

I tested this on a GitLab runner. Unfortunately, I could not get it to work automatically with a dev build of essreduce because I had conflicts between conda and pip packages. But I could get it to work in an interactive session with proper pip install hacks.

nvaytet · 2025-10-02T14:21:16Z

src/ess/reduce/data/__init__.py

-
-_bifrost_registry = Registry(
-    instrument='bifrost',
+_bifrost_registry = make_registry(


I'm sort of confused as to why we have these registries for specific instruments in here.
I did not actually know we had them.

As far as I can see, they are only used for the unit tests?
Should we just make something in the test folder?
I think each technique sub-package has its own data registry which does not need the registries here?

These will be moved to the tests.

nvaytet · 2025-10-02T14:24:12Z

src/ess/reduce/data/_registry.py

+
+def _import_pooch() -> Any:
+    try:
+        import pooch


Can using lazy_loader help not having to have these try/except blocks, just import pooch at the top, and only import it when needed?

It can. But then you don't get a custom error message.

nvaytet · 2025-10-02T14:28:48Z

src/ess/reduce/data/_registry.py

+    :
+        Either a :class:`PoochRegistry` or :class:`LocalRegistry`.
+    """
+    if (override := os.environ.get(_LOCAL_REGISTRY_ENV_VAR)) is not None:


As you wrote in the PR description, this is all or nothing. Either all files are local, or all files are from pooch.
I'm wondering if this will be impractical. I had imagined there would be some sort of priority: first check if a file is found locally, if not, try pooch?

In addition, does the current approach mean that the environment variable needs to be set before we import packages? If so, it would be nice if order did not matter; that you could set the variable at any time and it would pick it up.

After in-person converstion, this env variable would only be used for tests, not by users.
The all or nothing approach is a way to ensure that we are controlling precisely whether we are getting files from local disk or pooch. This is what we want.

nvaytet · 2025-10-09T11:58:18Z

tests/conftest.py

+
+
+@pytest.fixture(scope='session')
+def bifrost_registry() -> Registry:


Can I suggest we make a single registry for all files? I think name clashes are very unlikely?

Clashes are unlikely. But most file names don't tell us anything about the file. We can rely on comments but I think it is good to have the instrument name in the code. Is there a benefit to merging them?

Is there a benefit to merging them?

I thought it was a bit overkill to create 3 registries for the tests, but yes I don't think it matters.
Additionally, because the files are in different folders on the http storage (ess/bifrost, ess/loki), I think we cannot use only one without moving the files around?

We can by including the instrument name in the file name:

make_registry( 'ess', files={ "bifrost/BIFROST_20240914T053723.h5": "md5:0f2fa5c9a851f8e3a4fa61defaa3752e", }, version='1', )

This would also help with my comment about documenting what instrument the files belong to.

Maybe just one isn't so bad then?

Ahh, I forgot about the version. The actual full path is {base_url}/{prefix}/{version}/{name}. So we can't just move the instrument name from the prefix to the file name. Plus, the different version numbers get in the way.

nvaytet · 2025-10-09T12:00:42Z

src/ess/reduce/data/_registry.py

+        self._files = _to_file_entries(files)
+
+    @abstractmethod
+    def get_path(self, name: str) -> Path:


Suggestion: can we make this into __call__ instead, so we can do dream_registry('tutorial_dream_file')?
The get_path is kind of the only method on the class, and it would be quite obvious that we want to get a file path?

I would rather keep a named method. I would only use __call__ for types that are conceptually functions. But registries seem more like highly specialised containers to me.

Just based on naming, registry(name) seems wrong. It should be a verb or phrase, not a noun. E.g.,

get_path = make_registry(...) filename = get_path(name)

But that is a bit odd here, too, IMHO.

Would a __getitem__ be better suited then?

I think that is worse. I would expect that to be a cheap operation that gets an element from a container, not download something from the internet.

I don't mind, I just thought we could save some typing by using one or the other.
Just pick what you think is best.

jl-wynen added 2 commits September 26, 2025 17:11

Implement data registry override

80557eb

Unzip files from local source

59fa877

jl-wynen requested a review from nvaytet September 29, 2025 10:56

nvaytet reviewed Oct 2, 2025

View reviewed changes

Move data registries into tests

a187443

jl-wynen enabled auto-merge October 9, 2025 11:35

nvaytet reviewed Oct 9, 2025

View reviewed changes

jl-wynen mentioned this pull request Oct 10, 2025

Roll out transparent local file access #265

Open

6 tasks

nvaytet approved these changes Oct 10, 2025

View reviewed changes

jl-wynen merged commit 48dc3e4 into main Oct 10, 2025
4 checks passed

jl-wynen deleted the data-retriever branch October 10, 2025 12:27



		@pytest.fixture(scope='session')
		def bifrost_registry() -> Registry:

Transparent file access with pooch or local override #264

Transparent file access with pooch or local override #264

Uh oh!

Conversation

jl-wynen commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvaytet Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jl-wynen commented Sep 29, 2025 •

edited

Loading

nvaytet Oct 2, 2025 •

edited

Loading