Skip to content

Conversation

@jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Sep 29, 2025

This allows overriding the get_path function for data files to use a local file instead of downloading from pooch. Simply define the env var SCIPP_OVERRIDE_DATA_DIR=/path/to/data and it will return paths in that folder instead of downloading anything. The code in downstream packages only needs to be updated and use the new make_registry function.

The override folder must have the same layout as the http server. I achieved that by symlinking /dmsc/codeshelf/ci/ess to /nfs/www/html/groups/scipp/ess. So any files that we use in any ess.* package should be accessible automatically.

This is a binary setting, either all files are downloaded or all files are accessed locally. I did this to indicate to us if we forget to provide a file locally and it downloads the file instead which can lead to (flaky) timeouts.

See https://git.esss.dk/dram/code-shelf/code-shelf-template/-/merge_requests/7 for how it can be used on gitlab.

I tested this on a GitLab runner. Unfortunately, I could not get it to work automatically with a dev build of essreduce because I had conflicts between conda and pip packages. But I could get it to work in an interactive session with proper pip install hacks.

@jl-wynen jl-wynen requested a review from nvaytet September 29, 2025 10:56

_bifrost_registry = Registry(
instrument='bifrost',
_bifrost_registry = make_registry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sort of confused as to why we have these registries for specific instruments in here.
I did not actually know we had them.

As far as I can see, they are only used for the unit tests?
Should we just make something in the test folder?
I think each technique sub-package has its own data registry which does not need the registries here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be moved to the tests.


def _import_pooch() -> Any:
try:
import pooch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can using lazy_loader help not having to have these try/except blocks, just import pooch at the top, and only import it when needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can. But then you don't get a custom error message.

:
Either a :class:`PoochRegistry` or :class:`LocalRegistry`.
"""
if (override := os.environ.get(_LOCAL_REGISTRY_ENV_VAR)) is not None:
Copy link
Member

@nvaytet nvaytet Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wrote in the PR description, this is all or nothing. Either all files are local, or all files are from pooch.
I'm wondering if this will be impractical. I had imagined there would be some sort of priority: first check if a file is found locally, if not, try pooch?

In addition, does the current approach mean that the environment variable needs to be set before we import packages? If so, it would be nice if order did not matter; that you could set the variable at any time and it would pick it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After in-person converstion, this env variable would only be used for tests, not by users.
The all or nothing approach is a way to ensure that we are controlling precisely whether we are getting files from local disk or pooch. This is what we want.

@jl-wynen jl-wynen enabled auto-merge October 9, 2025 11:35


@pytest.fixture(scope='session')
def bifrost_registry() -> Registry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest we make a single registry for all files? I think name clashes are very unlikely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clashes are unlikely. But most file names don't tell us anything about the file. We can rely on comments but I think it is good to have the instrument name in the code. Is there a benefit to merging them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to merging them?

I thought it was a bit overkill to create 3 registries for the tests, but yes I don't think it matters.
Additionally, because the files are in different folders on the http storage (ess/bifrost, ess/loki), I think we cannot use only one without moving the files around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can by including the instrument name in the file name:

make_registry(
        'ess',
        files={
            "bifrost/BIFROST_20240914T053723.h5": "md5:0f2fa5c9a851f8e3a4fa61defaa3752e",
        },
        version='1',
    )

This would also help with my comment about documenting what instrument the files belong to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just one isn't so bad then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I forgot about the version. The actual full path is {base_url}/{prefix}/{version}/{name}. So we can't just move the instrument name from the prefix to the file name. Plus, the different version numbers get in the way.

self._files = _to_file_entries(files)

@abstractmethod
def get_path(self, name: str) -> Path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can we make this into __call__ instead, so we can do dream_registry('tutorial_dream_file')?
The get_path is kind of the only method on the class, and it would be quite obvious that we want to get a file path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep a named method. I would only use __call__ for types that are conceptually functions. But registries seem more like highly specialised containers to me.

Just based on naming, registry(name) seems wrong. It should be a verb or phrase, not a noun. E.g.,

get_path = make_registry(...)
filename = get_path(name)

But that is a bit odd here, too, IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a __getitem__ be better suited then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is worse. I would expect that to be a cheap operation that gets an element from a container, not download something from the internet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind, I just thought we could save some typing by using one or the other.
Just pick what you think is best.

@jl-wynen jl-wynen merged commit 48dc3e4 into main Oct 10, 2025
4 checks passed
@jl-wynen jl-wynen deleted the data-retriever branch October 10, 2025 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants