-
Notifications
You must be signed in to change notification settings - Fork 1
Transparent file access with pooch or local override #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/ess/reduce/data/__init__.py
Outdated
|
|
||
| _bifrost_registry = Registry( | ||
| instrument='bifrost', | ||
| _bifrost_registry = make_registry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sort of confused as to why we have these registries for specific instruments in here.
I did not actually know we had them.
As far as I can see, they are only used for the unit tests?
Should we just make something in the test folder?
I think each technique sub-package has its own data registry which does not need the registries here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will be moved to the tests.
|
|
||
| def _import_pooch() -> Any: | ||
| try: | ||
| import pooch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can using lazy_loader help not having to have these try/except blocks, just import pooch at the top, and only import it when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can. But then you don't get a custom error message.
| : | ||
| Either a :class:`PoochRegistry` or :class:`LocalRegistry`. | ||
| """ | ||
| if (override := os.environ.get(_LOCAL_REGISTRY_ENV_VAR)) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wrote in the PR description, this is all or nothing. Either all files are local, or all files are from pooch.
I'm wondering if this will be impractical. I had imagined there would be some sort of priority: first check if a file is found locally, if not, try pooch?
In addition, does the current approach mean that the environment variable needs to be set before we import packages? If so, it would be nice if order did not matter; that you could set the variable at any time and it would pick it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After in-person converstion, this env variable would only be used for tests, not by users.
The all or nothing approach is a way to ensure that we are controlling precisely whether we are getting files from local disk or pooch. This is what we want.
|
|
||
|
|
||
| @pytest.fixture(scope='session') | ||
| def bifrost_registry() -> Registry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest we make a single registry for all files? I think name clashes are very unlikely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clashes are unlikely. But most file names don't tell us anything about the file. We can rely on comments but I think it is good to have the instrument name in the code. Is there a benefit to merging them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benefit to merging them?
I thought it was a bit overkill to create 3 registries for the tests, but yes I don't think it matters.
Additionally, because the files are in different folders on the http storage (ess/bifrost, ess/loki), I think we cannot use only one without moving the files around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can by including the instrument name in the file name:
make_registry(
'ess',
files={
"bifrost/BIFROST_20240914T053723.h5": "md5:0f2fa5c9a851f8e3a4fa61defaa3752e",
},
version='1',
)This would also help with my comment about documenting what instrument the files belong to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just one isn't so bad then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I forgot about the version. The actual full path is {base_url}/{prefix}/{version}/{name}. So we can't just move the instrument name from the prefix to the file name. Plus, the different version numbers get in the way.
| self._files = _to_file_entries(files) | ||
|
|
||
| @abstractmethod | ||
| def get_path(self, name: str) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: can we make this into __call__ instead, so we can do dream_registry('tutorial_dream_file')?
The get_path is kind of the only method on the class, and it would be quite obvious that we want to get a file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep a named method. I would only use __call__ for types that are conceptually functions. But registries seem more like highly specialised containers to me.
Just based on naming, registry(name) seems wrong. It should be a verb or phrase, not a noun. E.g.,
get_path = make_registry(...)
filename = get_path(name)But that is a bit odd here, too, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a __getitem__ be better suited then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is worse. I would expect that to be a cheap operation that gets an element from a container, not download something from the internet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind, I just thought we could save some typing by using one or the other.
Just pick what you think is best.
This allows overriding the
get_pathfunction for data files to use a local file instead of downloading from pooch. Simply define the env varSCIPP_OVERRIDE_DATA_DIR=/path/to/dataand it will return paths in that folder instead of downloading anything. The code in downstream packages only needs to be updated and use the newmake_registryfunction.The override folder must have the same layout as the http server. I achieved that by symlinking
/dmsc/codeshelf/ci/essto/nfs/www/html/groups/scipp/ess. So any files that we use in anyess.*package should be accessible automatically.This is a binary setting, either all files are downloaded or all files are accessed locally. I did this to indicate to us if we forget to provide a file locally and it downloads the file instead which can lead to (flaky) timeouts.
See https://git.esss.dk/dram/code-shelf/code-shelf-template/-/merge_requests/7 for how it can be used on gitlab.
I tested this on a GitLab runner. Unfortunately, I could not get it to work automatically with a dev build of essreduce because I had conflicts between conda and pip packages. But I could get it to work in an interactive session with proper
pip installhacks.