New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow readers to filter the available datasets configured in YAML #434
Comments
What if Or maybe the method in the file handler should get the original dictionary from the reader (instead of being called multiple times with the individual DatasetID/info arguments? |
So I double checked this to make sure I wasn't crazy after discussing things in #437 with @sfinkens. I think this may have pointed out an overall limitation of the base reader that I need to address. I would like @mraspaud's opinion on this too (@pnuu and @adybbroe too if you guys care). Currently, satpy has the preference to produce as much of the products that were requested by the user as possible. That means that if they ask for Right now readers have a list of all known dataset IDs created from two sources:
The question that is coming up in #437 is what is the difference between "available" and "exists" in the readers/file handlers. Right now the reader's don't necessarily have the concept of this is a dataset I know about but is not available for loading except at the file type level where we can say "we don't have the M15 files so M15 isn't available for loading". The point of this issue is to allow the file handlers to say "this dataset that was configured in YAML isn't available in this file handler right now". For example, the I think I would like to have the yaml reader know the difference between possible (all) datasets and available datasets. Thoughts? (damn this got long) |
Thank you for breaking this down very clearly @djhoese! Maybe examples of the two different scenarios might be helpful as well. The current behaviour is: a) Dataset not available (no
b) Dataset unknown >>> scene.load(['foo'])
KeyError: 'Unknown datasets: foo' The implementation in #437 throws a KeyError of type b) if a known dataset is not available in the file which erroneously suggests that the dataset is unknown. |
@sfinkens Yes exactly. I get something different for case
What reader are you using for |
For the native_msg reader we determine what actual data is present in the file when we read the header. And if the user tries to load a channel that isnt present they will get as key error. It cant be difficult to then do as you suggest in that the available_datasets method returns what is actually present and not all possible datasets. or have i misunderstood. |
@ColinDuff The file handler may raise a KeyError but this is only logged as a warning and the base/parent Reader object in This issue was made to address that with the |
ok so i can have a look at how the native_msg (and netcdf msg) will only show what channels are actually available in the file , if you wish. |
That's what this issue is meant to discuss, the right way to do that. I understand why the native_msg reader is the way it is. I'm hoping to get some feedback on if there is some situation I'm not considering. I think trying to support the readers as I've described is to have the main reader know all of the datasets that could be loaded by this reader and also ask the file handlers what is available. This would mean that two lists would have to be maintained I think. This would preserve the functionality of warning for existing but unavailable datasets while improving the capabilities of file handlers to filter/limit what datasets are considered available. |
This issue supersedes an older one #263 |
@djhoese Sorry, I pasted an additional line |
Talking with Sauli and just thinking out loud - available_datasets could be renamed to known_datasets (or whatever) that lists what channels are listed in the yaml file. now available_datasets would be returned from the file reader listing what has actually been read. available_composites does(would??) use this to determine what composites are currently available and only return those?. For readers that work on a channel per file basis, or those ones that dont need to carry out a channels available check, it would just set available_datasets to known_datasets after a succesful read. |
Interesting discussion. Two things:
|
First, I think we need to all refer to full class.method combinations. There is a single Reader object that uses multiple file handler objects. I'll try to keep my answers below clear (pointing out which methods come from what objects). This is also kind of a brainstorm. You've been warned... @mraspaud For your first point, there is a For your second point, so you are thinking the full list of YAML datasets gets passed to the file handler and then the reader asks the file handlers what is available. This available list could include the YAML datasets and any dynamically discovered datasets from the file. Right? I think my only complaint about this is that it complicates the Making this functionality available through an optional method means that most people won't have to worry about it. The previously mentioned PR does a good job of solving both problems in one method when it should probably solve it in two. So I propose as a little-change for existing readers/handlers:
The above would require some "caching" of the available datasets versus all/known datasets. This would however speed up For a "perfect" brand new solution I propose:
|
I'm wondering how you would avoid the KeyError in |
Sounds good, looking forwards to reviewing the PR for this :) |
@sfinkens What happens is that the reader calls the file handler's |
I need some more opinions if anyone is on their computer this weekend. I was about to implement the "perfect" solution above and can't decide if it is overly complicated or not. Having a single method of the file handlers be responsible for updating yaml-configured dataset info and producing any dynamically discovered datasets isn't too bad. The complexity comes from this method telling whether or not these are available or if they are just "known". I see two options: Option 1 - All In Onedef available_datasets(self, configured_datasets=None):
for ds_info in (configured_datasets or []):
# update ds_info with things like file resolution, etc if needed
yield self._dataset_is_available(ds_info), ds_info
for var_name in dynamic_var_names:
yield True, {'name': var_name, ... other metadata ...} Option 2 - Splitdef all_datasets(self, configured_datasets=None):
for ds_info in (configured_datasets or []):
# update ds_info with things like file resolution, etc if needed
yield ds_info
for var_name in dynamic_var_names:
yield {'name': var_name, ... other metadata ...}
def available_dataset(self, ds_info):
return ds_info['name'] in self |
Eh the second option is redundant. I'm still not a huge fan of the first, but not terrible. |
I like option one! What would be the default be haviour of |
@sfinkens The latter. So the base file handler class would do: for ds_info in (configured_datasets or []):
yield True, ds_info |
Is your feature request related to a problem? Please describe.
The
native_msg
reader only accepts one file at a time but these files can differ with what datasets/bands are actually stored in them. There are currently 3 ways to specify/modify what datasets are available from a reader:file_type
for a dataset has been loaded then we assume this dataset can be loaded.available_datasets
method to the file handlers of a reader that produce a dictionary of ds_id -> ds_info. This is good when the files being read have a long list of dynamically loaded variables.resolution
property on the specific file handler to modify the DatasetID for that specific dataset to specify the exact resolution available from this file.Describe the solution you'd like
Add another method to file handlers
filter_available_datasets
that is called beforeavailable_datasets
. This new method allows a file handler to indicate "I know this dataset is configured in the YAML but I don't actually have that data in this file".Describe any changes to existing user workflow
Shouldn't be a problem if things are done properly.
Additional python or other dependencies
None.
Describe any changes required to the build process
None.
Describe alternatives you've considered
Move the YAML config datasets to the available_datasets method. This makes it really hard to configure things like bands where wavelengths are probably best not hardcoded (unless satpy stops using yaml to configure datasets/readers).
Could also just deal with it the way it is meaning datasets are listed as available, but aren't actually available.
CC'ing @sjoro @adybbroe @ColinDuff
The text was updated successfully, but these errors were encountered: