Refactor available datasets logic to be more flexible #739

djhoese · 2019-04-28T02:01:45Z

This is an attempt at resolving the discussion brought up in #434.

Problem Summary

Currently in Satpy we have a couple difficult-to-code use cases with readers:

Some readers don't have their datasets pre-configured, like generic readers or readers that parse the output of algorithm software (clavrx, geocat, etc). So far we've gotten by with an available_datasets method on file handlers that provides the dataset info for new "dynamic" datasets that the file reader knows about but the YAML didn't.
Certain datasets have information that is only knowable by the file contents, like resolution. Since it can't be configured in YAML we get around this by file handlers having a resolution property that the base yaml class is checking for and using to update DatasetIDs configured in the YAML.
Some datasets are configured in YAML but may or may not be in the input file depending on how it was created. There is therefore no way to know if it is "available" or not until it is actually requested by the user.
If a dataset is available in multiple resolutions but the highest ("best") resolution is not available due to some files not being provided, we see that the dataset is not available via available_dataset_ids but even asking for the generic "name" of the dataset will still try to load the highest resolution version and fail (see MODIS L1B reader).

Solution Summary

I've updated the available_datasets method of the file handlers to take a series of (is_available, dataset_info) pairs via a generator. I chain these calls together for the first file handler of each loaded file type. The is_available "boolean" can be:

True: There is a file handler and it has this dataset and can load it
False: There is a file handler and it knows it should be able to load this, but doesn't have it.
None: There isn't a file handler that knows how to load this dataset.

This method can do three things:

Yield the dataset info provided it, as is.
Update the dataset info with new identifying information (resolution, etc) and possibly other metadata that would effect loading (coordinates, etc). Then yield the dataset info.
Generate new dataset info for dynamic datasets that haven't been configured in the YAML, but that the file handler knows about.

I've changed the base class to have an idea of "all datasets" and "available datasets", but there are some flaws. Right now, as of this first commit, this only reproduces the current behaviors (maybe with a little extra flair).

Doubts and Questions

I used the "chaining" idea when calling available_datasets because it seemed the most flexible, ended up being kind of elegant, should be performant because of the way the generators are used, and allows for file handlers to "communicate" between themselves with what is available. However, it may make the actual use of the generator inside available_datasets overly complex and hard to use. The simple cases are still relatively simple, but the semi-complex ones become very hard to walk through.
The dependency tree used by the Scene and .load of the readers have a strange way of handling a "known" versus "available" dataset. It is very obvious now that I've been able to take a step back and work on this.

The dependency tree needs to know what's possible regardless of whether or not a dataset exists. If we know that it exists then we know what is possible. However, we also use the dependency tree when we actually load datasets to get the DatasetID that we need to load. The "best" dataset to load (highest resolution) may not be the "best available", we currently only load "best". We would probably be ok building the dependency tree with only available datasets, but I'm a little nervous about the lack of future-proofing this would cause. Maybe building two dependency trees would be best/most flexible; one of all known and one of available.

The main thing to decide is what should the base reader's get_dataset_key return. Should we split it in to two or more methods that do similar things? Is a keyword argument as I've laid out in this PR so far good enough? This is compounded by the problem in the previous paragraph, which components need "known" datasets and which ones need "available"? Right now I have it possible to do "prefer available but give me known if it isn't available".

I'm not sure what's best here and wouldn't mind some opinions. I've been thinking about this too much so I'm taking a break for a little while.

TODO

Move the "resolution" property logic to available_datasets functionality.
Determine best way to handle all/available differences in current dependency tree creation and use.

Closes Allow readers to filter the available datasets configured in YAML #434
Tests added
Tests passed
Passes git diff origin/master -- "*py" | flake8 --diff
Fully documented

sfinkens

From what I can tell, this is excellent so far!

Originally, I was preferring two separate methods for adding new and complementing existing datasets. But the chaining approach is (once I got it) much more elegant and flexible. I also like that there is now one well-documented baseclass method to look up (not the hardly documented mix of available_datasets and resolution). I found the comments in the clavrx reader even more instructive than the rather abstract docstring in BaseFileHandler.available_datasets. Maybe those could be added as an implementation guide:
```
# some other file handler knows how to load this
if is_avail is not None:
    yield is_avail, ds_info

# we can confidently say that we can provide this dataset and can
# provide more info
yield True, new_info

# if we didn't know how to handle this dataset and no one else did
# then we should keep it going down the chain
yield is_avail, ds_info
```
What do you think of replacing (True, False, None) with constants (AVAILABLE, KNOWN_BUT_UNAVAILABLE, UNKNOWN)? That would make the logic even more clear. Probably makes the docstrings less readable, though.
I know too little about the dependency tree to give an advice here. But you can't do much if the dataset isn't available, right? So "prefer available but give me known if it isn't available" seems like a good default to me. Can you give an example of why you are worried about the lack of future-proofing when building the dependency tree with only the available datasets?

djhoese · 2019-04-30T13:37:19Z

Good idea on the implementation guide. I actually had an example in the base classes docstring when I first wrote it, back when I wanted users to use the base classes method regardless.

I'm not completely against the constants, but I would almost rather wait for python 2 support to end so we could use an Enum.
Right now, it probably doesn't matter much. My thought was something like a composite being listed as "X" resolution in the future even though there are other resolutions available if the user provided that resolution. So I'm worried that all_dataset_ids would become less accurate or less useful. Maybe I do the preference for available and see what breaks (and what gets fixed).

djhoese · 2019-05-01T01:52:53Z

@sfinkens So I've changed the default behavior to do as discussed; return the best available dataset before falling back to best known dataset. From my basic test this looks to have fixed a lot of the issues with the MODIS L1B reader that @TAlonglong was running in to.

I need to add tests and more documentation but I think this may work. One thing I realized I was worried about @sfinkens is what the Scene.all/available_composite_ids methods would return. Looking at the code it may actually work as expected because of how I had it coded originally. Turns out past Dave was smart this time.

sfinkens · 2019-05-02T08:08:37Z

@djhoese Thanks for the explanation. Looks good to me! And agree to wait for Enums in Python3. We'll see, maybe the current approach is already clear enough and there will be no need to change it.

djhoese · 2019-05-02T20:19:43Z

I think this is ready to go. @sfinkens I did what you suggested and added a couple examples to the available_datasets docstring. It is nice and long now.

mraspaud · 2019-05-02T20:20:34Z

Let me have a look :)

coveralls · 2019-05-02T20:35:32Z

Coverage increased (+0.008%) to 80.715% when pulling e256d2c on djhoese:feature-available-datasets-update into 9f7dced on pytroll:master.

codecov · 2019-05-02T20:36:24Z

Codecov Report

Merging #739 into master will increase coverage by <.01%.
The diff coverage is 93.69%.

@@            Coverage Diff             @@
##           master     #739      +/-   ##
==========================================
+ Coverage   80.71%   80.71%   +<.01%     
==========================================
  Files         149      149              
  Lines       21677    21836     +159     
==========================================
+ Hits        17496    17626     +130     
- Misses       4181     4210      +29

Impacted Files	Coverage Δ
satpy/writers/__init__.py	`85.21% <ø> (ø)`	⬆️
satpy/readers/clavrx.py	`92.61% <100%> (+0.84%)`	⬆️
satpy/readers/file_handlers.py	`93.22% <100%> (+1.38%)`	⬆️
satpy/tests/test_scene.py	`99.44% <100%> (+0.01%)`	⬆️
satpy/readers/nucaps.py	`94.11% <100%> (ø)`	⬆️
satpy/tests/utils.py	`97.29% <100%> (+0.32%)`	⬆️
satpy/readers/viirs_sdr.py	`85.46% <100%> (+0.43%)`	⬆️
satpy/tests/reader_tests/test_viirs_sdr.py	`92.33% <100%> (+0.02%)`	⬆️
satpy/tests/reader_tests/test_clavrx.py	`98.37% <100%> (+0.28%)`	⬆️
satpy/readers/geocat.py	`87.5% <63.63%> (-4.23%)`	⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f7dced...e256d2c. Read the comment docs.

mraspaud

First quick review.

mraspaud · 2019-05-02T20:29:38Z