Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auxiliary data download API #1513

Merged
merged 16 commits into from Feb 12, 2021
Merged

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Jan 26, 2021

The Problem

There are a couple cases in Satpy where additional files are needed to completely function. Examples include the StaticImageCompositor, "crefl" rayleigh correction which uses data files for elevation values, or the MiRS reader in #1511 that needs LUTs to properly perform some limb correction. It is a pain for users to have to download these files and set an environment variable (previously SATPY_ANCPATH) just to use something that is builtin to Satpy. This should be automatic.

The Solution

This pull request utilizes the pooch library to allow Satpy components to download the files that they need to operate. Pooch allows us to cache the files and check hashes. It has some versioning support but we aren't using it. It also has some typical use cases that we aren't using because of the dynamic nature of Satpy's components (ex. a user customizing things in YAML).

There are a lot of TODOS left for this PR, but the general idea is sketched out here and I'm looking for feedback.

Example 1 - Satpy Component retrieves data file

from satpy.data_download import register_file, retrieve

cache_key = register_file('http://example.com/some_file.txt', 'some_file.txt', component_type='composites', component_name='StaticImageCompositor', known_hash="sha256:abcdefghij")
# ... later on ...
local_filename = retrieve(cache_key)
# use local_filename...

Example 2 - User wants to download all files for offline use

from satpy.data_download import retrieve_all
retrieve_all()

Example 3 - User wants to customize data location

import satpy

scn = satpy.Scene(...)
with satpy.config.set(data_dir='/tmp'):
    scn.load(['_night_background'])

@djhoese djhoese added enhancement code enhancements, features, improvements component:readers component:compositors labels Jan 26, 2021
@djhoese djhoese self-assigned this Jan 26, 2021
@ghost
Copy link

ghost commented Jan 26, 2021

DeepCode failed to analyze this pull request

Something went wrong despite trying multiple times, sorry about that.
Please comment this pull request with "Retry DeepCode" to manually retry, or contact us so that a human can look into the issue.

@simonrp84
Copy link
Member

This seems like a really good idea, the MetPy library uses pooch and I've not had any problems with it there so hopefully it's equally straightfroward in satpy :)

@mraspaud
Copy link
Member

mraspaud commented Feb 2, 2021

Thinking of how our operational environment doesn't have access to internet by default, I was just wondering if there is a location that I can set the data in that will automatically search by satpy, outside of the home directory? So that I can install the static files there from eg an rpm package

@djhoese
Copy link
Member Author

djhoese commented Feb 2, 2021

@mraspaud Yes, that's the whole point. Just combine example 2 and 3 from above. Or set the environment variable SATPY_DATA_DIR. The retrieve_all would do all this downloading for you if you needed it.

@djhoese
Copy link
Member Author

djhoese commented Feb 3, 2021

@mraspaud To clarify, there is no location outside of the home directory that is searched by default for data. The data directory needs to be one directory (not a series of directories) and by default we assume the user doesn't have permission to write to a system directory (since Satpy will assume it has to write at some point).

Depending on how your operational processing works (docker containers?) there may be a way to soft link or mount a directory to the default location that Satpy will look.

@mraspaud
Copy link
Member

mraspaud commented Feb 3, 2021

Thanks for clarifying. I guess I'll use the SATPY_DATA_DIR then

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

Merging #1513 (567b27d) into master (f35c9d9) will increase coverage by 0.42%.
The diff coverage is 96.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1513      +/-   ##
==========================================
+ Coverage   91.65%   92.08%   +0.42%     
==========================================
  Files         248      251       +3     
  Lines       36167    36660     +493     
==========================================
+ Hits        33150    33759     +609     
+ Misses       3017     2901     -116     
Flag Coverage Δ
behaviourtests 4.45% <9.86%> (+0.09%) ⬆️
unittests 92.62% <96.61%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
satpy/dataset/data_dict.py 90.09% <ø> (ø)
satpy/dataset/dataid.py 93.71% <ø> (ø)
satpy/readers/ahi_hsd.py 96.86% <ø> (ø)
satpy/readers/ahi_l1b_gridded_bin.py 99.05% <ø> (ø)
satpy/readers/ami_l1b.py 96.99% <ø> (ø)
satpy/readers/fci_l2_nc.py 92.80% <ø> (ø)
satpy/readers/goes_imager_nc.py 65.46% <0.00%> (ø)
satpy/readers/iasi_l2_so2_bufr.py 94.62% <ø> (ø)
satpy/readers/mimic_TPW2_nc.py 80.95% <ø> (ø)
satpy/readers/nwcsaf_nc.py 60.40% <0.00%> (ø)
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f35c9d9...f7eb5fb. Read the comment docs.

@djhoese
Copy link
Member Author

djhoese commented Feb 5, 2021

@mraspaud and others, I'm seeking re-review. I have solutions for everything I wanted to do, but I'm not overjoyed with how they all have to work. The main one is data downloads in a reader. Do to this you have to:

  1. Create a custom Reader class (not a file handler, a reader):

    class MyReader(FileYAMLReader, DataDownloadMixin):
        def _init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.data_files = []
            self.register_data_files()
    
  2. Configure the files you want to download in the YAML:

    reader:
      name: "fake"
      reader: !!python/name:satpy.tests.test_data_download.FakeMixedReader
      data_files:
        - url: "https://raw.githubusercontent.com/pytroll/satpy/master/README.rst"
          known_hash: null
        - url: "https://raw.githubusercontent.com/pytroll/satpy/master/README.rst"
          filename: "README2.rst"
          known_hash: null
    
  3. Then when you need to use these files in your file handlers you have to do (this is the part I really don't like):

    from satpy.data_download import retrieve
    ...
    # inside your file handler
    local_filename = retrieve('readers/FakeMixedReader/README2.rst')
    # use the file
    

In summary, the file handler ends up having to know the class name of the Reader to be able to get at (download or cached) the file.

Edit: The only remaining thing is a dedicated documentation page on this in the dev guide and a section in the custom reader page.

@pnuu
Copy link
Member

pnuu commented Feb 5, 2021

Without looking at the code (late night, ...):

  • why complicate simple download by creating a reader and all that?
  • the compositors should know what they need (expected and, optionally, online file locations listed in the composite YAML). Just point them to the right (base) directory with SATPY_DATA_DIR, and if the data are not there, call download_stuff(online_source, target_dir=SATPY_DATA_DIR). The compositor should handle the reading and such
  • for offline use, have an utility function download_aux(list_of_composites) that finds the relevant configs from PPP_CONFIG_DIR/SATPY_CONFIG_DIR and saves them to SATPY_DATA_DIR

@djhoese
Copy link
Member Author

djhoese commented Feb 6, 2021

@pnuu here are a few points to clarify some of the design here and how things work in general and to address your points here and some you mentioned on slack.

  1. The "register" type function never download or communicate with a remote data source. They are only letting Satpy (and pooch) know what files might be downloaded. So files will never be downloaded until the .load step of the Scene (either in the reader's file handlers or the compositor's __call__).
  2. The "register" and "retrieve" steps are NEVER opening the file. The retrieve is the only thing that will ever download and when it does it writes it to disk and returns the local path. The Satpy component requesting the file is still in charge of how and when the file is opened.
  3. "simple download": Downloading and caching is not simple. This should not be something left up to each developer in Satpy. Each component in Satpy (readers, writers, and compositors...so far) should be downloading and caching things in the same way and in the same place. Pooch and the way I'm using it here not only downloads the files if they aren't present, it also checks the hashes of the files to make sure the download was successful, handles different hashing algorithms, and different remote resource protocols with interfaces for additional ones (non-HTTP or non-FTP).
  4. "why complicate simple download by creating a reader and all that?": I'm not sure what you mean by this. There is no new reader just for downloads. The main purpose of the "Mixin" class is to avoid throwing anymore logic into the base reader class. Additionally the same mixin class can be used by writers. And now that I'm thinking of it, I should make the StaticImageCompositor use it too. I am open to other ideas besides a Mixin class, but I wanted to try it this way. I could do it with a simple self.download_helper = DataDownloaderHelper(data_file_configs), but it didn't seem to fit perfectly.
  5. As for your last point about specifying composites: I'd be OK allowing the user to limit composites and readers and writers that are including in the "retrieve_all" utility function, but this won't work for all cases. Mainly it requires a user to know which composites they will want to load and possibly what ones actually need data (although I guess they could provide all the ones they are interested in). Second, loading a composite config (at least with the CompositeLoader right now) involves initializing all composites in each config that is parsed. Otherwise, this isn't much different than what retrieve_all is doing now. It finds all readers, writers, and composites.

@djhoese djhoese changed the title Add ancillary data download API Add auxiliary data download API Feb 10, 2021
@djhoese djhoese marked this pull request as ready for review February 10, 2021 03:15
@djhoese
Copy link
Member Author

djhoese commented Feb 10, 2021

This has now been updated based on the comments by @mraspaud on slack. It is now ready for re-review.

@djhoese
Copy link
Member Author

djhoese commented Feb 10, 2021

@mraspaud after our conversation at the status meeting, I think I've implemented the big important things:

  1. Console script for easier access including limiting the readers, writers, and composites that are used and specifying the data directory to write files to.
  2. Added a 'download_aux' configuration setting that will skip pooch and try to use a local file directly.

If you're OK with it I'd like to avoid the complications of the safe YAML loading and downloading files. This is a bit more complex and would probably require changes to a lot of lower-level functions (the ones that get the reader configs and writer configs and stuff).

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have some minor review items inline.

doc/source/dev_guide/aux_data.rst Show resolved Hide resolved
doc/source/dev_guide/aux_data.rst Outdated Show resolved Hide resolved
satpy/tests/test_data_download.py Outdated Show resolved Hide resolved
satpy/tests/test_data_download.py Outdated Show resolved Hide resolved
satpy/data_download.py Outdated Show resolved Hide resolved
satpy/data_download.py Outdated Show resolved Hide resolved
@djhoese
Copy link
Member Author

djhoese commented Feb 12, 2021

Ok I think I've addressed all of your suggestions/concerns.

@djhoese
Copy link
Member Author

djhoese commented Feb 12, 2021

Actually, I didn't rename the module. Maybe I will. One sec...

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@djhoese djhoese merged commit edcceb1 into pytroll:master Feb 12, 2021
@djhoese djhoese deleted the feature-data-download branch February 12, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:compositors component:readers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: download tif's if needed in a composite
4 participants