Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Remote Data Requirements #1939

Closed
Cadair opened this issue Nov 4, 2016 · 13 comments · Fixed by #3124
Closed

Handling Remote Data Requirements #1939

Cadair opened this issue Nov 4, 2016 · 13 comments · Fixed by #3124
Labels
Effort High Requires a large time investment Feature Request New feature wanted! Package Intermediate Requires some knowledge of the internal structure of SunPy Priority High Rapid action required

Comments

@Cadair
Copy link
Member

Cadair commented Nov 4, 2016

This is an issue we have been kicking down the road for a while, but #1897 is pushing us to fix it.

Some functions in SunPy are going to need datafiles to work, be this instrument calibration type data like for AIA response functions or other remote data requirements, like for tutorials (see #1809) and no doubt other use cases I can't think of.

The requirements of this to my mind are:

  1. Data is downloaded and cached to $HOME/sunpy/data/... when first needed.
  2. There is some validation that the data has been transferred correctly (sha hashes etc).
  3. There is a mechanism by which we can allow users to re-download data.
  4. The download code supports multiple mirrors.

The second requirement here poses an interesting question. If the data on the remote server has changed (deliberately due to calibration change etc.) if we store a sha hash in the code, our downloads will error, and the only way to fix this is to do a new SunPy release with a bug fix. The alternative is to not store hashes and just assume the data is what we anticipated and roll with it. To my mind, I like pinning the version of data to the code, it means that in theory as long as the remote data is available one version of SunPy will always give the same answers. (Same code same data). I prefer the error hard and early approach here, but I do think we should provide a mechanism to override this behaviour (i.e. skip download verification if the user knows what they are doing.)

@dpshelio and I had a conversation about this, and for many cases, the ideal solution to this problem is working closely with the data providers. I am for the moment, however, assuming that we are working with random data on the internet where there is no way to persuade the provider to version their data properly by putting it on something like zenodo or Figshare.

Proposal

I suggest we add a data manager, which maintains a record of the cache, and can provide various features. I propose the following user code:

Define a function that needs some data:

@remote_data_manager.require(name='stuartsfile',
                             urls=('https://server1/file1.fits', 'http://server2/file1.fits'), 
                             shasum='1245645343545343')
def myfunction():
    filename = remote_data_manager.get('stuartsfile')

This adds the function name to the cache, and the files, when the code is run the downloader goes out and gets the file, verifies that it matches the provided hash and it puts it in a folder which probably has the function name in it, and then gives it to the function on request.

Skip hashsum check:

with remote_data_manager.skip_hash_check():
    myfunction()

Download a different file (if the user knows there is a newer version):

with remote_data_manager.replace_file(name='stuartsfile',
                                      shasum='1245645343545343',
                                      url='http://myserver/file1.fits'):
    myfunction()

The functionality of the remote_data_manager will be reasonably complex, needs to maintain a cache on disk somewhere (probably in a json file or something) and needs to do checking etc, but I don't want to describe all that here, it's implementation details.

ping @sunpy/sunpy-developers @wtbarnes

@Cadair Cadair added Feature Request New feature wanted! Priority High Rapid action required Package Intermediate Requires some knowledge of the internal structure of SunPy labels Nov 4, 2016
@ludoro

This comment has been minimized.

@dpshelio

This comment has been minimized.

@shpsi

This comment has been minimized.

@nabobalis

This comment has been minimized.

@nabobalis nabobalis added the Effort High Requires a large time investment label Jan 22, 2019
@darthgera123

This comment has been minimized.

@nabobalis

This comment has been minimized.

@kakirastern
Copy link

As per @Cadair's request on riot.im, please see some IRISpy usecases at sunpy/sunraster#108 (comment).

@DanRyanIrish
Copy link
Member

A functionality like this will also be needed for sunxspex, a Python package for X-ray spectroscopy I'm developing.

  • The use case is there are online IDL save and .geny files containing data derived from CHIANTI that need to be downloaded.
  • My understanding if that different versions of files are given different filenames so in this case perhaps hash conflicts are less of an issue...in theory anyway. But it may be good to have the option to employ hashes anyway.
  • The files are rarely changed, however users may want to derive their own files with their own CHIANTI parameters. There are also older versions that users may want to use to compare with older studies. So user over-ride is key.
  • The file is currently read by a non-user facing function, although this design may change. Nonetheless, will this API be able to used by a user-facing function that simply passes on the filename name to other functions to be read?

@Cadair
Copy link
Member Author

Cadair commented May 17, 2019

however users may want to derive their own files with their own CHIANTI parameters

Does this mean they might want to override the defaults with a local file? I don't think I envisaged that in the API designs at the top of the issue.

@Cadair
Copy link
Member Author

Cadair commented May 17, 2019

Also I had a chat with @dstansby about using this to get SPICE kernels for HelioPy etc today.

@Cadair
Copy link
Member Author

Cadair commented May 28, 2019

From #3144 we could use this machinery to prevent multiple file downloads in map.Map when used with URLs.

Also from #3144 could we add some kind of expiration date on data in the cache? Also could we have an API to expire data from the cache manually?

@Cadair
Copy link
Member Author

Cadair commented May 29, 2019

We also have goes in SunPy: https://github.com/sunpy/sunpy/blob/master/sunpy/instr/goes.py#L85

@Cadair
Copy link
Member Author

Cadair commented May 29, 2019

and apparently lyra as well:

check_download_file(dbname, LYTAF_REMOTE_PATH, lytaf_path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Effort High Requires a large time investment Feature Request New feature wanted! Package Intermediate Requires some knowledge of the internal structure of SunPy Priority High Rapid action required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants