-
-
Notifications
You must be signed in to change notification settings - Fork 577
GSoC 2019 vnki
- Name: Vishnunarayan K I
- Organisation: SunPy
- Time zone: UTC+05:30
- GitHub handle: vn-ki
- University: Indian Institute of Technology, Indore
- Major: Computer Science
- Current Year: 2nd year
- Programming Languages: Python, golang, C++, C, JavaScript/TypeScript, Flutter/Dart, Java(Basic knowledge)
- Contributions to SunPy:
- GSoC 2018 student with SunPy.
- I have 24 merged PRs on SunPy repo. Link
- Contributions to other repos:
- Co-maintainer for mps-youtube, pafy and spotify-downloader. All python projects.
My personal python projects are:
- anime-downloader (Python)
- codechef-cli (Python)
- YoutubePlayer (Python)
I think I have enough experience with python development to produce high quality, testable code. Therefore I think I am a good fit for this project.
The requirements mentioned in the issue are:
- Data is downloaded and cached to
$HOME/sunpy/data/...
when first needed.- There is some validation that the data has been transferred correctly (sha hashes etc).
- There is a mechanism by which we can allow users to re-download data.
- The download code supports multiple mirrors.
remote_data_manager
will be a singleton. This makes sense because it is not to be
customized after once initialized.
remote_data_manager
will handle these in the methods mentioned below:
- Data download will use a downloader
remote_data_manager
will be downloader agnostic. parfive is a good candidate for a downloader.
astropy.utils.data.download_file
is also a perfectly fine candidate.
The downloader according to the design being proposed should be provided using the constructor injection design pattern. This will mean that the manager is totally downloader agnostic.
- Validation
This should be some hash which is calculated from the file. SHA-1 is a good candidate because it's fast.
A decorator called require
is proposed to notify the data manager the requirement of a file by a certain
function.
def require(
name: str,
urls: List[str],
shasum: str,
)
This function can have **kwargs
which can be used to specify arguments specific to the downloader.
The flow of require
would be:
- The function checks if the file exists locally and hash matches.
- If exists, then proceed normally.
- If the file does not exist, the file will be downloaded from the URL.
- The hash of the file downloaded is computed and is checked with the provided hash.
If the hash matches, proceed normally.
If it does not match, retry from step 3 with the next mirror URL.
If there are no mirror URLs left, raise an
Exception
.
If the file has been updated on the server and the hash does not match the mentioned hash, an Exception
is thrown.
- Redownload
There will be two functions to re-download the data.
-
skip_hash_check
: As the name mentions, will skip the hash check add download the latest file. It can be a context manager. So the usage will look like:with remote_data_manager.skip_hash_check(): myfunction()
-
replace_file(name, url, shasum=None)
:replace_file
will be a context manager and can be used to replace a file to be used by a function with a newer file.shasum
will be an optional parameter. If not provided the latest file on the server will be used.This is different from
skip_hash_check
because if a function uses more than one file, user can selectively replace files and provide new mirror URLs.
The cache metadata will be stored using sqlite.
This is inspired by the implementation
of requests-cache
. requests-cache
support a number of backends for cache storage (SQLite,
Redis, MongoDB, dynamodb etc.). Even though remote data manager does not need multiple backends,
following a similar design to that will give us more extensibility in the future and an easier to
mock backend for testing purposes.
The content of the metadata will be:
- hash: hash of the file. (primary key)
- filepath: Path to the file. (Ideally not the absolute path, so that the sunpy data dir can be changed without conflicts)
- last_modified: Modified datetime on the server (Not the downloaded file). This can be useful when
using the function
skip_hash_check
. The file can be checked using the modified date of the file on the server with thelast_modified
of the hashed file. This can be easily achieved by gettingLast-Modified
from response headers and it is possible to get this without downloading the entire file. - functions: The names of the functions using the file.
The file can be stored under a directory which is named depending on the function (maybe function name).
One thing to investigate here is whether it is possible to do the cache metadata without sqlite at all. If we are ok with redownloading files if multiple functions use the same file, then it is possible to not use sqlite and use a JSON file per function instead.
parfive
is an async parallel downloader built for sunpy.
This is primarily used for downloading sample data for SunPy. parfive is a good candidate
downloader for remote data manager. Currently, parfive is only performant when downloading files in parallel.
The additional task is to implement downloading by parts for parfive so that the downloads can be accelerated when downloading a single file. Downloading by parts and recombining them afterward is how most download accelerators work.
I already have the experience of doing this for a personal project.
My vacations start on May 3rd. So I would like to code at least a small part during this period.
I am familiar with most libraries that are going to be used and have a good idea about the python features that will be used like context managers and decorators. I also have experience with pytest and general testing procedures like mocking past GSoC and from personal projects.
The first task here would be to finalize on the API design. One of the important things here is deciding on the method of storing cache. The next task is to start coding the skeleton of the remote data manager.
I like to write tests with the code I write so some tests will be done with tasks itself.
- Implement the caching mechanism.
- Implement the functions of remote data manager mentioned above.
- Integrate the data manager with the cache.
- API review. Investigate additions to the API and feasibility of the design.
- Document and test the tasks until now.
- Implement parfive download by parts complete with tests and documentation.
- Finish up any remaining testing and documentation.
- Example for the gallery.
- Complete written documentation
- Add necessary developer documentation
- Work on possible improvements to parfive
Yes, I participated in 2018 under SunPy.
Yes, I am also applying for Prometheus under CNCF.
I don't have any other internship or work during this summer. I have no plans for any vacations either.
Yes, I am eligible to receive payments from Google.