Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom download method #2960

Open
AmitMY opened this issue Jan 19, 2021 · 4 comments
Open

Custom download method #2960

AmitMY opened this issue Jan 19, 2021 · 4 comments
Labels

Comments

@AmitMY
Copy link

AmitMY commented Jan 19, 2021

Is your feature request related to a problem? Please describe.
I have a dataset that requires a bit more complicated download method than usual (for example, add some headers)

Describe the solution you'd like
I would like to have a method: dl_manager.download_custom that is given:

  1. a URL or list of URLs
  2. a custom download method that receives:
    a. a single URL
    b. local file destination path

So I could implement custom downloads.

Full code I want to write:

def my_custom_download(url: str, local_path: str):
  opener = urllib.request.build_opener()
  opener.addheaders = {...my headers...}
  urllib.request.install_opener(opener)
  urllib.request.urlretrieve(url, local_path)

URLs = ['url1', 'url2', 'url3']
dl_manager.download_custom(URLs, my_custom_download)

Describe alternatives you've considered
Doing my download without the download manager, but then I'll just hack around where to save the files. the dl_manager seems like the correct place to do this.

Additional context
This method exists in huggingface/datasets, and I think is well motivated.
This is not just for headers, but also for other download methods (for example, download over scp)

@AmitMY AmitMY added the enhancement New feature or request label Jan 19, 2021
@Conchylicultor
Copy link
Member

Conchylicultor commented Jan 19, 2021

Thank you for the suggestion. I think it would be best to add directly those features inside the DownloadManager.download method. It would ensure downloads are correctly tracked by the rest of the system (e.g. download size is recorded and exposed in our documentation, checksums ensure determinism,...).

How about adding parameters to the Resource object directly ?

# Arguments forwarded to `requests.Request(**request_kwargs)`
request_kwargs = {
    'headers': {...},
}

URLs = ['url1', 'url2', 'url3']
dl_manager.download([tfds.download.Resource(url, request_kwargs=request_kwargs) for url in URLs])

Another advantage is that downloading both urls with custom headers and default headers could be done in a single function call, so TFDS will automatically parallelise those downloads (while 2 function calls would be sequential)

This method exists in huggingface/datasets, and I think is well motivated.

Looking at their source code, only a single dataset pg19 seem to use this feature and the usage could easily be avoided.

So I would be in favor of expending the current DownloadManager to support custom headers, authentification,... and extend the existing method as much as possible.
Don't hesitate if someone want to send a PR

@AmitMY
Copy link
Author

AmitMY commented Jan 19, 2021

Thanks @Conchylicultor
I think it is still important to have a custom download method,
One extra example I have that I don't see how to adapt would be downloading a youtube video

I'm adding datasets for sign language which include videos. Some datasets reference youtube links directly, like Microsoft's MSASL dataset. My custom download function would be:

def download_youtube(url, dst_path):
  import youtube_dl  # Required for YouTube downloads

  ydl_opts = {"format": "bestvideo", "outtmpl": dst_path}
  with youtube_dl.YoutubeDL(ydl_opts) as ydl:
      ydl.download([url])

I don't mind if the solution would be something like:

dl_manager.download([tfds.download.Resource(url, download_function=download_youtube) for url in URLs])

I just can't think of a better solution for this kind of use

(I will eventually add these datasets directly in tfds, I'm first getting them organized and defining some shared characteristics before making PRs here)

@AmitMY
Copy link
Author

AmitMY commented Feb 12, 2021

Hi @Conchylicultor, can you please verify the following steps are correct to have a solution:

  1. Add an attribute to the Resource class called download_fn=None
  2. Send it down the download_manager to the download method
  3. In the sync_download method, add an if download_fn is not None branch, and use that download function to download

@martenlienen
Copy link

@AmitMY These three steps would work for my use case 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants