-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom download method #2960
Comments
Thank you for the suggestion. I think it would be best to add directly those features inside the How about adding parameters to the Resource object directly ? # Arguments forwarded to `requests.Request(**request_kwargs)`
request_kwargs = {
'headers': {...},
}
URLs = ['url1', 'url2', 'url3']
dl_manager.download([tfds.download.Resource(url, request_kwargs=request_kwargs) for url in URLs]) Another advantage is that downloading both urls with custom headers and default headers could be done in a single function call, so TFDS will automatically parallelise those downloads (while 2 function calls would be sequential)
Looking at their source code, only a single dataset pg19 seem to use this feature and the usage could easily be avoided. So I would be in favor of expending the current |
Thanks @Conchylicultor I'm adding datasets for sign language which include videos. Some datasets reference youtube links directly, like Microsoft's MSASL dataset. My custom download function would be: def download_youtube(url, dst_path):
import youtube_dl # Required for YouTube downloads
ydl_opts = {"format": "bestvideo", "outtmpl": dst_path}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([url]) I don't mind if the solution would be something like: dl_manager.download([tfds.download.Resource(url, download_function=download_youtube) for url in URLs]) I just can't think of a better solution for this kind of use (I will eventually add these datasets directly in tfds, I'm first getting them organized and defining some shared characteristics before making PRs here) |
Hi @Conchylicultor, can you please verify the following steps are correct to have a solution:
|
@AmitMY These three steps would work for my use case 👍 |
Is your feature request related to a problem? Please describe.
I have a dataset that requires a bit more complicated download method than usual (for example, add some headers)
Describe the solution you'd like
I would like to have a method:
dl_manager.download_custom
that is given:a. a single URL
b. local file destination path
So I could implement custom downloads.
Full code I want to write:
Describe alternatives you've considered
Doing my download without the download manager, but then I'll just hack around where to save the files. the dl_manager seems like the correct place to do this.
Additional context
This method exists in
huggingface/datasets
, and I think is well motivated.This is not just for headers, but also for other download methods (for example, download over scp)
The text was updated successfully, but these errors were encountered: