-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DownloadAndPrepareHuggingFaceDatasetJob #253
Conversation
How do you plan to use the datasets later on in the pipeline (in RETURNN?)? There is also |
Also Maybe, to make sure it does not modify anything, we could also make everything readonly at the end of this job?
Hm, I'm not really familiar with the HuggingFace datasets. Are you? Is this better? |
The issue with the cache dir is that I'm not sure that we can guarantee that there is no redownload and also that other operations on the dataset later on use this cache directory. Using I'm using hf datasets in some of my setups. For RETURNN I have a custom dataset implementation that wraps the library. There I directly call |
Ok, maybe this is a better idea using As cache dir, we just use |
Yes, this way we avoid that anything is reused and the temporary files should get deleted. |
I would propose to use the sisyphus parameter |
I adapted it with |
Co-authored-by: David Thulke <thulke@hltpr.rwth-aachen.de>
Hugging Face datasets
https://huggingface.co/docs/datasets/
https://huggingface.co/datasets