Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DownloadAndPrepareHuggingFaceDatasetJob #253

Merged
merged 7 commits into from
Apr 19, 2022
Merged

Conversation

albertz
Copy link
Member

@albertz albertz commented Apr 19, 2022

datasets/huggingface.py Outdated Show resolved Hide resolved
@albertz albertz requested a review from dthulke April 19, 2022 10:41
@dthulke
Copy link
Member

dthulke commented Apr 19, 2022

How do you plan to use the datasets later on in the pipeline (in RETURNN?)? There is also Dataset.save_to_disk (https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk) what might be more suitable to use the dataset later on than to rely on the cache directory.

@albertz
Copy link
Member Author

albertz commented Apr 19, 2022

How do you plan to use the datasets later on in the pipeline (in RETURNN?)?

Also datasets.load_dataset and using the same cache dir, and further maybe disallowing to redownload (I haven't checked in detail yet whether there is a good option for this).

Maybe, to make sure it does not modify anything, we could also make everything readonly at the end of this job?

There is also Dataset.save_to_disk (https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk) what might be more suitable to use the dataset later on than to rely on the cache directory.

Hm, I'm not really familiar with the HuggingFace datasets. Are you? Is this better?

@dthulke
Copy link
Member

dthulke commented Apr 19, 2022

The issue with the cache dir is that I'm not sure that we can guarantee that there is no redownload and also that other operations on the dataset later on use this cache directory.

Using save_to_disk and later load_from_disk properly serialises the dataset and thus should avoid these issues.

I'm using hf datasets in some of my setups. For RETURNN I have a custom dataset implementation that wraps the library. There I directly call load_dataset in initialize and use the global cache directory (so the dataset is downloaded and prepared when starting the training). But, I planned for some time already to move this out into separate Sisyphus jobs and remove the dependency on the global cache folder.

@albertz
Copy link
Member Author

albertz commented Apr 19, 2022

Ok, maybe this is a better idea using save_to_disk.

As cache dir, we just use /var/tmp or so?

@dthulke
Copy link
Member

dthulke commented Apr 19, 2022

As cache dir, we just use /var/tmp or so?

Yes, this way we avoid that anything is reused and the temporary files should get deleted.

@michelwi
Copy link
Contributor

As cache dir, we just use /var/tmp or so?

I would propose to use the sisyphus parameter gs.TMP_PREFIX
https://github.com/rwth-i6/sisyphus/blob/f6cba3d36153b449ed3ff93b4cbdc327e69f91b7/sisyphus/global_settings.py#L271-L272
This way it can be configurable for different locations.

@albertz
Copy link
Member Author

albertz commented Apr 19, 2022

I adapted it with save_to_disk and I'm using gs.TMP_PREFIX now.

datasets/huggingface.py Outdated Show resolved Hide resolved
datasets/huggingface.py Outdated Show resolved Hide resolved
albertz and others added 3 commits April 19, 2022 14:56
Co-authored-by: David Thulke <thulke@hltpr.rwth-aachen.de>
@dthulke dthulke self-requested a review April 19, 2022 14:01
@albertz albertz merged commit f9cb5dc into main Apr 19, 2022
@albertz albertz deleted the albert-huggingface-dataset branch April 19, 2022 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants