DownloadAndPrepareHuggingFaceDatasetJob #253

albertz · 2022-04-19T09:47:05Z

Hugging Face datasets

https://huggingface.co/docs/datasets/
https://huggingface.co/datasets

datasets/huggingface.py

dthulke · 2022-04-19T10:48:16Z

How do you plan to use the datasets later on in the pipeline (in RETURNN?)? There is also Dataset.save_to_disk (https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk) what might be more suitable to use the dataset later on than to rely on the cache directory.

albertz · 2022-04-19T10:55:12Z

How do you plan to use the datasets later on in the pipeline (in RETURNN?)?

Also datasets.load_dataset and using the same cache dir, and further maybe disallowing to redownload (I haven't checked in detail yet whether there is a good option for this).

Maybe, to make sure it does not modify anything, we could also make everything readonly at the end of this job?

There is also Dataset.save_to_disk (https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk) what might be more suitable to use the dataset later on than to rely on the cache directory.

Hm, I'm not really familiar with the HuggingFace datasets. Are you? Is this better?

dthulke · 2022-04-19T11:26:46Z

The issue with the cache dir is that I'm not sure that we can guarantee that there is no redownload and also that other operations on the dataset later on use this cache directory.

Using save_to_disk and later load_from_disk properly serialises the dataset and thus should avoid these issues.

I'm using hf datasets in some of my setups. For RETURNN I have a custom dataset implementation that wraps the library. There I directly call load_dataset in initialize and use the global cache directory (so the dataset is downloaded and prepared when starting the training). But, I planned for some time already to move this out into separate Sisyphus jobs and remove the dependency on the global cache folder.

albertz · 2022-04-19T11:36:57Z

Ok, maybe this is a better idea using save_to_disk.

As cache dir, we just use /var/tmp or so?

dthulke · 2022-04-19T11:55:05Z

As cache dir, we just use /var/tmp or so?

Yes, this way we avoid that anything is reused and the temporary files should get deleted.

michelwi · 2022-04-19T12:11:13Z

As cache dir, we just use /var/tmp or so?

I would propose to use the sisyphus parameter gs.TMP_PREFIX
https://github.com/rwth-i6/sisyphus/blob/f6cba3d36153b449ed3ff93b4cbdc327e69f91b7/sisyphus/global_settings.py#L271-L272
This way it can be configurable for different locations.

albertz · 2022-04-19T12:51:03Z

I adapted it with save_to_disk and I'm using gs.TMP_PREFIX now.

datasets/huggingface.py

Co-authored-by: David Thulke <thulke@hltpr.rwth-aachen.de>

DownloadAndPrepareHuggingFaceDatasetJob

3a0e9a2

albertz requested review from JackTemaki, michelwi and mmz33 April 19, 2022 09:47

dthulke reviewed Apr 19, 2022

View reviewed changes

datasets/huggingface.py Outdated Show resolved Hide resolved

albertz added 2 commits April 19, 2022 12:39

black

6cc63ca

remove data_dir (not an output but an input)

c4203c6

albertz requested a review from dthulke April 19, 2022 10:41

save_to_disk, TMP_PREFIX

b3d20c1

dthulke reviewed Apr 19, 2022

View reviewed changes

datasets/huggingface.py Outdated Show resolved Hide resolved

datasets/huggingface.py Outdated Show resolved Hide resolved

albertz and others added 3 commits April 19, 2022 14:56

Update datasets/huggingface.py

a4b6f41

Co-authored-by: David Thulke <thulke@hltpr.rwth-aachen.de>

data_files

ce777eb

revision

67b7f52

dthulke self-requested a review April 19, 2022 14:01

dthulke approved these changes Apr 19, 2022

View reviewed changes

mmz33 approved these changes Apr 19, 2022

View reviewed changes

albertz merged commit f9cb5dc into main Apr 19, 2022

albertz deleted the albert-huggingface-dataset branch April 19, 2022 14:53

albertz mentioned this pull request Apr 20, 2022

[Librispeech] Add 'all' config huggingface/datasets#4184

Merged

albertz mentioned this pull request Feb 8, 2023

HuggingFace datasets wrapper rwth-i6/returnn#1257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DownloadAndPrepareHuggingFaceDatasetJob #253

DownloadAndPrepareHuggingFaceDatasetJob #253

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

michelwi commented Apr 19, 2022

albertz commented Apr 19, 2022

DownloadAndPrepareHuggingFaceDatasetJob #253

DownloadAndPrepareHuggingFaceDatasetJob #253

Conversation

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

albertz commented Apr 19, 2022

dthulke commented Apr 19, 2022

michelwi commented Apr 19, 2022

albertz commented Apr 19, 2022