HuggingFace Dataset Integration #257

dthulke · 2022-04-21T14:40:51Z

This continues the discussion from huggingface/datasets#4184 on how to integrate hugging face datasets in our workflow (@albertz , also @patrickvonplaten if you're interested).

Things to discuss:

Do we want to use load_dataset or save_to_disk / load_from_disk to save / load the dataset?
When and how do we load / process the audio data?
How to handle caching when using the dataset?

Do we want to use load_dataset or save_to_disk / load_from_disk to save / load the dataset?

In general, both should work. I see a few advantages for load_from_disk:

The output of save_to_disk defines the full dataset, i.e. to load it we just need to call load_from_disk(path) and don't need to respecify the dataset name, config and cache dir location (btw. errors here may cause that datasets get downloaded into wrong cache folders).
We don't need to make the cache_dir read-only to avoid that any files are written/modified
If we have any jobs that do any processing steps on the dataset via e.g. Dataset.map or Dataset.filter (via a separate Map- or FilterHuggingFaceDatasetJob), the output can also be stored and loaded with save_to_disk / load_from_disk. With load_dataset we afaik would need to do all these steps also in later jobs (e.g. RETURNN).

When and how do we load / process the audio data?

Here, I would suggest to wait until the change mentioned here huggingface/datasets#4184 (comment) (option b) ) is implemented, i.e. the bytes of the audio data are directly loaded into the arrow dataset (alternatively, we do this manually until the change is implemented / to supported older datasets versions).
Then, we also don't need to keep the cache directory.

How to handle caching when using the dataset?

Using save_to_disk all data is contained in a separate .arrow file per split. I.e. we just need to cache these files using the cache manager.

The text was updated successfully, but these errors were encountered:

albertz · 2022-04-21T14:52:29Z

In any case, I think we do not want to have each audio file as single audio file on disk. This does not work well with the cache manager. We want to have only a few big files which can be handled well by the cache manager.

I think this already excludes load_dataset because this would rely on the cache dir and having lots of separate audio files.

It also sounds like currently save_to_disk does not automatically embed the content of all the files. We can wait for them to implement this (as I understand it, this is what they plan to do now) or use some of their code snippets to prepare the dataset in a way that it has the audio content embedded.

I don't really know whether there is a good generic way to do this which works for every dataset. E.g. other datasets don't have any audio and such code would fail then? (@patrickvonplaten?)

In any case, when we have the raw data embedded in the dataset, save_to_disk sounds like it would be the best solution.

So basically as we already have it, except that we miss the raw audio data content. When they change the save_to_disk implementation, it already works as it is. It probably also already works for other pure text datasets.

dthulke · 2022-04-21T16:41:28Z

I don't really know whether there is a good generic way to do this which works for every dataset. E.g. other datasets don't have any audio and such code would fail then? (@patrickvonplaten?)

Something like the following should work:

def embed_audio_data(ex):
  output = {}
  for name, column_type in ex.features.items():
    if column_type != Audio or ex.data[name]["bytes"] is not None:
      continue
    with open(ex.data[name]["path"], "rb") as f:
      audio_data = f.read()
      output[name] = {'path': None, 'bytes': audio_data}
  return output

if any(`ds.features contains audio column`):
  ds = ds.map(embed_audio_data)

patrickvonplaten · 2022-04-21T16:56:02Z

We're working on fixing the save_to_disk() function so this will soon work out of the box: huggingface/datasets#4196

patrickvonplaten · 2022-04-21T17:25:21Z

Note that audio datasets are the only datasets that are not self-contained by default. Text datasets are always self-contained. Image datasets are self-contained by default, but can be made not self-contained, audio datasets are not self-contained by default, but can be made self-contained.

One thing I want to mention also is that there is not necessarily only one Librispeech downloading dataset. You can create your own that never relies on single audio files and that is always self-contained.
This is very simple, e.g. I've done just done it here:

If you use this dataset loading script:

from dataset import load_dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_self_contained")

you can see that everything is available in memory and nothing relies on a local path (the local path is never extracted).

You could do:

ds = ds.cast_column("audio", datasets.Audio(decode=False))  # disable decoding on the fly
ds["train.clean.100"][0]["audio"]["bytes"]  <- retrieve bytes

Also save_to_disk(...) would directly work for this dataset.

In short, what I'm trying to say is that the easiest for you guys would probably be to use all datasets that exist and if there is an audio dataset that is by default not self-contained you could just copy-paset the script, adapt it and put it under a i6RWTH namespace on the Hub.

Happy to set this up for you if you want to try it out :-)

patrickvonplaten · 2022-04-21T17:25:44Z

See the dataset script that I just added here: https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_self_contained

albertz · 2022-04-24T19:29:21Z

You could do:

ds = ds.cast_column("audio", datasets.Audio(decode=False))  # disable decoding on the fly
ds["train.clean.100"][0]["audio"]["bytes"]  <- retrieve bytes

Also save_to_disk(...) would directly work for this dataset.

The cast_column here, does it have any influence on save_to_disk?

I assume I should use cast_column after I called load_from_disk, right? But not before save_to_disk (where it probably does not have any influence)?

albertz · 2022-04-25T09:25:05Z

@patrickvonplaten Another question: When can we expect a new public release which has the new save_to_disk behavior?

I wonder if it is worth it to come up with our custom implementation or whether we should just wait for it.

Btw, in general, it's nice that it is easy to come up with own dataset implementations. But the main idea of having the HuggingFace Dataset integration is that we want to be able to easily use all the existing datasets without the need to further invest any effort on it.

patrickvonplaten · 2022-04-25T14:23:56Z

Hey,

Yes I think cast_column should be used before save_to_disk. We'll surely have a new public release about the save_to_disk behavior

cc @mariosasko @polinaeterna @lhoestq here as well - I think you guys can answer better here. Also do you already have an idea about when the save_to_disk fix will be releasead?

lhoestq · 2022-04-27T10:50:51Z

save_to_disk will be updated in a few weeks I think. I'll keep you posted :)

In the meantime, feel free to decode the audio files with

>>> def read_audio_file(example):
...     with open(example["audio"]["path"], "rb") as f:
...         return {"audio": {"bytes": f.read()}}
>>> ds = ds.map(read_audio_file, num_proc=8)
>>> ds.save_to_disk("path/to/dataset/dir")

JackTemaki · 2023-09-07T08:11:30Z

@dthulke any further activity here or can this be closed?

dthulke assigned albertz and dthulke Apr 21, 2022

dthulke mentioned this issue Apr 21, 2022

[Librispeech] Add 'all' config huggingface/datasets#4184

Merged

patrickvonplaten mentioned this issue Apr 21, 2022

[Audio] Path of Common Voice cannot be used for audio loading anymore huggingface/datasets#3663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingFace Dataset Integration #257

HuggingFace Dataset Integration #257

dthulke commented Apr 21, 2022

albertz commented Apr 21, 2022 •

edited

Loading

dthulke commented Apr 21, 2022

patrickvonplaten commented Apr 21, 2022 •

edited

Loading

patrickvonplaten commented Apr 21, 2022 •

edited

Loading

patrickvonplaten commented Apr 21, 2022

albertz commented Apr 24, 2022

albertz commented Apr 25, 2022

patrickvonplaten commented Apr 25, 2022

lhoestq commented Apr 27, 2022

JackTemaki commented Sep 7, 2023

HuggingFace Dataset Integration #257

HuggingFace Dataset Integration #257

Comments

dthulke commented Apr 21, 2022

albertz commented Apr 21, 2022 • edited Loading

dthulke commented Apr 21, 2022

patrickvonplaten commented Apr 21, 2022 • edited Loading

patrickvonplaten commented Apr 21, 2022 • edited Loading

patrickvonplaten commented Apr 21, 2022

albertz commented Apr 24, 2022

albertz commented Apr 25, 2022

patrickvonplaten commented Apr 25, 2022

lhoestq commented Apr 27, 2022

JackTemaki commented Sep 7, 2023

albertz commented Apr 21, 2022 •

edited

Loading

patrickvonplaten commented Apr 21, 2022 •

edited

Loading

patrickvonplaten commented Apr 21, 2022 •

edited

Loading