Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFace Dataset Integration #257

Open
dthulke opened this issue Apr 21, 2022 · 10 comments
Open

HuggingFace Dataset Integration #257

dthulke opened this issue Apr 21, 2022 · 10 comments
Assignees

Comments

@dthulke
Copy link
Member

dthulke commented Apr 21, 2022

This continues the discussion from huggingface/datasets#4184 on how to integrate hugging face datasets in our workflow (@albertz , also @patrickvonplaten if you're interested).

Things to discuss:

  • Do we want to use load_dataset or save_to_disk / load_from_disk to save / load the dataset?
  • When and how do we load / process the audio data?
  • How to handle caching when using the dataset?

Do we want to use load_dataset or save_to_disk / load_from_disk to save / load the dataset?

In general, both should work. I see a few advantages for load_from_disk:

  1. The output of save_to_disk defines the full dataset, i.e. to load it we just need to call load_from_disk(path) and don't need to respecify the dataset name, config and cache dir location (btw. errors here may cause that datasets get downloaded into wrong cache folders).
  2. We don't need to make the cache_dir read-only to avoid that any files are written/modified
  3. If we have any jobs that do any processing steps on the dataset via e.g. Dataset.map or Dataset.filter (via a separate Map- or FilterHuggingFaceDatasetJob), the output can also be stored and loaded with save_to_disk / load_from_disk. With load_dataset we afaik would need to do all these steps also in later jobs (e.g. RETURNN).

When and how do we load / process the audio data?

Here, I would suggest to wait until the change mentioned here huggingface/datasets#4184 (comment) (option b) ) is implemented, i.e. the bytes of the audio data are directly loaded into the arrow dataset (alternatively, we do this manually until the change is implemented / to supported older datasets versions).
Then, we also don't need to keep the cache directory.

How to handle caching when using the dataset?

Using save_to_disk all data is contained in a separate .arrow file per split. I.e. we just need to cache these files using the cache manager.

@albertz
Copy link
Member

albertz commented Apr 21, 2022

In any case, I think we do not want to have each audio file as single audio file on disk. This does not work well with the cache manager. We want to have only a few big files which can be handled well by the cache manager.

I think this already excludes load_dataset because this would rely on the cache dir and having lots of separate audio files.

It also sounds like currently save_to_disk does not automatically embed the content of all the files. We can wait for them to implement this (as I understand it, this is what they plan to do now) or use some of their code snippets to prepare the dataset in a way that it has the audio content embedded.

I don't really know whether there is a good generic way to do this which works for every dataset. E.g. other datasets don't have any audio and such code would fail then? (@patrickvonplaten?)

In any case, when we have the raw data embedded in the dataset, save_to_disk sounds like it would be the best solution.

So basically as we already have it, except that we miss the raw audio data content. When they change the save_to_disk implementation, it already works as it is. It probably also already works for other pure text datasets.

@dthulke
Copy link
Member Author

dthulke commented Apr 21, 2022

I don't really know whether there is a good generic way to do this which works for every dataset. E.g. other datasets don't have any audio and such code would fail then? (@patrickvonplaten?)

Something like the following should work:

def embed_audio_data(ex):
  output = {}
  for name, column_type in ex.features.items():
    if column_type != Audio or ex.data[name]["bytes"] is not None:
      continue
    with open(ex.data[name]["path"], "rb") as f:
      audio_data = f.read()
      output[name] = {'path': None, 'bytes': audio_data}
  return output

if any(`ds.features contains audio column`):
  ds = ds.map(embed_audio_data)

@patrickvonplaten
Copy link

patrickvonplaten commented Apr 21, 2022

We're working on fixing the save_to_disk() function so this will soon work out of the box: huggingface/datasets#4196

@patrickvonplaten
Copy link

patrickvonplaten commented Apr 21, 2022

Note that audio datasets are the only datasets that are not self-contained by default. Text datasets are always self-contained. Image datasets are self-contained by default, but can be made not self-contained, audio datasets are not self-contained by default, but can be made self-contained.

One thing I want to mention also is that there is not necessarily only one Librispeech downloading dataset. You can create your own that never relies on single audio files and that is always self-contained.
This is very simple, e.g. I've done just done it here:

If you use this dataset loading script:

from dataset import load_dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_self_contained")

you can see that everything is available in memory and nothing relies on a local path (the local path is never extracted).

You could do:

ds = ds.cast_column("audio", datasets.Audio(decode=False))  # disable decoding on the fly
ds["train.clean.100"][0]["audio"]["bytes"]  <- retrieve bytes

Also save_to_disk(...) would directly work for this dataset.

In short, what I'm trying to say is that the easiest for you guys would probably be to use all datasets that exist and if there is an audio dataset that is by default not self-contained you could just copy-paset the script, adapt it and put it under a i6RWTH namespace on the Hub.

Happy to set this up for you if you want to try it out :-)

@patrickvonplaten
Copy link

See the dataset script that I just added here: https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_self_contained

@albertz
Copy link
Member

albertz commented Apr 24, 2022

You could do:

ds = ds.cast_column("audio", datasets.Audio(decode=False))  # disable decoding on the fly
ds["train.clean.100"][0]["audio"]["bytes"]  <- retrieve bytes

Also save_to_disk(...) would directly work for this dataset.

The cast_column here, does it have any influence on save_to_disk?

I assume I should use cast_column after I called load_from_disk, right? But not before save_to_disk (where it probably does not have any influence)?

@albertz
Copy link
Member

albertz commented Apr 25, 2022

@patrickvonplaten Another question: When can we expect a new public release which has the new save_to_disk behavior?

I wonder if it is worth it to come up with our custom implementation or whether we should just wait for it.

Btw, in general, it's nice that it is easy to come up with own dataset implementations. But the main idea of having the HuggingFace Dataset integration is that we want to be able to easily use all the existing datasets without the need to further invest any effort on it.

@patrickvonplaten
Copy link

Hey,

Yes I think cast_column should be used before save_to_disk. We'll surely have a new public release about the save_to_disk behavior

cc @mariosasko @polinaeterna @lhoestq here as well - I think you guys can answer better here. Also do you already have an idea about when the save_to_disk fix will be releasead?

@lhoestq
Copy link

lhoestq commented Apr 27, 2022

save_to_disk will be updated in a few weeks I think. I'll keep you posted :)

In the meantime, feel free to decode the audio files with

>>> def read_audio_file(example):
...     with open(example["audio"]["path"], "rb") as f:
...         return {"audio": {"bytes": f.read()}}
>>> ds = ds.map(read_audio_file, num_proc=8)
>>> ds.save_to_disk("path/to/dataset/dir")

@JackTemaki
Copy link
Contributor

@dthulke any further activity here or can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants