-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HuggingFace Dataset Integration #257
Comments
In any case, I think we do not want to have each audio file as single audio file on disk. This does not work well with the cache manager. We want to have only a few big files which can be handled well by the cache manager. I think this already excludes It also sounds like currently I don't really know whether there is a good generic way to do this which works for every dataset. E.g. other datasets don't have any audio and such code would fail then? (@patrickvonplaten?) In any case, when we have the raw data embedded in the dataset, So basically as we already have it, except that we miss the raw audio data content. When they change the |
Something like the following should work: def embed_audio_data(ex):
output = {}
for name, column_type in ex.features.items():
if column_type != Audio or ex.data[name]["bytes"] is not None:
continue
with open(ex.data[name]["path"], "rb") as f:
audio_data = f.read()
output[name] = {'path': None, 'bytes': audio_data}
return output
if any(`ds.features contains audio column`):
ds = ds.map(embed_audio_data) |
We're working on fixing the |
Note that audio datasets are the only datasets that are not self-contained by default. Text datasets are always self-contained. Image datasets are self-contained by default, but can be made not self-contained, audio datasets are not self-contained by default, but can be made self-contained. One thing I want to mention also is that there is not necessarily only one Librispeech downloading dataset. You can create your own that never relies on single audio files and that is always self-contained. If you use this dataset loading script: from dataset import load_dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_self_contained") you can see that everything is available in memory and nothing relies on a local path (the local path is never extracted). You could do: ds = ds.cast_column("audio", datasets.Audio(decode=False)) # disable decoding on the fly
ds["train.clean.100"][0]["audio"]["bytes"] <- retrieve bytes Also In short, what I'm trying to say is that the easiest for you guys would probably be to use all datasets that exist and if there is an audio dataset that is by default not self-contained you could just copy-paset the script, adapt it and put it under a i6RWTH namespace on the Hub. Happy to set this up for you if you want to try it out :-) |
See the dataset script that I just added here: https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_self_contained |
The I assume I should use |
@patrickvonplaten Another question: When can we expect a new public release which has the new I wonder if it is worth it to come up with our custom implementation or whether we should just wait for it. Btw, in general, it's nice that it is easy to come up with own dataset implementations. But the main idea of having the HuggingFace Dataset integration is that we want to be able to easily use all the existing datasets without the need to further invest any effort on it. |
Hey, Yes I think cc @mariosasko @polinaeterna @lhoestq here as well - I think you guys can answer better here. Also do you already have an idea about when the |
In the meantime, feel free to decode the audio files with >>> def read_audio_file(example):
... with open(example["audio"]["path"], "rb") as f:
... return {"audio": {"bytes": f.read()}}
>>> ds = ds.map(read_audio_file, num_proc=8)
>>> ds.save_to_disk("path/to/dataset/dir") |
@dthulke any further activity here or can this be closed? |
This continues the discussion from huggingface/datasets#4184 on how to integrate hugging face datasets in our workflow (@albertz , also @patrickvonplaten if you're interested).
Things to discuss:
load_dataset
orsave_to_disk
/load_from_disk
to save / load the dataset?In general, both should work. I see a few advantages for
load_from_disk
:load_from_disk(path)
and don't need to respecify the dataset name, config and cache dir location (btw. errors here may cause that datasets get downloaded into wrong cache folders).Map
- orFilterHuggingFaceDatasetJob
), the output can also be stored and loaded with save_to_disk / load_from_disk. With load_dataset we afaik would need to do all these steps also in later jobs (e.g. RETURNN).Here, I would suggest to wait until the change mentioned here huggingface/datasets#4184 (comment) (option b) ) is implemented, i.e. the bytes of the audio data are directly loaded into the arrow dataset (alternatively, we do this manually until the change is implemented / to supported older
datasets
versions).Then, we also don't need to keep the cache directory.
Using
save_to_disk
all data is contained in a separate.arrow
file per split. I.e. we just need to cache these files using the cache manager.The text was updated successfully, but these errors were encountered: