Multi-dataset recipes #2368
Replies: 3 comments
-
Hey @pplantinga and @Gastron, what do you think about this? Thanks! :) |
Beta Was this translation helpful? Give feedback.
-
In SpeechBrain we currently don't have a specific format for datasets, which makes dataset operations tricky. I think it's simply a matter of us not forcing any specific filenames or sample keys. Cases where some datasets do use the same conventions can probably already be handled with something like a ConcatDataset. On the other hand, SpeechBrain is compatible with any normal torch dataset implementations. Maybe for these kinds of cases users could look at Lhotse or other dataset providers that do standardize the datasets. |
Beta Was this translation helpful? Give feedback.
-
Hey @Gastron, thanks for your comment. We are planning to train some large ASR models on multiple datasets with SpeechBrain. Do you think you could contribute into this project by adding this new feature of multi-datasets recipes? Thanks. |
Beta Was this translation helpful? Give feedback.
-
🚀 The feature
It could be interesting to have a recipe and tools that help combine multiple training datasets. Some other toolkits do it, picking one example: k2/icefall.
To my knowledge, we do not have such a recipe in SpeechBrain nor do we provide documentation on how to do it.
This issue is for the sake of tracking progress and centralize discussion. If one wants to implement it, it should probably get more discussion beforehand.
Solution outline
Things to investigate:
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions