-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Support large-scale HuggingFace datasets #37591
Comments
Just to add more context: Currently from_huggingface() would load the Hugging Face dataset in-memory in a single node. This is working for small size of dataset, but not large size. For example, user hit OOM issue when trying to use https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T . We should change the API to make the reading in a streaming way across multiple tasks. Hugging Face dataset streaming - https://huggingface.co/docs/datasets/stream Use case |
Given the performance of #38432, I don't think we should mark this as done yet :) |
@pcmoritz I cannot find any conext regarding the performance issue. Could you help to explain that. Besides, what's the recommend way to consume large huggingface dataset now? I'm not sure the way I do is correct
|
If it can help, you can stream a dataset using an from datasets.distributed import split_dataset_by_node
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size) The documentation is here |
Current
ray.data.from_huggingface
needs to materialize all data first. We need to make it more scalable.Potential solutions:
ray.data.read_parquet
The text was updated successfully, but these errors were encountered: