Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Support large-scale HuggingFace datasets #37591

Open
raulchen opened this issue Jul 20, 2023 · 5 comments · Fixed by #38432
Open

[data] Support large-scale HuggingFace datasets #37591

raulchen opened this issue Jul 20, 2023 · 5 comments · Fixed by #38432
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@raulchen
Copy link
Contributor

Current ray.data.from_huggingface needs to materialize all data first. We need to make it more scalable.
Potential solutions:

  1. support IterableDataset
  2. List parquet files and use ray.data.read_parquet
@raulchen raulchen added this to the Ray Data Benchmarks milestone Jul 20, 2023
@raulchen raulchen added data Ray Data-related issues Ray 2.7 labels Jul 20, 2023
@c21
Copy link
Contributor

c21 commented Aug 1, 2023

Just to add more context:

Currently from_huggingface() would load the Hugging Face dataset in-memory in a single node. This is working for small size of dataset, but not large size. For example, user hit OOM issue when trying to use https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T . We should change the API to make the reading in a streaming way across multiple tasks.

Hugging Face dataset streaming - https://huggingface.co/docs/datasets/stream
split_dataset_by_node - https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.distributed.split_dataset_by_node

Use case
Read large scale Hugging Face dataset efficiently for fine tuning LLM

@c21 c21 added the P1 Issue that should be fixed within a few weeks label Aug 1, 2023
@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 24, 2023

Given the performance of #38432, I don't think we should mark this as done yet :)

@pcmoritz pcmoritz reopened this Aug 24, 2023
@anyscalesam anyscalesam assigned pcmoritz and c21 and unassigned pcmoritz and bveeramani Nov 2, 2023
@anyscalesam
Copy link
Collaborator

@c21 please review with @pcmoritz on whether performance gap has been addressed

@anyscalesam anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed Ray 2.7 labels Nov 2, 2023
@anyscalesam anyscalesam added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023
@anyscalesam anyscalesam added enhancement Request for new feature and/or capability and removed ray 2.10 labels Dec 8, 2023
@npuichigo
Copy link
Contributor

@pcmoritz I cannot find any conext regarding the performance issue. Could you help to explain that.

Besides, what's the recommend way to consume large huggingface dataset now? I'm not sure the way I do is correct

  1. use to_parquet to save huggingface datasets as parquet format.
  2. use ray for parallel read.

@lhoestq
Copy link

lhoestq commented Mar 22, 2024

If it can help, you can stream a dataset using an IterableDataset and stream it in a distributed setup using

from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

The documentation is here

@anyscalesam anyscalesam removed this from the Ray Data Benchmarks milestone Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants