[data] Support large-scale HuggingFace datasets #37591

raulchen · 2023-07-20T00:07:52Z

Current ray.data.from_huggingface needs to materialize all data first. We need to make it more scalable.
Potential solutions:

support IterableDataset
List parquet files and use ray.data.read_parquet

The text was updated successfully, but these errors were encountered:

c21 · 2023-08-01T21:39:27Z

Just to add more context:

Currently from_huggingface() would load the Hugging Face dataset in-memory in a single node. This is working for small size of dataset, but not large size. For example, user hit OOM issue when trying to use https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T . We should change the API to make the reading in a streaming way across multiple tasks.

Hugging Face dataset streaming - https://huggingface.co/docs/datasets/stream
split_dataset_by_node - https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.distributed.split_dataset_by_node

Use case
Read large scale Hugging Face dataset efficiently for fine tuning LLM

pcmoritz · 2023-08-24T06:40:35Z

Given the performance of #38432, I don't think we should mark this as done yet :)

anyscalesam · 2023-11-02T21:48:06Z

@c21 please review with @pcmoritz on whether performance gap has been addressed

npuichigo · 2023-12-18T08:30:59Z

@pcmoritz I cannot find any conext regarding the performance issue. Could you help to explain that.

Besides, what's the recommend way to consume large huggingface dataset now? I'm not sure the way I do is correct

use to_parquet to save huggingface datasets as parquet format.
use ray for parallel read.

lhoestq · 2024-03-22T15:07:20Z

If it can help, you can stream a dataset using an IterableDataset and stream it in a distributed setup using

from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

The documentation is here

raulchen added this to the Ray Data Benchmarks milestone Jul 20, 2023

raulchen added data Ray Data-related issues Ray 2.7 labels Jul 20, 2023

matthewdeng mentioned this issue Aug 1, 2023

[Data] Support reading Hugging Face dataset in distributed streaming way #37990

Closed

c21 added the P1 Issue that should be fixed within a few weeks label Aug 1, 2023

c21 assigned bveeramani Aug 1, 2023

scottjlee mentioned this issue Aug 14, 2023

[Data] Implement streamed read from Hugging Face Datasets #38432

Merged

8 tasks

zhe-thoughts closed this as completed in #38432 Aug 24, 2023

pcmoritz reopened this Aug 24, 2023

anyscalesam assigned pcmoritz and c21 and unassigned pcmoritz and bveeramani Nov 2, 2023

anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed Ray 2.7 labels Nov 2, 2023

anyscalesam added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023

anyscalesam added enhancement Request for new feature and/or capability and removed ray 2.10 labels Dec 8, 2023

omatthew98 mentioned this issue Jan 23, 2024

[Data] Distributed reads for from_huggingface #42599

Merged

9 tasks

anyscalesam removed this from the Ray Data Benchmarks milestone Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Support large-scale HuggingFace datasets #37591

[data] Support large-scale HuggingFace datasets #37591

raulchen commented Jul 20, 2023

c21 commented Aug 1, 2023

pcmoritz commented Aug 24, 2023 •

edited

Loading

anyscalesam commented Nov 2, 2023

npuichigo commented Dec 18, 2023

lhoestq commented Mar 22, 2024

[data] Support large-scale HuggingFace datasets #37591

[data] Support large-scale HuggingFace datasets #37591

Comments

raulchen commented Jul 20, 2023

c21 commented Aug 1, 2023

pcmoritz commented Aug 24, 2023 • edited Loading

anyscalesam commented Nov 2, 2023

npuichigo commented Dec 18, 2023

lhoestq commented Mar 22, 2024

pcmoritz commented Aug 24, 2023 •

edited

Loading