[Data] Implement streamed read from Hugging Face Datasets #38432

scottjlee · 2023-08-14T22:32:57Z

Why are these changes needed?

The current implementation of ray.data.from_huggingface materializes all data in memory. This PR implements a streaming (but not distributed) implementation to support efficient reads for large datasets. The implementation in the PR uses a single read task to stream data from the Hugging Face Dataset into Ray Data.

Related issue number

Closes #37591, Closes #37990

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam · 2023-08-15T20:56:59Z

python/ray/data/datasource/huggingface_datasource.py

+        self,
+        parallelism: int,
+    ) -> List[ReadTask]:
+        # Note that `parallelism` arg is currently not used for HuggingFaceDatasource.


I thought we wanted to use split_dataset_by_node so that we can have distributed reads for the streaming case?

Yeah that was the initial intent of the PR. We did further investigation into split_dataset_by_node, and turned out that it doesn't actually shard the dataset. It actually reads in the same dataset for each node, and selects a subset to emulate sharding -- so it doesn't provide any efficiency gains. That's why this PR only uses a single read task because we cannot distribute the dataset read.

We plan on opening an issue with HF datasets team, to see if they already have another existing way to accomplish this or request as a new feature.

Looking at the code, it seems there is logic to shard the base IterableDataset if it contains multiple files:

https://sourcegraph.com/github.com/huggingface/datasets/-/blob/src/datasets/iterable_dataset.py?L1231-1243
huggingface/datasets#5984

Only if the base dataset contains a single file, does it not get sharded

Example:

>>> hf_ds = datasets.load_dataset("openclimatefix/gfs-surface-pressure-2.0deg", split='train', streaming=True) Downloading metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14.5k/14.5k [00:00<00:00, 4.98MB/s] Using custom data configuration openclimatefix--gfs-surface-pressure-2.0deg-e3bd919c6fc2ba90 >>> print(hf_ds.n_shards) 39

I believe if n_shards is >= parallelism then we should be able to do a proper distributed read

amogkam · 2023-08-15T20:57:49Z

python/ray/data/read_api.py

@@ -2167,14 +2185,14 @@ def from_huggingface(dataset: "datasets.Dataset") -> MaterializedDataset:
        hf_ds_arrow = dataset.with_format("arrow")
        ray_ds = from_arrow(hf_ds_arrow[:])


let's also add distributed reads for this case? create read tasks, and have each read task read a portion of hf_ds_arrow

Signed-off-by: Scott Lee <sjl@anyscale.com>

pcmoritz · 2023-08-23T02:08:03Z

Wow, that sounds painful (both the dependencies as well as the speed is pretty abysmal). Do the hugging face datasets have a well-defined dataformat under the hood (like parquet)? If yes, could we use that format directly to read them (maybe combined with some metadata we get from the huggingface library)?

Just putting out this idea, maybe it is a bad idea :D

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-08-23T17:28:15Z

Wow, that sounds painful (both the dependencies as well as the speed is pretty abysmal). Do the hugging face datasets have a well-defined dataformat under the hood (like parquet)? If yes, could we use that format directly to read them (maybe combined with some metadata we get from the huggingface library)?

Just putting out this idea, maybe it is a bad idea :D

Yeah, under the hood, they use a memory mapped Arrow table, so ideally we would just be able to do distributed reads from that. When we looked into datasets API for our implementation though, the publicly available sharding methods didn't seem to split the dataset across nodes as intended. Amog dug around and potentially found some private APIs which may be helpful, but we decided to do that in a future PR since we'll need to ask the datasets developers some questions about them.

This PR at the very least allows for streaming reads for datasets which won't fit in memory, like the RedPajamas dataset, so I figured it would be worthwhile to get in.

Signed-off-by: Scott Lee <sjl@anyscale.com>

angelinalg · 2023-08-23T23:41:39Z

python/ray/data/datasource/huggingface_datasource.py

+    # Due to HF Dataset's dynamic module system, we need to dynamically import the
+    # datasets_modules module on every actor when training.
+    # We accomplish this by simply running the following bit of code directly
+    # in module you are currently viewing. This ensures that when we


Suggested change

# in module you are currently viewing. This ensures that when we

# in the module you are currently viewing. This ensures that when we

angelinalg · 2023-08-23T23:41:50Z

python/ray/data/datasource/huggingface_datasource.py

+    # datasets_modules module on every actor when training.
+    # We accomplish this by simply running the following bit of code directly
+    # in module you are currently viewing. This ensures that when we
+    # unpickle the Dataset, it will be ran before pickle tries to


Suggested change

# unpickle the Dataset, it will be ran before pickle tries to

# unpickle the Dataset, it runs before pickle tries to

angelinalg · 2023-08-23T23:42:02Z

python/ray/data/datasource/huggingface_datasource.py

+    # in module you are currently viewing. This ensures that when we
+    # unpickle the Dataset, it will be ran before pickle tries to
+    # import datasets_modules and prevents an exception from being thrown.
+    # Same logic is present inside ray's TransformersTrainer and HF Transformers Ray


Suggested change

# Same logic is present inside ray's TransformersTrainer and HF Transformers Ray

# Same logic is present inside Ray's TransformersTrainer and HF Transformers Ray

angelinalg · 2023-08-23T23:43:14Z

python/ray/data/read_api.py

+        return read_datasource(
+            HuggingFaceDatasource(),
+            dataset=dataset,
+        )
    if isinstance(dataset, datasets.Dataset):
        # To get the resulting Arrow table from a Hugging Face Dataset after
        # applying transformations (e.g. train_test_split(), shard(), select()),


Suggested change

# applying transformations (e.g. train_test_split(), shard(), select()),

# applying transformations (e.g., train_test_split(), shard(), select()),

angelinalg · 2023-08-23T23:43:43Z

python/ray/data/read_api.py

-            "Dataset. To convert just a single Hugging Face Dataset to a "
-            "Ray Dataset, specify a split. For example, "
-            "`ray.data.from_huggingface(my_dataset_dictionary"
+            "You provided a Hugging Face DatasetDict or IterableDatasetDict "


Suggested change

"You provided a Hugging Face DatasetDict or IterableDatasetDict "

"You provided a Hugging Face DatasetDict or IterableDatasetDict, "

angelinalg · 2023-08-23T23:44:32Z

Just some copy edit nits.

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam · 2023-08-24T06:06:14Z

CI all looks good here, cc @zhe-thoughts for approval.

zhe-thoughts

OK to merge

…t#38432) Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…t#38432) Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

scottjlee added 2 commits August 14, 2023 15:26

initial progress

ecb9de8

Merge branch 'master' into 0810-hf-stream

45f83ee

scottjlee changed the title ~~[Data] Implement streamed/distributed read from Hugging Face Datasets~~ [Data] Implement streamed read from Hugging Face Datasets Aug 14, 2023

scottjlee added 4 commits August 14, 2023 17:01

wip cleanup

d103a90

working streamed HF read

e5eff2b

update docs, clean up

43b8614

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

a6930aa

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review August 15, 2023 20:02

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners August 15, 2023 20:02

Merge branch 'master' into 0810-hf-stream

a3c6319

amogkam reviewed Aug 15, 2023

View reviewed changes

scottjlee added 14 commits August 15, 2023 16:50

update datasets>=2.9.0 in train-requirements

38905ba

Signed-off-by: Scott Lee <sjl@anyscale.com>

try updating compiled requirements manually

b3b4b94

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

19198b2

Signed-off-by: Scott Lee <sjl@anyscale.com>

try datasets 2.8.0

caadc8e

Signed-off-by: Scott Lee <sjl@anyscale.com>

pin with ==2.8.0

dcae909

Signed-off-by: Scott Lee <sjl@anyscale.com>

update compiled requirements

1372695

Signed-off-by: Scott Lee <sjl@anyscale.com>

remove extra cpu reqs

af6d4b1

Signed-off-by: Scott Lee <sjl@anyscale.com>

try dataset==2.14.0

49dfbdc

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

5b89591

Signed-off-by: Scott Lee <sjl@anyscale.com>

try installing datasets in CI command

94525cd

Signed-off-by: Scott Lee <sjl@anyscale.com>

install datasets for data tests

e5c244a

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

30593a1

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

49b640d

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

9b88e3d

Signed-off-by: Scott Lee <sjl@anyscale.com>

try moving mosaic before install script

4f92359

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from a team as a code owner August 23, 2023 00:57

scottjlee added 3 commits August 23, 2023 02:30

add sacremoses and move mosaic after install script

dd33334

Signed-off-by: Scott Lee <sjl@anyscale.com>

add sacremoses for legacy test

10c8533

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

20707b5

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added 5 commits August 23, 2023 11:12

insert debugging statements and try removing dynamic transformers import

3dcc669

Signed-off-by: Scott Lee <sjl@anyscale.com>

import hf datasource within from_huggingface

ae1fbb0

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

042f272

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

9ab1775

Signed-off-by: Scott Lee <sjl@anyscale.com>

remove hf datasource from datasource init

e20be60

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam self-assigned this Aug 23, 2023

clean up

976bfcb

Signed-off-by: Scott Lee <sjl@anyscale.com>

amogkam approved these changes Aug 23, 2023

View reviewed changes

update import path

7242c92

Signed-off-by: Scott Lee <sjl@anyscale.com>

richardliaw approved these changes Aug 23, 2023

View reviewed changes

angelinalg approved these changes Aug 23, 2023

View reviewed changes

scottjlee added 2 commits August 23, 2023 17:07

clean up dependencies

c35c926

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0810-hf-stream

77ed31e

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from amogkam August 24, 2023 01:57

c21 approved these changes Aug 24, 2023

View reviewed changes

amogkam assigned zhe-thoughts Aug 24, 2023

zhe-thoughts approved these changes Aug 24, 2023

View reviewed changes

zhe-thoughts merged commit 8ace253 into ray-project:master Aug 24, 2023
49 of 53 checks passed

pcmoritz mentioned this pull request Aug 24, 2023

[data] Support large-scale HuggingFace datasets #37591

Open

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[Data] Implement streamed read from Hugging Face Datasets (ray-projec…

50c799a

…t#38432) Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[Data] Implement streamed read from Hugging Face Datasets (ray-projec…

4b4d8a1

…t#38432) Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Implement streamed read from Hugging Face Datasets #38432

[Data] Implement streamed read from Hugging Face Datasets #38432

scottjlee commented Aug 14, 2023 •

edited

Loading

amogkam Aug 15, 2023

scottjlee Aug 15, 2023 •

edited

Loading

amogkam Aug 15, 2023

amogkam Aug 15, 2023

amogkam Aug 15, 2023 •

edited

Loading

amogkam Aug 15, 2023

pcmoritz commented Aug 23, 2023

scottjlee commented Aug 23, 2023

angelinalg Aug 23, 2023

angelinalg Aug 23, 2023

angelinalg Aug 23, 2023

angelinalg Aug 23, 2023

angelinalg Aug 23, 2023

angelinalg commented Aug 23, 2023

amogkam commented Aug 24, 2023

zhe-thoughts left a comment

		@@ -2167,14 +2185,14 @@ def from_huggingface(dataset: "datasets.Dataset") -> MaterializedDataset:
		hf_ds_arrow = dataset.with_format("arrow")
		ray_ds = from_arrow(hf_ds_arrow[:])

	# in module you are currently viewing. This ensures that when we
	# in the module you are currently viewing. This ensures that when we

	# unpickle the Dataset, it will be ran before pickle tries to
	# unpickle the Dataset, it runs before pickle tries to

	# Same logic is present inside ray's TransformersTrainer and HF Transformers Ray
	# Same logic is present inside Ray's TransformersTrainer and HF Transformers Ray

	# applying transformations (e.g. train_test_split(), shard(), select()),
	# applying transformations (e.g., train_test_split(), shard(), select()),

	"You provided a Hugging Face DatasetDict or IterableDatasetDict "
	"You provided a Hugging Face DatasetDict or IterableDatasetDict, "

[Data] Implement streamed read from Hugging Face Datasets #38432

[Data] Implement streamed read from Hugging Face Datasets #38432

Conversation

scottjlee commented Aug 14, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

scottjlee Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented Aug 23, 2023

scottjlee commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angelinalg commented Aug 23, 2023

amogkam commented Aug 24, 2023

zhe-thoughts left a comment

Choose a reason for hiding this comment

scottjlee commented Aug 14, 2023 •

edited

Loading

scottjlee Aug 15, 2023 •

edited

Loading

amogkam Aug 15, 2023 •

edited

Loading