[data] Write each block to a separate file #37986

stephanie-wang · 2023-08-01T20:23:14Z

Why are these changes needed?

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time.

Related issue number

Closes #37948.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

bveeramani

Overall LGTM. Just had a few questions

bveeramani · 2023-08-03T17:40:18Z

python/ray/data/tests/conftest.py

@@ -159,13 +159,12 @@ def _get_write_path_for_block(
            *,


Nit: (I know this isn't directly related to this PR) test_block_write_path_provider is a confusing name for a fixture because it sounds like a unit test. Something like mock_block_write_path_provider and MockBlockWritePathProvider might clearer.

python/ray/data/datasource/file_based_datasource.py

python/ray/data/_internal/stats.py

python/ray/data/datasource/file_based_datasource.py

bveeramani · 2023-08-03T18:08:48Z

python/ray/data/datasource/file_based_datasource.py

+        if block_index is not None:
+            suffix = f"{dataset_uuid}_{task_index:06}_{block_index:06}.{file_format}"
+        else:
+            suffix = f"{dataset_uuid}_{task_index:06}.{file_format}"


Maybe dumb question, but why do we encode information like task_index at all? Like, why don't we do something like suffix = "{random_uuid}.{file_format}?

We have to do this to make sure each task writes to different files. That's also why I added the additional block_index in this PR. I'll add a comment.

Wouldn't a random file name also guarantee that we write to different files?

You don't want duplicate data on retries. Also, it's harder to debug with random IDs.

bveeramani · 2023-08-03T18:09:51Z

python/ray/data/tests/test_dynamic_block_split.py

@@ -369,6 +379,91 @@ def foo(batch):
    assert ds.count() == num_blocks_per_task


+def _test_write_large_data(


Rather than writing several distinct unit tests, would it make sense to parametrize this function?

I think it's pretty much the same thing; is it okay if I keep it like this?

IMO explicitly parametrized tests with @pytest.mark.parametrize are easier to read, but I won't block on it.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by #37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: harborn <gangsheng.wu@intel.com>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Victor <vctr.y.m@example.com>

In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

stephanie-wang added 2 commits August 1, 2023 15:20

multiple files

b02d476

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

lint

0e86bd1

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani and raulchen as code owners August 1, 2023 20:23

stephanie-wang assigned ericl and c21 Aug 1, 2023

test

7ea2dcd

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang mentioned this pull request Aug 1, 2023

[data] Make file datasource outputs streaming to avoid OOM #37966

Closed

8 tasks

stephanie-wang added 3 commits August 2, 2023 09:41

fix

b8f8d42

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix test

6203e50

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix test

559d6a2

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

bveeramani reviewed Aug 3, 2023

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 3, 2023

stephanie-wang assigned bveeramani Aug 3, 2023

bveeramani approved these changes Aug 3, 2023

View reviewed changes

fixes

389dce5

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang merged commit d09e7c5 into ray-project:master Aug 4, 2023
57 of 63 checks passed

stephanie-wang deleted the write-multiple-files branch August 4, 2023 15:16

bveeramani mentioned this pull request Aug 10, 2023

[Data] Remove unnecessary filesystem wrapping #38299

Merged

8 tasks

c21 mentioned this pull request Aug 16, 2023

[Data] Bump up timeout for test_numpy #38534

Merged

8 tasks

bveeramani mentioned this pull request Dec 8, 2023

Ray throwing ObjectRefStreamEndOfStreamError when using awswrangler.redshift.copy() #41209

Closed

pmandadkes1207 mentioned this pull request Dec 8, 2023

Ray throwing ObjectRefStreamEndOfStreamError when using awswrangler.redshift.copy() aws/aws-sdk-pandas#2549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Write each block to a separate file #37986

[data] Write each block to a separate file #37986

stephanie-wang commented Aug 1, 2023

bveeramani left a comment

bveeramani Aug 3, 2023

bveeramani Aug 3, 2023

stephanie-wang Aug 3, 2023

bveeramani Aug 3, 2023

ericl Aug 3, 2023

bveeramani Aug 3, 2023

stephanie-wang Aug 3, 2023

bveeramani Aug 3, 2023

		@@ -369,6 +379,91 @@ def foo(batch):
		assert ds.count() == num_blocks_per_task


		def _test_write_large_data(

[data] Write each block to a separate file #37986

[data] Write each block to a separate file #37986

Conversation

stephanie-wang commented Aug 1, 2023

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment