-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Write each block to a separate file #37986
[data] Write each block to a separate file #37986
Conversation
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just had a few questions
@@ -159,13 +159,12 @@ def _get_write_path_for_block( | |||
*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: (I know this isn't directly related to this PR) test_block_write_path_provider
is a confusing name for a fixture because it sounds like a unit test. Something like mock_block_write_path_provider
and MockBlockWritePathProvider
might clearer.
if block_index is not None: | ||
suffix = f"{dataset_uuid}_{task_index:06}_{block_index:06}.{file_format}" | ||
else: | ||
suffix = f"{dataset_uuid}_{task_index:06}.{file_format}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dumb question, but why do we encode information like task_index
at all? Like, why don't we do something like suffix = "{random_uuid}.{file_format}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to do this to make sure each task writes to different files. That's also why I added the additional block_index in this PR. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a random file name also guarantee that we write to different files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want duplicate data on retries. Also, it's harder to debug with random IDs.
@@ -369,6 +379,91 @@ def foo(batch): | |||
assert ds.count() == num_blocks_per_task | |||
|
|||
|
|||
def _test_write_large_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than writing several distinct unit tests, would it make sense to parametrize this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty much the same thing; is it okay if I keep it like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO explicitly parametrized tests with @pytest.mark.parametrize
are easier to read, but I won't block on it.
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by #37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: harborn <gangsheng.wu@intel.com>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time. Related issue number Closes ray-project#37948. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Victor <vctr.y.m@example.com>
In FileBasedDatasource.write, we wrap then immediately unwrap the filesystem object. Since there's no point in wrapping the filesystem object, this PR removes the line. For context, this logic is likely from legacy code that was refactored by ray-project#37986. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
During file-based write tasks, write each block to a separate file so that we avoid needing to keep all blocks in memory at the same time.
Related issue number
Closes #37948.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.