-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960
base: master
Are you sure you want to change the base?
[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960
Conversation
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
036d720
to
1409bc1
Compare
1409bc1
to
1dc4193
Compare
e9b9029
to
e1a840d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will make our outputs much more consistent, thanks!
Also left a related comment here, we could even move the implementation from that PR into this one if it's easier (or wait for this one to merge, then update the other PR).
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
423661d
to
89f98e6
Compare
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Signed-off-by: Michael Huang <michaelhly@gmail.com>
@@ -95,7 +97,7 @@ def sample_boundaries( | |||
samples = sample_bar.fetch_until_complete(sample_results) | |||
sample_bar.close() | |||
del sample_results | |||
samples = [s for s in samples if len(s) > 0] | |||
samples = normalize_blocks([s for s in samples if len(s) > 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize here b/c later to_numpy
call can be ambiguous
@@ -82,7 +83,7 @@ def reduce( | |||
# TODO: Support fusion with other downstream operators. | |||
stats = BlockExecStats.builder() | |||
builder = DelegatingBlockBuilder() | |||
for block in mapper_outputs: | |||
for block in normalize_blocks(mapper_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize here b/c later random_shuffle
call can be ambiguous
@@ -80,7 +81,7 @@ def reduce( | |||
) -> (Block, BlockMetadata): | |||
stats = BlockExecStats.builder() | |||
builder = DelegatingBlockBuilder() | |||
for block in mapper_outputs: | |||
for block in normalize_blocks(mapper_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize here b/c later random_shuffle
call can be ambiguous
@@ -165,7 +167,7 @@ def sample_boundaries( | |||
if should_close_bar: | |||
sample_bar.close() | |||
del sample_results | |||
samples = [s for s in samples if len(s) > 0] | |||
samples = normalize_blocks([s for s in samples if len(s) > 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize here b/c later to_numpy
call can be ambiguous
Hi @c21. Would you mind taking a look at this? |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Signed-off-by: Michael Huang michaelhly@gmail.com
Why are these changes needed?
When reducing mapped outputs (or zipping datasets), BlockAccessors are inferred based on the first block of
mapper_outputs
. However,mapper_outputs
can contain heterogeneous block types inadvertently due toa.)
ray/python/ray/data/block.py
Lines 366 to 373 in 3135323
b.)
ray/python/ray/data/_internal/delegating_block_builder.py
Lines 20 to 28 in 08aa138
We want to normalize heterogeneous blocks in
mapper_output
to arrow blocks before using the BlockAccessor for the first block to compute reduced results.Related issue number
Closes #39155
Closes #39206
Closes #39291
Closes #31550
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.