[Data] Preserve block format on `map_batches` over empty blocks #38161

amogkam · 2023-08-07T02:53:51Z

Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations.

Why are these changes needed?

Related issue number

Closes #37963

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21 · 2023-08-07T16:55:49Z

python/ray/data/_internal/planner/map_batches.py

@@ -105,6 +106,12 @@ def process_next_batch(batch: DataBatch) -> Iterator[Block]:
                else:
                    raise e from None

+        try:
+            first_block = next(blocks)


This would hold the an extra block in-memory during execution, right? To avoid the increased memory overhead, can we create the corresponding output block builder here?

No I don't think it will hold any extra memory? We're not copying the block anywhere. There's always just one reference to the block.

hmm, but in this case, the first block is always living in object store memory, until the map_batches is finished, right? because we hold the reference here.

python/ray/data/_internal/delegating_block_builder.py

python/ray/data/_internal/output_buffer.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21

LG

…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com>

…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Victor <vctr.y.m@example.com>

fix

1225304

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam requested review from ericl, scv119, c21, scottjlee, bveeramani and raulchen as code owners August 7, 2023 02:53

amogkam assigned c21 Aug 7, 2023

c21 reviewed Aug 7, 2023

View reviewed changes

amogkam added 5 commits August 7, 2023 13:48

fix

dce4415

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

f02290b

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

5e3e4c6

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

2da0f41

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

f4ba867

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21 approved these changes Aug 7, 2023

View reviewed changes

Merge branch 'master' into map-batches-empty-block

e34f2ac

amogkam merged commit 18bf299 into ray-project:master Aug 8, 2023
58 of 63 checks passed

amogkam deleted the map-batches-empty-block branch August 8, 2023 03:14

scottjlee mentioned this pull request Aug 25, 2023

[Data] Aggregating on dataset with mixture of Block types causes pyarrow AttributeError #37963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Preserve block format on `map_batches` over empty blocks #38161

[Data] Preserve block format on `map_batches` over empty blocks #38161

amogkam commented Aug 7, 2023 •

edited by scottjlee

Loading

c21 Aug 7, 2023

amogkam Aug 7, 2023 •

edited

Loading

c21 Aug 7, 2023

c21 left a comment

[Data] Preserve block format on map_batches over empty blocks #38161

[Data] Preserve block format on map_batches over empty blocks #38161

Conversation

amogkam commented Aug 7, 2023 • edited by scottjlee Loading

Why are these changes needed?

Related issue number

Checks

c21 Aug 7, 2023

Choose a reason for hiding this comment

amogkam Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

c21 Aug 7, 2023

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

[Data] Preserve block format on `map_batches` over empty blocks #38161

[Data] Preserve block format on `map_batches` over empty blocks #38161

amogkam commented Aug 7, 2023 •

edited by scottjlee

Loading

amogkam Aug 7, 2023 •

edited

Loading