-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Preserve block format on map_batches
over empty blocks
#38161
Conversation
@@ -105,6 +106,12 @@ def process_next_batch(batch: DataBatch) -> Iterator[Block]: | |||
else: | |||
raise e from None | |||
|
|||
try: | |||
first_block = next(blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would hold the an extra block in-memory during execution, right? To avoid the increased memory overhead, can we create the corresponding output block builder here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think it will hold any extra memory? We're not copying the block anywhere. There's always just one reference to the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, but in this case, the first block is always living in object store memory, until the map_batches
is finished, right? because we hold the reference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com>
…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…project#38161) Preserve the original block format when calling map_batches over empty blocks. Previously, map_batches would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Victor <vctr.y.m@example.com>
Preserve the original block format when calling
map_batches
over empty blocks. Previously,map_batches
would default to always outputting an Arrow block regardless of the format of the input empty blocks. Changing the underlying block format for empty blocks can lead to the dataset having multiple block formats, which does not work for sort or aggregation operations.Why are these changes needed?
Related issue number
Closes #37963
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.