New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebatch undersized batches when rebuilding partitions #2583
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the next step to passively optimizing VAST deplyoments in the background. We've measured that partitions containing undersized batches are significantly slower to query than those with optimally sized batches. This commit fixes that problem by making rebuild recreate undersized batches so the all output slices (except for potentially the last one) are optimally sized. Implementation-wise, this turned out to be a bit more complicated than I had anticipated for two reasons: 1. Arrow's API to append record batches does not support extension types, and this was neither documented nor does an extension point for this exist. 2. We do not have a column-major builder API for table slices. I ended up having to implement three different ways to append columns: - For record types we need to iterate over all fields and call the outer loop recursively that adds the current slice. - For basic types that are not modeled as extension types we can use the existing Arrow API. - For all other types, i.e., basic types that are models as extension types and complex types, we need to slice and append ourselves.
dominiklohmann
added
performance
Improvements or regressions of performance
and removed
feature
New functionality
labels
Sep 18, 2022
dominiklohmann
force-pushed
the
story/sc-37173/rebatch
branch
from
September 18, 2022 16:52
b0810dd
to
ddd9dbb
Compare
dominiklohmann
force-pushed
the
story/sc-37173/rebatch
branch
from
September 18, 2022 16:59
ddd9dbb
to
ab800d7
Compare
dominiklohmann
commented
Sep 20, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes from a review in a call with @tobim.
tobim
approved these changes
Sep 23, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed in pairing session and tested on the Tenzir testbed. Apparently re-batching is working as expected.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the next step to passively optimizing VAST deplyoments in the background. We've measured that partitions containing undersized batches are significantly slower to query than those with optimally sized batches. This commit fixes that problem by making rebuild recreate undersized batches so the all output slices (except for potentially the last one) are optimally sized.
Implementation-wise, this turned out to be a bit more complicated than I had anticipated for two reasons:
I ended up having to implement three different ways to append columns:
As of writing this description I still need to add tests and document the change, but the change to rebuilding can already be reviewed.In terms of documentation, I think a changelog entry suffices. It doesn't change anything about the existing behavior, and this is not configurable, so there's no need to document it as it's purely an internal for now.
For testing, I added a small Python script that prints the batch size of every exported record batch in
vast export arrow
and used that before and after rebuilding so we can see the impact of this PR.馃摑 Reviewer Checklist
Review this pull request by ensuring the following items: