Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebatch undersized batches when rebuilding partitions #2583

Merged
merged 10 commits into from Sep 23, 2022

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Sep 18, 2022

This is the next step to passively optimizing VAST deplyoments in the background. We've measured that partitions containing undersized batches are significantly slower to query than those with optimally sized batches. This commit fixes that problem by making rebuild recreate undersized batches so the all output slices (except for potentially the last one) are optimally sized.

Implementation-wise, this turned out to be a bit more complicated than I had anticipated for two reasons:

  1. Arrow's API to append record batches does not support extension types, and this was neither documented nor does an extension point for this exist.
  2. We do not have a column-major builder API for table slices.

I ended up having to implement three different ways to append columns:

  • For record types we need to iterate over all fields and call the outer loop recursively that adds the current slice.
  • For basic types that are not modeled as extension types we can use the existing Arrow API.
  • For all other types, i.e., basic types that are models as extension types and complex types, we need to slice and append ourselves.

As of writing this description I still need to add tests and document the change, but the change to rebuilding can already be reviewed.

In terms of documentation, I think a changelog entry suffices. It doesn't change anything about the existing behavior, and this is not configurable, so there's no need to document it as it's purely an internal for now.

For testing, I added a small Python script that prints the batch size of every exported record batch in vast export arrow and used that before and after rebuilding so we can see the impact of this PR.

馃摑 Reviewer Checklist

Review this pull request by ensuring the following items:

  • All user-facing changes have changelog entries
  • User-facing changes are reflected on vast.io

This is the next step to passively optimizing VAST deplyoments in the
background. We've measured that partitions containing undersized batches are
significantly slower to query than those with optimally sized batches. This
commit fixes that problem by making rebuild recreate undersized batches so the
all output slices (except for potentially the last one) are optimally sized.

Implementation-wise, this turned out to be a bit more complicated than I had
anticipated for two reasons:
1. Arrow's API to append record batches does not support extension types, and
   this was neither documented nor does an extension point for this exist.
2. We do not have a column-major builder API for table slices.

I ended up having to implement three different ways to append columns:
- For record types we need to iterate over all fields and call the outer loop
  recursively that adds the current slice.
- For basic types that are not modeled as extension types we can use the
  existing Arrow API.
- For all other types, i.e., basic types that are models as extension types and
  complex types, we need to slice and append ourselves.
@dominiklohmann dominiklohmann added the feature New functionality label Sep 18, 2022
@dominiklohmann dominiklohmann added performance Improvements or regressions of performance and removed feature New functionality labels Sep 18, 2022
Copy link
Member Author

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from a review in a call with @tobim.

libvast/native-plugins/rebuild.cpp Show resolved Hide resolved
libvast/native-plugins/rebuild.cpp Outdated Show resolved Hide resolved
@dominiklohmann dominiklohmann removed the request for review from dispanser September 20, 2022 15:23
Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed in pairing session and tested on the Tenzir testbed. Apparently re-batching is working as expected.

@dominiklohmann dominiklohmann merged commit c571174 into master Sep 23, 2022
@dominiklohmann dominiklohmann deleted the story/sc-37173/rebatch branch September 23, 2022 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements or regressions of performance
Projects
None yet
2 participants