Rebatch undersized batches when rebuilding partitions #2583

dominiklohmann · 2022-09-18T14:43:05Z

This is the next step to passively optimizing VAST deplyoments in the background. We've measured that partitions containing undersized batches are significantly slower to query than those with optimally sized batches. This commit fixes that problem by making rebuild recreate undersized batches so the all output slices (except for potentially the last one) are optimally sized.

Implementation-wise, this turned out to be a bit more complicated than I had anticipated for two reasons:

Arrow's API to append record batches does not support extension types, and this was neither documented nor does an extension point for this exist.
We do not have a column-major builder API for table slices.

I ended up having to implement three different ways to append columns:

For record types we need to iterate over all fields and call the outer loop recursively that adds the current slice.
For basic types that are not modeled as extension types we can use the existing Arrow API.
For all other types, i.e., basic types that are models as extension types and complex types, we need to slice and append ourselves.

~~As of writing this description I still need to add tests and document the change, but the change to rebuilding can already be reviewed.~~

In terms of documentation, I think a changelog entry suffices. It doesn't change anything about the existing behavior, and this is not configurable, so there's no need to document it as it's purely an internal for now.

For testing, I added a small Python script that prints the batch size of every exported record batch in vast export arrow and used that before and after rebuilding so we can see the impact of this PR.

📝 Reviewer Checklist

Review this pull request by ensuring the following items:

All user-facing changes have changelog entries
User-facing changes are reflected on vast.io

This is the next step to passively optimizing VAST deplyoments in the background. We've measured that partitions containing undersized batches are significantly slower to query than those with optimally sized batches. This commit fixes that problem by making rebuild recreate undersized batches so the all output slices (except for potentially the last one) are optimally sized. Implementation-wise, this turned out to be a bit more complicated than I had anticipated for two reasons: 1. Arrow's API to append record batches does not support extension types, and this was neither documented nor does an extension point for this exist. 2. We do not have a column-major builder API for table slices. I ended up having to implement three different ways to append columns: - For record types we need to iterate over all fields and call the outer loop recursively that adds the current slice. - For basic types that are not modeled as extension types we can use the existing Arrow API. - For all other types, i.e., basic types that are models as extension types and complex types, we need to slice and append ourselves.

dominiklohmann

Notes from a review in a call with @tobim.

libvast/native-plugins/rebuild.cpp

tobim

Reviewed in pairing session and tested on the Tenzir testbed. Apparently re-batching is working as expected.

dominiklohmann added the feature New functionality label Sep 18, 2022

dominiklohmann requested a review from dispanser September 18, 2022 14:43

dominiklohmann added performance Improvements or regressions of performance and removed feature New functionality labels Sep 18, 2022

Document rebatching behavior

c5640a6

dominiklohmann force-pushed the story/sc-37173/rebatch branch from b0810dd to ddd9dbb Compare September 18, 2022 16:52

Check batch sizes before and after rebuilding in tests

ab800d7

dominiklohmann force-pushed the story/sc-37173/rebatch branch from ddd9dbb to ab800d7 Compare September 18, 2022 16:59

dominiklohmann added 2 commits September 18, 2022 19:36

Fix row addition in rebatch operator

21b3567

Use Arrow API for bulk-adding rows

5ee2c9a

dominiklohmann requested a review from tobim September 19, 2022 06:40

dominiklohmann commented Sep 20, 2022

View reviewed changes

libvast/native-plugins/rebuild.cpp Show resolved Hide resolved

libvast/native-plugins/rebuild.cpp Outdated Show resolved Hide resolved

dominiklohmann removed the request for review from dispanser September 20, 2022 15:23

dominiklohmann added 5 commits September 21, 2022 09:17

Merge remote-tracking branch 'origin/master' into story/sc-37173/rebatch

260b36f

Improve legibility of rebatch operator

a977e10

Fix comment about Arrow behavior

7995e83

Merge remote-tracking branch 'origin/master' into story/sc-37173/rebatch

d56bd7a

Merge branch 'master' into story/sc-37173/rebatch

8069b71

dominiklohmann enabled auto-merge September 23, 2022 14:51

tobim approved these changes Sep 23, 2022

View reviewed changes

dominiklohmann merged commit c571174 into master Sep 23, 2022

dominiklohmann deleted the story/sc-37173/rebatch branch September 23, 2022 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebatch undersized batches when rebuilding partitions #2583

Rebatch undersized batches when rebuilding partitions #2583

dominiklohmann commented Sep 18, 2022 •

edited

dominiklohmann left a comment

tobim left a comment

Rebatch undersized batches when rebuilding partitions #2583

Rebatch undersized batches when rebuilding partitions #2583

Conversation

dominiklohmann commented Sep 18, 2022 • edited

📝 Reviewer Checklist

dominiklohmann left a comment

Choose a reason for hiding this comment

tobim left a comment

Choose a reason for hiding this comment

dominiklohmann commented Sep 18, 2022 •

edited