New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data predicate evaluation column-major #2730
Merged
dominiklohmann
merged 7 commits into
master
from
story/sc-35030/columnar-data-predicate-evaluation
Dec 6, 2022
Merged
Make data predicate evaluation column-major #2730
dominiklohmann
merged 7 commits into
master
from
story/sc-35030/columnar-data-predicate-evaluation
Dec 6, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dominiklohmann
added a commit
that referenced
this pull request
Nov 24, 2022
I noticed this when working on #2730; I doubt this makes a big difference overall, but we still shouldn't do unnecessary work to infer a schema when we already have it.
mavam
approved these changes
Nov 24, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at a high level. I only left some minor questions/comments. If CI passes, this would work for me.
dominiklohmann
added a commit
that referenced
this pull request
Nov 24, 2022
I noticed this when working on #2730; I doubt this makes a big difference overall, but we still shouldn't do unnecessary work to infer a schema when we already have it.
mavam
reviewed
Nov 24, 2022
mavam
reviewed
Nov 24, 2022
dominiklohmann
added a commit
that referenced
this pull request
Nov 24, 2022
I noticed this when working on #2730; I doubt this makes a big difference overall, but we still shouldn't do unnecessary work to infer a schema when we already have it.
dominiklohmann
force-pushed
the
story/sc-35030/columnar-data-predicate-evaluation
branch
from
November 25, 2022 17:17
c584a14
to
dd4d0b5
Compare
dominiklohmann
added a commit
that referenced
this pull request
Nov 25, 2022
I noticed this when working on #2730; I doubt this makes a big difference overall, but we still shouldn't do unnecessary work to infer a schema when we already have it.
dominiklohmann
added a commit
that referenced
this pull request
Dec 6, 2022
I noticed this when working on #2730; I doubt this makes a big difference overall, but we still shouldn't do unnecessary work to infer a schema when we already have it.
This is a major performance optimization for expression evaluation that essentially fills in one of the last TODOs from #2440: Making data predicate evaluation itself column-major, i.e., moving as much as possible of the evaluation out of the hot loop. Here's how the new expression algorithm works, roughly, with (3b) and (4) being the new and improved steps: 1. Normalize the selection bitmap from the dense index result to the length of the batch + offset. 2. Determine whether the expression is empty, a connective of some sort, or a predicate. For connectives, resolve them recursively and combine the resulting bitmaps accordingly. 3. Evaluate predicates: a) If it's a meta extractor, operate on the batch metadata. In case of a match, the selection bitmap is the result directly. b) If it's a data predicate, access the desired array, and lift the resolved types for both sides of the predicate into a compile time context for the column evaluator. 4. The column evaluator has specialization based on the three-tuple of lhs type, relational operator, and rhs view. The generic fall back case iterates over all fields per the selection bitmap to do the evaluation using the cell evaluator, which can be specialized per relational operator. We've seen some pretty great results. When using a query on a 6M event database that's doing multiple substring searches (for which we have neither sparse nor dense index support), the full scan of the database is now over 2x quicker than before. Concretely, we measured the following numbers using Hyperfine with three warmup runs and ten measured runs for this commit compared to its merge-base with master `de86af3fca`. New: Time (mean ± σ): 11.038 s ± 0.360 s [User: 28.600 s, System: 4.430 s] Range (min … max): 10.808 s … 11.569 s 10 runs Old: Time (mean ± σ): 22.805 s ± 0.616 s [User: 45.060 s, System: 4.826 s] Range (min … max): 22.042 s … 23.568 s 10 runs This was a Hackathon group effort with @patszt, @dispanser, and @dominiklohmann. Co-authored-by: Patryk Sztyglic <patryk.sztyglic@gmail.com> Co-authored-by: Thomas Peiselt <pi@kulturguerilla.org> Co-authored-by: Dominik Lohmann <mail@dominiklohmann.de>
Co-authored-by: Matthias Vallentin <matthias@vallentin.net>
dominiklohmann
force-pushed
the
story/sc-35030/columnar-data-predicate-evaluation
branch
from
December 6, 2022 17:30
f4c5bd4
to
98add3e
Compare
dominiklohmann
deleted the
story/sc-35030/columnar-data-predicate-evaluation
branch
December 6, 2022 19:05
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a major performance optimization for expression evaluation that essentially fills in one of the last TODOs from #2440: Making data predicate evaluation itself column-major, i.e., moving as much as possible of the evaluation out of the hot loop.
Here's how the new expression algorithm works, roughly, with (3.ii) and (4) being the new and improved steps:
We've seen some pretty great results. When using a query on a 6M event database that's doing multiple substring searches (for which we have neither sparse nor dense index support), the full scan of the database is now over 2x quicker than before.
Concretely, we measured the following numbers using Hyperfine with three warmup runs and ten measured runs for this commit compared to its merge-base with master
de86af3fca
.New:
Old:
This was a Hackathon group effort with @patszt, @dispanser, and @dominiklohmann.