New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up chunk search by restriction clauses #4492
Conversation
For reviewing purposes, it would be good to understand what change introduced the regression. |
Codecov Report
@@ Coverage Diff @@
## main #4492 +/- ##
==========================================
+ Coverage 90.82% 90.85% +0.03%
==========================================
Files 224 224
Lines 42344 41654 -690
==========================================
- Hits 38457 37843 -614
+ Misses 3887 3811 -76
Continue to review full report at Codecov.
|
I updated the description. |
Looks like this does improve some queries, although not all of them. |
src/hypertable_restrict_info.c
Outdated
@@ -524,6 +549,18 @@ gather_restriction_dimension_vectors(const HypertableRestrictInfo *hri) | |||
open->lower_strategy, | |||
open->lower_bound); | |||
|
|||
/* | |||
* If we have a nontrivial condition for the second index column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does nontrivial mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just any restrictions, actually. We represent the restrictions as intervals, and "no restriction" corresponds to (-inf, +inf)
interval, that is which I meant by "trivial". I'll reword the comment to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that is weird why do you need backwards scan for constraints on the 2nd index column scan direction shouldnt matter for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have <=
for the 2nd column, and scan direction is forward, you can't use the 2nd column to find the starting point for the scan. So you use only the first column, and go through the rest of values sequentially. For backwards, you know where the first record is that matches both the conditions on the 1st and the 2nd columns.
More about this in postgres code: https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtsearch.c#L928
We have this optimization in several other places, so I decided to add it here as well.
src/hypertable_restrict_info.c
Outdated
* starting point. | ||
* If not, prefer forward direction, because backwards scan is | ||
* slightly slower for some reason. | ||
* Ideally we need some other index type than btree for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need non-btree index here, what would we gain by other index type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The btree index is not so suitable for queries like find interval that contains the given point
and find intervals that intersect with the given interval
. Idk what it should be, we also have this dimension id column on which we should filter first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, in all queries we make the third column (range_end) is only compared in a sequential scan over index entries, not used to find the starting/ending point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, but I left some comments you might want to review.
src/hypertable_restrict_info.c
Outdated
* Filter out the dimensions for which we don't have a nontrivial | ||
* restriction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t would be good, for the reader's understanding, to add why this is useful in addition to what is being done. Also, what is meant by "filter out"? Are the restrictions "filtered out" those we'd like to use or those we'd like to ignore?
For example:
* Filter out the dimensions for which we don't have a nontrivial | |
* restriction. | |
* Remove restrictions that "select" everything in their dimension, since they don't further restrict the result and instead increase scan time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we need to include "trivial" restrictions in HypertableRestrictInfo
to begin with? Why not exclude them from the beginning so that we need not filter them out later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we need to include "trivial" restrictions in HypertableRestrictInfo to begin with? Why not exclude them from the beginning so that we need not filter them out later?
Originally I just didn't want to change the algorithm too much, because it would require different code for initializing the dimension restrict infos on demand.
But now I think this approach also has value for queries when there are multiple conditions on a dimension, that end up adding to the entire dimension. This isn't likely to arise in simple hand-written queries, but more likely in auto-generated queries or more complex queries that use views. So we can consider this approach of "removing empty ones in the end" to be an additional optimization.
a932008
to
09fea64
Compare
We don't have to look up the dimension slices for dimensions for which we don't have restrictions. Also sort chunks by ids before looking up the metadata, because this gives more favorable table access patterns (closer to sequential). This fixes a planning time regression introduced in 2.7.
09fea64
to
87dfc27
Compare
We don't have to look up the dimension slices for dimensions for which
we don't have restrictions.
This also fixes a planning time regression introduced in 2.7. It was introduced in this commit 37190e8a8#diff-e488f45c83647e5d9415d438f2e79eb2780639b7888c7533890b6c577e02d03dL852-L857
We stopped using the fast path for case where there are no restrictions, and the normal path turned out to be slower. This commit speeds it up to the previous level.
The lookup logic this PR introduces is basically the same as the one introduced for
chunk_point_find_chunk_id
by #4390 . The difference is that here the conditions may be underspecified, so multiple chunks match.