Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up chunk search by restriction clauses #4492

Merged
merged 1 commit into from Sep 12, 2022

Conversation

akuzm
Copy link
Member

@akuzm akuzm commented Jul 5, 2022

We don't have to look up the dimension slices for dimensions for which
we don't have restrictions.

This also fixes a planning time regression introduced in 2.7. It was introduced in this commit 37190e8a8#diff-e488f45c83647e5d9415d438f2e79eb2780639b7888c7533890b6c577e02d03dL852-L857
We stopped using the fast path for case where there are no restrictions, and the normal path turned out to be slower. This commit speeds it up to the previous level.

The lookup logic this PR introduces is basically the same as the one introduced for chunk_point_find_chunk_id by #4390 . The difference is that here the conditions may be underspecified, so multiple chunks match.

@akuzm akuzm self-assigned this Jul 5, 2022
@erimatnor
Copy link
Contributor

For reviewing purposes, it would be good to understand what change introduced the regression.

@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

Merging #4492 (87dfc27) into main (a26a597) will increase coverage by 0.03%.
The diff coverage is 96.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4492      +/-   ##
==========================================
+ Coverage   90.82%   90.85%   +0.03%     
==========================================
  Files         224      224              
  Lines       42344    41654     -690     
==========================================
- Hits        38457    37843     -614     
+ Misses       3887     3811      -76     
Impacted Files Coverage Δ
src/chunk.h 100.00% <ø> (ø)
src/compat/compat.h 94.59% <87.50%> (-0.53%) ⬇️
src/hypertable_restrict_info.c 94.50% <91.89%> (-1.35%) ⬇️
src/chunk_scan.c 98.47% <98.73%> (+2.61%) ⬆️
src/chunk.c 88.37% <100.00%> (-6.73%) ⬇️
src/planner/expand_hypertable.c 93.94% <100.00%> (-0.03%) ⬇️
tsl/src/nodes/compress_dml/compress_dml.c 90.47% <0.00%> (-9.53%) ⬇️
src/chunk_constraint.c 90.07% <0.00%> (-3.88%) ⬇️
tsl/src/partialize_finalize.c 92.44% <0.00%> (-3.73%) ⬇️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a26a597...87dfc27. Read the comment docs.

@akuzm
Copy link
Member Author

akuzm commented Aug 2, 2022

For reviewing purposes, it would be good to understand what change introduced the regression.

I updated the description.

@akuzm akuzm marked this pull request as ready for review August 2, 2022 15:21
@akuzm
Copy link
Member Author

akuzm commented Aug 9, 2022

@@ -524,6 +549,18 @@ gather_restriction_dimension_vectors(const HypertableRestrictInfo *hri)
open->lower_strategy,
open->lower_bound);

/*
* If we have a nontrivial condition for the second index column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does nontrivial mean exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just any restrictions, actually. We represent the restrictions as intervals, and "no restriction" corresponds to (-inf, +inf) interval, that is which I meant by "trivial". I'll reword the comment to make it more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is weird why do you need backwards scan for constraints on the 2nd index column scan direction shouldnt matter for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have <= for the 2nd column, and scan direction is forward, you can't use the 2nd column to find the starting point for the scan. So you use only the first column, and go through the rest of values sequentially. For backwards, you know where the first record is that matches both the conditions on the 1st and the 2nd columns.

More about this in postgres code: https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtsearch.c#L928

We have this optimization in several other places, so I decided to add it here as well.

src/chunk.c Outdated Show resolved Hide resolved
* starting point.
* If not, prefer forward direction, because backwards scan is
* slightly slower for some reason.
* Ideally we need some other index type than btree for this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need non-btree index here, what would we gain by other index type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The btree index is not so suitable for queries like find interval that contains the given point and find intervals that intersect with the given interval. Idk what it should be, we also have this dimension id column on which we should filter first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, in all queries we make the third column (range_end) is only compared in a sequential scan over index entries, not used to find the starting/ending point.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, but I left some comments you might want to review.

Comment on lines 630 to 631
* Filter out the dimensions for which we don't have a nontrivial
* restriction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t would be good, for the reader's understanding, to add why this is useful in addition to what is being done. Also, what is meant by "filter out"? Are the restrictions "filtered out" those we'd like to use or those we'd like to ignore?

For example:

Suggested change
* Filter out the dimensions for which we don't have a nontrivial
* restriction.
* Remove restrictions that "select" everything in their dimension, since they don't further restrict the result and instead increase scan time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, do we need to include "trivial" restrictions in HypertableRestrictInfo to begin with? Why not exclude them from the beginning so that we need not filter them out later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, do we need to include "trivial" restrictions in HypertableRestrictInfo to begin with? Why not exclude them from the beginning so that we need not filter them out later?

Originally I just didn't want to change the algorithm too much, because it would require different code for initializing the dimension restrict infos on demand.

But now I think this approach also has value for queries when there are multiple conditions on a dimension, that end up adding to the entire dimension. This isn't likely to arise in simple hand-written queries, but more likely in auto-generated queries or more complex queries that use views. So we can consider this approach of "removing empty ones in the end" to be an additional optimization.

We don't have to look up the dimension slices for dimensions for which
we don't have restrictions.

Also sort chunks by ids before looking up the metadata, because this
gives more favorable table access patterns (closer to sequential).

This fixes a planning time regression introduced in 2.7.
@akuzm akuzm merged commit 4e47302 into timescale:main Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants