Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support more TPC-H queries #791

Merged
merged 34 commits into from
Aug 11, 2023
Merged

feat: support more TPC-H queries #791

merged 34 commits into from
Aug 11, 2023

Conversation

wangrunji0408
Copy link
Member

@wangrunji0408 wangrunji0408 commented Jul 31, 2023

This PR adds support for all remaining TPC-H queries without correlated subquery, including q7,q8,q12,q13,q14,q19.

Q7 and Q8 have one table appearing twice in the FROM clause.

from
    nation n1,
    nation n2
...

Previously it caused error because columns from both table have the same identifier, but they should be distinct in the query. This PR adds a new field table_occurence to ColumnRefId so that columns from n1 and n2 are considered distinct columns in the planner.

Q13 is a query containing as table (column...) alias on subquery. This PR adds the support in binder.

Q12, Q14 and Q19 are simple queries depending on IN (...) and CASE expression. This PR adds in and if node to the planner, and supports them in binder and evaluator. Note that we only support IN-value-list in this PR. The IN operator may also be followed by a subquery, as in Q18 and other queries. This will be supported in a follow-up PR.

This PR also has several improvements in the optimizer:

  1. We utilize statistics in storage, including table row counts, to better estimate rows in the optimizer.
  2. Estimated number of rows is also shown in the explanation.
  3. We introduced a way to mock row counts in planner test: SET mock_rowcount_<table_name> = <count>;
  4. Added join-swap rule and made join-rotate rule unconditional.
  5. The second stage optimization for join reorder and hash join is now iterated for multiple rounds.
  6. Cost function is fine-tuned.

The above efforts are aimed to make join reordering really effective. This is critical for queries with many table joins. As a result, the time of Q9 has dropped from ~60s to 2s (#764). You can checkout the plan snapshots in tests/planner_test/tpch.planner.sql to see it in action.

Benchmark result:

time(s) Before After DuckDB
q1 1.643 1.750 0.053
q3 0.410 0.423 0.021
q5 1.087 1.012 0.020
q6 0.129 0.136 0.007
q7   1.530 0.045
q8   0.845 0.022
q9 56.748 1.739 0.067
q10 0.461 0.455 0.061
q12   0.346 0.017
q13   0.816 0.051
q14   0.188 0.014
q19   0.547 0.034

Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
but the result is empty, test data needs update

Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@wangrunji0408 wangrunji0408 requested a review from skyzh July 31, 2023 05:46
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Alex Chi <iskyzh@gmail.com>
@skyzh skyzh enabled auto-merge August 11, 2023 02:58
@skyzh skyzh added this pull request to the merge queue Aug 11, 2023
Merged via the queue into main with commit 3ea87f9 Aug 11, 2023
4 checks passed
@skyzh skyzh deleted the wrj/subquery branch August 11, 2023 03:19
@skyzh
Copy link
Member

skyzh commented Aug 11, 2023

it seems that TPC-H plans are slower to generate than before, but we can improve it later. (make apply_planner_test)

@wangrunji0408
Copy link
Member Author

Yes. The current CBO has not been fine-tuned. It can be very slow especially with many table joins. We should rethink the strategy to apply rules and may need to explore how to perform RBO in the egg framework. 🥵

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants