Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize queries with similar subplans (CTE) #5878

Closed
sopel39 opened this issue Nov 9, 2020 · 7 comments
Closed

Optimize queries with similar subplans (CTE) #5878

sopel39 opened this issue Nov 9, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@sopel39
Copy link
Member

sopel39 commented Nov 9, 2020

Many queries also have similar subqueries. For instance:

  • tpch/q21 share similar subqueries in two NOT EXIST/EXIST correlated subqueries.
  • tpcds/q95 has a redundant IN predicate which can be removed

This would also enable us to optimize queries where:

  • some subquery predicates are redundant (see tpcds/q95)
  • same subquery can be used to evaluate different IN/EXISTS/NOT EXISTS/ALL predicates (see tpch/q21)

Initially, we could reuse similar subplans only when we are sure that it won't cause performance degradation (without CBO).

An example of such optimization based on tpch/q21 query:

	and exists (select * from lineitem l2
		where
			l2.l_orderkey = l1.l_orderkey
			and l2.l_suppkey <> l1.l_suppkey)
	and not exists (
		select * from lineitem l3
		where
			l3.l_orderkey = l1.l_orderkey
			and l3.l_suppkey <> l1.l_suppkey
			and l3.l_receiptdate > l3.l_commitdate)

this translates to plan:

                                    ApplyNode[not_exists]
                                   /                     \
                            ApplyNode[exists]           Subquery_A
                           /            \
                     Input           Subquery_B

when reusing similar subplans this plan can be conceptually rewritten to:

                     ApplyNode[not_exists(A), exists(B)]
                    /                       \
                 Input             Project A predicates, project B predicates
                                               \
                                       Filter predicates that neither A and B could satisfy
                                                  \
                                           Common_subplan_for_subquery_A_and_B                                                  
                                                                                   

We could base the similar subplans extraction on paper: http://www.dbis.informatik.hu-berlin.de/fileadmin/lectures/SS2008/Seminar_MatViews/p533-zhou.pdf (Efficient Exploitation of Similar Subexpressions for Query Processing).

The paper describes a method for detecting similar subplans and constructing a reusable common subplan. The method works for S(elect)P(roject)J(oin)G(roup By) type of plans.

The optimization could be introduced in following steps:

  • Introduce PlanSignatureEvaluator, which can compute plan signature according to normalization rules (SPJG plan into TableScan -> Join -> Filter -> GroupBy plan).
  • Introduce CommonSubplanExtractor that takes a list of plans as an input. The result would be a list of extracted common subplans with a mapping to original subplans (with detailed info which predicates and aggregates needs to be applied in order to have the same result as the original subplan). A conceptual output example:
...
common_subplan_X  =>
   [{original_plan_node: Y,
     missing_predicates: p_y,
     missing_aggregates: aggr_y
    },
    {original_plan_node: Z,
     missing_predicates: p_z,
     missing_aggregaes: aggr_z
    }
   ]
...

common_subplan_X with predicates p_y and aggregates aggr_y applied should produce the same result as Y.

We could then add following rules:

  • Add transformations to pushdown ApplyNodes
  • Add transformations to merge/eliminate ApplyNodes with similar subqueries.

Original issue: prestodb/presto#6944

@sopel39 sopel39 added the enhancement New feature or request label Nov 9, 2020
@sopel39
Copy link
Member Author

sopel39 commented Nov 9, 2020

Initially we could focus on eliminating redundant predicates.

@sopel39 sopel39 changed the title Optimize queries with similar subplans Optimize queries with similar subplans (CTE) Nov 9, 2020
@gravitys169
Copy link

gravitys169 commented Nov 13, 2020

@sopel39 , In my mind, one way to eliminate the redundant subquery is materializing the subquery to a temp table, then query from the temp table, so there is some work:

  1. extract the CTE
  2. create a temp table
  3. query from the temp table
  4. delete the temp table
    every step seems need quite a lot work to make it is programable, how do you think?

@sopel39
Copy link
Member Author

sopel39 commented Nov 13, 2020

In my mind, one way to eliminate the redundant subquery is materializing the subquery to a temp table

Either temp table or keep the data in memory. First step is to identify and unify CTEs. Just identifying common CTE could lead to improvements as we can execute certain subplans just once.

@djsstarburst
Copy link
Member

I note from http://c.raqsoft.com/article/1571214718416:

The main query of this problem is relatively simple, that is, the join between the main sub-table and the foreign key table introduced earlier. The main problem here is two sub-queries with exists. Careful study of these two sub-queries reveals that they are all operations against lineitem records under the same l_orderkey.

We know that lineitem has been sorted by l_orderkey, and subtable records can be regarded as the set fields of the main table. If we group the join result set of orders and lineitem according to orderkey (but not aggregate first), we can get a small set of lineitem records with the same l_orderkey value, and then calculate the two existing conditions in this small set. It will be simpler this way.

It seems like the big payoff in q21 is pushing down the l2.l_orderkey = l1.l_orderkey and l3.l_orderkey = l1.l_orderkey filters into their respective TableScans. Conversely, once you've limited the subqueries by the l_orderkey constraint, I'm not sure how much benefit there is in sharing the scans for the aggregations.

@sopel39
Copy link
Member Author

sopel39 commented Jan 19, 2021

Pushing l2.l_orderkey = l1.l_orderkey won't help since every l1 has corresponding l2 row. There could be some benefit in pushing l3.l_orderkey = l1.l_orderkey, but that is already covered by dynamic filtering (assuming number of rows is small)

@hackeryang
Copy link
Member

hackeryang commented Oct 30, 2023

HUAWEI OpenLookeng(based on Trino 350) has implemented CTE subquery result reuse feature, maybe we can refer to the relevant design in the future: https://gitee.com/openlookeng/hetu-core/issues/I38Q5L
https://openlookeng.slite.page/p/0czaZoIp8h/CTE-Caching-and-Optimized-Execution
https://openlookeng.io/zh/information/blog/20220426/

@sopel39
Copy link
Member Author

sopel39 commented May 27, 2024

Superseded by #22167

@sopel39 sopel39 closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants