Optimize queries with similar subplans (CTE) #5878

sopel39 · 2020-11-09T11:40:09Z

Many queries also have similar subqueries. For instance:

tpch/q21 share similar subqueries in two NOT EXIST/EXIST correlated subqueries.
tpcds/q95 has a redundant IN predicate which can be removed

This would also enable us to optimize queries where:

some subquery predicates are redundant (see tpcds/q95)
same subquery can be used to evaluate different IN/EXISTS/NOT EXISTS/ALL predicates (see tpch/q21)

Initially, we could reuse similar subplans only when we are sure that it won't cause performance degradation (without CBO).

An example of such optimization based on tpch/q21 query:

	and exists (select * from lineitem l2
		where
			l2.l_orderkey = l1.l_orderkey
			and l2.l_suppkey <> l1.l_suppkey)
	and not exists (
		select * from lineitem l3
		where
			l3.l_orderkey = l1.l_orderkey
			and l3.l_suppkey <> l1.l_suppkey
			and l3.l_receiptdate > l3.l_commitdate)

this translates to plan:

                                    ApplyNode[not_exists]
                                   /                     \
                            ApplyNode[exists]           Subquery_A
                           /            \
                     Input           Subquery_B

when reusing similar subplans this plan can be conceptually rewritten to:

                     ApplyNode[not_exists(A), exists(B)]
                    /                       \
                 Input             Project A predicates, project B predicates
                                               \
                                       Filter predicates that neither A and B could satisfy
                                                  \
                                           Common_subplan_for_subquery_A_and_B

We could base the similar subplans extraction on paper: http://www.dbis.informatik.hu-berlin.de/fileadmin/lectures/SS2008/Seminar_MatViews/p533-zhou.pdf (Efficient Exploitation of Similar Subexpressions for Query Processing).

The paper describes a method for detecting similar subplans and constructing a reusable common subplan. The method works for S(elect)P(roject)J(oin)G(roup By) type of plans.

The optimization could be introduced in following steps:

Introduce PlanSignatureEvaluator, which can compute plan signature according to normalization rules (SPJG plan into TableScan -> Join -> Filter -> GroupBy plan).
Introduce CommonSubplanExtractor that takes a list of plans as an input. The result would be a list of extracted common subplans with a mapping to original subplans (with detailed info which predicates and aggregates needs to be applied in order to have the same result as the original subplan). A conceptual output example:

...
common_subplan_X  =>
   [{original_plan_node: Y,
     missing_predicates: p_y,
     missing_aggregates: aggr_y
    },
    {original_plan_node: Z,
     missing_predicates: p_z,
     missing_aggregaes: aggr_z
    }
   ]
...

common_subplan_X with predicates p_y and aggregates aggr_y applied should produce the same result as Y.

We could then add following rules:

Add transformations to pushdown ApplyNodes
Add transformations to merge/eliminate ApplyNodes with similar subqueries.

Original issue: prestodb/presto#6944

The text was updated successfully, but these errors were encountered:

sopel39 · 2020-11-09T11:46:08Z

Initially we could focus on eliminating redundant predicates.

gravitys169 · 2020-11-13T03:30:51Z

@sopel39 , In my mind, one way to eliminate the redundant subquery is materializing the subquery to a temp table, then query from the temp table, so there is some work:

extract the CTE
create a temp table
query from the temp table
delete the temp table
every step seems need quite a lot work to make it is programable, how do you think?

sopel39 · 2020-11-13T09:00:50Z

In my mind, one way to eliminate the redundant subquery is materializing the subquery to a temp table

Either temp table or keep the data in memory. First step is to identify and unify CTEs. Just identifying common CTE could lead to improvements as we can execute certain subplans just once.

djsstarburst · 2021-01-18T21:07:30Z

I note from http://c.raqsoft.com/article/1571214718416:

The main query of this problem is relatively simple, that is, the join between the main sub-table and the foreign key table introduced earlier. The main problem here is two sub-queries with exists. Careful study of these two sub-queries reveals that they are all operations against lineitem records under the same l_orderkey.

We know that lineitem has been sorted by l_orderkey, and subtable records can be regarded as the set fields of the main table. If we group the join result set of orders and lineitem according to orderkey (but not aggregate first), we can get a small set of lineitem records with the same l_orderkey value, and then calculate the two existing conditions in this small set. It will be simpler this way.

It seems like the big payoff in q21 is pushing down the l2.l_orderkey = l1.l_orderkey and l3.l_orderkey = l1.l_orderkey filters into their respective TableScans. Conversely, once you've limited the subqueries by the l_orderkey constraint, I'm not sure how much benefit there is in sharing the scans for the aggregations.

sopel39 · 2021-01-19T10:21:12Z

Pushing l2.l_orderkey = l1.l_orderkey won't help since every l1 has corresponding l2 row. There could be some benefit in pushing l3.l_orderkey = l1.l_orderkey, but that is already covered by dynamic filtering (assuming number of rows is small)

hackeryang · 2023-10-30T07:22:07Z

HUAWEI OpenLookeng(based on Trino 350) has implemented CTE subquery result reuse feature, maybe we can refer to the relevant design in the future: https://gitee.com/openlookeng/hetu-core/issues/I38Q5L
https://openlookeng.slite.page/p/0czaZoIp8h/CTE-Caching-and-Optimized-Execution
https://openlookeng.io/zh/information/blog/20220426/

sopel39 · 2024-05-27T20:27:28Z

Superseded by #22167

sopel39 added the enhancement New feature or request label Nov 9, 2020

sopel39 changed the title ~~Optimize queries with similar subplans~~ Optimize queries with similar subplans (CTE) Nov 9, 2020

sopel39 mentioned this issue Nov 9, 2020

Reuse table scan results when the same table is used in different parts of query #5880

Closed

sopel39 assigned djsstarburst Dec 23, 2020

sopel39 unassigned djsstarburst Mar 2, 2021

sopel39 mentioned this issue Jul 29, 2022

Could we have an optimizer to deduplicate joins? #13409

Closed

jaystarshot mentioned this issue May 26, 2023

Optimize queries with similar subplans (CTE) prestodb/presto#19744

Open

purcell mentioned this issue Jul 20, 2023

Support for TABLE keyword in queries #18361

Closed

sopel39 closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize queries with similar subplans (CTE) #5878

Optimize queries with similar subplans (CTE) #5878

sopel39 commented Nov 9, 2020 •

edited

Loading

sopel39 commented Nov 9, 2020

gravitys169 commented Nov 13, 2020 •

edited

Loading

sopel39 commented Nov 13, 2020

djsstarburst commented Jan 18, 2021

sopel39 commented Jan 19, 2021

hackeryang commented Oct 30, 2023 •

edited

Loading

sopel39 commented May 27, 2024

Optimize queries with similar subplans (CTE) #5878

Optimize queries with similar subplans (CTE) #5878

Comments

sopel39 commented Nov 9, 2020 • edited Loading

sopel39 commented Nov 9, 2020

gravitys169 commented Nov 13, 2020 • edited Loading

sopel39 commented Nov 13, 2020

djsstarburst commented Jan 18, 2021

sopel39 commented Jan 19, 2021

hackeryang commented Oct 30, 2023 • edited Loading

sopel39 commented May 27, 2024

sopel39 commented Nov 9, 2020 •

edited

Loading

gravitys169 commented Nov 13, 2020 •

edited

Loading

hackeryang commented Oct 30, 2023 •

edited

Loading