Prevent redundant collection rewriting in AggregationAnalyzer #15292

skrzypo987 · 2022-12-05T11:58:31Z

Description

See commit message

Additional context and related issues

Replaces #14983

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Improve performance of analysis of queries with a lot of GROUP BY clauses

skrzypo987 · 2022-12-05T11:59:31Z

@findepi @martint
This replaces #14983

findepi · 2022-12-05T12:01:47Z

core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java

-        this.columnReferences = analysis.getColumnReferenceFields()
-                .entrySet().stream()
-                .collect(toImmutableMap(Map.Entry::getKey, entry -> entry.getValue().getFieldId()));
+        this.columnReferences = analysis.getColumnReferenceFields();


this.columnReferences is (indirectly) mutable now

BTW looks like this place is the most important change in the commit.
How can we prevent it from being undone by an innocent refactor?

analysis.getColumnReferenceFields()

returns an immutable map

How can we prevent it from being undone by an innocent refactor?

I don't quite follow why would anybody want to change that. I see no gain in this.

How can we prevent it from being undone by an innocent refactor?

An automated benchmark for planner?

That's a good idea long-term. But this exact problem emerges only for quite specific queries. Mine had 80k columns referenced and 556 group by clauses.

I added a comment.

What I don't like is that this now relies on getColumnReferenceFields returning an immutable map. If the implementation of that method changes, it could indirectly affect this usage without anyone noticing.

Maybe we can just make a sanity check, that the map is indeed immutable? If anyone changes the implementation it will fail fast

Maybe we can just make a sanity check, that the map is indeed immutable?

but it's not (it's just an unmodifiable wrapper)

also, you can't verify that it's what it is, since unmodifiableMap has no public class you can test for

@findepi Unfortunately you are right.
So what are the options?

core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java

skrzypo987 · 2022-12-05T12:35:31Z

@findepi AC
Thanks for the quick look

skrzypo987 · 2022-12-06T07:54:44Z

AC again.
Added a proper benchmark to BenchmarkPlanner class

findepi · 2022-12-06T16:23:33Z

Added a proper benchmark to BenchmarkPlanner class

thank you!

do you have the results you can share?

skrzypo987 · 2022-12-06T17:54:20Z

Added a proper benchmark to BenchmarkPlanner class

thank you!

do you have the results you can share?

See commit message

core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java

findepi · 2022-12-08T09:05:58Z

core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java

+                        JOIN lineitem b ON a.l_orderkey = b.l_orderkey
+                        JOIN lineitem c ON a.l_orderkey = c.l_orderkey
+                        JOIN lineitem d ON a.l_orderkey = d.l_orderkey
+                        JOIN lineitem e ON a.l_orderkey = e.l_orderkey


That's a perfect case for optimizer to eliminate redundant reads from lineitem table, and read it only once.
To prevent such potential future optimization breaking your benchmark, eg use lineitem coming from different schemas.
(from planning time perspective, the scale factor (sf) factor does not matter)

findepi · 2022-12-08T09:08:20Z

core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java

                                    .collect(toImmutableList());
                            for (Expression sortKey : sortKeys) {
-                                if (!node.getArguments().contains(sortKey) && !fieldIds.contains(columnReferences.get(NodeRef.of(sortKey)))) {
+                                ResolvedField field = columnReferences.get(NodeRef.of(sortKey));


Why extracted?
previously the lookup was lazy, only when (!node.getArguments().contains(sortKey), now it's always.

Now instead of X.contains(map.get(A)) we need to do

A a = map.getA(); return ( a == null ) || !X.contains(a.getB());

since the value type of the map changed.
But you are right, I replaced that with a double if

skrzypo987 · 2022-12-08T12:35:19Z

@findepi AC

core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java

core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java

Make queries to be executed a parameter

Add a test case with huge number of column referenced in a query (86k) and a lot of group by clauses. This benchmark highlights the quadratic complexity of AggregationAnalyzer constructor, which will be fixed in subsequent commit.

Every time the `AggregationAnalyzer` class was initialized, all fields present in analysis.getColumnReferenceFields() has been rewritten to the `columnReferences` map which only trivially mapped values: entry -> entry.getValue().getFieldId(). This happened for every aggregation in the query. The problem is that analysis.referencedFields scope is wider than that of AggregationAnalyzer and the number of fields kept rising with time. The complexity of this is O(n^2) which led to long analysis time after number of referenced fields got to tens of thousands. This commit removes the problematic eager remapping of all values in favor of lazy mapping when needed. before: BenchmarkPlanner.plan GROUP_BY_WITH_MANY_REFERENCED_COLUMNS CREATED avgt 20 4835,674 ± 233,587 ms/op after: BenchmarkPlanner.plan GROUP_BY_WITH_MANY_REFERENCED_COLUMNS CREATED avgt 20 863,968 ± 5,566 ms/op

cla-bot bot added the cla-signed label Dec 5, 2022

skrzypo987 mentioned this pull request Dec 5, 2022

Use IdentityHashMap in AggregationAnalyzer #14983

Closed

skrzypo987 requested review from martint and findepi December 5, 2022 11:59

findepi reviewed Dec 5, 2022

View reviewed changes

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from a33efec to 4b61bba Compare December 5, 2022 12:34

Replace String type with enum in BenchmarkPlanner

94daca0

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 4b61bba to 2505e40 Compare December 6, 2022 07:53

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch 3 times, most recently from 2368c85 to a482c1c Compare December 6, 2022 15:25

findepi approved these changes Dec 8, 2022

View reviewed changes

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from a482c1c to 99d63b0 Compare December 8, 2022 12:34

findepi reviewed Dec 8, 2022

View reviewed changes

core/trino-main/src/test/java/io/trino/sql/planner/BenchmarkPlanner.java Outdated Show resolved Hide resolved

findepi reviewed Dec 8, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/analyzer/AggregationAnalyzer.java Outdated Show resolved Hide resolved

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 99d63b0 to 0fbcdef Compare December 8, 2022 17:19

skrzypo987 added 3 commits December 9, 2022 10:18

Parametrize BenchmarkPlanner

17e8def

Make queries to be executed a parameter

Add test case to BenchmarkPlanner

17f0ae0

Add a test case with huge number of column referenced in a query (86k) and a lot of group by clauses. This benchmark highlights the quadratic complexity of AggregationAnalyzer constructor, which will be fixed in subsequent commit.

skrzypo987 force-pushed the skrzypo/132-optimize-aggregation-analysis branch from 0fbcdef to 572cadb Compare December 9, 2022 08:51

findepi merged commit c68bc26 into trinodb:master Dec 9, 2022

findepi mentioned this pull request Dec 9, 2022

Release notes for 405 #15058

Closed

github-actions bot added this to the 404 milestone Dec 9, 2022

colebow mentioned this pull request Dec 14, 2022

Add Trino 405 release notes #15139

Merged

skrzypo987 mentioned this pull request Feb 21, 2023

Optimize aggregation analyzer #16198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent redundant collection rewriting in AggregationAnalyzer #15292

Prevent redundant collection rewriting in AggregationAnalyzer #15292

skrzypo987 commented Dec 5, 2022

skrzypo987 commented Dec 5, 2022

findepi Dec 5, 2022

findepi Dec 5, 2022

skrzypo987 Dec 5, 2022

kokosing Dec 5, 2022

skrzypo987 Dec 5, 2022

skrzypo987 Dec 6, 2022

martint Dec 6, 2022

skrzypo987 Dec 6, 2022

findepi Dec 6, 2022

skrzypo987 Dec 8, 2022

skrzypo987 commented Dec 5, 2022

skrzypo987 commented Dec 6, 2022

findepi commented Dec 6, 2022

skrzypo987 commented Dec 6, 2022

findepi Dec 8, 2022

findepi Dec 8, 2022

skrzypo987 Dec 8, 2022

skrzypo987 commented Dec 8, 2022

Prevent redundant collection rewriting in AggregationAnalyzer #15292

Prevent redundant collection rewriting in AggregationAnalyzer #15292

Conversation

skrzypo987 commented Dec 5, 2022

Description

Additional context and related issues

Release notes

skrzypo987 commented Dec 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrzypo987 commented Dec 5, 2022

skrzypo987 commented Dec 6, 2022

findepi commented Dec 6, 2022

skrzypo987 commented Dec 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrzypo987 commented Dec 8, 2022