Optimize distinct aggregation on multiple columns #624

kaka11chen · 2019-04-11T17:00:54Z

Fix #613 to optimize distinct aggregation on multi columns.

cla-bot · 2019-04-11T17:00:57Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: chenqi.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot · 2019-04-11T17:21:51Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

sopel39 · 2019-04-15T09:52:47Z

.../src/main/java/io/prestosql/sql/planner/optimizations/OptimizeMixedDistinctAggregations.java

-            Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks);
-            if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) {
+
+            if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) {


Could you extract "$distinct" to static final. Which code produces symbols with $distinct suffix?

OK, thanks. MultipleDistinctAggregationToMarkDistinct produces this.

martint · 2019-04-13T00:26:46Z

.../src/main/java/io/prestosql/sql/planner/optimizations/OptimizeMixedDistinctAggregations.java

-            Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks);
-            if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) {
+
+            if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) {


There's no guarantee that the name of the mask will be named xxx$distinct, so this check is brittle. We need a better way to identify which fields correspond to the distinct mask (by matching the field they originate from)

wenleix · 2019-04-17T22:48:11Z

What's the performance of this optimization compared with setting use_mark_distinct=false ? :)

kaka11chen · 2019-04-18T15:36:25Z

@wenleix

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales;

Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk))

Result: It only cost 20 seconds in our perf-test env(3 nodes, each is 48 cores, 96G memory).

kaka11chen · 2019-04-22T06:11:14Z

@sopel39 @martint Any other suggestions?

martint

Some high-level comments:

The OptimizeMixedDistinctAggregations is a legacy (Visitor-based optimizer), which we're trying to move away from. Ideally, we'd implement this as a Rule, instead.
The optimization only works when use_mark_distinct is set to true. That causes a GROUP BY with mixed distincts to be turned into a sequence of MarkDistinct. OptimizeMixedDistinctAggregations tries to reverse-engineer that transformation and turn it into the grouping-sets based version.

I think a better approach than extending this optimizer is to implement a Rule that turns a GROUP BY + mixed distincts into a grouping sets-based GROUP BY. It will make the code much simpler as we wouldn't need to dig through the plan to find the MarkDistinct nodes that produce the corresponding "distinct" inputs to the aggregation. Once we do that, we can also get rid of this legacy optimizer.

Let me know if you need help doing this or if you need additional pointers and examples of where to look.

martint · 2019-04-22T22:49:05Z

.../test/java/io/prestosql/sql/planner/optimizations/TestOptimizeMixedDistinctAggregations.java

+        Map<Optional<String>, ExpectedValueProvider<FunctionCall>> aggregationsFirst = ImmutableMap.of(
+                Optional.of("COUNT"), functionCall("count", ImmutableList.of("ORDERSTATUS")));
+
+        PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.of("TOTALPRICE", "totalprice", "CUSTKEY", "custkey", "ORDERDATE", "orderdate",


Use a map builder for readability when there are many keys and values:

PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.<String, String>builder() .put("TOTALPRICE", "totalprice") .put("CUSTKEY", "custkey") .put("ORDERDATE", "orderdate") .put("ORDERSTATUS", "orderstatus") .build());

kaka11chen · 2019-04-23T15:37:03Z

@martint Got it, thanks. I will implement as a Rule instead.

kaka11chen · 2019-05-05T15:44:24Z

@martint @sopel39 I have modified to implement as a Rule. Could you please review it? Thanks.

martint · 2019-05-13T18:15:10Z

presto-main/src/main/java/io/prestosql/sql/planner/PlanOptimizers.java

@@ -275,6 +275,7 @@ public PlanOptimizers(
                        statsCalculator,
                        estimatedExchangesCostCalculator,
                        new CanonicalizeExpressions().rules()),
+                new UnaliasSymbolReferences(),


Why is this needed?

martint · 2019-05-13T18:15:41Z

presto-main/src/main/java/io/prestosql/sql/planner/PlanOptimizers.java

@@ -433,7 +435,6 @@ public PlanOptimizers(
                        estimatedExchangesCostCalculator,
                        ImmutableSet.of(new ReorderJoins(costComparator))));

-        builder.add(new OptimizeMixedDistinctAggregations(metadata));


We can remove the OptimizeMixedDistinctAggregations class, too.

martint · 2019-05-13T18:22:21Z