Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize distinct aggregation on multiple columns #624

Conversation

kaka11chen
Copy link
Contributor

Fix #613 to optimize distinct aggregation on multi columns.

@cla-bot
Copy link

cla-bot bot commented Apr 11, 2019

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: chenqi.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@kaka11chen kaka11chen force-pushed the optimize-distinct-aggr-on-multi-cols#613 branch from 40a54e8 to 4928c60 Compare April 11, 2019 17:21
@cla-bot
Copy link

cla-bot bot commented Apr 11, 2019

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

@findepi findepi changed the title Optimize distinct aggregation on multi columns#613. Optimize distinct aggregation on multiple columns Apr 12, 2019
@martint martint self-assigned this Apr 12, 2019
Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks);
if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) {

if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extract "$distinct" to static final. Which code produces symbols with $distinct suffix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. MultipleDistinctAggregationToMarkDistinct produces this.

Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks);
if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) {

if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that the name of the mask will be named xxx$distinct, so this check is brittle. We need a better way to identify which fields correspond to the distinct mask (by matching the field they originate from)

@wenleix
Copy link
Contributor

wenleix commented Apr 17, 2019

What's the performance of this optimization compared with setting use_mark_distinct=false ? :)

@kaka11chen
Copy link
Contributor Author

kaka11chen commented Apr 18, 2019

@wenleix

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales;

Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk))

Result: It only cost 20 seconds in our perf-test env(3 nodes, each is 48 cores, 96G memory).

@kaka11chen
Copy link
Contributor Author

@sopel39 @martint Any other suggestions?

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments:

  • The OptimizeMixedDistinctAggregations is a legacy (Visitor-based optimizer), which we're trying to move away from. Ideally, we'd implement this as a Rule, instead.
  • The optimization only works when use_mark_distinct is set to true. That causes a GROUP BY with mixed distincts to be turned into a sequence of MarkDistinct. OptimizeMixedDistinctAggregations tries to reverse-engineer that transformation and turn it into the grouping-sets based version.

I think a better approach than extending this optimizer is to implement a Rule that turns a GROUP BY + mixed distincts into a grouping sets-based GROUP BY. It will make the code much simpler as we wouldn't need to dig through the plan to find the MarkDistinct nodes that produce the corresponding "distinct" inputs to the aggregation. Once we do that, we can also get rid of this legacy optimizer.

Let me know if you need help doing this or if you need additional pointers and examples of where to look.

Map<Optional<String>, ExpectedValueProvider<FunctionCall>> aggregationsFirst = ImmutableMap.of(
Optional.of("COUNT"), functionCall("count", ImmutableList.of("ORDERSTATUS")));

PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.of("TOTALPRICE", "totalprice", "CUSTKEY", "custkey", "ORDERDATE", "orderdate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a map builder for readability when there are many keys and values:

PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.<String, String>builder()
        .put("TOTALPRICE", "totalprice")
        .put("CUSTKEY", "custkey")
        .put("ORDERDATE", "orderdate")
        .put("ORDERSTATUS", "orderstatus")
        .build());

@kaka11chen
Copy link
Contributor Author

@martint Got it, thanks. I will implement as a Rule instead.

@martint martint assigned kaka11chen and unassigned martint Apr 23, 2019
@kaka11chen kaka11chen force-pushed the optimize-distinct-aggr-on-multi-cols#613 branch from 4928c60 to 81f00c2 Compare May 5, 2019 12:48
@cla-bot cla-bot bot added the cla-signed label May 5, 2019
@kaka11chen kaka11chen force-pushed the optimize-distinct-aggr-on-multi-cols#613 branch from 81f00c2 to eae95d3 Compare May 5, 2019 14:28
@kaka11chen
Copy link
Contributor Author

@martint @sopel39 I have modified to implement as a Rule. Could you please review it? Thanks.

@martint martint self-requested a review May 9, 2019 01:53
@@ -275,6 +275,7 @@ public PlanOptimizers(
statsCalculator,
estimatedExchangesCostCalculator,
new CanonicalizeExpressions().rules()),
new UnaliasSymbolReferences(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@@ -433,7 +435,6 @@ public PlanOptimizers(
estimatedExchangesCostCalculator,
ImmutableSet.of(new ReorderJoins(costComparator))));

builder.add(new OptimizeMixedDistinctAggregations(metadata));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the OptimizeMixedDistinctAggregations class, too.

.map(Aggregation::getCall)
.filter(FunctionCall::isDistinct)
.map(FunctionCall::getArguments)
.<Set<Expression>>map(HashSet::new)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<Set<Expression>> is not required. The compiler should be able to infer the correct type.


private static boolean hasDistinctInput(AggregationNode aggregation)
{
return aggregation.getAggregations()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to find out whether the aggregation has any calls with DISTINCT, this can be simplified to:

return aggregation.getAggregations()
        .values().stream()
        .map(Aggregation::getCall)
        .anyMatch(FunctionCall::isDistinct);

.count() > 0;
}

private static boolean hasSingleDistinctArgument(AggregationNode aggregation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method is misleading. It returns true if the the aggregation has DISTINCT with a single argument or no aggregations with DISTINCT.

.noneMatch(e -> e.getMask().isPresent());
}

private static boolean noOrdering(AggregationNode aggregation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trivial enough that I would just inline it where it's used.

}
}

List<Symbol> nonDistinctAggregateSymbols = nonDistinctAggregateSymbolsBuilder.build().stream().distinct().collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make nonDistinctAggregateSymbolsBuilder a Set builder, you can skip the calls to remove duplicates.

*
* INTO
*
* SELECT a1, a2,..., an, arbitrary(if(group = 0, f1)),...., arbitrary(if(group = 0, fm)), F(if(group = 1, c1)), ...., F(if(group = m, cm)) FROM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation is actually incorrect. Not all aggregations are insensitive to nulls, so the terms F(if(group = 1, c1)), etc. will call the aggregation with null values for other groups:

presto:default> select a, array_agg(b), array_agg(distinct c) from (values (1,1,1)) t(a,b,c) group by a;
 a | _col1 |   _col2
---+-------+-----------
 1 | [1]   | [null, 1]
(1 row)

This needs to be rewritten to F(c1) FILTER (WHERE group = 1). Also, please add a test for this case.

}

@Override
public Result apply(AggregationNode node, Captures captures, Context context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this implementation hard to read. I think it would help if you structured it such that it builds things from the bottom up and return objects that contain the node that was created + anything that's needed to build the node above it. However, I haven't tried playing this out, so not sure how much cleaner it will make it. You'll have to give it a try and see how it goes.

@martint
Copy link
Member

martint commented Oct 9, 2019

@kaka11chen, any updates on this?

@Donderia
Copy link

Donderia commented Apr 12, 2021

@martint @kaka11chen - is it still not available in in trino ? I had the same observation with multiple distincts with presto version 340

@colebow
Copy link
Member

colebow commented Oct 6, 2022

👋 @kaka11chen - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Optimize distinct aggregation on multi column
6 participants