-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize distinct aggregation on multiple columns #624
Optimize distinct aggregation on multiple columns #624
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: chenqi.
|
40a54e8
to
4928c60
Compare
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla. |
Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks); | ||
if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) { | ||
|
||
if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extract "$distinct"
to static final. Which code produces symbols with $distinct
suffix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. MultipleDistinctAggregationToMarkDistinct produces this.
Set<Symbol> uniqueMasks = ImmutableSet.copyOf(masks); | ||
if (uniqueMasks.size() != 1 || masks.size() == node.getAggregations().size()) { | ||
|
||
if (masks.size() == 0 || !masks.stream().allMatch(mask -> mask.getName().endsWith("$distinct"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee that the name of the mask will be named xxx$distinct
, so this check is brittle. We need a better way to identify which fields correspond to the distinct mask (by matching the field they originate from)
What's the performance of this optimization compared with setting |
Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct. If I change it to
Result: It only cost 20 seconds in our perf-test env(3 nodes, each is 48 cores, 96G memory). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high-level comments:
- The
OptimizeMixedDistinctAggregations
is a legacy (Visitor-based optimizer), which we're trying to move away from. Ideally, we'd implement this as aRule
, instead. - The optimization only works when
use_mark_distinct
is set totrue
. That causes a GROUP BY with mixed distincts to be turned into a sequence ofMarkDistinct
.OptimizeMixedDistinctAggregations
tries to reverse-engineer that transformation and turn it into the grouping-sets based version.
I think a better approach than extending this optimizer is to implement a Rule that turns a GROUP BY + mixed distincts into a grouping sets-based GROUP BY. It will make the code much simpler as we wouldn't need to dig through the plan to find the MarkDistinct nodes that produce the corresponding "distinct" inputs to the aggregation. Once we do that, we can also get rid of this legacy optimizer.
Let me know if you need help doing this or if you need additional pointers and examples of where to look.
Map<Optional<String>, ExpectedValueProvider<FunctionCall>> aggregationsFirst = ImmutableMap.of( | ||
Optional.of("COUNT"), functionCall("count", ImmutableList.of("ORDERSTATUS"))); | ||
|
||
PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.of("TOTALPRICE", "totalprice", "CUSTKEY", "custkey", "ORDERDATE", "orderdate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a map builder for readability when there are many keys and values:
PlanMatchPattern tableScan = tableScan("orders", ImmutableMap.<String, String>builder()
.put("TOTALPRICE", "totalprice")
.put("CUSTKEY", "custkey")
.put("ORDERDATE", "orderdate")
.put("ORDERSTATUS", "orderstatus")
.build());
@martint Got it, thanks. I will implement as a |
4928c60
to
81f00c2
Compare
81f00c2
to
eae95d3
Compare
@@ -275,6 +275,7 @@ public PlanOptimizers( | |||
statsCalculator, | |||
estimatedExchangesCostCalculator, | |||
new CanonicalizeExpressions().rules()), | |||
new UnaliasSymbolReferences(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
@@ -433,7 +435,6 @@ public PlanOptimizers( | |||
estimatedExchangesCostCalculator, | |||
ImmutableSet.of(new ReorderJoins(costComparator)))); | |||
|
|||
builder.add(new OptimizeMixedDistinctAggregations(metadata)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the OptimizeMixedDistinctAggregations class, too.
.map(Aggregation::getCall) | ||
.filter(FunctionCall::isDistinct) | ||
.map(FunctionCall::getArguments) | ||
.<Set<Expression>>map(HashSet::new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<Set<Expression>>
is not required. The compiler should be able to infer the correct type.
|
||
private static boolean hasDistinctInput(AggregationNode aggregation) | ||
{ | ||
return aggregation.getAggregations() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to find out whether the aggregation has any calls with DISTINCT, this can be simplified to:
return aggregation.getAggregations()
.values().stream()
.map(Aggregation::getCall)
.anyMatch(FunctionCall::isDistinct);
.count() > 0; | ||
} | ||
|
||
private static boolean hasSingleDistinctArgument(AggregationNode aggregation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method is misleading. It returns true if the the aggregation has DISTINCT with a single argument or no aggregations with DISTINCT.
.noneMatch(e -> e.getMask().isPresent()); | ||
} | ||
|
||
private static boolean noOrdering(AggregationNode aggregation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trivial enough that I would just inline it where it's used.
} | ||
} | ||
|
||
List<Symbol> nonDistinctAggregateSymbols = nonDistinctAggregateSymbolsBuilder.build().stream().distinct().collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make nonDistinctAggregateSymbolsBuilder
a Set builder, you can skip the calls to remove duplicates.
* | ||
* INTO | ||
* | ||
* SELECT a1, a2,..., an, arbitrary(if(group = 0, f1)),...., arbitrary(if(group = 0, fm)), F(if(group = 1, c1)), ...., F(if(group = m, cm)) FROM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transformation is actually incorrect. Not all aggregations are insensitive to nulls, so the terms F(if(group = 1, c1))
, etc. will call the aggregation with null values for other groups:
presto:default> select a, array_agg(b), array_agg(distinct c) from (values (1,1,1)) t(a,b,c) group by a;
a | _col1 | _col2
---+-------+-----------
1 | [1] | [null, 1]
(1 row)
This needs to be rewritten to F(c1) FILTER (WHERE group = 1)
. Also, please add a test for this case.
} | ||
|
||
@Override | ||
public Result apply(AggregationNode node, Captures captures, Context context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this implementation hard to read. I think it would help if you structured it such that it builds things from the bottom up and return objects that contain the node that was created + anything that's needed to build the node above it. However, I haven't tried playing this out, so not sure how much cleaner it will make it. You'll have to give it a try and see how it goes.
@kaka11chen, any updates on this? |
@martint @kaka11chen - is it still not available in in trino ? I had the same observation with multiple distincts with presto version 340 |
👋 @kaka11chen - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open. |
Fix #613 to optimize distinct aggregation on multi columns.