Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support min and max group by aggregations and reductions on lists of structs and strings #10408

Closed
revans2 opened this issue Mar 10, 2022 · 10 comments · Fixed by #13676
Closed
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 10, 2022

Is your feature request related to a problem? Please describe.
This is specifically for NVIDIA/spark-rapids#4929 for a customer query.

Describe the solution you'd like
Group by and reduction aggregations for min/max that support lists of structs and lists of strings. Ideally we could solve ordering in general like #5890 is trying to do. This would then be a follow on PR to reuse the row comparison code for min/max in this case.

If more information is needed like null ordering etc I can provide it.

Describe alternatives you've considered
Just ugly hacks that we should not do.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Mar 10, 2022
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning Mar 10, 2022
@revans2 revans2 changed the title [FEA] support min and max group by aggregations and reductions on lists of structs [FEA] support min and max group by aggregations and reductions on lists of structs and strings Mar 10, 2022
@devavret
Copy link
Contributor

Can you explain what ordering would look like on a list of struct? Say I have

[{1, b}, {1, b}]
[{1, a}, {2, b}]

Would we compare the first child and then the second child? meaning the order of comparisons is 1, 1, a, b vs 1, 2, b, b and so idx 0 < idx 1

Or would we compare in this order 1, b, 1, b vs 1, a, 2, b meaning idx 1 < idx 0.

@revans2
Copy link
Contributor Author

revans2 commented Mar 10, 2022

It is the second one 1, b, 1, b vs 1, a, 2, b. But more generally sorting a list in Spark looks at each element in the list starting at the first and only if they are the same does it look at the next elements to break the tie.

null == null in sorting elements, but null < non-null

If one list is longer than another list, then it shorter list is less than the longer list if and only if all of the elements in the shorter list match those in the longer list up to the length of the shorter list.

The code here is for sorting list/arrays in Spark

https://github.com/apache/spark/blob/a26c01d85035b487e2048f1c106b14b89455e2d9/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala#L116-L148

If you want some examples I can some up with a few.

@ttnghia ttnghia self-assigned this Mar 10, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Mar 10, 2022

@devavret I suppose to work on this, if nobody has been assigned.

@jrhemstad
Copy link
Contributor

@ttnghia please hold off. This requires further discussion.

@github-actions
Copy link

github-actions bot commented Apr 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented May 3, 2022

Sill needed

@github-actions
Copy link

github-actions bot commented Jun 2, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented Jun 2, 2022

Still needed.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 3, 2022

Depends on #11129.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit that referenced this issue Jul 12, 2023
…uction (#13676)

This adds support for list type in `min` and `max` aggregations in groupby and reduction contexts.

Closes #13667 and closes #10408.

Status:
 * [X] Implementation.
 * [X] Unit tests.
 * [X] Run `compute-sanitizer`.
 * [X] Test with spark-rapids (NVIDIA/spark-rapids#8689).

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - MithunR (https://github.com/mythrocks)

URL: #13676
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
No open projects
Feature Planning
Needs prioritizing
Development

Successfully merging a pull request may close this issue.

6 participants