Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support max and min aggregations for nested structs #8974

Closed
5 of 6 tasks
revans2 opened this issue Aug 5, 2021 · 17 comments
Closed
5 of 6 tasks

[FEA] support max and min aggregations for nested structs #8974

revans2 opened this issue Aug 5, 2021 · 17 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 5, 2021

Is your feature request related to a problem? Please describe.
In the Spark plugin we have a push to try and support as much of structs as we can. Cudf supports sorting of nested structs (no lists just basic types and other structs). It would really be great if we could support max and min aggregations on these as well. There are a lot of different types of max/min aggregations and if we cannot get them all at once we can take it a bit at a time too. We would like to be able to support this for.

  • Sort-based group by aggregations
  • Hash-based group by aggregations
  • Group by scans
  • reductions
  • scans
  • window operations

like described in #8964 null child column values would come before non-null child column values.

Describe the solution you'd like
Just what I asked for min/max aggregations that can work on structs.

Describe alternatives you've considered
I'm not sure there is an alternative that we can support.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 5, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 23, 2021
@ttnghia
Copy link
Contributor

ttnghia commented Oct 15, 2021

This issue should cover the previous one: #7995

@ttnghia ttnghia self-assigned this Oct 18, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 10, 2021
… aggregate() and scan() (#9545)

This PR adds struct support for `min`, `max`, `argmin` and `argmax` in groupby aggregation, and `min`, `max` in groupby scan. Although these operations require implementation for both hash-based and sort-based approaches, this PR only adds struct support into the sort-based approach due to the complexity of the hash-based implementation. Struct support for the hash-based approach will be future work.

Partially addresses #8974 and #7995.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9545
rapids-bot bot pushed a commit that referenced this issue Nov 18, 2021
This PR continues to address #8974, adding support for structs in `min` and `max` reduction.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - https://github.com/nvdbaranec

URL: #9697
rapids-bot bot pushed a commit that referenced this issue Nov 30, 2021
This PR continues to address #8974, adding support for structs in `min` and `max` inclusive scan. Exclusive scan support is not needed in the near future.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - https://github.com/nvdbaranec

URL: #9725
@ttnghia
Copy link
Contributor

ttnghia commented Nov 30, 2021

Currently, there are two more places we need to support struct min and max ops:

  • Hash-based groupby aggregates
  • Window ops

In these places, currently the internal functions call a kernel very early, which prohibits injecting any additional host function call. As such, supporting structs in those functions requires refactoring the entire computation pipeline. Thus, supporting structs in them will be postponed until such refactoring is done.

@res-life

This comment has been minimized.

@ttnghia

This comment has been minimized.

@ttnghia

This comment has been minimized.

@res-life

This comment has been minimized.

@ttnghia

This comment has been minimized.

@res-life

This comment has been minimized.

@ttnghia

This comment has been minimized.

@ttnghia

This comment has been minimized.

@res-life

This comment has been minimized.

@res-life

This comment has been minimized.

rapids-bot bot pushed a commit that referenced this issue Mar 7, 2022
)

This PR adds support for `min` and `max` operations in rolling window for STRUCT type. It also does some minor modifications to the existing code, such as renaming some variables and refining some comments.

Partially addresses #8974.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: #10332
@GregoryKimball
Copy link
Contributor

@revans2 is this issue solved?

@ttnghia
Copy link
Contributor

ttnghia commented Nov 4, 2024

No this is not resolved, pending the unfinished task above.

@revans2
Copy link
Contributor Author

revans2 commented Nov 4, 2024

Yes it looks like we have full support now.

@revans2 revans2 closed this as completed Nov 4, 2024
@revans2
Copy link
Contributor Author

revans2 commented Nov 4, 2024

@ttnghia I will file a follow on issue for the hash based min/max group by.

@revans2
Copy link
Contributor Author

revans2 commented Nov 4, 2024

#17241 is what I filed as a follow on issue for performance. From a usability standpoint this what we have works today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants