Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

Open
vyasr opened this issue Feb 24, 2024 · 0 comments
Open

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

vyasr opened this issue Feb 24, 2024 · 0 comments
Labels
feature request New feature or request pylibcudf Issues specific to the pylibcudf package

Comments

@vyasr
Copy link
Contributor

vyasr commented Feb 24, 2024

Is your feature request related to a problem? Please describe.
The current pylibcudf groupby-aggregation API maps to libcudf's. It is very expressive, allowing the specification of an arbitrary number of aggregations for every column to be aggregated, and an arbitrary number of aggregation columns per groupby table. However, this API is also fairly verbose and cumbersome to work with. Writing a groupby-aggregation in pylibcudf currently requires ~10 lines of code, as compared the concise single line version of the pandas API. Ideally we would like to offer the same level of convenience via a simpler API without sacrificing the flexibility and performance of the more general API where necessary.

Describe the solution you'd like
We should consider making the following changes:

  1. Every aggregation should be default-constructible. That essentially means that every parameter should have a default parameter. This is true of most but not all aggregations already. Where appropriate, we may also want to push some of these defaults down to libcudf, but I would be OK with the small deviation of different default values if necessary.
  2. We should add an API roughly like GroupBy.aggregate_simple (name TBD) that accepts a List[Tuple[Column, List[str]]] and handles the construction of the GroupByRequest objects under the hood. This only works once every agg is default-constructible. It may not be terribly useful for the heavily parametrized aggregations, but it will simplify working with the most common unparametrized aggregations (sum, prod, min, max, etc).
  3. We should consider adding a functional API groupby_agg(data: Table, group_columns : List[int], aggs : List[Tuple[Column, List[str]]]) that effectively functions as a simple one-line wrapper around step 2.

Describe alternatives you've considered
None yet. I am not yet fully sold on the exact APIs that I proposed above, but I do think that the current API is too cumbersome for direct use most of the time except by other library developers and needs improvement. I do not want to lose the purity of mirroring libcudf APIs, especially not by replacing those APIs with higher-overhead alternatives, so the current API does need to exist. I've listed the three steps above in decreasing order of importance/quality, and we may go a completely different direction with 3. This issue is largely intended as a starting point for a discussion capturing the problems with the current API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request pylibcudf Issues specific to the pylibcudf package
Projects
Status: UX
Development

No branches or pull requests

1 participant