[FEA] Simplify the pylibcudf groupby-aggregation API #15130

vyasr · 2024-02-24T00:05:41Z

Is your feature request related to a problem? Please describe.
The current pylibcudf groupby-aggregation API maps to libcudf's. It is very expressive, allowing the specification of an arbitrary number of aggregations for every column to be aggregated, and an arbitrary number of aggregation columns per groupby table. However, this API is also fairly verbose and cumbersome to work with. Writing a groupby-aggregation in pylibcudf currently requires ~10 lines of code, as compared the concise single line version of the pandas API. Ideally we would like to offer the same level of convenience via a simpler API without sacrificing the flexibility and performance of the more general API where necessary.

Describe the solution you'd like
We should consider making the following changes:

Every aggregation should be default-constructible. That essentially means that every parameter should have a default parameter. This is true of most but not all aggregations already. Where appropriate, we may also want to push some of these defaults down to libcudf, but I would be OK with the small deviation of different default values if necessary.
We should add an API roughly like GroupBy.aggregate_simple (name TBD) that accepts a List[Tuple[Column, List[str]]] and handles the construction of the GroupByRequest objects under the hood. This only works once every agg is default-constructible. It may not be terribly useful for the heavily parametrized aggregations, but it will simplify working with the most common unparametrized aggregations (sum, prod, min, max, etc).
We should consider adding a functional API groupby_agg(data: Table, group_columns : List[int], aggs : List[Tuple[Column, List[str]]]) that effectively functions as a simple one-line wrapper around step 2.

Describe alternatives you've considered
None yet. I am not yet fully sold on the exact APIs that I proposed above, but I do think that the current API is too cumbersome for direct use most of the time except by other library developers and needs improvement. I do not want to lose the purity of mirroring libcudf APIs, especially not by replacing those APIs with higher-overhead alternatives, so the current API does need to exist. I've listed the three steps above in decreasing order of importance/quality, and we may go a completely different direction with 3. This issue is largely intended as a starting point for a discussion capturing the problems with the current API.

The text was updated successfully, but these errors were encountered:

vyasr added the feature request New feature or request label Feb 24, 2024

vyasr mentioned this issue Mar 4, 2024

[FEA] Be consistent in handling of default parameter values in pylibcudf #15198

Open

vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024

vyasr mentioned this issue Jun 26, 2024

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

vyasr commented Feb 24, 2024

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

[FEA] Simplify the pylibcudf groupby-aggregation API #15130

Comments

vyasr commented Feb 24, 2024