You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The current pylibcudf groupby-aggregation API maps to libcudf's. It is very expressive, allowing the specification of an arbitrary number of aggregations for every column to be aggregated, and an arbitrary number of aggregation columns per groupby table. However, this API is also fairly verbose and cumbersome to work with. Writing a groupby-aggregation in pylibcudf currently requires ~10 lines of code, as compared the concise single line version of the pandas API. Ideally we would like to offer the same level of convenience via a simpler API without sacrificing the flexibility and performance of the more general API where necessary.
Describe the solution you'd like
We should consider making the following changes:
Every aggregation should be default-constructible. That essentially means that every parameter should have a default parameter. This is true of most but not all aggregations already. Where appropriate, we may also want to push some of these defaults down to libcudf, but I would be OK with the small deviation of different default values if necessary.
We should add an API roughly like GroupBy.aggregate_simple (name TBD) that accepts a List[Tuple[Column, List[str]]] and handles the construction of the GroupByRequest objects under the hood. This only works once every agg is default-constructible. It may not be terribly useful for the heavily parametrized aggregations, but it will simplify working with the most common unparametrized aggregations (sum, prod, min, max, etc).
We should consider adding a functional API groupby_agg(data: Table, group_columns : List[int], aggs : List[Tuple[Column, List[str]]]) that effectively functions as a simple one-line wrapper around step 2.
Describe alternatives you've considered
None yet. I am not yet fully sold on the exact APIs that I proposed above, but I do think that the current API is too cumbersome for direct use most of the time except by other library developers and needs improvement. I do not want to lose the purity of mirroring libcudf APIs, especially not by replacing those APIs with higher-overhead alternatives, so the current API does need to exist. I've listed the three steps above in decreasing order of importance/quality, and we may go a completely different direction with 3. This issue is largely intended as a starting point for a discussion capturing the problems with the current API.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The current pylibcudf groupby-aggregation API maps to libcudf's. It is very expressive, allowing the specification of an arbitrary number of aggregations for every column to be aggregated, and an arbitrary number of aggregation columns per groupby table. However, this API is also fairly verbose and cumbersome to work with. Writing a groupby-aggregation in pylibcudf currently requires ~10 lines of code, as compared the concise single line version of the pandas API. Ideally we would like to offer the same level of convenience via a simpler API without sacrificing the flexibility and performance of the more general API where necessary.
Describe the solution you'd like
We should consider making the following changes:
GroupBy.aggregate_simple
(name TBD) that accepts aList[Tuple[Column, List[str]]]
and handles the construction of the GroupByRequest objects under the hood. This only works once every agg is default-constructible. It may not be terribly useful for the heavily parametrized aggregations, but it will simplify working with the most common unparametrized aggregations (sum, prod, min, max, etc).groupby_agg(data: Table, group_columns : List[int], aggs : List[Tuple[Column, List[str]]])
that effectively functions as a simple one-line wrapper around step 2.Describe alternatives you've considered
None yet. I am not yet fully sold on the exact APIs that I proposed above, but I do think that the current API is too cumbersome for direct use most of the time except by other library developers and needs improvement. I do not want to lose the purity of mirroring libcudf APIs, especially not by replacing those APIs with higher-overhead alternatives, so the current API does need to exist. I've listed the three steps above in decreasing order of importance/quality, and we may go a completely different direction with 3. This issue is largely intended as a starting point for a discussion capturing the problems with the current API.
The text was updated successfully, but these errors were encountered: