Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] SeriesGroupBy doesn't support std aggregation #3429

Closed
Nanthini10 opened this issue Nov 20, 2019 · 6 comments · Fixed by #4346
Closed

[FEA] SeriesGroupBy doesn't support std aggregation #3429

Nanthini10 opened this issue Nov 20, 2019 · 6 comments · Fixed by #4346
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@Nanthini10
Copy link

Is your feature request related to a problem? Please describe.
SeriesGroupBy is not implemented with std() yet.

With mean aggregation, we can get the following

df = cudf.DataFrame({'a': [1.,3,4,1.],'b': [4.,5,6,-10], 'c': [6., 7., 5., 10.]})
df.groupby('a').b.mean()

But it is not yet implemented with standard deviation.

Describe the solution you'd like
It'd be useful to have this feature to select the Series for aggregation instead of selecting a subset of columns as dataframe each time. With larger datasets and a need for multiple Series aggregation, it would be better to have the SeriesGroupBy have this method.

Describe alternatives you've considered
For now, I'm selecting a subset of the dataframe to get the results for a specific series.

df[['a','b']].groupby('a').std()
@Nanthini10 Nanthini10 added Needs Triage Need team to review and classify feature request New feature or request labels Nov 20, 2019
@jangorecki
Copy link

jangorecki commented Jan 9, 2020

shouldn't this issue be resolved by #2791 in 0.11.0?
I am on 0.11.0 but it still doesn't seems to work.

ans = x.groupby(['id4','id5'],as_index=False).agg({'v3':'std'})
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/core/groupby/groupby.py", #line 46, in agg
#    return self._apply_aggregation(func)
#  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/core/groupby/groupby.py", #line 132, in _apply_aggregation
#    result = self._groupby.compute_result(agg)
#  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/core/groupby/groupby.py", #line 370, in compute_result
#    self.dropna,
#  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/core/groupby/groupby.py", #line 551, in _groupby_engine
#    key_columns, value_columns, aggs, dropna=dropna
#  File "cudf/_lib/groupby.pyx", line 81, in cudf._lib.groupby.groupby
#KeyError: 'std'

v3 is float64 if that matters

@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 9, 2020
@beckernick
Copy link
Member

beckernick commented Jan 13, 2020

We'll need more than just the above notation. We would explicitly want to be able to do something like this, too:

import cudf
​
df = cudf.DataFrame({'a': [1.,3,4,1.],'b': [4.,5,6,-10], 'c': [6., 7., 5., 10.]})
df.groupby('a').agg({'b':['sum', 'std']})
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-7c4e9424589f> in <module>
      2 
      3 df = cudf.DataFrame({'a': [1.,3,4,1.],'b': [4.,5,6,-10], 'c': [6., 7., 5., 10.]})
----> 4 df.groupby('a').agg({'b':['sum', 'std']})

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in agg(self, func)
     44 
     45     def agg(self, func):
---> 46         return self._apply_aggregation(func)
     47 
     48     def size(self):

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in _apply_aggregation(self, agg)
    130         Applies the aggregation function(s) ``agg`` on all columns
    131         """
--> 132         result = self._groupby.compute_result(agg)
    133         libcudf.nvtx.nvtx_range_pop()
    134         return result

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in compute_result(self, agg)
    368             aggs_as_list,
    369             self.sort,
--> 370             self.dropna,
    371         )
    372 

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in _groupby_engine(key_columns, value_columns, aggs, sort, dropna)
    549     """
    550     out_key_columns, out_value_columns = libcudf.groupby.groupby(
--> 551         key_columns, value_columns, aggs, dropna=dropna
    552     )
    553 

cudf/_lib/groupby.pyx in cudf._lib.groupby.groupby()

KeyError: 'std'

That way, we could do multiple groupby-aggregations in a single call to libcudf.groupby.groupby

@jangorecki
Copy link

@beckernick your comment seems to be a different FR, it was already requested in #3737

@beckernick
Copy link
Member

Thanks for linking that @jangorecki . The error in that issue is caused by as_index=False, which we don't currently support (at least in 0.12). If you use the default as_index=True, you can do cdf.groupby(['a']).agg({'c': ['sum','mean']}). I'll comment on that issue and update it to reflect the specific nature of the missing feature.

What you can't currently do is cdf.groupby(['a']).agg({'c': ['sum','std']}), due to the current implementation of groupby.std.

However, now that I think about it more, I wonder if we wouldn't avoid two libcudf calls unless all aggs used are either hash-based or sort-based?

@harrism
Copy link
Member

harrism commented Jan 28, 2020

@devavret added groupby std in #2791 as mentioned. However it was never exposed through Python bindings. @shwina I believe you assigned yourself to port groupby.pyx to the new APIs. Can you look into whether support for std will fall out of that?

@shwina
Copy link
Contributor

shwina commented Jan 29, 2020

Thanks @harrism. This will be tackled in the upcoming groupby libcudf++ port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants