Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Dict of Dicts for renaming Groupby Aggregations #9052

Closed
TomAugspurger opened this issue Dec 10, 2014 · 12 comments · Fixed by #11603
Closed

DOC: Dict of Dicts for renaming Groupby Aggregations #9052

TomAugspurger opened this issue Dec 10, 2014 · 12 comments · Fixed by #11603

Comments

@TomAugspurger
Copy link
Contributor

I didn't realize this was possible, and didn't see it in the docs.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'a', 'b'], 'C': [3, 4, 5]})
df.groupby('B').agg({'A': {'mean1': 'mean', 'med1': 'median'}, 'C': {'mean2': 'mean', 'med2': 'median'}})
@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

xref is #8593 (which would replace / enhance this)

@aimboden
Copy link

Thanks for the tip. Didn't realize this was possible either, this will save me from building my multicolumns "by hand".

@jreback are you planning any API change for 0.16.0 on this? #8593 does not seem to interfere with this behaviour, but maybe a deeper change is planned?

I'd rather not rely on this if it's not tested atm. Or would you accept a test for this?

@jreback
Copy link
Contributor

jreback commented Dec 11, 2014

@Gimli510 this IS implemented. Its basically the same as the following (except the name determination is slightly different).

In [5]: df.groupby('B').agg({'A': ['mean','median'], 'C': ['mean','median']})
Out[5]: 
     A           C       
  mean median mean median
B                        
a  1.5    1.5  3.5    3.5
b  3.0    3.0  5.0    5.0

I haven't carefully looked thru, but I suspect their is at least 1 tests. Though would for sure accept a PR which makes these tests more prominent (e.g. test_agg_api or something).

pd.Summary will enhance this API, the existing will remain.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2015

from mailing list

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})

In [3]: 

In [3]: grouped = df.groupby(['A', 'B'])

In [4]: grouped[['D','C']].agg({'r':np.sum, 'r2':np.mean})
Out[4]: 
                    r        r2
A   B                          
bar one   D -0.460078 -0.460078
          C  0.798220  0.798220
    three D  1.599986  1.599986
          C -0.554798 -0.554798
    two   D  0.124900  0.124900
          C  0.084758  0.084758
foo one   D -0.466082 -0.233041
          C -0.585512 -0.292756
    three D -0.184726 -0.184726
          C  0.130756  0.130756
    two   D -1.985586 -0.992793
          C  1.275138  0.637569

In [5]: grouped[['D','C']].agg({'r': { 'C' : np.sum }, 'r2' : { 'D' : np.mean }})
Out[5]: 
                    r        r2
                    C         D
A   B                          
bar one   D -0.460078 -0.460078
          C  0.798220  0.798220
    three D  1.599986  1.599986
          C -0.554798 -0.554798
    two   D  0.124900  0.124900
          C  0.084758  0.084758
foo one   D -0.466082 -0.233041
          C -0.585512 -0.292756
    three D -0.184726 -0.184726
          C  0.130756  0.130756
    two   D -1.985586 -0.992793
          C  1.275138  0.637569

In [6]: grouped[['D','C']].agg([np.sum, np.mean])
Out[6]: 
                  D                   C          
                sum      mean       sum      mean
A   B                                            
bar one   -0.460078 -0.460078  0.798220  0.798220
    three  1.599986  1.599986 -0.554798 -0.554798
    two    0.124900  0.124900  0.084758  0.084758
foo one   -0.466082 -0.233041 -0.585512 -0.292756
    three -0.184726 -0.184726  0.130756  0.130756
    two   -1.985586 -0.992793  1.275138  0.637569

with a trivial patch

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index add5080..b885b6f 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2837,9 +2837,6 @@ class NDFrameGroupBy(GroupBy):
             keys = []
             if self._selection is not None:
                 subset = obj
-                if isinstance(subset, DataFrame):
-                    raise NotImplementedError("Aggregating on a DataFrame is "
-                                              "not supported")

                 for fname, agg_how in compat.iteritems(arg):
                     colg = SeriesGroupBy(subset, selection=self._selection,

of course need some tests......

@jreback
Copy link
Contributor

jreback commented Dec 19, 2015

acutally not closing this

@jreback jreback reopened this Dec 19, 2015
jreback added a commit to jreback/pandas that referenced this issue Feb 2, 2016
@jreback jreback closed this as completed in 1dc49f5 Feb 2, 2016
@xflr6
Copy link
Contributor

xflr6 commented Feb 14, 2016

The following raises SpecificationError in 0.18.0, although there is no ambiguity (SeriesGroupby):

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.arange(8)})

grouped = df.groupby(['A', 'B'])

grouped['D'].agg({'D': np.sum, 'result2': np.mean})

Is this intended or a bug (I'd prefer to be able to reuse the series column name)?

@jorisvandenbossche
Copy link
Member

This should work (it is also a regression, as it worked before).
I think this should work because for a SeriesGroupBy, the dict keys can/should always be interpreted as new column names, and not to select existing columns names.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2016

@xflr6

this is fixed in #12329

In [3]: grouped['D'].agg({'D': np.sum, 'result2': np.mean})
Out[3]: 
           result2  D
A   B                
bar one          1  1
    three        3  3
    two          5  5
foo one          3  6
    three        7  7
    two          3  6

Note that this works as well, though maybe not as to the users intent (e.g. the C is exactly a label here, nothing to do with the actual aggregation columns.

In [4]: grouped['D'].agg({'D': np.sum, 'c': np.mean})
Out[4]: 
           C  D
A   B          
bar one    1  1
    three  3  3
    two    5  5
foo one    3  6
    three  7  7
    two    3  6

@arita37
Copy link

arita37 commented Jan 28, 2017

To reference on complex groupby:
We have sometimes 2 dimensionnal data like
date, user_id, val1, val2, val3

and need to transform into 'groupby' :
user_id_, mycol1, mycol2,..

Usually, this is done by

for x in user_id_list : 
   dfi= df[ df.user_id= x] 
   user_dict[x]['mycol1']=  myfun(dfi)
   user_dict[x]['mycol2']=  myfun2(dfi)

Is there a way to this kind of complex and generic grouping in groupby pandas ?

@jreback
Copy link
Contributor

jreback commented Jan 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment