Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using numpy median with groupby aggregrate #1989

Closed
hayd opened this issue Sep 29, 2012 · 3 comments
Closed

Using numpy median with groupby aggregrate #1989

hayd opened this issue Sep 29, 2012 · 3 comments
Labels
Groupby Ideas Long-Term Enhancement Discussions
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Sep 29, 2012

Migrated from http://stackoverflow.com/questions/12651618/inconsisitency-in-results-of-aggregating-pandas-groupby-object-using-numpy-media

test = pd.DataFrame({'A' : [10,11,12,13,15,25,43,70],  
                     'B' : [1,2,3,4,5,6,7,8],  
                     'C' : [1,1,1,1,2,2,2,2]})
    A  B  C
0  10  1  1
1  11  2  1
2  12  3  1
3  13  4  1
4  15  5  2
5  25  6  2
6  43  7  2
7  70  8  2
test_g = test.groupby('C')

Aggregating using np.median (unexpectedly) produces DataFrame-wise aggregation within groups:

test_g.aggregate(np.median) 
      A     B
C            
1   7.0   7.0
2  11.5  11.5

It works perfectly when axis=0 is passed in:

test_g.aggregate(np.median, axis=0)
      A    B
C           
1  11.5  2.5
2  34.0  6.5

For np.mean (also sum, min, max) this doesn't average over entire array:

test_g.aggregate(np.mean) 
       A    B
C            
1  11.50  2.5
2  38.25  6.5

Perhaps worth noting that when passing in as a list this behaviour is not seen:

test_g.agg([np.median])
       A             B        
    median    median
C                             
1  11.5         2.5
2   34.0        6.5
@seberg
Copy link
Contributor

seberg commented Sep 29, 2012

Never used Pandas, so I am not sure how its seen, but as a suggestion, maybe it would be an idea to have pandas throw a warning (or even an Exception) if the result is broadcasted to all columns? That numpy functions work (due to a funny back and forth when trying to execute the corresponding pandas attribute with axis=None) very differently if they are ndarray attributes compared to when they are not (and generally the silent switching between passing 2D or 1D array likes) seems a bit dangerous.

@hayd
Copy link
Contributor Author

hayd commented Sep 29, 2012

You're right, the issue is that axis=0 isn't passed in by default (and perhaps it should be... although it may be difficult to know what to pass in as default for arbitrary functions?).

As median flattens the array (and produces a number), and since:

test_g.aggregate(lambda x: 8)

makes everything 8, this behaviour is "expected" in some sense (and sometimes might be what you want) so we probably don't want an exception...?

@changhiskhan
Copy link
Contributor

test_g.aggregate(np.median) should now result in the correct result. np.mean was different originally because certain numpy functions are special cased in the pandas groupby machinery for speed, which also changed default behavior to be pandas-like (df.mean()) rather than numpy-like (np.mean(arr)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests

3 participants