Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Series.groupby is said to have only optional parameters #8015

Closed
tknuth opened this issue Aug 13, 2014 · 17 comments
Closed

pd.Series.groupby is said to have only optional parameters #8015

tknuth opened this issue Aug 13, 2014 · 17 comments
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Milestone

Comments

@tknuth
Copy link

tknuth commented Aug 13, 2014

s = pd.Series(x)
m = s.groupby().count() # does not work
m = s.groupby(s.values).count() # works

In https://github.com/pydata/pandas/blob/master/pandas/core/generic.py, by=None should be corrected to just by as None always leads to an error.

   def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
            group_keys=True, squeeze=False):

        from pandas.core.groupby import groupby
        axis = self._get_axis_number(axis)
        return groupby(self, by, axis=axis, level=level, as_index=as_index,
            sort=sort, group_keys=group_keys, squeeze=squeeze)

See this http://stackoverflow.com/questions/17929426/groupby-for-pandas-series-not-working

@jorisvandenbossche
Copy link
Member

by=None is possible, eg if you provide level, you don't need to specify by.

But we could probably provide a better error message here. Something like "You have to at least specify one of 'by' or 'level' "

Current error: TypeError: 'NoneType' object is not callable

@jorisvandenbossche
Copy link
Member

Thinking about this again, I am a little bit confused:

The signature is: DataFrame.groupby(by=None, axis=0, level=None, ..), so there is a default of axis=0, and the explanation of level states "If the axis is a MultiIndex (hierarchical), group by a particular level or levels".
So when you don't have a MultiIndex, but a single Index, I would assume that you don't need to specify the level kwarg. So in that reasoning, df.groupby(axis=0) should just group by the row index?

In [39]: df = pd.DataFrame(np.random.randn(5,5), index=[0,0,1,1,1])

In [40]: df.groupby(axis=0, level=0).sum()
Out[40]:
          0         1         2         3         4
0 -2.219825  0.787332 -0.863655  0.877530  3.664685
1  0.766315 -2.294101 -1.636917 -0.205543  1.535448

In [41]: df.groupby(axis=0).sum()
-> TypeError: 'NoneType' object is not callable

So, or this should actually groupby by the index, or the documentation should be updated.

@hayd @jreback

@hayd
Copy link
Contributor

hayd commented Oct 30, 2014

Hmmmm. To me this looks much less readable than (something like):

df.groupby(df.index).sum()

I don't think this being the default is sensible (I think it would lead to confusion/bugs as well as not usually being a regular use case) ... ?

Saying that, definitely the error message should be fixed! Perhaps say, groupby a column(s), df.index or ...just (perish the thought) see the groupby docs!

@jorisvandenbossche
Copy link
Member

@hayd yep, I can agree with that, being more explicit is better.

But then, I would argue that axis should not have a default of 0, so that df.groupby(df.index) would be equivalent to df.groupby(axis=0) when having a single index.
The problem with that is that you then can't specify only the level in df.groupby(level=1) when having a MultiIndex, but always also will have to specify axis: df.groupby(axis=0, level=1), as axis is no longer 0 by default.

A use case is to remove duplicates in your index, but indeed, not the most standard thing to use groupby for.

@hayd
Copy link
Contributor

hayd commented Oct 30, 2014

I would argue that this minor inconsistency (requiring level=0 to be passed, though I'm not sure I even agree it's inconsistent!) is a good compromise rather than allowing .groupby().

drop_duplicates is a bad example (there's function for that), but I know I've done this before for something and it can be useful, it's just infrequently required (which I think is good reason not to use it as the "default" a. you rarely use it b. you'd have to look up what it's doing when you do see it).

There's reason to have axis=0 by default, as we are grouping row-wise (be it on index or a column), it doesn't mean groupby the index!

That is to say, I like the current behaviour (both are explicit in what they are doing):

df.groupby(level=0) == df.groupby(df.index)

and leaving groupby blank should definitely raise

df.groupby() # reader has no idea what this should do

@jorisvandenbossche
Copy link
Member

@hayd Ah yes, forgot about that meaning of axis=0!

So let's just say then: the docstring of level should be clarified that it is not only for a MultiIndex + add an explanation of the axis kwarg (there is none now)

And to further nitpick a bit :-) drop_duplicates is not fully a bad examples, as the drop_duplicates method looks at duplicate rows (which I first always forget) and not index, there is also a drop_duplicates method for index, but that is a bit less straightforward to use (as you have to re-index with the new index) + you can only drop duplicates, while with groupby you can also say you want the mean of the duplicates for example.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

axis is just a modifier of of the level argument. Doesn't have meaning unless that is specified.

So in reality, you have to specify:

by or level/axis

would not allow just a bare, axis=0 because then it ONLY works if the index is an Index, better to be explict and have the user pass the actual axis

@springcoil
Copy link
Contributor

This all feels like turtles all the way down :)

@jorisvandenbossche
Copy link
Member

@jreback I think @hayd is right, that axis has indeed another meaning as well: on wich axis is the grouping applied, as you can also combine axis with by (and not only with level):

In [61]: df = pd.DataFrame(np.random.randn(5,5))

In [62]: df.groupby(by=pd.Series([1,1,1,2,2]), axis=0).sum()
Out[62]:
          0         1         2         3         4
1  3.685961  0.176227  0.422497  0.688798  0.341932
2  0.397360 -1.918284 -0.498714  3.177563  0.379342

In [63]: df.groupby(by=pd.Series([1,1,1,2,2]), axis=1).sum()
Out[63]:
          1         2
0  1.749975 -0.624686
1 -0.862650  1.331078
2  3.397360  0.324337
3 -1.941983  1.222833
4 -0.077655  2.334072

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

@jorisvandenbossche that look right, so axis is a modifier to by or level (which are the required parms).

@jorisvandenbossche
Copy link
Member

yep, indeed, the docstring should be clarified a bit on that account.
So, @springcoil, that doesn't change anything for your PR, at leas one of by or level should be provided.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

closed by #8950

@jreback jreback closed this as completed Dec 3, 2014
@ffranchina
Copy link

Hello!

Sorry for bringing back to life this old issue but I find very ugly the current syntax of the groupby() method if applied on a pandas.Series. It took me many couple of minutes (ok, I am a newbie of pandas) to figure out how to do it: python focuses on simplicity and I think that this

s = pd.Series(x)
s.groupby(s.values).count()

is not straightforward at all.

A pandas.Series object is a bit more than just an array, isn't it implicit what I mean by just typing s.groupby().count()? Since it's just an array of object, it's obvious I want to group according to them!

I'd like to propose this little enhancement, would you accept a PR of this kind? Or, at least, can you explain me the design choice made behind this syntax, please?

Thank you and have a nice day!

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

so which shall be the answer?

explicit is much better than implicit

In [1]: s = pd.Series([1, 1, 2, 2, 3])

In [2]: s.groupby(s.index).count()
Out[2]: 
0    1
1    1
2    1
3    1
4    1
dtype: int64

In [3]: s.groupby(s.values).count()
Out[3]: 
1    2
2    2
3    1
dtype: int64

# the same as [2]
In [5]: s.groupby(level=0).count()
Out[5]: 
0    1
1    1
2    1
3    1
4    1
dtype: int64

@ffranchina
Copy link

I agree totally on the fact that explicit is better than implicit but, from my point of view, it should not have a different meaning from

In [1]: s = pd.Series([1, 1, 2, 2, 3])

In [3]: s.groupby(s.values).count()
Out[3]: 
1    2
2    2
3    1
dtype: int64

if not differently specified. After all

>>> s = pd.Series([1, 1, 2, 2, 3])
>>> s.groupby(s).count()
1    2
2    2
3    1
dtype: int64

does exactly that, I would simply propose to put this as default behavior.

But I'm just speaking from a user, point of view, I don't know almost anything of the details of the implementations, I just wanted to highlight the problem :)

If you think it may be a reasonable point of view, I can work on a small PR.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

@ferdas but you are missing the point, we can groupby on index or values (in the Series case) by default, which shall it be? and if you pick one, why? it makes no sense to have a default here, when its not very clear which is the preferred operation. So -1 on any change to this.

Furthermore for what you are actually using this, .value_counts is more idiomatic.

In [35]: pd.Series([1, 1, 2, 2, 3]).value_counts()
Out[35]: 
2    2
1    2
3    1
dtype: int64

@ffranchina
Copy link

Since the actual behavior (s.groupby(s).count()) is this way, I was deducing that, by default, it should behave in the same way if no arguments are passed.

I see your point but I don't get how it's possible to think that by calling s.groupby().count() the user is confused about what he wants: since the pandas.Series is meant to store values, he definitely wants to groupby the values, if he make a clear request (I want to groupby the indexes), he would have a way to explicit that.

Due to the values-oriented nature of the pandas.Series, I think that the API should ease the operations on values rather than treating values and indexes in the same way (not user-friendly).

Well, if I'm the only one that see the good of this proposal, I'll quit it here.

Thank you for the suggestion of .value_counts, I will use it for sure since is far more straightforward!

Good evening!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants