pd.Series.groupby is said to have only optional parameters #8015

tknuth · 2014-08-13T09:43:31Z

s = pd.Series(x)
m = s.groupby().count() # does not work
m = s.groupby(s.values).count() # works

In https://github.com/pydata/pandas/blob/master/pandas/core/generic.py, by=None should be corrected to just by as None always leads to an error.

   def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
            group_keys=True, squeeze=False):

        from pandas.core.groupby import groupby
        axis = self._get_axis_number(axis)
        return groupby(self, by, axis=axis, level=level, as_index=as_index,
            sort=sort, group_keys=group_keys, squeeze=squeeze)

See this http://stackoverflow.com/questions/17929426/groupby-for-pandas-series-not-working

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2014-08-13T12:28:56Z

by=None is possible, eg if you provide level, you don't need to specify by.

But we could probably provide a better error message here. Something like "You have to at least specify one of 'by' or 'level' "

Current error: TypeError: 'NoneType' object is not callable

jorisvandenbossche · 2014-10-30T09:36:01Z

Thinking about this again, I am a little bit confused:

The signature is: DataFrame.groupby(by=None, axis=0, level=None, ..), so there is a default of axis=0, and the explanation of level states "If the axis is a MultiIndex (hierarchical), group by a particular level or levels".
So when you don't have a MultiIndex, but a single Index, I would assume that you don't need to specify the level kwarg. So in that reasoning, df.groupby(axis=0) should just group by the row index?

In [39]: df = pd.DataFrame(np.random.randn(5,5), index=[0,0,1,1,1])

In [40]: df.groupby(axis=0, level=0).sum()
Out[40]:
          0         1         2         3         4
0 -2.219825  0.787332 -0.863655  0.877530  3.664685
1  0.766315 -2.294101 -1.636917 -0.205543  1.535448

In [41]: df.groupby(axis=0).sum()
-> TypeError: 'NoneType' object is not callable

So, or this should actually groupby by the index, or the documentation should be updated.

@hayd @jreback

hayd · 2014-10-30T09:48:33Z

Hmmmm. To me this looks much less readable than (something like):

df.groupby(df.index).sum()

I don't think this being the default is sensible (I think it would lead to confusion/bugs as well as not usually being a regular use case) ... ?

Saying that, definitely the error message should be fixed! Perhaps say, groupby a column(s), df.index or ...just (perish the thought) see the groupby docs!

jorisvandenbossche · 2014-10-30T09:59:40Z

@hayd yep, I can agree with that, being more explicit is better.

But then, I would argue that axis should not have a default of 0, so that df.groupby(df.index) would be equivalent to df.groupby(axis=0) when having a single index.
The problem with that is that you then can't specify only the level in df.groupby(level=1) when having a MultiIndex, but always also will have to specify axis: df.groupby(axis=0, level=1), as axis is no longer 0 by default.

A use case is to remove duplicates in your index, but indeed, not the most standard thing to use groupby for.

hayd · 2014-10-30T10:10:43Z

I would argue that this minor inconsistency (requiring level=0 to be passed, though I'm not sure I even agree it's inconsistent!) is a good compromise rather than allowing .groupby().

drop_duplicates is a bad example (there's function for that), but I know I've done this before for something and it can be useful, it's just infrequently required (which I think is good reason not to use it as the "default" a. you rarely use it b. you'd have to look up what it's doing when you do see it).

There's reason to have axis=0 by default, as we are grouping row-wise (be it on index or a column), it doesn't mean groupby the index!

That is to say, I like the current behaviour (both are explicit in what they are doing):

df.groupby(level=0) == df.groupby(df.index)

and leaving groupby blank should definitely raise

df.groupby() # reader has no idea what this should do

jorisvandenbossche · 2014-10-30T10:17:21Z

@hayd Ah yes, forgot about that meaning of axis=0!

So let's just say then: the docstring of level should be clarified that it is not only for a MultiIndex + add an explanation of the axis kwarg (there is none now)

And to further nitpick a bit :-) drop_duplicates is not fully a bad examples, as the drop_duplicates method looks at duplicate rows (which I first always forget) and not index, there is also a drop_duplicates method for index, but that is a bit less straightforward to use (as you have to re-index with the new index) + you can only drop duplicates, while with groupby you can also say you want the mean of the duplicates for example.

jreback · 2014-10-30T10:28:33Z

axis is just a modifier of of the level argument. Doesn't have meaning unless that is specified.

So in reality, you have to specify:

by or level/axis

would not allow just a bare, axis=0 because then it ONLY works if the index is an Index, better to be explict and have the user pass the actual axis

springcoil · 2014-10-30T11:48:28Z

This all feels like turtles all the way down :)

jorisvandenbossche · 2014-10-30T12:12:53Z

@jreback I think @hayd is right, that axis has indeed another meaning as well: on wich axis is the grouping applied, as you can also combine axis with by (and not only with level):

In [61]: df = pd.DataFrame(np.random.randn(5,5))

In [62]: df.groupby(by=pd.Series([1,1,1,2,2]), axis=0).sum()
Out[62]:
          0         1         2         3         4
1  3.685961  0.176227  0.422497  0.688798  0.341932
2  0.397360 -1.918284 -0.498714  3.177563  0.379342

In [63]: df.groupby(by=pd.Series([1,1,1,2,2]), axis=1).sum()
Out[63]:
          1         2
0  1.749975 -0.624686
1 -0.862650  1.331078
2  3.397360  0.324337
3 -1.941983  1.222833
4 -0.077655  2.334072

jreback · 2014-10-30T12:49:49Z

@jorisvandenbossche that look right, so axis is a modifier to by or level (which are the required parms).

jorisvandenbossche · 2014-10-30T12:53:02Z

yep, indeed, the docstring should be clarified a bit on that account.
So, @springcoil, that doesn't change anything for your PR, at leas one of by or level should be provided.

jreback · 2014-12-03T10:51:50Z

closed by #8950

ffranchina · 2017-11-26T17:17:26Z

Hello!

Sorry for bringing back to life this old issue but I find very ugly the current syntax of the groupby() method if applied on a pandas.Series. It took me many couple of minutes (ok, I am a newbie of pandas) to figure out how to do it: python focuses on simplicity and I think that this

s = pd.Series(x)
s.groupby(s.values).count()

is not straightforward at all.

A pandas.Series object is a bit more than just an array, isn't it implicit what I mean by just typing s.groupby().count()? Since it's just an array of object, it's obvious I want to group according to them!

I'd like to propose this little enhancement, would you accept a PR of this kind? Or, at least, can you explain me the design choice made behind this syntax, please?

Thank you and have a nice day!

jreback · 2017-11-26T17:27:41Z

so which shall be the answer?

explicit is much better than implicit

In [1]: s = pd.Series([1, 1, 2, 2, 3])

In [2]: s.groupby(s.index).count()
Out[2]: 
0    1
1    1
2    1
3    1
4    1
dtype: int64

In [3]: s.groupby(s.values).count()
Out[3]: 
1    2
2    2
3    1
dtype: int64

# the same as [2]
In [5]: s.groupby(level=0).count()
Out[5]: 
0    1
1    1
2    1
3    1
4    1
dtype: int64

ffranchina · 2017-11-26T17:43:18Z

I agree totally on the fact that explicit is better than implicit but, from my point of view, it should not have a different meaning from

In [1]: s = pd.Series([1, 1, 2, 2, 3])

In [3]: s.groupby(s.values).count()
Out[3]: 
1    2
2    2
3    1
dtype: int64

if not differently specified. After all

>>> s = pd.Series([1, 1, 2, 2, 3])
>>> s.groupby(s).count()
1    2
2    2
3    1
dtype: int64

does exactly that, I would simply propose to put this as default behavior.

But I'm just speaking from a user, point of view, I don't know almost anything of the details of the implementations, I just wanted to highlight the problem :)

If you think it may be a reasonable point of view, I can work on a small PR.

jreback · 2017-11-26T18:12:20Z

@ferdas but you are missing the point, we can groupby on index or values (in the Series case) by default, which shall it be? and if you pick one, why? it makes no sense to have a default here, when its not very clear which is the preferred operation. So -1 on any change to this.

Furthermore for what you are actually using this, .value_counts is more idiomatic.

In [35]: pd.Series([1, 1, 2, 2, 3]).value_counts()
Out[35]: 
2    2
1    2
3    1
dtype: int64

ffranchina · 2017-11-26T18:38:52Z

Since the actual behavior (s.groupby(s).count()) is this way, I was deducing that, by default, it should behave in the same way if no arguments are passed.

I see your point but I don't get how it's possible to think that by calling s.groupby().count() the user is confused about what he wants: since the pandas.Series is meant to store values, he definitely wants to groupby the values, if he make a clear request (I want to groupby the indexes), he would have a way to explicit that.

Due to the values-oriented nature of the pandas.Series, I think that the API should ease the operations on values rather than treating values and indexes in the same way (not user-friendly).

Well, if I'm the only one that see the good of this proposal, I'll quit it here.

Thank you for the suggestion of .value_counts, I will use it for sure since is far more straightforward!

Good evening!

jorisvandenbossche added the Error Reporting label Aug 13, 2014

jreback added this to the 0.15.1 milestone Oct 2, 2014

jreback added the Good as first PR label Oct 2, 2014

springcoil mentioned this issue Oct 23, 2014

Updating generic.py error message #8618

Closed

jorisvandenbossche added the Groupby label Oct 30, 2014

jreback modified the milestones: 0.15.2, 0.16.0 Nov 5, 2014

springcoil mentioned this issue Dec 1, 2014

Updating generic.py error message #8618 - New branch #8950

Closed

jreback closed this as completed Dec 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.Series.groupby is said to have only optional parameters #8015

pd.Series.groupby is said to have only optional parameters #8015

tknuth commented Aug 13, 2014

jorisvandenbossche commented Aug 13, 2014

jorisvandenbossche commented Oct 30, 2014

hayd commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

hayd commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Oct 30, 2014

springcoil commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Dec 3, 2014

ffranchina commented Nov 26, 2017

jreback commented Nov 26, 2017

ffranchina commented Nov 26, 2017

jreback commented Nov 26, 2017

ffranchina commented Nov 26, 2017

pd.Series.groupby is said to have only optional parameters #8015

pd.Series.groupby is said to have only optional parameters #8015

Comments

tknuth commented Aug 13, 2014

jorisvandenbossche commented Aug 13, 2014

jorisvandenbossche commented Oct 30, 2014

hayd commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

hayd commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Oct 30, 2014

springcoil commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Oct 30, 2014

jorisvandenbossche commented Oct 30, 2014

jreback commented Dec 3, 2014

ffranchina commented Nov 26, 2017

jreback commented Nov 26, 2017

ffranchina commented Nov 26, 2017

jreback commented Nov 26, 2017

ffranchina commented Nov 26, 2017