BUG: groupby sub-selection ignored with some methods #5264

TomAugspurger · 2013-10-19T14:59:21Z

Update from @hayd I think these should reference _selected_obj rather than obj.

Looking through some others, looks these also ignore the selection

Aggregation functions like (they already kind of do, but they allow bad selections ie column names not in columns, may be sep issue?):

sum/max/min/median/mean/var/std/.. (not tested)
agg (not tested)
(these "work" with the described bug)

Atm selecting a column not in df doesn't raise:

it should raise a Key Error, FIX raise when groupby selecting cols not in frame #6578

what about iloc/loc/ix (current all disabled)?

iloc (very similar to head/tail)
loc/ix (maybe push off for now, this is pretty tricky)
iterate over all (whitelisted) functions to check they adhere to this

The column selection on a groupby object is being ignored when .quantile() is called. So it computes the quantile on all the (numeric) columns and returns the full DataFrame.

In [92]: t = pd.DataFrame(np.random.randn(10, 4)); t[0] = np.hstack([np.ones(5), np.zeros(5)])

In [95]: t.groupby(0)[[1, 2]].quantile()  # shows other cols
Out[95]: 
   0         1         2         3
0                                 
0  0  0.127152  0.108908  0.369601
1  1 -0.321279  0.265550 -0.382398

In [96]: t[[1, 2]].groupby(t[0]).quantile()  # Should be equivalent to:
Out[96]: 
          1         2
0                    
0  0.127152  0.108908
1 -0.321279  0.265550

Seeing all these, I'm wondering if this is a bug or just how some of the methods are implementer. The docs don't mention anything about only supporting some methods though.

version: '0.12.0-883-g988d4be'

The text was updated successfully, but these errors were encountered:

jreback · 2013-10-19T18:18:44Z

the cython routines use _iterate_slices which is where _selection is used; however _python_apply_general I think needs to be passed the _selection so that the Splitter can get access (and use) the selection. A bit tricky...but just step thru the example and you can see where it can be fixed.

I would say this is a bug / not implemented behavior

TomAugspurger · 2013-12-13T15:20:25Z

I'm taking a look at this today (the groupby code is a tad intimidating).

Would this be considered a bug as well? That the column being grouped on, 0, is included in the results and has had the function applied to it?

In [18]: t
Out[18]: 
   0         1         2         3
0  1 -0.138264  0.647689  1.523030
1  1 -0.234137  1.579213  0.767435
2  1  0.542560 -0.463418 -0.465730
3  1 -1.913280 -1.724918 -0.562288
4  1  0.314247 -0.908024 -1.412304
5  0 -0.225776  0.067528 -1.424748
6  0  0.110923 -1.150994  0.375698
7  0 -0.291694 -0.601707  1.852278
8  0 -1.057711  0.822545 -1.220844
9  0 -1.959670 -1.328186  0.196861

In [19]: t.groupby(0).apply(lambda x: x + 1)
Out[19]: 
   0         1         2         3
0  2  0.861736  1.647689  2.523030
1  2  0.765863  2.579213  1.767435
2  2  1.542560  0.536582  0.534270
3  2 -0.913280 -0.724918  0.437712
4  2  1.314247  0.091976 -0.412304
5  1  0.774224  1.067528 -0.424748
6  1  1.110923 -0.150994  1.375698
7  1  0.708306  0.398293  2.852278
8  1 -0.057711  1.822545 -0.220844
9  1 -0.959670 -0.328186  1.196861

If 0 were a column of strs then the entire apply would fail.

And just to make sure I follow this, GroupBy.__getitem__ raises a NotImplementedError, but that's just so that subclasses implement __getitem__ right?

TomAugspurger · 2013-12-14T13:58:23Z

OK, wow that was a lot of digging for a (roughly) one line change. All it comes down to is in Groupby._python_apply_general, self.grouper.apply() needs to be called with self._obj_with_exclusions instead of self.obj.

I said roughly because that would mess up the group_keys arg to groupby, so you have to check for that.

Could I get a ruling on whether the column being grouped on should be included as part of the output? My fix on the selection bug will break that. I can hack around it, but I'm not sure including the grouping column is desired. For example, from test_groupby.test_apply_multiindex_fail:

        index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1],
                                        [1, 2, 3, 1, 2, 3]])
        df = DataFrame({'d': [1., 1., 1., 2., 2., 2.],
                        'c': np.tile(['a', 'b', 'c'], 2),
                        'v': np.arange(1., 7.)}, index=index)

        def f(group):
            v = group['v']
            group['v2'] = (v - v.min()) / (v.max() - v.min())
            return group

        result = df.groupby('d').apply(f)

        expected = df.copy()
        expected['v2'] = np.tile([0., 0.5, 1], 2)

        assert_frame_equal(result, expected)

So when I run this with my fix I get:

In [11]: df.groupby('d').apply(f)
Out[11]: 
     c  v   v2
0 1  a  1  0.0
  2  b  2  0.5
  3  c  3  1.0
1 1  a  4  0.0
  2  b  5  0.5
  3  c  6  1.0

[6 rows x 3 columns]

In [12]: expected
Out[12]: 
     c  d  v   v2
0 1  a  1  1  0.0
  2  b  1  2  0.5
  3  c  1  3  1.0
1 1  a  2  4  0.0
  2  b  2  5  0.5
  3  c  2  6  1.0

[6 rows x 4 columns]

expected is the result from before my fix. The difference is that I exclude the column being grouped on. I could see this being a pretty disruptive change. But like I said in my previous post, something like df.groupby('c').apply(lambda x: np.max(x) / 2) would fail if we don't exclude the groupby column, since its a column of strs and the applied function is looking for ints.

jreback · 2013-12-14T14:08:02Z

can u put a link to the branch?

TomAugspurger · 2013-12-14T15:07:09Z

https://github.com/TomAugspurger/pandas/tree/groupby-sugar

EDIT: forgot to push my changes. They're up now.

TomAugspurger · 2013-12-14T15:46:21Z

I guess for applys we do need to keep the column (or row) of group keys around since that's the only identifier of what values are in which group (I'm still not sure that they should have the function applied to them though).

TomAugspurger · 2014-01-09T15:20:57Z

For what it's worth, it look like R's plyr library apples the function to the grouping column:

> library("plyr", lib.loc="/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library")
> f <- function(x) {return(x + 1)}
> dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
> dt
         age group
1  11.137641     1
2   8.796264     2
3  13.530428     1
4  13.647518     2
5   5.647735     2
6   7.277148     1
7   6.479024     2
8   9.289560     1
9   5.211502     1
10 12.575817     2
11 11.291723     2
12 11.295667     1
13 13.752943     2
14  9.731945     1
15  4.490958     2
16 16.340906     2
17 10.620146     2
18 11.168694     1
19  3.015077     2
20 13.511891     2
> ddply(dt, ~group, f)
         age group
1  12.137641     2
2  14.530428     2
3   8.277148     2
4  10.289560     2
5   6.211502     2
6  12.295667     2
7  10.731945     2
8  12.168694     2
9   9.796264     3
10 14.647518     3
11  6.647735     3
12  7.479024     3
13 13.575817     3
14 12.291723     3
15 14.752943     3
16  5.490958     3
17 17.340906     3
18 11.620146     3
19  4.015077     3
20 14.511891     3

TomAugspurger · 2014-01-11T16:27:49Z

See this post on SO. I think this user was confused about the grouper having the function applied to it. My answer there goes through what I think happened to them, and it's not at all obvious.

Can anyone provide a reason why the apply or transform or agg function should be applied to the grouping column? I don't see any reason, but it seems to be deliberate in the code.

jreback · 2014-01-11T17:19:45Z

@TomAugspurger ok. I believe this is a bug.

Here is what happens. If the grouper (e.g. 'year' in this case) would have been a string, then then the aggregation usually will fall to the except part as the functions cannot aggregate strings, so item-by-item will work properly.
(I say usually because as an example, if the function just returns 0 it doesn't care what it is aggregating and would NEVER fail; similary, if the aggregator function had been np.sum then it would work on strings as well!)

In this can since year is a number the function works and thus the grouping column IS included.

So I think that you could exclude the self.grouper.names e.g. the named columns when passing the data to the function that aggregates. Thus not triggering the exception more often.

I think to nail this down you need a simple example and basically a bunch of cases which need to be checked:

the product of these combinatons:

groupby fields: int64, datetime, string, function (e.g. what we are actually grouping by whether a field or a passed function)
functions: a) lambda x: 0, b), return lambda x: np.mean(x), c) lambda x: np.sum(x)
dtypes of the object: a) pure floats, b) strings (that are not grouping columns), and floats, c) datetimes and floats

so 4 x 3 x 3 = 36 combinations! need to find a nice easy way to check these (this may be overkill though)

make sense?

TomAugspurger · 2014-01-11T17:43:54Z

Agreed completely. I'll make a new issue.

jreback · 2014-01-11T17:47:35Z

u have a PR for this issue (the sub selection) I believe? (or is it just the private branch?) go ahead and make a PR and can resolve this one

TomAugspurger · 2014-01-11T21:52:24Z

Do you want the "apply to grouper" issue in a separate PR? Or one PR with two commits? The PR for this issue (the sub selection being ignored) will probably depend on the PR for the apply to grouper.

jreback · 2014-01-11T21:55:19Z

maybe best to close at the same time

if not possible then can do separately

whatever works

jreback · 2014-03-03T14:22:14Z

@TomAugspurger I think all of these merely need to replace obj with _obj_with_exclusions (but tests have to be carefully looked at), the as_index=False are going to break (as they I think are wrong)

TomAugspurger · 2014-03-03T14:29:15Z

Great, that's the approach I was taking before getting muddled up with bigger API issues. You said you have a fix in the other issue right? I'd be happy look that over (especially the tests).

jreback · 2014-03-03T14:32:03Z

jreback@92e5c50

trivial fix... (but a couple of other tests will break)...

hayd · 2014-03-04T00:00:21Z

apparently self._obj_with_exclusion isn't quite the right object to use (it excludes the grouped by column, regardless of as_index) Eeep.

I think good opportunity to change head and tail... see PR :S I created new method _selected_obj but maybe there is one already?

jreback · 2014-03-07T00:07:35Z

add to the list corr;

http://stackoverflow.com/questions/22238802/python-pandas-groupby-correlation-includes-all-columns-instead-of-selected-colum

@TomAugspurger maybe make a check list or something at the top?

I guess need to verify all of the tmethod by a smoke tests (these are methods on the whitelist).

perhaps change is really easy if its done with the whitelist generating function then....

hayd · 2014-03-07T01:00:25Z

I put this into a checklist and ticked off head and tail (which I fixed yest) added in a few but haven't properly gone through it. Like I say, I implemented a _selected_obj which is subtley different from other methods. One thing to be wary of is the as_index selection (good time to check whether these make sense, like we changed head)...

hayd · 2014-03-07T01:01:42Z

@jreback yea re whitelist way to do it, makes sense

jreback · 2014-03-07T20:38:14Z

pls tick the boxes when you have a chance (and put the ref PR #)

hayd · 2014-03-07T20:47:03Z

added issue number, will tick off once merged, looks like my PR fixes most of these, if not all. needs some tests for a few of the others... may need a swathe of tests for these as I think coverage is low...

hayd · 2014-03-08T02:21:48Z

I added a load to this list. Their is some weird behaviour in agg functions, where if you select a col not in the DataFrame it doesn't care:

In [60]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])

In [61]: g = df.groupby('A')

In [62]: g[['C']]  # should raise here (maybe this will solve)
Out[62]: <pandas.core.groupby.DataFrameGroupBy object at 0x106ec6710>

In [63]: g[['C']].max()  # raises KeyError here in my PR
Out[63]:
    C
A
1 NaN
5 NaN

feature/edge case here is that A isn't displayed unless you select it

jreback · 2014-03-08T11:54:05Z

I agree this should raise KeyError

hayd · 2014-03-09T21:47:49Z

@TomAugspurger fix for a lot of this is in master now. Would you mind taking a look at what you think about the rest, is the behaviour correct? (esp. the aggs?)

TomAugspurger · 2014-03-10T19:25:32Z

Thanks for taking care of this. I may have actually lost the branch that I started to fix this, so I guess there wasn't really duplicated effort :)

I ticked the boxes for the aggs (since #6578 closes it the unrelated KeyError issue).

I haven't tested the filter and ohlc methods (haven't used either extensively, but I'll try to come up with some tests for them).

For a groupby.plot, I'm getting back multiple (identical) figures. I'll try to figure when this behavior started.

EDIT: I'm not very smart today. Of course the groupby.plot() returns multiple figures (and they aren't identical either). It correctly gets the subselected columns. I ticked the box for plot as well.

hayd · 2014-03-10T22:43:36Z

I can take care of filter, it's a one line change (was just waiting for the above to merge). I think ohlc already passes I just needs to be added to the first list in my PR.

@TomAugspurger Do you think the aggs is all good once that 6578 is fixed? Does the behaviour look right? If so, we should add tests for them. (I unchecked them 'til we add tests)

hayd · 2014-03-11T00:53:17Z

actually I'm confused if ohlc was meant to be a method, see #6594...

TomAugspurger · 2014-04-11T14:49:09Z

Any objections to closing this?

The remaining issues (ohlc and not raising a KeyError on nonexistent columns) aren't related to the original subselection being ignored and have their own issues.

jreback · 2014-04-11T14:52:53Z

no...let's close (if necessary move anything remaining to a new issue)

TomAugspurger mentioned this issue Jan 11, 2014

BUG: Don't apply function to grouper #5908

Closed

TomAugspurger mentioned this issue Mar 3, 2014

Grouby select multiple columns #6524

Closed

hayd mentioned this issue Mar 4, 2014

API change in groupby head and tail #6533

Merged

This was referenced Mar 7, 2014

ENH/BUG groupby nth now filters, works with DataFrames #6569

Merged

FIX use selected_obj rather the obj throughout groupby #6570

Merged

hayd mentioned this issue Mar 9, 2014

FIX raise when groupby selecting cols not in frame #6578

Merged

jreback mentioned this issue Mar 10, 2014

groupby - transform not working with multiple columns #6585

Closed

hayd mentioned this issue Mar 11, 2014

FIX filter selects selected columns #6593

Merged

TomAugspurger closed this as completed Apr 11, 2014

nbonnotte mentioned this issue Nov 18, 2015

BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions' #11640

Closed

maurosilber mentioned this issue Dec 8, 2021

BUG: Inconsistent behaviour in DataFrameGroupBy when selecting a subset of columns #44821

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby sub-selection ignored with some methods #5264

BUG: groupby sub-selection ignored with some methods #5264

TomAugspurger commented Oct 19, 2013

jreback commented Oct 19, 2013

TomAugspurger commented Dec 13, 2013

TomAugspurger commented Dec 14, 2013

jreback commented Dec 14, 2013

TomAugspurger commented Dec 14, 2013

TomAugspurger commented Dec 14, 2013

TomAugspurger commented Jan 9, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

jreback commented Mar 3, 2014

TomAugspurger commented Mar 3, 2014

jreback commented Mar 3, 2014

hayd commented Mar 4, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

hayd commented Mar 7, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

hayd commented Mar 8, 2014

jreback commented Mar 8, 2014

hayd commented Mar 9, 2014

TomAugspurger commented Mar 10, 2014

hayd commented Mar 10, 2014

hayd commented Mar 11, 2014

TomAugspurger commented Apr 11, 2014

jreback commented Apr 11, 2014

BUG: groupby sub-selection ignored with some methods #5264

BUG: groupby sub-selection ignored with some methods #5264

Comments

TomAugspurger commented Oct 19, 2013

jreback commented Oct 19, 2013

TomAugspurger commented Dec 13, 2013

TomAugspurger commented Dec 14, 2013

jreback commented Dec 14, 2013

TomAugspurger commented Dec 14, 2013

TomAugspurger commented Dec 14, 2013

TomAugspurger commented Jan 9, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

TomAugspurger commented Jan 11, 2014

jreback commented Jan 11, 2014

jreback commented Mar 3, 2014

TomAugspurger commented Mar 3, 2014

jreback commented Mar 3, 2014

hayd commented Mar 4, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

hayd commented Mar 7, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

hayd commented Mar 8, 2014

jreback commented Mar 8, 2014

hayd commented Mar 9, 2014

TomAugspurger commented Mar 10, 2014

hayd commented Mar 10, 2014

hayd commented Mar 11, 2014

TomAugspurger commented Apr 11, 2014

jreback commented Apr 11, 2014