Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby sub-selection ignored with some methods #5264

Closed
27 of 31 tasks
TomAugspurger opened this issue Oct 19, 2013 · 30 comments
Closed
27 of 31 tasks

BUG: groupby sub-selection ignored with some methods #5264

TomAugspurger opened this issue Oct 19, 2013 · 30 comments

Comments

@TomAugspurger
Copy link
Contributor

related #6512, #6524, #6346

Update from @hayd I think these should reference _selected_obj rather than obj.

Looking through some others, looks these also ignore the selection

Aggregation functions like (they already kind of do, but they allow bad selections ie column names not in columns, may be sep issue?):

  • sum/max/min/median/mean/var/std/.. (not tested)
  • agg (not tested)
    (these "work" with the described bug)

Atm selecting a column not in df doesn't raise:

what about iloc/loc/ix (current all disabled)?

  • iloc (very similar to head/tail)
  • loc/ix (maybe push off for now, this is pretty tricky)
  • iterate over all (whitelisted) functions to check they adhere to this

The column selection on a groupby object is being ignored when .quantile() is called. So it computes the quantile on all the (numeric) columns and returns the full DataFrame.

In [92]: t = pd.DataFrame(np.random.randn(10, 4)); t[0] = np.hstack([np.ones(5), np.zeros(5)])

In [95]: t.groupby(0)[[1, 2]].quantile()  # shows other cols
Out[95]: 
   0         1         2         3
0                                 
0  0  0.127152  0.108908  0.369601
1  1 -0.321279  0.265550 -0.382398

In [96]: t[[1, 2]].groupby(t[0]).quantile()  # Should be equivalent to:
Out[96]: 
          1         2
0                    
0  0.127152  0.108908
1 -0.321279  0.265550

Seeing all these, I'm wondering if this is a bug or just how some of the methods are implementer. The docs don't mention anything about only supporting some methods though.

version: '0.12.0-883-g988d4be'

@jreback
Copy link
Contributor

jreback commented Oct 19, 2013

the cython routines use _iterate_slices which is where _selection is used; however _python_apply_general I think needs to be passed the _selection so that the Splitter can get access (and use) the selection. A bit tricky...but just step thru the example and you can see where it can be fixed.

I would say this is a bug / not implemented behavior

@TomAugspurger
Copy link
Contributor Author

I'm taking a look at this today (the groupby code is a tad intimidating).

Would this be considered a bug as well? That the column being grouped on, 0, is included in the results and has had the function applied to it?

In [18]: t
Out[18]: 
   0         1         2         3
0  1 -0.138264  0.647689  1.523030
1  1 -0.234137  1.579213  0.767435
2  1  0.542560 -0.463418 -0.465730
3  1 -1.913280 -1.724918 -0.562288
4  1  0.314247 -0.908024 -1.412304
5  0 -0.225776  0.067528 -1.424748
6  0  0.110923 -1.150994  0.375698
7  0 -0.291694 -0.601707  1.852278
8  0 -1.057711  0.822545 -1.220844
9  0 -1.959670 -1.328186  0.196861

In [19]: t.groupby(0).apply(lambda x: x + 1)
Out[19]: 
   0         1         2         3
0  2  0.861736  1.647689  2.523030
1  2  0.765863  2.579213  1.767435
2  2  1.542560  0.536582  0.534270
3  2 -0.913280 -0.724918  0.437712
4  2  1.314247  0.091976 -0.412304
5  1  0.774224  1.067528 -0.424748
6  1  1.110923 -0.150994  1.375698
7  1  0.708306  0.398293  2.852278
8  1 -0.057711  1.822545 -0.220844
9  1 -0.959670 -0.328186  1.196861

If 0 were a column of strs then the entire apply would fail.


And just to make sure I follow this, GroupBy.__getitem__ raises a NotImplementedError, but that's just so that subclasses implement __getitem__ right?

@TomAugspurger
Copy link
Contributor Author

OK, wow that was a lot of digging for a (roughly) one line change. All it comes down to is in Groupby._python_apply_general, self.grouper.apply() needs to be called with self._obj_with_exclusions instead of self.obj.

I said roughly because that would mess up the group_keys arg to groupby, so you have to check for that.


Could I get a ruling on whether the column being grouped on should be included as part of the output? My fix on the selection bug will break that. I can hack around it, but I'm not sure including the grouping column is desired. For example, from test_groupby.test_apply_multiindex_fail:

        index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1],
                                        [1, 2, 3, 1, 2, 3]])
        df = DataFrame({'d': [1., 1., 1., 2., 2., 2.],
                        'c': np.tile(['a', 'b', 'c'], 2),
                        'v': np.arange(1., 7.)}, index=index)

        def f(group):
            v = group['v']
            group['v2'] = (v - v.min()) / (v.max() - v.min())
            return group

        result = df.groupby('d').apply(f)

        expected = df.copy()
        expected['v2'] = np.tile([0., 0.5, 1], 2)

        assert_frame_equal(result, expected)

So when I run this with my fix I get:

In [11]: df.groupby('d').apply(f)
Out[11]: 
     c  v   v2
0 1  a  1  0.0
  2  b  2  0.5
  3  c  3  1.0
1 1  a  4  0.0
  2  b  5  0.5
  3  c  6  1.0

[6 rows x 3 columns]

In [12]: expected
Out[12]: 
     c  d  v   v2
0 1  a  1  1  0.0
  2  b  1  2  0.5
  3  c  1  3  1.0
1 1  a  2  4  0.0
  2  b  2  5  0.5
  3  c  2  6  1.0

[6 rows x 4 columns]

expected is the result from before my fix. The difference is that I exclude the column being grouped on. I could see this being a pretty disruptive change. But like I said in my previous post, something like df.groupby('c').apply(lambda x: np.max(x) / 2) would fail if we don't exclude the groupby column, since its a column of strs and the applied function is looking for ints.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2013

can u put a link to the branch?

@TomAugspurger
Copy link
Contributor Author

https://github.com/TomAugspurger/pandas/tree/groupby-sugar

EDIT: forgot to push my changes. They're up now.

@TomAugspurger
Copy link
Contributor Author

I guess for applys we do need to keep the column (or row) of group keys around since that's the only identifier of what values are in which group (I'm still not sure that they should have the function applied to them though).

@TomAugspurger
Copy link
Contributor Author

For what it's worth, it look like R's plyr library apples the function to the grouping column:

> library("plyr", lib.loc="/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library")
> f <- function(x) {return(x + 1)}
> dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
> dt
         age group
1  11.137641     1
2   8.796264     2
3  13.530428     1
4  13.647518     2
5   5.647735     2
6   7.277148     1
7   6.479024     2
8   9.289560     1
9   5.211502     1
10 12.575817     2
11 11.291723     2
12 11.295667     1
13 13.752943     2
14  9.731945     1
15  4.490958     2
16 16.340906     2
17 10.620146     2
18 11.168694     1
19  3.015077     2
20 13.511891     2
> ddply(dt, ~group, f)
         age group
1  12.137641     2
2  14.530428     2
3   8.277148     2
4  10.289560     2
5   6.211502     2
6  12.295667     2
7  10.731945     2
8  12.168694     2
9   9.796264     3
10 14.647518     3
11  6.647735     3
12  7.479024     3
13 13.575817     3
14 12.291723     3
15 14.752943     3
16  5.490958     3
17 17.340906     3
18 11.620146     3
19  4.015077     3
20 14.511891     3

@TomAugspurger
Copy link
Contributor Author

See this post on SO. I think this user was confused about the grouper having the function applied to it. My answer there goes through what I think happened to them, and it's not at all obvious.

Can anyone provide a reason why the apply or transform or agg function should be applied to the grouping column? I don't see any reason, but it seems to be deliberate in the code.

@jreback
Copy link
Contributor

jreback commented Jan 11, 2014

@TomAugspurger ok. I believe this is a bug.

Here is what happens. If the grouper (e.g. 'year' in this case) would have been a string, then then the aggregation usually will fall to the except part as the functions cannot aggregate strings, so item-by-item will work properly.
(I say usually because as an example, if the function just returns 0 it doesn't care what it is aggregating and would NEVER fail; similary, if the aggregator function had been np.sum then it would work on strings as well!)

In this can since year is a number the function works and thus the grouping column IS included.

So I think that you could exclude the self.grouper.names e.g. the named columns when passing the data to the function that aggregates. Thus not triggering the exception more often.

I think to nail this down you need a simple example and basically a bunch of cases which need to be checked:

the product of these combinatons:

  • groupby fields: int64, datetime, string, function (e.g. what we are actually grouping by whether a field or a passed function)
  • functions: a) lambda x: 0, b), return lambda x: np.mean(x), c) lambda x: np.sum(x)
  • dtypes of the object: a) pure floats, b) strings (that are not grouping columns), and floats, c) datetimes and floats

so 4 x 3 x 3 = 36 combinations! need to find a nice easy way to check these (this may be overkill though)

make sense?

@TomAugspurger
Copy link
Contributor Author

Agreed completely. I'll make a new issue.

@jreback
Copy link
Contributor

jreback commented Jan 11, 2014

u have a PR for this issue (the sub selection) I believe? (or is it just the private branch?) go ahead and make a PR and can resolve this one

@TomAugspurger
Copy link
Contributor Author

Do you want the "apply to grouper" issue in a separate PR? Or one PR with two commits? The PR for this issue (the sub selection being ignored) will probably depend on the PR for the apply to grouper.

@jreback
Copy link
Contributor

jreback commented Jan 11, 2014

maybe best to close at the same time

if not possible then can do separately

whatever works

@jreback
Copy link
Contributor

jreback commented Mar 3, 2014

@TomAugspurger I think all of these merely need to replace obj with _obj_with_exclusions (but tests have to be carefully looked at), the as_index=False are going to break (as they I think are wrong)

@TomAugspurger
Copy link
Contributor Author

Great, that's the approach I was taking before getting muddled up with bigger API issues. You said you have a fix in the other issue right? I'd be happy look that over (especially the tests).

@jreback
Copy link
Contributor

jreback commented Mar 3, 2014

jreback@92e5c50

trivial fix... (but a couple of other tests will break)...

@hayd
Copy link
Contributor

hayd commented Mar 4, 2014

apparently self._obj_with_exclusion isn't quite the right object to use (it excludes the grouped by column, regardless of as_index) Eeep.

I think good opportunity to change head and tail... see PR :S I created new method _selected_obj but maybe there is one already?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2014

add to the list corr;

http://stackoverflow.com/questions/22238802/python-pandas-groupby-correlation-includes-all-columns-instead-of-selected-colum

@TomAugspurger maybe make a check list or something at the top?

I guess need to verify all of the tmethod by a smoke tests (these are methods on the whitelist).

perhaps change is really easy if its done with the whitelist generating function then....

@hayd
Copy link
Contributor

hayd commented Mar 7, 2014

I put this into a checklist and ticked off head and tail (which I fixed yest) added in a few but haven't properly gone through it. Like I say, I implemented a _selected_obj which is subtley different from other methods. One thing to be wary of is the as_index selection (good time to check whether these make sense, like we changed head)...

@hayd
Copy link
Contributor

hayd commented Mar 7, 2014

@jreback yea re whitelist way to do it, makes sense

@jreback
Copy link
Contributor

jreback commented Mar 7, 2014

pls tick the boxes when you have a chance (and put the ref PR #)

@hayd
Copy link
Contributor

hayd commented Mar 7, 2014

added issue number, will tick off once merged, looks like my PR fixes most of these, if not all. needs some tests for a few of the others... may need a swathe of tests for these as I think coverage is low...

@hayd
Copy link
Contributor

hayd commented Mar 8, 2014

I added a load to this list. Their is some weird behaviour in agg functions, where if you select a col not in the DataFrame it doesn't care:

In [60]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])

In [61]: g = df.groupby('A')

In [62]: g[['C']]  # should raise here (maybe this will solve)
Out[62]: <pandas.core.groupby.DataFrameGroupBy object at 0x106ec6710>

In [63]: g[['C']].max()  # raises KeyError here in my PR
Out[63]:
    C
A
1 NaN
5 NaN

feature/edge case here is that A isn't displayed unless you select it

@jreback
Copy link
Contributor

jreback commented Mar 8, 2014

I agree this should raise KeyError

@hayd
Copy link
Contributor

hayd commented Mar 9, 2014

@TomAugspurger fix for a lot of this is in master now. Would you mind taking a look at what you think about the rest, is the behaviour correct? (esp. the aggs?)

@TomAugspurger
Copy link
Contributor Author

Thanks for taking care of this. I may have actually lost the branch that I started to fix this, so I guess there wasn't really duplicated effort :)

I ticked the boxes for the aggs (since #6578 closes it the unrelated KeyError issue).

I haven't tested the filter and ohlc methods (haven't used either extensively, but I'll try to come up with some tests for them).

For a groupby.plot, I'm getting back multiple (identical) figures. I'll try to figure when this behavior started.

EDIT: I'm not very smart today. Of course the groupby.plot() returns multiple figures (and they aren't identical either). It correctly gets the subselected columns. I ticked the box for plot as well.

@hayd
Copy link
Contributor

hayd commented Mar 10, 2014

I can take care of filter, it's a one line change (was just waiting for the above to merge). I think ohlc already passes I just needs to be added to the first list in my PR.

@TomAugspurger Do you think the aggs is all good once that 6578 is fixed? Does the behaviour look right? If so, we should add tests for them. (I unchecked them 'til we add tests)

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

actually I'm confused if ohlc was meant to be a method, see #6594...

@TomAugspurger
Copy link
Contributor Author

Any objections to closing this?

The remaining issues (ohlc and not raising a KeyError on nonexistent columns) aren't related to the original subselection being ignored and have their own issues.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2014

no...let's close (if necessary move anything remaining to a new issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants