Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.transform length mismatch under certain specifications #9697

Closed
nickeubank opened this issue Mar 22, 2015 · 10 comments · Fixed by #9699
Closed

BUG: groupby.transform length mismatch under certain specifications #9697

nickeubank opened this issue Mar 22, 2015 · 10 comments · Fixed by #9699
Milestone

Comments

@nickeubank
Copy link
Contributor

Replicating Example (pandas 15.2 and 0.16.0rc1-32-g5a417ec):

import numpy as np
df = pd.DataFrame({'col1':[1,1,2,2], 'col2':[1,2,3,np.nan]})

# Works fine
df.groupby('col1').transform(sum)['col2']

# Throws error
df.groupby('col1')['col2'].transform(sum)

Error:
Traceback (most recent call last):

  File "<ipython-input-7-f969e26273d4>", line 8, in <module>
    df.groupby('col1')['col2'].transform(sum)

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 2418, in transform
    return self._transform_fast(cyfunc)

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 2459, in _transform_fast
    return self._set_result_index_ordered(Series(values))

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 497, in _set_result_index_ordered
    result.index = self.obj.index

  File "/Users/Nick/GitHub/pandas/pandas/core/generic.py", line 2061, in __setattr__
    return object.__setattr__(self, name, value)

  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41404)
    obj._set_axis(self.axis, value)

  File "/Users/Nick/GitHub/pandas/pandas/core/series.py", line 268, in _set_axis
    self._data.set_axis(axis, labels)

  File "/Users/Nick/GitHub/pandas/pandas/core/internals.py", line 2211, in set_axis
    'new values have %d elements' % (old_len, new_len))

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements
@dsm054
Copy link
Contributor

dsm054 commented Mar 22, 2015

Note that

>>> df.groupby("col1")["col2"].transform(lambda x: np.nansum(x))
0    3
1    3
2    3
3    3
Name: col2, dtype: float64

works, so the bug must be in the fastpath.

@dsm054
Copy link
Contributor

dsm054 commented Mar 22, 2015

More information: the use of count in _transform_fast doesn't seem right to me (that'll pick up the number of non-nan values in the grouped series, IIUC, whereas we want the size of the groups.)

We're also seeing things like

>>> df = pd.DataFrame({"c1": [1], "c2": [2]})
>>> df
   c1  c2
0   1   2
>>> df.groupby("c1").transform(sum)
   c2
0 NaN
1   2

which makes me think there's definitely been a confusion about the index.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2015

yep, this https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2456
should be self.size().values. Its the group size here that matters.

@jreback jreback added this to the 0.16.1 milestone Mar 22, 2015
@dsm054
Copy link
Contributor

dsm054 commented Mar 22, 2015

Switching count -> size seems to fix the original problem. Doesn't pass the tests I made which revealed the index problem above. Shall I submit a PR with the count/size switch here and open a new issue for that one?

@jreback
Copy link
Contributor

jreback commented Mar 22, 2015

@dsm054 that would be gr8 thxs

@dylkot
Copy link

dylkot commented May 17, 2015

I may be having the same problem in Pandas 0.16.1. I describe it here:

http://stackoverflow.com/questions/30290482/normalizing-columns-of-multiindex-dataframe-in-pandas-maybe-a-bug

I have a Pandas Dataframe with hierarchically indexed columns. Eg:

cols = pd.MultiIndex.from_tuples([('syn', 'A'), ('mis', 'A'), ('non', 'A'),     ('syn', 'C'), ('mis', 'C'), ('non', 'C'),
                              ('syn', 'T'), ('mis', 'T'), ('non', 'T'), ('syn', 'G'), ('mis', 'G'), ('non', 'G')])
sample = pd.DataFrame(np.random.randint(1, 10, (4,12)), columns=cols, index=['A', 'C', 'G', 'T'])
sample.head()

Giving:

    syn mis non syn mis non syn mis non syn mis non
    A   A   A   C   C   C   T   T   T   G   G   G
A   7   3   9   5   4   8   6   4   3   6   4   2
C   5   2   2   4   9   6   3   3   9   6   2   1
G   2   4   5   2   8   3   8   3   6   1   2   4
T   9   4   8   9   8   5   8   8   2   2   6   5

However, when I attempt to normalize on level 1 of the columns as follows:

sample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

I get an error:

ValueError: Length mismatch: Expected axis has 4 elements, new values have 12 elements

Weirdly, it works fine if I use up to 3 of the values of level 1 of axis 1, for example:

ind = sample.columns.get_level_values(1).isin(['A', 'C', 'G'])
subsample = sample.loc[:, ind]
subsample.head()

which just takes the first 3 sets of values from the sample:

    syn mis non syn mis non syn mis non
    A   A   A   C   C   C   G   G   G
A   7   3   9   5   4   8   6   4   2
C   5   2   2   4   9   6   6   2   1
G   2   4   5   2   8   3   1   2   4
T   9   4   8   9   8   5   2   6   5

Then:

subsample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

correctly returns:

    syn mis non syn mis non syn mis non
    A       A       A       C       C       C       G       G       G
A   0.37    0.16    0.47    0.29    0.24    0.47    0.50    0.33    0.17
C   0.56    0.22    0.22    0.21    0.47    0.32    0.67    0.22    0.11
G   0.18    0.36    0.45    0.15    0.62    0.23    0.14    0.29    0.57
T   0.43    0.19    0.38    0.41    0.36    0.23    0.15    0.46    0.38

Any idea why the latter works and the former doesn't? I'm using Pandas version 0.16.1

@jreback
Copy link
Contributor

jreback commented May 18, 2015

.transform must return a scalar value for each group. You are returning a frame. Try this.

In [44]: sample.groupby(axis=1, level=0).apply(lambda z: z.div(z.sum(axis=1), axis=0))
Out[44]: 
        syn       mis       non       syn       mis       non       syn       mis       non       syn       mis       non
          A         A         A         C         C         C         T         T         T         G         G         G
A  0.125000  0.090909  0.333333  0.375000  0.181818  0.133333  0.250000  0.090909  0.200000  0.250000  0.636364  0.333333
C  0.200000  0.240000  0.230769  0.133333  0.320000  0.307692  0.133333  0.320000  0.115385  0.533333  0.120000  0.346154
G  0.350000  0.318182  0.461538  0.300000  0.136364  0.230769  0.100000  0.136364  0.230769  0.250000  0.409091  0.076923
T  0.052632  0.200000  0.071429  0.368421  0.150000  0.285714  0.368421  0.350000  0.142857  0.210526  0.300000  0.500000

@dylkot
Copy link

dylkot commented May 18, 2015

I see, that does indeed work, thanks so much for the help. I am a bit confused by the documentation then. The docstring for transform says:

Call function producing a like-indexed DataFrame on each group and
return a DataFrame having the same indexes as the original object
filled with the transformed values

So I assumed that transform would return a "transformed" dataframe for each group. In any case, now I think I have a better understanding of apply. Thanks again!

@jreback
Copy link
Contributor

jreback commented May 18, 2015

yes, that the result of a transform. a user-defined function needs to return a single (scalar) value. hmm, maybe be better to have a nicer error message / doc. going to make an issue.

@jreback
Copy link
Contributor

jreback commented May 18, 2015

@dylkot see #10165
if you want to have a go would be gr8!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants