BUG: groupby.transform length mismatch under certain specifications #9697

nickeubank · 2015-03-22T00:31:06Z

Replicating Example (pandas 15.2 and 0.16.0rc1-32-g5a417ec):

import numpy as np
df = pd.DataFrame({'col1':[1,1,2,2], 'col2':[1,2,3,np.nan]})

# Works fine
df.groupby('col1').transform(sum)['col2']

# Throws error
df.groupby('col1')['col2'].transform(sum)

Error:
Traceback (most recent call last):

  File "<ipython-input-7-f969e26273d4>", line 8, in <module>
    df.groupby('col1')['col2'].transform(sum)

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 2418, in transform
    return self._transform_fast(cyfunc)

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 2459, in _transform_fast
    return self._set_result_index_ordered(Series(values))

  File "/Users/Nick/GitHub/pandas/pandas/core/groupby.py", line 497, in _set_result_index_ordered
    result.index = self.obj.index

  File "/Users/Nick/GitHub/pandas/pandas/core/generic.py", line 2061, in __setattr__
    return object.__setattr__(self, name, value)

  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41404)
    obj._set_axis(self.axis, value)

  File "/Users/Nick/GitHub/pandas/pandas/core/series.py", line 268, in _set_axis
    self._data.set_axis(axis, labels)

  File "/Users/Nick/GitHub/pandas/pandas/core/internals.py", line 2211, in set_axis
    'new values have %d elements' % (old_len, new_len))

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

The text was updated successfully, but these errors were encountered:

dsm054 · 2015-03-22T02:00:31Z

Note that

>>> df.groupby("col1")["col2"].transform(lambda x: np.nansum(x))
0    3
1    3
2    3
3    3
Name: col2, dtype: float64

works, so the bug must be in the fastpath.

dsm054 · 2015-03-22T02:22:15Z

More information: the use of count in _transform_fast doesn't seem right to me (that'll pick up the number of non-nan values in the grouped series, IIUC, whereas we want the size of the groups.)

We're also seeing things like

>>> df = pd.DataFrame({"c1": [1], "c2": [2]})
>>> df
   c1  c2
0   1   2
>>> df.groupby("c1").transform(sum)
   c2
0 NaN
1   2

which makes me think there's definitely been a confusion about the index.

jreback · 2015-03-22T14:42:40Z

yep, this https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2456
should be self.size().values. Its the group size here that matters.

dsm054 · 2015-03-22T15:22:12Z

Switching count -> size seems to fix the original problem. Doesn't pass the tests I made which revealed the index problem above. Shall I submit a PR with the count/size switch here and open a new issue for that one?

jreback · 2015-03-22T16:25:02Z

@dsm054 that would be gr8 thxs

dylkot · 2015-05-17T23:19:19Z

I may be having the same problem in Pandas 0.16.1. I describe it here:

http://stackoverflow.com/questions/30290482/normalizing-columns-of-multiindex-dataframe-in-pandas-maybe-a-bug

I have a Pandas Dataframe with hierarchically indexed columns. Eg:

cols = pd.MultiIndex.from_tuples([('syn', 'A'), ('mis', 'A'), ('non', 'A'),     ('syn', 'C'), ('mis', 'C'), ('non', 'C'),
                              ('syn', 'T'), ('mis', 'T'), ('non', 'T'), ('syn', 'G'), ('mis', 'G'), ('non', 'G')])
sample = pd.DataFrame(np.random.randint(1, 10, (4,12)), columns=cols, index=['A', 'C', 'G', 'T'])
sample.head()

Giving:

    syn mis non syn mis non syn mis non syn mis non
    A   A   A   C   C   C   T   T   T   G   G   G
A   7   3   9   5   4   8   6   4   3   6   4   2
C   5   2   2   4   9   6   3   3   9   6   2   1
G   2   4   5   2   8   3   8   3   6   1   2   4
T   9   4   8   9   8   5   8   8   2   2   6   5

However, when I attempt to normalize on level 1 of the columns as follows:

sample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

I get an error:

ValueError: Length mismatch: Expected axis has 4 elements, new values have 12 elements

Weirdly, it works fine if I use up to 3 of the values of level 1 of axis 1, for example:

ind = sample.columns.get_level_values(1).isin(['A', 'C', 'G'])
subsample = sample.loc[:, ind]
subsample.head()

which just takes the first 3 sets of values from the sample:

    syn mis non syn mis non syn mis non
    A   A   A   C   C   C   G   G   G
A   7   3   9   5   4   8   6   4   2
C   5   2   2   4   9   6   6   2   1
G   2   4   5   2   8   3   1   2   4
T   9   4   8   9   8   5   2   6   5

Then:

subsample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

correctly returns:

    syn mis non syn mis non syn mis non
    A       A       A       C       C       C       G       G       G
A   0.37    0.16    0.47    0.29    0.24    0.47    0.50    0.33    0.17
C   0.56    0.22    0.22    0.21    0.47    0.32    0.67    0.22    0.11
G   0.18    0.36    0.45    0.15    0.62    0.23    0.14    0.29    0.57
T   0.43    0.19    0.38    0.41    0.36    0.23    0.15    0.46    0.38

Any idea why the latter works and the former doesn't? I'm using Pandas version 0.16.1

jreback · 2015-05-18T11:50:54Z

.transform must return a scalar value for each group. You are returning a frame. Try this.

In [44]: sample.groupby(axis=1, level=0).apply(lambda z: z.div(z.sum(axis=1), axis=0))
Out[44]: 
        syn       mis       non       syn       mis       non       syn       mis       non       syn       mis       non
          A         A         A         C         C         C         T         T         T         G         G         G
A  0.125000  0.090909  0.333333  0.375000  0.181818  0.133333  0.250000  0.090909  0.200000  0.250000  0.636364  0.333333
C  0.200000  0.240000  0.230769  0.133333  0.320000  0.307692  0.133333  0.320000  0.115385  0.533333  0.120000  0.346154
G  0.350000  0.318182  0.461538  0.300000  0.136364  0.230769  0.100000  0.136364  0.230769  0.250000  0.409091  0.076923
T  0.052632  0.200000  0.071429  0.368421  0.150000  0.285714  0.368421  0.350000  0.142857  0.210526  0.300000  0.500000

dylkot · 2015-05-18T13:29:00Z

I see, that does indeed work, thanks so much for the help. I am a bit confused by the documentation then. The docstring for transform says:

Call function producing a like-indexed DataFrame on each group and
return a DataFrame having the same indexes as the original object
filled with the transformed values

So I assumed that transform would return a "transformed" dataframe for each group. In any case, now I think I have a better understanding of apply. Thanks again!

jreback · 2015-05-18T15:17:21Z

yes, that the result of a transform. a user-defined function needs to return a single (scalar) value. hmm, maybe be better to have a nicer error message / doc. going to make an issue.

jreback · 2015-05-18T15:20:38Z

@dylkot see #10165
if you want to have a go would be gr8!

jreback added Bug Groupby labels Mar 22, 2015

jreback added this to the 0.16.1 milestone Mar 22, 2015

This was referenced Mar 22, 2015

BUG: groupby.transform confused about index #9700

Closed

BUG: ensure we use group sizes, not group counts, in transform (GH9697) #9699

Merged

jreback closed this as completed in #9699 Mar 23, 2015

jreback mentioned this issue May 18, 2015

ERR: better error reporting with .transform and an invalid output from a UDF #10165

Closed

briangerke mentioned this issue Aug 28, 2015

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

Closed

chbrandt mentioned this issue Jul 27, 2017

Ambiguous behaviour when transform groupby with NaNs #17093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.transform length mismatch under certain specifications #9697

BUG: groupby.transform length mismatch under certain specifications #9697

nickeubank commented Mar 22, 2015

dsm054 commented Mar 22, 2015

dsm054 commented Mar 22, 2015

jreback commented Mar 22, 2015

dsm054 commented Mar 22, 2015

jreback commented Mar 22, 2015

dylkot commented May 17, 2015

jreback commented May 18, 2015

dylkot commented May 18, 2015

jreback commented May 18, 2015

jreback commented May 18, 2015

BUG: groupby.transform length mismatch under certain specifications #9697

BUG: groupby.transform length mismatch under certain specifications #9697

Comments

nickeubank commented Mar 22, 2015

dsm054 commented Mar 22, 2015

dsm054 commented Mar 22, 2015

jreback commented Mar 22, 2015

dsm054 commented Mar 22, 2015

jreback commented Mar 22, 2015

dylkot commented May 17, 2015

jreback commented May 18, 2015

dylkot commented May 18, 2015

jreback commented May 18, 2015

jreback commented May 18, 2015