Unexpected `transform` behavior on grouped dataset #3740

fonnesbeck · 2013-06-02T20:40:12Z

I have a simple longitudinal biomedical dataset that I am grouping according to the patient on which measurements are taken. Here are the first couple of groups:

1
   patient  obs  week  site  id  treat  age sex  twstrs  treatment
0        1    1     0     1   1  5000U   65   F      32          1
1        1    2     2     1   1  5000U   65   F      30          1
2        1    3     4     1   1  5000U   65   F      24          1
3        1    4     8     1   1  5000U   65   F      37          1
4        1    5    12     1   1  5000U   65   F      39          1
5        1    6    16     1   1  5000U   65   F      36          1

2
    patient  obs  week  site  id   treat  age sex  twstrs  treatment
6         2    1     0     1   2  10000U   70   F      60          2
7         2    2     2     1   2  10000U   70   F      26          2
8         2    3     4     1   2  10000U   70   F      27          2
9         2    4     8     1   2  10000U   70   F      41          2
10        2    5    12     1   2  10000U   70   F      65          2
11        2    6    16     1   2  10000U   70   F      67          2

However, when I try to transform these data, say by normalization, I get nonsensical results:

normalize = lambda x: (x - x.mean())/x.std()
normed = cdystonia_grouped.transform(normalize)
normed.head(10)

               patient  obs  week                 site                   id  \
0 -9223372036854775808   -1    -1 -9223372036854775808 -9223372036854775808   
1 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
2 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
3 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
4 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   

                   age  twstrs            treatment  
0 -9223372036854775808       0 -9223372036854775808  
1 -9223372036854775808       0 -9223372036854775808  
2 -9223372036854775808      -1 -9223372036854775808  
3 -9223372036854775808       0 -9223372036854775808  
4 -9223372036854775808       1 -9223372036854775808

The normalize function is straightforward, and works fine when applied to manually subsetted data:

normalize(cdystonia.twstrs[cdystonia.patient==1])

0   -0.181369
1   -0.544107
2   -1.632322
3    0.725476
4    1.088214
5    0.544107
Name: twstrs, dtype: float64

Any guidance here much appreciated. I'm hoping its something obvious.

The text was updated successfully, but these errors were encountered:

jreback · 2013-06-02T21:18:06Z

looks like u have some uint dtypes
can u show df.info()

also try this on master I just fixed the cause of this

On Jun 2, 2013, at 4:40 PM, Chris Fonnesbeck notifications@github.com wrote:

I have a simple longitudinal biomedical dataset that I am grouping according to the patient on which measurements are taken. Here are the first couple of groups:

1
patient obs week site id treat age sex twstrs treatment
0 1 1 0 1 1 5000U 65 F 32 1
1 1 2 2 1 1 5000U 65 F 30 1
2 1 3 4 1 1 5000U 65 F 24 1
3 1 4 8 1 1 5000U 65 F 37 1
4 1 5 12 1 1 5000U 65 F 39 1
5 1 6 16 1 1 5000U 65 F 36 1

2
patient obs week site id treat age sex twstrs treatment
6 2 1 0 1 2 10000U 70 F 60 2
7 2 2 2 1 2 10000U 70 F 26 2
8 2 3 4 1 2 10000U 70 F 27 2
9 2 4 8 1 2 10000U 70 F 41 2
10 2 5 12 1 2 10000U 70 F 65 2
11 2 6 16 1 2 10000U 70 F 67 2
However, when I try to transform these data, say by normalization, I get nonsensical results:

normalize = lambda x: (x - x.mean())/x.std()
normed = cdystonia_grouped.transform(normalize)
normed.head(10)
           patient  obs  week                 site                   id  \
0 -9223372036854775808 -1 -1 -9223372036854775808 -9223372036854775808
1 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
2 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
3 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
4 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
               age  twstrs            treatment  
0 -9223372036854775808 0 -9223372036854775808
1 -9223372036854775808 0 -9223372036854775808
2 -9223372036854775808 -1 -9223372036854775808
3 -9223372036854775808 0 -9223372036854775808
4 -9223372036854775808 1 -9223372036854775808
The normalize function is straightforward, and works fine when applied to manually subsetted data:

normalize(cdystonia.twstrs[cdystonia.patient==1])

0 -0.181369
1 -0.544107
2 -1.632322
3 0.725476
4 1.088214
5 0.544107
Name: twstrs, dtype: float64
Any guidance here much appreciated. I'm hoping its something obvious.

—
Reply to this email directly or view it on GitHub.

jreback · 2013-06-02T23:22:17Z

Here is an example. This is in 0.11.0

In [16]: df = DataFrame(dict(A = Series([1]*5,dtype='uint8'), B = Series([0]*5,dtype='uint64'),C=np.random.randint(-10,10,size=5)))

In [17]: df
Out[17]: 
   A  B  C
0  1  0  8
1  1  0 -3
2  1  0  9
3  1  0  8
4  1  0  6

In [18]: df.values
Out[18]: 
array([[                   1,                    0,                    8],
       [                   1,                    0, 18446744073709551613],
       [                   1,                    0,                    9],
       [                   1,                    0,                    8],
       [                   1,                    0,                    6]], dtype=uint64)

0.11.1 this works

In [1]: df = DataFrame(dict(A = Series([1]*5,dtype='uint8'), B = Series([0]*5,dtype='uint64'),C=np.random.randint(-10,10,size=5)))

In [2]: df.values
Out[2]: 
array([[ 1,  0, -8],
       [ 1,  0,  2],
       [ 1,  0, -3],
       [ 1,  0,  5],
       [ 1,  0,  5]])

fonnesbeck · 2013-06-03T02:57:09Z

My DataFrame info is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 631 entries, 0 to 630
Data columns (total 10 columns):
patient      631  non-null values
obs          631  non-null values
week         631  non-null values
site         631  non-null values
id           631  non-null values
treat        631  non-null values
age          631  non-null values
sex          631  non-null values
twstrs       631  non-null values
treatment    631  non-null values
dtypes: int64(8), object(2)

So, sex and treat are string variables, but the others are valid int64, which ought to nornalize properly. I updated to the current master, but the result is the same.

jreback · 2013-06-03T03:00:02Z

can u give me a Dropbox link to the frame as a csv?

fonnesbeck · 2013-06-03T03:06:43Z

Here it is

jreback · 2013-06-03T13:44:13Z

@fonnesbeck can you give a try with this PR, should fix it; this was a pretty esotric bug

jreback · 2013-06-03T14:39:09Z

This is a reproduction:

0.11.0

In [3]: df = DataFrame(dict(A = [1,1,1,2,2,2], B = 1, C = [1,2,3,1,2,3], D = 'foo'))

In [4]: df.groupby('A').transform(lambda x: (x-x.mean())/x.std())
Out[4]: 
                     B  C
0 -9223372036854775808 -1
1 -9223372036854775808  0
2 -9223372036854775808  1
3 -9223372036854775808 -1
4 -9223372036854775808  0
5 -9223372036854775808  1

0.11.1

In [1]: df = DataFrame(dict(A = [1,1,1,2,2,2], B = 1, C = [1,2,3,1,2,3], D = 'foo'))

In [2]: df.groupby('A').transform(lambda x: (x-x.mean())/x.std())
Out[2]: 
    B  C
0 NaN -1
1 NaN  0
2 NaN  1
3 NaN -1
4 NaN  0
5 NaN  1

fonnesbeck · 2013-06-03T15:23:37Z

Right, however the values for C are wrong. They should be upcast to floats, since they are z-scores:

array([-1.22474487, 0., 1.22474487, -1.22474487, 0., 1.22474487])

jreback · 2013-06-03T15:44:46Z

what is your calculation? this seems correct (its by group)

In [11]: x
Out[11]: 
0    1
1    2
2    3
dtype: float64

In [12]: (x-x.mean())/x.std()
Out[12]: 
0   -1
1    0
2    1
dtype: float64

fonnesbeck · 2013-06-03T15:52:37Z

It works when your data is [1,2,3], but try it for the values in C from your df example above, or even [1,2,3,4]. Also, as I reported originally, the function works stand-alone but not as the argument to transform.

jreback · 2013-06-03T15:55:16Z

I think my bug-fix works; but in your lamba you need to be sure to use floats (otherwise it is actually correct)

e.g.

lambda x: (x-x.mean())/(x.astype(float).std())

I think would work; you are getting integer division; with a lambda like this pandas cannot infer what you actually want

so either astype the data on the way in, or use the lambda like above

jreback · 2013-06-03T16:00:06Z

actually...hold on

fonnesbeck · 2013-06-03T16:00:40Z

If you take the standard deviation of a series of integers, you should not get an integer back, since we are taking a square root of a sum. Its not clear why the explicit casting should be required.

jreback · 2013-06-03T16:10:57Z

ok...think I got it; I put another commit us, pls give a shot

here's your original data set

(Pdb) df.groupby('week').transform(lambda x: (x-x.mean())/x.std()).head()
    patient  obs      site        id       age    twstrs  treatment
0 -1.708342  NaN -1.548845 -1.410553  0.780821 -1.405490  -0.011160
1 -1.683606  NaN -1.522794 -1.398507  0.773854 -0.594184  -0.035435
2 -1.692057  NaN -1.527510 -1.400460  0.759730 -0.989923  -0.011473
3 -1.691585  NaN -1.532686 -1.385268  0.761087 -0.207029   0.011749
4 -1.708176  NaN -1.540303 -1.392138  0.764912 -0.322402   0.000000

fonnesbeck · 2013-06-03T17:08:04Z

That works. Thanks!

jreback · 2013-06-03T17:08:48Z

great! in time for 0.11.1

fonnesbeck · 2013-06-03T21:14:31Z

This fix appears to work for some numeric columns in the sample DataFrame that I sent, but not others:

normalize = lambda x: (x - x.mean())/x.std()
cdystonia_grouped.transform(normalize).head()

   patient       obs      week  site  id  age    twstrs  treatment
0      NaN -1.336306 -1.135550   NaN NaN  NaN -0.181369        NaN
1      NaN -0.801784 -0.811107   NaN NaN  NaN -0.544107        NaN
2      NaN -0.267261 -0.486664   NaN NaN  NaN -1.632322        NaN
3      NaN  0.267261  0.162221   NaN NaN  NaN  0.725476        NaN
4      NaN  0.801784  0.811107   NaN NaN  NaN  1.088214        NaN

With the exception of the two string variables (id, treatment), the columns appear to be valid int64s with no missing data:

cdystonia.site.value_counts()

8    106
6     87
2     82
7     72
3     72
1     70
9     61
4     48
5     33
dtype: int64

Its not clear why they are coming up NaN.

jreback · 2013-06-03T21:33:50Z

u. the particular groups that I looked they were all the same value
so std is 0 and hence a nan

can u show a particular group where that is not the case and they still come up nan?

show your groupby as well

thxs

fonnesbeck · 2013-06-03T21:36:37Z

Yes, of course. My mistake, sorry.

jreback · 2013-06-03T21:44:31Z

np

jreback mentioned this issue Jun 3, 2013

BUG: (GH3740) Groupby transform with item-by-item not upcasting correctly #3743

Merged

jreback closed this as completed in #3743 Jun 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected `transform` behavior on grouped dataset #3740

Unexpected `transform` behavior on grouped dataset #3740

fonnesbeck commented Jun 2, 2013

jreback commented Jun 2, 2013

jreback commented Jun 2, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

Unexpected transform behavior on grouped dataset #3740

Unexpected transform behavior on grouped dataset #3740

Comments

fonnesbeck commented Jun 2, 2013

jreback commented Jun 2, 2013

jreback commented Jun 2, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

fonnesbeck commented Jun 3, 2013

jreback commented Jun 3, 2013

Unexpected `transform` behavior on grouped dataset #3740

Unexpected `transform` behavior on grouped dataset #3740