Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected transform behavior on grouped dataset #3740

Closed
fonnesbeck opened this issue Jun 2, 2013 · 20 comments · Fixed by #3743
Closed

Unexpected transform behavior on grouped dataset #3740

fonnesbeck opened this issue Jun 2, 2013 · 20 comments · Fixed by #3743
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@fonnesbeck
Copy link

I have a simple longitudinal biomedical dataset that I am grouping according to the patient on which measurements are taken. Here are the first couple of groups:

1
   patient  obs  week  site  id  treat  age sex  twstrs  treatment
0        1    1     0     1   1  5000U   65   F      32          1
1        1    2     2     1   1  5000U   65   F      30          1
2        1    3     4     1   1  5000U   65   F      24          1
3        1    4     8     1   1  5000U   65   F      37          1
4        1    5    12     1   1  5000U   65   F      39          1
5        1    6    16     1   1  5000U   65   F      36          1

2
    patient  obs  week  site  id   treat  age sex  twstrs  treatment
6         2    1     0     1   2  10000U   70   F      60          2
7         2    2     2     1   2  10000U   70   F      26          2
8         2    3     4     1   2  10000U   70   F      27          2
9         2    4     8     1   2  10000U   70   F      41          2
10        2    5    12     1   2  10000U   70   F      65          2
11        2    6    16     1   2  10000U   70   F      67          2

However, when I try to transform these data, say by normalization, I get nonsensical results:

normalize = lambda x: (x - x.mean())/x.std()
normed = cdystonia_grouped.transform(normalize)
normed.head(10)

               patient  obs  week                 site                   id  \
0 -9223372036854775808   -1    -1 -9223372036854775808 -9223372036854775808   
1 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
2 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
3 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   
4 -9223372036854775808    0     0 -9223372036854775808 -9223372036854775808   

                   age  twstrs            treatment  
0 -9223372036854775808       0 -9223372036854775808  
1 -9223372036854775808       0 -9223372036854775808  
2 -9223372036854775808      -1 -9223372036854775808  
3 -9223372036854775808       0 -9223372036854775808  
4 -9223372036854775808       1 -9223372036854775808  

The normalize function is straightforward, and works fine when applied to manually subsetted data:

normalize(cdystonia.twstrs[cdystonia.patient==1])

0   -0.181369
1   -0.544107
2   -1.632322
3    0.725476
4    1.088214
5    0.544107
Name: twstrs, dtype: float64

Any guidance here much appreciated. I'm hoping its something obvious.

@jreback
Copy link
Contributor

jreback commented Jun 2, 2013

looks like u have some uint dtypes
can u show df.info()

also try this on master I just fixed the cause of this

On Jun 2, 2013, at 4:40 PM, Chris Fonnesbeck notifications@github.com wrote:

I have a simple longitudinal biomedical dataset that I am grouping according to the patient on which measurements are taken. Here are the first couple of groups:

1
patient obs week site id treat age sex twstrs treatment
0 1 1 0 1 1 5000U 65 F 32 1
1 1 2 2 1 1 5000U 65 F 30 1
2 1 3 4 1 1 5000U 65 F 24 1
3 1 4 8 1 1 5000U 65 F 37 1
4 1 5 12 1 1 5000U 65 F 39 1
5 1 6 16 1 1 5000U 65 F 36 1

2
patient obs week site id treat age sex twstrs treatment
6 2 1 0 1 2 10000U 70 F 60 2
7 2 2 2 1 2 10000U 70 F 26 2
8 2 3 4 1 2 10000U 70 F 27 2
9 2 4 8 1 2 10000U 70 F 41 2
10 2 5 12 1 2 10000U 70 F 65 2
11 2 6 16 1 2 10000U 70 F 67 2
However, when I try to transform these data, say by normalization, I get nonsensical results:

normalize = lambda x: (x - x.mean())/x.std()
normed = cdystonia_grouped.transform(normalize)
normed.head(10)

           patient  obs  week                 site                   id  \

0 -9223372036854775808 -1 -1 -9223372036854775808 -9223372036854775808
1 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
2 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
3 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808
4 -9223372036854775808 0 0 -9223372036854775808 -9223372036854775808

               age  twstrs            treatment  

0 -9223372036854775808 0 -9223372036854775808
1 -9223372036854775808 0 -9223372036854775808
2 -9223372036854775808 -1 -9223372036854775808
3 -9223372036854775808 0 -9223372036854775808
4 -9223372036854775808 1 -9223372036854775808
The normalize function is straightforward, and works fine when applied to manually subsetted data:

normalize(cdystonia.twstrs[cdystonia.patient==1])

0 -0.181369
1 -0.544107
2 -1.632322
3 0.725476
4 1.088214
5 0.544107
Name: twstrs, dtype: float64
Any guidance here much appreciated. I'm hoping its something obvious.


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Jun 2, 2013

Here is an example. This is in 0.11.0

In [16]: df = DataFrame(dict(A = Series([1]*5,dtype='uint8'), B = Series([0]*5,dtype='uint64'),C=np.random.randint(-10,10,size=5)))

In [17]: df
Out[17]: 
   A  B  C
0  1  0  8
1  1  0 -3
2  1  0  9
3  1  0  8
4  1  0  6

In [18]: df.values
Out[18]: 
array([[                   1,                    0,                    8],
       [                   1,                    0, 18446744073709551613],
       [                   1,                    0,                    9],
       [                   1,                    0,                    8],
       [                   1,                    0,                    6]], dtype=uint64)

0.11.1 this works

In [1]: df = DataFrame(dict(A = Series([1]*5,dtype='uint8'), B = Series([0]*5,dtype='uint64'),C=np.random.randint(-10,10,size=5)))

In [2]: df.values
Out[2]: 
array([[ 1,  0, -8],
       [ 1,  0,  2],
       [ 1,  0, -3],
       [ 1,  0,  5],
       [ 1,  0,  5]])

@fonnesbeck
Copy link
Author

My DataFrame info is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 631 entries, 0 to 630
Data columns (total 10 columns):
patient      631  non-null values
obs          631  non-null values
week         631  non-null values
site         631  non-null values
id           631  non-null values
treat        631  non-null values
age          631  non-null values
sex          631  non-null values
twstrs       631  non-null values
treatment    631  non-null values
dtypes: int64(8), object(2)

So, sex and treat are string variables, but the others are valid int64, which ought to nornalize properly. I updated to the current master, but the result is the same.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

can u give me a Dropbox link to the frame as a csv?

@fonnesbeck
Copy link
Author

Here it is

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

@fonnesbeck can you give a try with this PR, should fix it; this was a pretty esotric bug

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

This is a reproduction:

0.11.0

In [3]: df = DataFrame(dict(A = [1,1,1,2,2,2], B = 1, C = [1,2,3,1,2,3], D = 'foo'))

In [4]: df.groupby('A').transform(lambda x: (x-x.mean())/x.std())
Out[4]: 
                     B  C
0 -9223372036854775808 -1
1 -9223372036854775808  0
2 -9223372036854775808  1
3 -9223372036854775808 -1
4 -9223372036854775808  0
5 -9223372036854775808  1

0.11.1

In [1]: df = DataFrame(dict(A = [1,1,1,2,2,2], B = 1, C = [1,2,3,1,2,3], D = 'foo'))

In [2]: df.groupby('A').transform(lambda x: (x-x.mean())/x.std())
Out[2]: 
    B  C
0 NaN -1
1 NaN  0
2 NaN  1
3 NaN -1
4 NaN  0
5 NaN  1

@fonnesbeck
Copy link
Author

Right, however the values for C are wrong. They should be upcast to floats, since they are z-scores:

array([-1.22474487, 0., 1.22474487, -1.22474487, 0., 1.22474487])

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

what is your calculation? this seems correct (its by group)

In [11]: x
Out[11]: 
0    1
1    2
2    3
dtype: float64

In [12]: (x-x.mean())/x.std()
Out[12]: 
0   -1
1    0
2    1
dtype: float64

@fonnesbeck
Copy link
Author

It works when your data is [1,2,3], but try it for the values in C from your df example above, or even [1,2,3,4]. Also, as I reported originally, the function works stand-alone but not as the argument to transform.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

I think my bug-fix works; but in your lamba you need to be sure to use floats (otherwise it is actually correct)

e.g.

lambda x: (x-x.mean())/(x.astype(float).std())

I think would work; you are getting integer division; with a lambda like this pandas cannot infer what you actually want

so either astype the data on the way in, or use the lambda like above

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

actually...hold on

@fonnesbeck
Copy link
Author

If you take the standard deviation of a series of integers, you should not get an integer back, since we are taking a square root of a sum. Its not clear why the explicit casting should be required.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

ok...think I got it; I put another commit us, pls give a shot

here's your original data set

(Pdb) df.groupby('week').transform(lambda x: (x-x.mean())/x.std()).head()
    patient  obs      site        id       age    twstrs  treatment
0 -1.708342  NaN -1.548845 -1.410553  0.780821 -1.405490  -0.011160
1 -1.683606  NaN -1.522794 -1.398507  0.773854 -0.594184  -0.035435
2 -1.692057  NaN -1.527510 -1.400460  0.759730 -0.989923  -0.011473
3 -1.691585  NaN -1.532686 -1.385268  0.761087 -0.207029   0.011749
4 -1.708176  NaN -1.540303 -1.392138  0.764912 -0.322402   0.000000

@fonnesbeck
Copy link
Author

That works. Thanks!

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

great! in time for 0.11.1

@fonnesbeck
Copy link
Author

This fix appears to work for some numeric columns in the sample DataFrame that I sent, but not others:

normalize = lambda x: (x - x.mean())/x.std()
cdystonia_grouped.transform(normalize).head()

   patient       obs      week  site  id  age    twstrs  treatment
0      NaN -1.336306 -1.135550   NaN NaN  NaN -0.181369        NaN
1      NaN -0.801784 -0.811107   NaN NaN  NaN -0.544107        NaN
2      NaN -0.267261 -0.486664   NaN NaN  NaN -1.632322        NaN
3      NaN  0.267261  0.162221   NaN NaN  NaN  0.725476        NaN
4      NaN  0.801784  0.811107   NaN NaN  NaN  1.088214        NaN

With the exception of the two string variables (id, treatment), the columns appear to be valid int64s with no missing data:

cdystonia.site.value_counts()

8    106
6     87
2     82
7     72
3     72
1     70
9     61
4     48
5     33
dtype: int64

Its not clear why they are coming up NaN.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

u. the particular groups that I looked they were all the same value
so std is 0 and hence a nan

can u show a particular group where that is not the case and they still come up nan?

show your groupby as well

thxs

@fonnesbeck
Copy link
Author

Yes, of course. My mistake, sorry.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

np

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants