Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of unexpected behavior using resample after groupby #12923

Closed
BreitA opened this issue Apr 19, 2016 · 5 comments
Closed

Lots of unexpected behavior using resample after groupby #12923

BreitA opened this issue Apr 19, 2016 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request Groupby Resample resample method

Comments

@BreitA
Copy link

BreitA commented Apr 19, 2016

Code Sample, a copy-pastable example if possible

PANDAS 0.18 code :

df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],
index=pd.date_range('2014-01-01',freq='D',periods=150))
df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],
index=pd.date_range('2014-01-01',freq='D',periods=150))

df=pd.concat([df,df2])

print df.groupby('B').mean()
print df.groupby('B').resample('MS').mean().head()
print 'shape : ',df.groupby('B').resample('MS').mean().shape
print df.groupby('B').apply(lambda x:x.resample('MS').mean()).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS').mean()).shape
print df.groupby('B').mean()
print df.groupby('B').resample('H').mean().head()
print 'shape : ',df.groupby('B').resample('H').mean().shape
print df.groupby('B').apply(lambda x:x.resample('H').mean()).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H').mean()).shape
print 'pd version', pd.__version__

PANDAS 0.17 equivalent code:

df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],index=pd.date_range('2014-01-01',freq='D',periods=150))
df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],index=pd.date_range('2014-01-01',freq='D',periods=150))

df=pd.concat([df,df2])

print df.groupby('B').mean()
print df.groupby('B').resample('MS').head()
print 'shape : ',df.groupby('B').resample('MS').shape
print df.groupby('B').apply(lambda x:x.resample('MS')).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS')).shape
print df.groupby('B').mean()
print df.groupby('B').resample('H').head()
print 'shape : ',df.groupby('B').resample('H').shape
print df.groupby('B').apply(lambda x:x.resample('H')).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H')).shape
print 'pd version', pd.__version__

Expected Output

Pandas 0.18 code Output :

   A    C    D

B
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-02-01 0.0 0.0 0.0 0.0
2014-03-01 0.0 0.0 0.0 0.0
2014-04-01 0.0 0.0 0.0 0.0
2014-05-01 0.0 0.0 0.0 0.0
shape : (10, 4)
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-02-01 0.0 0.0 0.0 0.0
2014-03-01 0.0 0.0 0.0 0.0
2014-04-01 0.0 0.0 0.0 0.0
2014-05-01 0.0 0.0 0.0 0.0
shape : (10, 4)
A C D
B
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-01-02 0.0 0.0 0.0 0.0
2014-01-03 0.0 0.0 0.0 0.0
2014-01-04 0.0 0.0 0.0 0.0
2014-01-05 0.0 0.0 0.0 0.0
shape : (300, 4)
A B C D
B
0.0 2014-01-01 00:00:00 0.0 0.0 0.0 0.0
2014-01-01 01:00:00 NaN NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN NaN
shape : (7154, 4)
pd version 0.18.0

Pandas 0.17 equivalent code Output :

A C D
B
0 0 0 0
1 1 1 1
A C D
B
0 2014-01-01 0 0 0
2014-02-01 0 0 0
2014-03-01 0 0 0
2014-04-01 0 0 0
2014-05-01 0 0 0
shape : (10, 3)
A B C D
B
0 2014-01-01 0 0 0 0
2014-02-01 0 0 0 0
2014-03-01 0 0 0 0
2014-04-01 0 0 0 0
2014-05-01 0 0 0 0
shape : (10, 4)
A C D
B
0 0 0 0
1 1 1 1
A C D
B
0 2014-01-01 00:00:00 0 0 0
2014-01-01 01:00:00 NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN
shape : (7154, 3)
A B C D
B
0 2014-01-01 00:00:00 0 0 0 0
2014-01-01 01:00:00 NaN NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN NaN
shape : (7154, 4)
pd version 0.17.1

ISSUES :

in pandas 0.18.0 the column B is not dropped when applying resample afterwards (it should be dropped and put in index like with the simple example using .mean() after groupby).
in pandas 0.18.0 the behavior is correct when downsampling (example with 'MS') but is wrong when upsampling (example with 'H') The dataframe is not upsampled in that case and stays at freq='D'

A workaround is to use df.groupby('B').apply(lambda x: x.resample.mean()) but it's inelegant to say the least and does not solve the issue of B being not dropped in columns.

@TomAugspurger
Copy link
Contributor

Can you checkout #12743, which is closing a bunch of these issues, and ensure that it gives you the expected answers? I'm having a bit of trouble understanding your output since the formatting is off, but it looks correct on that branch.

Let's move the discussion there if there are any issues.

@TomAugspurger TomAugspurger added Groupby Duplicate Report Duplicate issue or pull request Resample resample method labels Apr 19, 2016
@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

In [1]: df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],
   ...: index=pd.date_range('2014-01-01',freq='D',periods=150))

In [2]: df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],
   ...: index=pd.date_range('2014-01-01',freq='D',periods=150))

In [3]: df=pd.concat([df,df2])

In [4]: print df.groupby('B').mean()
       A    C    D
B                 
0.0  0.0  0.0  0.0
1.0  1.0  1.0  1.0

In [5]: print df.groupby('B').resample('MS').mean().head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-02-01  0.0  0.0  0.0  0.0
    2014-03-01  0.0  0.0  0.0  0.0
    2014-04-01  0.0  0.0  0.0  0.0
    2014-05-01  0.0  0.0  0.0  0.0

In [6]: print 'shape : ',df.groupby('B').resample('MS').mean().shape
shape :  (10, 4)

In [7]: print df.groupby('B').apply(lambda x:x.resample('MS').mean()).head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-02-01  0.0  0.0  0.0  0.0
    2014-03-01  0.0  0.0  0.0  0.0
    2014-04-01  0.0  0.0  0.0  0.0
    2014-05-01  0.0  0.0  0.0  0.0

In [8]: print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS').mean()).shape
shape :  (10, 4)
In [9]: print df.groupby('B').mean()
       A    C    D
B                 
0.0  0.0  0.0  0.0
1.0  1.0  1.0  1.0

In [10]: print df.groupby('B').resample('H').mean().head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

In [11]: print 'shape : ',df.groupby('B').resample('H').mean().shape
shape :  (7154, 4)

In [12]: print df.groupby('B').apply(lambda x:x.resample('H').mean()).head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

In [13]: print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H').mean()).shape
shape :  (7154, 4)

In [14]: print 'pd version', pd.__version__
pd version 0.18.0+129.g928a8b4

So these all look correct to me, as @TomAugspurger says, #12743 will resolve any remaining issues here. In esscense df.groupby(...).resample(...) is doing df.groupby(...).apply(lambda x: x.resample(...)) under the hood

@BreitA
Copy link
Author

BreitA commented Apr 19, 2016

So the different behavior we have here :

YOU

In [10]: print df.groupby('B').resample('H').mean().head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

ME

In [10]: print df.groupby('B').resample('H').mean().head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-01-02  0.0  0.0  0.0  0.0
    2014-01-03  0.0  0.0  0.0  0.0
    2014-01-04  0.0  0.0  0.0  0.0
    2014-01-05  0.0  0.0  0.0  0.0
shape :  (225, 4)

This will be fixed in next build?

Also is it normal B isn't dropped anymore? It seems weird it is dropped for simple functions such as .mean() but not for resampling.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

@BreitA you are probably looking to do this:

In [6]: df.groupby('B').resample('H').ffill()
Out[6]: 
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  0.0  0.0  0.0  0.0
    2014-01-01 02:00:00  0.0  0.0  0.0  0.0
    2014-01-01 03:00:00  0.0  0.0  0.0  0.0
    2014-01-01 04:00:00  0.0  0.0  0.0  0.0
    2014-01-01 05:00:00  0.0  0.0  0.0  0.0
    2014-01-01 06:00:00  0.0  0.0  0.0  0.0
    2014-01-01 07:00:00  0.0  0.0  0.0  0.0
    2014-01-01 08:00:00  0.0  0.0  0.0  0.0

.mean() is a downsamping operation and doesn't make any sense here (it works, but is probably not what you want)

The implemenation is exactly this. Yes you are doing an operation on the entire frame, so it makes sense to keep all columns.

In [9]: df.groupby('B').apply(lambda x: x.resample('H').ffill())
Out[9]: 
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  0.0  0.0  0.0  0.0
    2014-01-01 02:00:00  0.0  0.0  0.0  0.0
    2014-01-01 03:00:00  0.0  0.0  0.0  0.0
    2014-01-01 04:00:00  0.0  0.0  0.0  0.0
    2014-01-01 05:00:00  0.0  0.0  0.0  0.0
    2014-01-01 06:00:00  0.0  0.0  0.0  0.0
    2014-01-01 07:00:00  0.0  0.0  0.0  0.0

@BreitA
Copy link
Author

BreitA commented Apr 19, 2016

yeah I know the example is kind of silly (using .mean() for upsampling). The point was that the behavior was not the same by using apply(lambda x:x.resample.mean()) instead of using .resample.mean()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Groupby Resample resample method
Projects
None yet
Development

No branches or pull requests

3 participants