Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.mean, etc, doesn't recognize timedelta64 #5724

Closed
hsharrison opened this issue Dec 17, 2013 · 12 comments
Closed

groupby.mean, etc, doesn't recognize timedelta64 #5724

hsharrison opened this issue Dec 17, 2013 · 12 comments
Labels
Milestone

Comments

@hsharrison
Copy link

See http://stackoverflow.com/questions/20625982/split-apply-combine-on-pandas-timedelta-column
related as well: http://stackoverflow.com/questions/20789976/python-pandas-dataframe-1st-line-issue-with-datetime-timedelta/20802902#20802902

I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns] or '<m8[ns]'), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:

import pandas as pd

import numpy as np

pd.__version__
Out[3]: '0.13.0rc1'

np.__version__
Out[4]: '1.8.0'

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

data
Out[8]: 
         f1        f2              td
0  0.990140  0.948313 00:00:00.003066
1  0.277125  0.993549 00:00:00.001443
2  0.016427  0.581129 00:00:00.009257
3  0.048662  0.512215 00:00:00.000702
4  0.846301  0.179160 00:00:00.000396
5  0.568323  0.419887 00:00:00.000266
6  0.328182  0.919897 00:00:00.006138
7  0.292882  0.213219 00:00:00.008876
8  0.623332  0.003409 00:00:00.000322
9  0.650436  0.844180 00:00:00.006873

[10 rows x 3 columns]

data.groupby(data.index < 5).mean()
Out[9]: 
             f1        f2
False  0.492631  0.480118
True   0.435731  0.642873

[2 rows x 2 columns]

Or, forcing pandas to try the operation on the 'td' column:

data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
    417         """
    418         try:
--> 419             return self._cython_agg_general('mean')
    420         except GroupByError:
    421             raise

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
    669 
    670         if len(output) == 0:
--> 671             raise DataError('No numeric types to aggregate')
    672 
    673         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

However, taking the mean of the column works fine, so numeric operations should be possible:

data['td'].mean()
Out[11]: 
0   00:00:00.003734
dtype: timedelta64[ns]
@hayd
Copy link
Contributor

hayd commented Dec 18, 2013

Worth mentioning that DataFrame's mean doesn't do this either, I think it should (convert those originally datelike to date):

In [11]: data.mean()
Out[11]: 
f1          0.528609
f2          0.583264
td    5975598.700000
dtype: float64

Compare to data.mean(1) which ignores date columns (correctly imo as you're going across dtypes).

@cancan101
Copy link
Contributor

Interestingly, mean works on Series but not DataFrame (this is using the new timedelta formatting code #5701):

In [10]: pd.to_timedelta(list(range(5)), unit='D')
Out[10]: 
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]
In [9]: pd.to_timedelta(list(range(5)), unit='D').mean()
Out[9]: 
0   2 days
dtype: timedelta64[ns]
In [6]: pd.DataFrame(pd.to_timedelta(list(range(5)), unit='D')).mean()
Out[6]: 
0   55785 days, 14:53:21.659060
dtype: timedelta64[ns]

@jreback
Copy link
Contributor

jreback commented Dec 18, 2013

@hayd actually the results of data.mean() are correct. The result for td is in nanoseconds. It IS however possible to return an object array that is correct, .eg.

In [34]: Series([0.1,0.2,timedelta(10)])
Out[34]: 
0                 0.1
1                 0.2
2    10 days, 0:00:00
dtype: object

so @hayd maybe create a separate issue for this type of inference (Its just a bit of inference detection in nanops.py/_reduce)

@jreback
Copy link
Contributor

jreback commented Dec 18, 2013

@hsharrison as far as the groupby; this is just not implemented ATM in groupby.py; its not that difficult, just needs to follow basically what datetime64 stuff does

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@danielballan
Copy link
Contributor

See #6884.

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 16, 2014
@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jul 1, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@tomfitzhenry
Copy link

Workaround: Use .describe() (which includes the mean) rather than .mean()

See https://gist.github.com/tomfitzhenry/d36ebba697a1f3eeefcb for demo.

@danielballan
Copy link
Contributor

Wow. Any insight into why that works? I would not have expected a convenience method to take a different code path.

@zaxliu
Copy link

zaxliu commented Nov 3, 2015

Also has the same problem. @tomfitzhenry 's solution works.

@kjam
Copy link

kjam commented Feb 3, 2016

Seeing as this keeps getting kicked to the next release, you can also perform sum and count. Referencing @hsharrison's original dataframe from the bug report:

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

grouped_df = data.groupby(data.index < 5)

mean_df = grouped_df.mean()

In [14]: mean_df
Out[14]: 
             f1        f2
False  0.271488  0.614299
True   0.535522  0.476918

mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()

In [16]: mean_df
Out[16]: 
             f1        f2              td
False  0.271488  0.614299 00:00:00.004187
True   0.535522  0.476918 00:00:00.003278

@jreback jreback modified the milestones: 0.18.1, Next Major Release Feb 24, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016
@jreback jreback modified the milestones: 0.18.2, 0.19.0 Jul 6, 2016
jreback added a commit to jreback/pandas that referenced this issue Jan 4, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
jreback added a commit to jreback/pandas that referenced this issue Jan 5, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
jreback added a commit to jreback/pandas that referenced this issue Jan 9, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
jreback added a commit to jreback/pandas that referenced this issue Jan 10, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes pandas-dev#5724
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
BUG: allow timedelta64 to work in groupby with numeric_only=False
closes pandas-dev#5724

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15054 from jreback/groupby_arg and squashes the following commits:

768fce1 [Jeff Reback] BUG: make sure that we are passing thru kwargs to groupby BUG: allow timedelta64 to work in groupby with numeric_only=False
@sylvanosullivan
Copy link

This is still an issue in version 0.24.2 with both datetime64 and timedelta -type columns. If I should truncate any part of the below very long post, please let me know.
Calling np.mean via apply on timedelta with chaining works, but calling mean in the aggregation function fails:

In [11]: pd.__version__                                                                                
Out[11]: '0.24.2'

In [12]: np.__version__                                                                                
Out[12]: '1.16.2'

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: import datetime                                                                                

In [2]: start_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in range(1,5)]) 
   ...:                                                                                                

In [3]: end_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in np.random.randint(4,17,4)]) 
   ...:                                                                                                

In [4]: df = pd.DataFrame({'start': start_dates,'end':end_dates,'ID':[1,1,2,2]}) 
   ...:                                                                                                

In [5]: df['delta'] = df['end'] - df['start'] 
   ...:                                                                                                

In [6]: df                                                                                             
Out[6]: 
       start        end  ID   delta
0 2019-01-01 2019-01-15   1 14 days
1 2019-01-02 2019-01-04   1  2 days
2 2019-01-03 2019-01-04   2  1 days
3 2019-01-04 2019-01-06   2  2 days

In [7]: df.groupby('ID').agg({'delta':'mean'}) 
   ...:                                                                                                
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-7-4b906de9b470> in <module>
----> 1 df.groupby('ID').agg({'delta':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

And with datetime64 series:

In [8]: df.groupby('ID').agg({'start':'mean'})                                                         
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-8-81a3957f4666> in <module>
----> 1 df.groupby('ID').agg({'start':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Dtypes:

In [14]: df.dtypes                                                                                     
Out[14]: 
start     datetime64[ns]
end       datetime64[ns]
ID                 int64
delta    timedelta64[ns]
dtype: object

As mentioned above, chaining with .apply and np.mean() works for timedelta only:

In [17]: df.groupby('ID')['delta'].apply(np.mean)                                                      
Out[17]: 
ID
1   8 days 00:00:00
2   1 days 12:00:00
Name: delta, dtype: timedelta64[ns]

This same method fails with my 'start' and 'end' columns, with the error:

In [simulated]: df.groupby('ID')['start'].apply(np.mean)    
Out[truncated]: DatetimeIndex cannot perform the operation mean

(the error trace for the above simulated function call is extremely long, therefor omitted)

@JCZuurmond
Copy link

@sylvanosullivan You should be able to do the aggregate by setting numeric_only=False in the group by.

df = pd.DataFrame({
    'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
    'group': ['a', 'a', 'a', 'b', 'b']
})

(
    df
    .groupby('group')
    .mean(numeric_only=False)
)

@jung-benjamin
Copy link

I still had some issues with this today.
Setting numeric_only=False works for groupby().mean(), but not for .std().
It raises an error:

TypeError: std() got an unexpected keyword argument 'numeric_only'

This error only occurs only with DataFrame.groupby().std, not with DataFrame.std.

If I don't try setting numeric_only=False, groupby().std() ignores columns with timedeltas.
Is there a purpose to this behaviour?

Using .describe() still outputs a standard deviation, which can be used as a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet