groupby.mean, etc, doesn't recognize timedelta64 #5724

hsharrison · 2013-12-17T22:22:32Z

See http://stackoverflow.com/questions/20625982/split-apply-combine-on-pandas-timedelta-column
related as well: http://stackoverflow.com/questions/20789976/python-pandas-dataframe-1st-line-issue-with-datetime-timedelta/20802902#20802902

I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns] or '<m8[ns]'), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:

import pandas as pd

import numpy as np

pd.__version__
Out[3]: '0.13.0rc1'

np.__version__
Out[4]: '1.8.0'

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

data
Out[8]: 
         f1        f2              td
0  0.990140  0.948313 00:00:00.003066
1  0.277125  0.993549 00:00:00.001443
2  0.016427  0.581129 00:00:00.009257
3  0.048662  0.512215 00:00:00.000702
4  0.846301  0.179160 00:00:00.000396
5  0.568323  0.419887 00:00:00.000266
6  0.328182  0.919897 00:00:00.006138
7  0.292882  0.213219 00:00:00.008876
8  0.623332  0.003409 00:00:00.000322
9  0.650436  0.844180 00:00:00.006873

[10 rows x 3 columns]

data.groupby(data.index < 5).mean()
Out[9]: 
             f1        f2
False  0.492631  0.480118
True   0.435731  0.642873

[2 rows x 2 columns]

Or, forcing pandas to try the operation on the 'td' column:

data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
    417         """
    418         try:
--> 419             return self._cython_agg_general('mean')
    420         except GroupByError:
    421             raise

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
    669 
    670         if len(output) == 0:
--> 671             raise DataError('No numeric types to aggregate')
    672 
    673         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

However, taking the mean of the column works fine, so numeric operations should be possible:

data['td'].mean()
Out[11]: 
0   00:00:00.003734
dtype: timedelta64[ns]

The text was updated successfully, but these errors were encountered:

hayd · 2013-12-18T00:29:21Z

Worth mentioning that DataFrame's mean doesn't do this either, I think it should (convert those originally datelike to date):

In [11]: data.mean()
Out[11]: 
f1          0.528609
f2          0.583264
td    5975598.700000
dtype: float64

Compare to data.mean(1) which ignores date columns (correctly imo as you're going across dtypes).

cancan101 · 2013-12-18T00:41:53Z

Interestingly, mean works on Series but not DataFrame (this is using the new timedelta formatting code #5701):

In [10]: pd.to_timedelta(list(range(5)), unit='D')
Out[10]: 
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]

In [9]: pd.to_timedelta(list(range(5)), unit='D').mean()
Out[9]: 
0   2 days
dtype: timedelta64[ns]

In [6]: pd.DataFrame(pd.to_timedelta(list(range(5)), unit='D')).mean()
Out[6]: 
0   55785 days, 14:53:21.659060
dtype: timedelta64[ns]

jreback · 2013-12-18T00:45:31Z

@hayd actually the results of data.mean() are correct. The result for td is in nanoseconds. It IS however possible to return an object array that is correct, .eg.

In [34]: Series([0.1,0.2,timedelta(10)])
Out[34]: 
0                 0.1
1                 0.2
2    10 days, 0:00:00
dtype: object

so @hayd maybe create a separate issue for this type of inference (Its just a bit of inference detection in nanops.py/_reduce)

jreback · 2013-12-18T00:46:50Z

@hsharrison as far as the groupby; this is just not implemented ATM in groupby.py; its not that difficult, just needs to follow basically what datetime64 stuff does

danielballan · 2014-04-15T02:30:06Z

See #6884.

tomfitzhenry · 2015-10-15T01:27:38Z

Workaround: Use .describe() (which includes the mean) rather than .mean()

See https://gist.github.com/tomfitzhenry/d36ebba697a1f3eeefcb for demo.

danielballan · 2015-10-15T14:06:22Z

Wow. Any insight into why that works? I would not have expected a convenience method to take a different code path.

zaxliu · 2015-11-03T23:38:44Z

Also has the same problem. @tomfitzhenry 's solution works.

kjam · 2016-02-03T10:48:01Z

Seeing as this keeps getting kicked to the next release, you can also perform sum and count. Referencing @hsharrison's original dataframe from the bug report:

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

grouped_df = data.groupby(data.index < 5)

mean_df = grouped_df.mean()

In [14]: mean_df
Out[14]: 
             f1        f2
False  0.271488  0.614299
True   0.535522  0.476918

mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()

In [16]: mean_df
Out[16]: 
             f1        f2              td
False  0.271488  0.614299 00:00:00.004187
True   0.535522  0.476918 00:00:00.003278

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724 Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#15054 from jreback/groupby_arg and squashes the following commits: 768fce1 [Jeff Reback] BUG: make sure that we are passing thru kwargs to groupby BUG: allow timedelta64 to work in groupby with numeric_only=False

sylvanosullivan · 2019-05-04T19:44:55Z

This is still an issue in version 0.24.2 with both datetime64 and timedelta -type columns. If I should truncate any part of the below very long post, please let me know.
Calling np.mean via apply on timedelta with chaining works, but calling mean in the aggregation function fails:

In [11]: pd.__version__                                                                                
Out[11]: '0.24.2'

In [12]: np.__version__                                                                                
Out[12]: '1.16.2'


In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: import datetime                                                                                

In [2]: start_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in range(1,5)]) 
   ...:                                                                                                

In [3]: end_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in np.random.randint(4,17,4)]) 
   ...:                                                                                                

In [4]: df = pd.DataFrame({'start': start_dates,'end':end_dates,'ID':[1,1,2,2]}) 
   ...:                                                                                                

In [5]: df['delta'] = df['end'] - df['start'] 
   ...:                                                                                                

In [6]: df                                                                                             
Out[6]: 
       start        end  ID   delta
0 2019-01-01 2019-01-15   1 14 days
1 2019-01-02 2019-01-04   1  2 days
2 2019-01-03 2019-01-04   2  1 days
3 2019-01-04 2019-01-06   2  2 days

In [7]: df.groupby('ID').agg({'delta':'mean'}) 
   ...:                                                                                                
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-7-4b906de9b470> in <module>
----> 1 df.groupby('ID').agg({'delta':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

And with datetime64 series:

In [8]: df.groupby('ID').agg({'start':'mean'})                                                         
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-8-81a3957f4666> in <module>
----> 1 df.groupby('ID').agg({'start':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Dtypes:

In [14]: df.dtypes                                                                                     
Out[14]: 
start     datetime64[ns]
end       datetime64[ns]
ID                 int64
delta    timedelta64[ns]
dtype: object

As mentioned above, chaining with .apply and np.mean() works for timedelta only:

In [17]: df.groupby('ID')['delta'].apply(np.mean)                                                      
Out[17]: 
ID
1   8 days 00:00:00
2   1 days 12:00:00
Name: delta, dtype: timedelta64[ns]

This same method fails with my 'start' and 'end' columns, with the error:

In [simulated]: df.groupby('ID')['start'].apply(np.mean)    
Out[truncated]: DatetimeIndex cannot perform the operation mean

(the error trace for the above simulated function call is extremely long, therefor omitted)

JCZuurmond · 2019-07-11T14:12:49Z

@sylvanosullivan You should be able to do the aggregate by setting numeric_only=False in the group by.

df = pd.DataFrame({
    'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
    'group': ['a', 'a', 'a', 'b', 'b']
})

(
    df
    .groupby('group')
    .mean(numeric_only=False)
)

jung-benjamin · 2022-03-30T09:01:09Z

I still had some issues with this today.
Setting numeric_only=False works for groupby().mean(), but not for .std().
It raises an error:

TypeError: std() got an unexpected keyword argument 'numeric_only'

This error only occurs only with DataFrame.groupby().std, not with DataFrame.std.

If I don't try setting numeric_only=False, groupby().std() ignores columns with timedeltas.
Is there a purpose to this behaviour?

Using .describe() still outputs a standard deviation, which can be used as a workaround.

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

danielballan mentioned this issue Apr 16, 2014

ENH: Allow aggregate numeric operations on timedelta64. #6884

Closed

jreback modified the milestones: 0.14.0, 0.15.0 Apr 16, 2014

jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014

jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jul 1, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014

jreback mentioned this issue Sep 20, 2014

Bloomberg Hackathon #8323

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback added Difficulty Intermediate labels Apr 8, 2015

jreback mentioned this issue Apr 8, 2015

PyCon 2015 sprints #9811

Closed

jreback mentioned this issue Feb 24, 2016

calling mean on groupby of timedelta does not work #12440

Closed

jreback modified the milestones: 0.18.1, Next Major Release Feb 24, 2016

jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016

jreback modified the milestones: 0.18.2, 0.19.0 Jul 6, 2016

jreback added a commit to jreback/pandas that referenced this issue Jan 4, 2017

BUG: make sure that we are passing thru kwargs to groupby

be7d2d2

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback mentioned this issue Jan 4, 2017

BUG: make sure that we are passing thru kwargs to groupby #15054

Closed

jreback added a commit to jreback/pandas that referenced this issue Jan 5, 2017

BUG: make sure that we are passing thru kwargs to groupby

045bcb5

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback added a commit to jreback/pandas that referenced this issue Jan 9, 2017

BUG: make sure that we are passing thru kwargs to groupby

75fc113

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback added a commit to jreback/pandas that referenced this issue Jan 10, 2017

BUG: make sure that we are passing thru kwargs to groupby

f64a309

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017

BUG: make sure that we are passing thru kwargs to groupby

c250302

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017

BUG: make sure that we are passing thru kwargs to groupby

768fce1

BUG: allow timedelta64 to work in groupby with numeric_only=False closes pandas-dev#5724

jreback closed this as completed in 8e13da2 Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby.mean, etc, doesn't recognize timedelta64 #5724

groupby.mean, etc, doesn't recognize timedelta64 #5724

hsharrison commented Dec 17, 2013

hayd commented Dec 18, 2013

cancan101 commented Dec 18, 2013

jreback commented Dec 18, 2013

jreback commented Dec 18, 2013

danielballan commented Apr 15, 2014

tomfitzhenry commented Oct 15, 2015

danielballan commented Oct 15, 2015

zaxliu commented Nov 3, 2015

kjam commented Feb 3, 2016

sylvanosullivan commented May 4, 2019

JCZuurmond commented Jul 11, 2019

jung-benjamin commented Mar 30, 2022

groupby.mean, etc, doesn't recognize timedelta64 #5724

groupby.mean, etc, doesn't recognize timedelta64 #5724

Comments

hsharrison commented Dec 17, 2013

hayd commented Dec 18, 2013

cancan101 commented Dec 18, 2013

jreback commented Dec 18, 2013

jreback commented Dec 18, 2013

danielballan commented Apr 15, 2014

tomfitzhenry commented Oct 15, 2015

danielballan commented Oct 15, 2015

zaxliu commented Nov 3, 2015

kjam commented Feb 3, 2016

sylvanosullivan commented May 4, 2019

JCZuurmond commented Jul 11, 2019

jung-benjamin commented Mar 30, 2022