Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resample not working when time coordinate is timezone aware #1490

Open
benoit-fuentes opened this issue Jul 26, 2017 · 4 comments
Open

Resample not working when time coordinate is timezone aware #1490

benoit-fuentes opened this issue Jul 26, 2017 · 4 comments

Comments

@benoit-fuentes
Copy link

benoit-fuentes commented Jul 26, 2017

hi all,
here is the code to reproduce the bug

import pandas as pd
import xarray as xr
time1 = pd.date_range('2000-01-01', freq='H', periods=365 * 24)  #timezone naïve
time2 = pd.date_range('2000-01-01', freq='H', periods=365 * 24, tz='UTC')  #timezone aware
ds1 = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time1})
ds2 = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time2})
ds1.resample('3H', 'time', how='mean')  #works fine
ds2.resample('3H', 'time', how='mean')  #returns an error

This last line returns the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-0de4b0d703bd> in <module>()
      4 ds2 = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time2})
      5 ds1.resample('3H', 'time', how='mean')
----> 6 ds.resample('3H', 'time', how='mean')

~/.virtualenvs/planck3/lib/python3.5/site-packages/xarray/core/common.py in resample(self, freq, dim, how, skipna, closed, label, base, keep_attrs)
    546         time_grouper = pd.TimeGrouper(freq=freq, how=how, closed=closed,
    547                                       label=label, base=base)
--> 548         gb = self._groupby_cls(self, group, grouper=time_grouper)
    549         if isinstance(how, basestring):
    550             f = getattr(gb, how)

~/.virtualenvs/planck3/lib/python3.5/site-packages/xarray/core/groupby.py in __init__(self, obj, group, squeeze, grouper, bins, cut_kwargs)
    243                 raise ValueError('index must be monotonic for resampling')
    244             s = pd.Series(np.arange(index.size), index)
--> 245             first_items = s.groupby(grouper).first()
    246             if first_items.isnull().any():
    247                 full_index = first_items.index

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   4414         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   4415                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 4416                        **kwargs)
   4417 
   4418     def asfreq(self, freq, method=None, how=None, normalize=False,

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/groupby.py in groupby(obj, by, **kwds)
   1697         raise TypeError('invalid type: %s' % type(obj))
   1698 
-> 1699     return klass(obj, by, **kwds)
   1700 
   1701 

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    390                                                     level=level,
    391                                                     sort=sort,
--> 392                                                     mutated=self.mutated)
    393 
    394         self.obj = obj

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated)
   2605     # a passed-in Grouper, directly convert
   2606     if isinstance(key, Grouper):
-> 2607         binner, grouper, obj = key._get_grouper(obj)
   2608         if key.key is None:
   2609             return grouper, [], obj

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/resample.py in _get_grouper(self, obj)
   1093     def _get_grouper(self, obj):
   1094         # create the resampler and return our binner
-> 1095         r = self._get_resampler(obj)
   1096         r._set_binner()
   1097         return r.binner, r.grouper, r.obj

~/.virtualenvs/planck3/lib/python3.5/site-packages/pandas/core/resample.py in _get_resampler(self, obj, kind)
   1089         raise TypeError("Only valid with DatetimeIndex, "
   1090                         "TimedeltaIndex or PeriodIndex, "
-> 1091                         "but got an instance of %r" % type(ax).__name__)
   1092 
   1093     def _get_grouper(self, obj):

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

My config:
xarray==0.9.6
pandas==0.20.3
numpy==1.13.1
python-dateutil==2.6.1
six==1.10.0
pytz==2017.2

Tested on python 2.7 and python 3.5.2

@benoit-fuentes benoit-fuentes changed the title Resample not working when time coordonate is timezone aware Resample not working when time coordinate is timezone aware Jul 26, 2017
@darothen
Copy link

Did some digging.

Note here that the dtypes of time1 and time2 are different; the first is a datetime64[ns] but the second is a datetime64[ns, UTC]. For the sake of illustration, I'm going to change the timezone to EST. If we print time2, we get something that looks like this:

>>> time2
DatetimeIndex(['2000-01-01 00:00:00-05:00', '2000-01-01 01:00:00-05:00',
               '2000-01-01 02:00:00-05:00', '2000-01-01 03:00:00-05:00',
               '2000-01-01 04:00:00-05:00', '2000-01-01 05:00:00-05:00',
               '2000-01-01 06:00:00-05:00', '2000-01-01 07:00:00-05:00',
               '2000-01-01 08:00:00-05:00', '2000-01-01 09:00:00-05:00',
               ...
               '2000-12-30 14:00:00-05:00', '2000-12-30 15:00:00-05:00',
               '2000-12-30 16:00:00-05:00', '2000-12-30 17:00:00-05:00',
               '2000-12-30 18:00:00-05:00', '2000-12-30 19:00:00-05:00',
               '2000-12-30 20:00:00-05:00', '2000-12-30 21:00:00-05:00',
               '2000-12-30 22:00:00-05:00', '2000-12-30 23:00:00-05:00'],
              dtype='datetime64[ns, EST]', length=8760, freq='H')

But, if we directly print its values, we get something slightly different:

>>> time2.values
array(['2000-01-01T05:00:00.000000000', '2000-01-01T06:00:00.000000000',
       '2000-01-01T07:00:00.000000000', ...,
       '2000-12-31T02:00:00.000000000', '2000-12-31T03:00:00.000000000',
       '2000-12-31T04:00:00.000000000'], dtype='datetime64[ns]')

The difference is that the timezone delta has been automatically added in terms of hours to each value in time2. This brings up something to note: if you construct your Dataset using time1.values and time2.values, there is no problem:

import pandas as pd
import xarray as xr
time1 = pd.date_range('2000-01-01', freq='H', periods=365 * 24)  #timezone naïve
time2 = pd.date_range('2000-01-01', freq='H', periods=365 * 24, tz='UTC')  #timezone aware
ds1 = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time1.values})
ds2 = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time2.values})
ds1.resample('3H', 'time', how='mean')  # works fine
ds2.resample('3H', 'time', how='mean')  # works fine

Both time1 and time2 are instances of pd.DatetimeIndex which are subclasses of pd.Index. When xarray tries to turn them into Variables, it ultimately uses a PandasIndexAdapter to decode the contents of time1 and time2, and this is where the trouble happens. The PandasIndexAdapter tries to safely cast the dtype of the array it is passed, which works just fine for time1. But for some weird reason, numpy doesn't recognize its own datetime dtypes when they have timezone information. That is, this will work:

>>> np.dtype('datetime64[ns]')
dtype('<M8[ns]')

But this won't:

>>> np.dtype('datetime64[ns, UTC]')
TypeError: Invalid datetime unit in metadata string "[ns, UC]"

But also, the type of time2.dtype is a pandas.types.dtypes.DatetimeTZDtype, which NumPy doesn't know what to do with (it doesn't know how to map that type to its own datetime64).

So what happens is that the resulting Variable which defines the time coordinate on your ds2 has an array with the correct values, but is explicitly told to have the dtype object. When the array is decoded, then, bad things happen.

One solution would be to catch this potential glitch in either is_valid_numpy_dtype() or the PandasIndexAdapter constructor. Alternatively, we could eagerly coerce arrays with type pandas.types.dtypes.DatetimeTZDtype into numpy-compliant types at some earlier point.

@shoyer
Copy link
Member

shoyer commented Jul 26, 2017

NumPy doesn't support timezones, but pandas does. This puts things in a slightly tricky position for xarray.

We do manage to get things to work for pandas dtypes stored in indexes, in most cases. Given that our resampling behavior also relies on pandas, I think we should be able to get this work, probably by tweaking our PandasIndexAdapter, as @darothen notes.

It's borderline whether this is a new bug or feature, but this would certainly be nice to fix if possible, so I'm marking this as "Contributions welcome".

@stale
Copy link

stale bot commented Jun 26, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jun 26, 2019
@stale stale bot closed this as completed Jul 26, 2019
@jhamman jhamman reopened this Jul 26, 2019
@stale stale bot removed the stale label Jul 26, 2019
@stale
Copy link

stale bot commented Jul 2, 2021

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jul 2, 2021
@dcherian dcherian removed the stale label Jul 2, 2021
@dcherian dcherian added this to To do in Explicit Indexes via automation Jul 13, 2021
@dcherian dcherian moved this from To do to Would enable this in Explicit Indexes Jul 13, 2021
erialC-P added a commit to GeoscienceAustralia/dea-intertidal that referenced this issue Dec 23, 2022
Progress to end of 2022:  Trying to incorporate temporal filtering into exposure tidal monitoring. Prototype workflow used pandas. Current approach using xarray.  See the following issue for discussion of timezone aware datetimes in xarray
pydata/xarray#1490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Explicit Indexes
  
Would enable this
Development

No branches or pull requests

5 participants