Adding resample functionality to CFTimeIndex #2191

spencerahill · 2018-05-28T18:01:57Z

Now that CFTimeIndex has been implemented (#1252), one thing that remains to implement is resampling. @shoyer provided a sketch of how to implement it: #1252 (comment). In the interim, @spencerkclark provided a sketch of a workaround for some use-cases using groupby: #1270 (comment).

I thought it would be useful to have a new Issue specifically on this topic from which future conversation can continue. @shoyer, does that sketch you provided still seem like a good starting point?

shoyer · 2018-05-28T19:16:24Z

Yes, I think so. The main thing we need is a function to map from datetime -> datetime at start of frequency.

naomi-henderson · 2018-06-05T19:15:09Z

I am trying to combine the monthly CMIP5 rcp85 ts datasets (go past 2064AD) with the myriad calendars, so I love the new CFTimeIndex! But I need resample(time='MS') in order to force them all to start on the first of each month
thanks!

spencerkclark · 2018-06-05T19:56:30Z

@naomi-henderson thanks! In the meantime here's a possible workaround, in case you haven't figured one out already:

import numpy as np
import xarray as xr

from cftime import num2date, DatetimeNoLeap


times = num2date(np.arange(730), calendar='noleap', units='days since 0001-01-01')
da = xr.DataArray(np.arange(730), coords=[times], dims=['time'])

month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
da['MS'] = xr.DataArray(month_start, coords=da.time.coords)
resampled = da.groupby('MS').mean('time').rename({'MS': 'time'})

naomi-henderson · 2018-06-05T23:20:00Z

@spencerkclark thanks! I hadn't figured out that particular workaround, but it works, albeit quite slow. For now it will get me to the next step, but just changing to first-of-the-month takes longer than regridding all models to a common grid!

spencerkclark · 2018-06-06T00:07:10Z

Indeed what I had above is quite slow!

In [6]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
   ...:
1 loop, best of 3: 588 ms per loop

Iterating over the contents of da.time generates DataArray instances encapsulating single dates. We can iterate over the dates themselves directly, which is much (over 1000x) faster:

In [7]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in da.time.values]
   ...:
1000 loops, best of 3: 302 µs per loop

naomi-henderson · 2018-06-06T13:25:11Z

Yes, when open_mfdataset decides to convert to CFTime this is much faster. When time is in datetime64, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-72-a96fa0263d3e> in <module>()
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

<ipython-input-72-a96fa0263d3e> in <listcomp>(.0)
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

AttributeError: 'numpy.datetime64' object has no attribute 'year'

You can see I made a feeble attempt to fix it to work for all the CMIP5 calendars, but is just as slow. Any suggestions?

spencerkclark · 2018-06-06T14:09:56Z

When the time coordinate contains np.datetime64 objects I recommend using resample directly, because the underlying index will be a pandas DatetimeIndex (so you just need some logic to detect if that's the case).

I think the most general workaround for right now would probably look something like the example below. This has the property that it preserves the underlying calendar type of the time index.

import pandas as pd
import xarray as xr

def resample_ms_freq(ds, dim='time'):
    """Resample the dataset to 'MS' frequency regardless of the
    calendar used.
    
    Parameters
    ----------
    ds : Dataset
        Dataset to be resampled
    dim : str
        Dimension name associated with the time index
        
    Returns
    -------
    Dataset
    """
    index = ds.indexes[dim]
    if isinstance(index, pd.DatetimeIndex):
        return ds.resample(**{dim: 'MS'}).mean(dim)
    elif isinstance(index, xr.CFTimeIndex):
        date_type = index.date_type
        month_start = [date_type(date.year, date.month, 1) for date in ds[dim].values]
        ms = xr.DataArray(month_start, coords=ds[dim].coords)
        ds = ds.assign_coords(MS=ms)
        return ds.groupby('MS').mean(dim).rename({'MS': dim})
    else:
        raise TypeError(
            'Resampling to month start frequency requires using a time index of either '
            'type pd.DatetimeIndex or xr.CFTimeIndex.')

with xr.set_options(enable_cftimeindex=True):
    ds = xr.open_mfdataset(files)
resampled = resample_ms_freq(ds)

aidanheerdegen · 2018-06-22T04:12:11Z

I'm not sure if my issue belongs in here, but I didn't want to create a new Issue (there are already 455 open ones).

I am experimenting with the new CFTimeIndex functionality (thanks heaps BTW! That was a mammoth effort if the PR thread is anything to go by).

I am trying to shift a time index as I need to align datasets to a common start point. So using the example code above,

da.time.get_index('time').shift(1,'D')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-71-db48b2fbb340> in <module>()
----> 1 da.time.get_index('time').shift(1,'D')

/g/data3/hh5/public/apps/miniconda3/envs/analysis27-18.04/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in shift(self, periods, freq)
   2627         """
   2628         raise NotImplementedError("Not supported for type %s" %
-> 2629                                   type(self).__name__)
   2630 
   2631     def argsort(self, *args, **kwargs):

NotImplementedError: Not supported for type CFTimeIndex

Is this not implemented because it might require resampling?

I ask because this works:

times[0] + pd.Timedelta('365 days')
cftime.DatetimeNoLeap(2, 1, 1, 0, 0, 0, 0, -1, 1)

I guess I am asking, if I want to shift a time index is the best (only?) way currently is to loop over all the individual elements of the index and add a time offset to each?

shoyer · 2018-06-22T04:20:48Z

shift() is different from resampling, but indeed it looks like we’ll need to add it manually to CFTimeIndex.

…

On Thu, Jun 21, 2018 at 9:12 PM Aidan Heerdegen ***@***.***> wrote: I'm not sure if my issue belongs in here, but I didn't want to create a new Issue (there are already 455 open ones). I am experimenting with the new CFTimeIndex functionality (thanks heaps BTW! That was a mammoth effort if the PR thread is anything to go by). I am trying to shift a time index as I need to align datasets to a common start point. So using the example code above, da.time.get_index('time').shift(1,'D')---------------------------------------------------------------------------NotImplementedError Traceback (most recent call last)<ipython-input-71-db48b2fbb340> in <module>()----> 1 da.time.get_index('time').shift(1,'D') /g/data3/hh5/public/apps/miniconda3/envs/analysis27-18.04/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in shift(self, periods, freq) 2627 """ 2628 raise NotImplementedError("Not supported for type %s" %-> 2629 type(self).__name__) 2630 2631 def argsort(self, *args, **kwargs):NotImplementedError: Not supported for type CFTimeIndex Is this not implemented because it might require resampling? I ask because this works: times[0] + pd.Timedelta('365 days') cftime.DatetimeNoLeap(2, 1, 1, 0, 0, 0, 0, -1, 1)``` I guess I am asking, if I want to shift a time index is the best (only?) way currently to loop over all the individual elements of the index and add a time offset to each? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2191 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1vEhsbxVMPJ6nHrwU9BT_AgCLLWlks5t_G6cgaJpZM4UQeax> .

aidanheerdegen · 2018-06-22T04:51:16Z

Does this need it's own issue then, so it doesn't get lost?

shoyer · 2018-06-22T06:42:03Z

Yes, that would probably be a good idea.

…

On Thu, Jun 21, 2018 at 9:51 PM Aidan Heerdegen ***@***.***> wrote: Does this need it's own issue then, so it doesn't get lost? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2191 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1lEE7z5wdd_cmlrNnLzUJWC5wmegks5t_HfFgaJpZM4UQeax> .

huard · 2018-10-01T19:38:44Z

I'm trying to wrap my head around what is needed to get the resample method to work but I must say I'm confused. Would it be possible/practical to create a branch with stubs in the code for the methods that need to be written (with a #2191 comment) so newbies can help fill-in the gaps?

shoyer · 2018-10-02T15:45:08Z

Take a look at #2458 for a very basic version of this.

spencerkclark · 2018-10-02T16:10:51Z

Thanks @shoyer for getting things started! @huard your help would be very much appreciated in implementing this. As mentioned in #2437 (comment), this is one of the biggest remaining gaps in functionality between xarray objects indexed by a CFTimeIndex and xarray objects indexed by a DatetimeIndex.

spencerkclark · 2019-02-03T12:16:21Z

This has been implemented in #2593 🎉.

zhonghua-zheng · 2019-02-18T20:56:02Z

Hi folks,
I have some data like
2000-01-01 00:00:00, 2000-01-01 12:00:00,
2000-01-02 00:00:00, 2000-01-02 12:00:00.
The index is cftime
And I want to take the average within the same date and save the results.
I am wondering if it is possible to resample them at a daily level (e.g., the results will be 2000-01-01 00:00:00 and 2000-01-02 00:00:00)?

spencerkclark · 2019-02-18T21:43:34Z

@zzheng93 this will be possible in the next release of xarray, so not quite yet, but soon. If you're in a hurry you could install the development version.

zhonghua-zheng · 2019-02-18T23:46:46Z

@zzheng93 this will be possible in the next release of xarray, so not quite yet, but soon. If you're in a hurry you could install the development version.

@spencerkclark Thank you very much :)
I am new to the Xarray community. I am wondering if there is any instruction regarding installing the latest development version and how to implement the daily resampling function.

spencerkclark · 2019-02-19T02:04:39Z

@zzheng93 welcome! One way to install the development version is to clone this repo, and do an editable install:

$ git clone https://github.com/pydata/xarray.git
$ cd xarray
$ pip install -e .

Then using resample with a daily frequency would look something like:

In [1]: import xarray as xr

In [2]: times = xr.cftime_range('2000', periods=4, freq='12H')

In [3]: times
Out[3]:
CFTimeIndex([2000-01-01 00:00:00, 2000-01-01 12:00:00, 2000-01-02 00:00:00,
             2000-01-02 12:00:00],
            dtype='object')

In [4]: da = xr.DataArray(range(4), [('time', times)])

In [5]: da.resample(time='D').mean()
Out[5]:
<xarray.DataArray (time: 2)>
array([0.5, 2.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 2000-01-02 00:00:00

zhonghua-zheng · 2019-02-19T02:22:22Z

@spencerkclark Thank you very much for your help! I will install the development version on my local machine.
Currently I am using NCAR Cheyenne to manipulate the climate data. What I am doing on Cheyenne as a detour is:
xarray.assign_coords(time = xarray.indexes['time'].to_datetimeindex()) xarray.resample(time="D").mean("time")
I hope NCAR will support the next release of xarray.
A follow-up question is that when we using xarray to manipulate the large dataset such as <xarray.DataArray (time: 14600, lat: 192, lon: 288)> and want to save the results for further machine learning applications (e.g., using sklearn or XGBoost, even deep learning), what will be a good format to store the data on server or local machine that will be easily used by sklearn or XGBoost?

spencerkclark · 2019-02-19T20:06:15Z

@zzheng93 sure thing!

I hope NCAR will support the next release of xarray.

I know you didn't ask for help with this, but I can't resist :) -- I recommend you set up your own Python environment on Cheyenne. This is nice because it gives you full control over the packages you install (so you don't need to wait until someone else installs them for you). A good place to start on how to do this is the "Getting started with Pangeo on HPC" page on the Pangeo website.

A follow-up question is that when we using xarray to manipulate the large dataset such as <xarray.DataArray (time: 14600, lat: 192, lon: 288)> and want to save the results for further machine learning applications (e.g., using sklearn or XGBoost, even deep learning), what will be a good format to store the data on server or local machine that will be easily used by sklearn or XGBoost?

I think with some more specific details regarding what you are looking to do, this could potentially be a good question to ask in the (relatively new) pangeo-data/ml-workflow-examples repo, where they are discussing machine learning workflows connected to xarray.

zhonghua-zheng · 2019-02-19T20:22:28Z

@spencerkclark
Very helpful!!! Thanks a million! :)

spencerahill mentioned this issue May 28, 2018

Remove datetime workaround logic spencerahill/aospy#273

Merged

spencerahill mentioned this issue May 29, 2018

BUG: Resample on PeriodIndex not working? #1270

Closed

aidanheerdegen mentioned this issue Jun 22, 2018

Implement shift for CFTimeIndex #2244

Closed

spencerkclark mentioned this issue Jul 19, 2018

WIP Add a CFTimeIndex-enabled xr.cftime_range function #2301

Merged

4 tasks

spencerkclark mentioned this issue Sep 25, 2018

xarray potential inconstistencies with cftime #2437

Closed

This was referenced Oct 12, 2018

Implement CFPeriodIndex #2481

Closed

varia : calendars, time_bnds, ... Ouranosinc/xclim#30

Closed

huard mentioned this issue Oct 22, 2018

Implement resample for CFTImeIndex in xarray Ouranosinc/xclim#73

Closed

dcherian added the topic-cftime label Oct 27, 2018

jwenfai mentioned this issue Dec 5, 2018

CFTimeIndex Resampling #2593

Merged

3 tasks

spencerkclark closed this as completed Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding resample functionality to CFTimeIndex #2191

Adding resample functionality to CFTimeIndex #2191

spencerahill commented May 28, 2018

shoyer commented May 28, 2018

naomi-henderson commented Jun 5, 2018

spencerkclark commented Jun 5, 2018

naomi-henderson commented Jun 5, 2018

spencerkclark commented Jun 6, 2018

naomi-henderson commented Jun 6, 2018

spencerkclark commented Jun 6, 2018 •

edited

Loading

aidanheerdegen commented Jun 22, 2018 •

edited

Loading

shoyer commented Jun 22, 2018 via email

aidanheerdegen commented Jun 22, 2018

shoyer commented Jun 22, 2018 via email

huard commented Oct 1, 2018

shoyer commented Oct 2, 2018

spencerkclark commented Oct 2, 2018

spencerkclark commented Feb 3, 2019

zhonghua-zheng commented Feb 18, 2019

spencerkclark commented Feb 18, 2019

zhonghua-zheng commented Feb 18, 2019 •

edited

Loading

spencerkclark commented Feb 19, 2019

zhonghua-zheng commented Feb 19, 2019 •

edited

Loading

spencerkclark commented Feb 19, 2019

zhonghua-zheng commented Feb 19, 2019

Adding resample functionality to CFTimeIndex #2191

Adding resample functionality to CFTimeIndex #2191

Comments

spencerahill commented May 28, 2018

shoyer commented May 28, 2018

naomi-henderson commented Jun 5, 2018

spencerkclark commented Jun 5, 2018

naomi-henderson commented Jun 5, 2018

spencerkclark commented Jun 6, 2018

naomi-henderson commented Jun 6, 2018

spencerkclark commented Jun 6, 2018 • edited Loading

aidanheerdegen commented Jun 22, 2018 • edited Loading

shoyer commented Jun 22, 2018 via email

aidanheerdegen commented Jun 22, 2018

shoyer commented Jun 22, 2018 via email

huard commented Oct 1, 2018

shoyer commented Oct 2, 2018

spencerkclark commented Oct 2, 2018

spencerkclark commented Feb 3, 2019

zhonghua-zheng commented Feb 18, 2019

spencerkclark commented Feb 18, 2019

zhonghua-zheng commented Feb 18, 2019 • edited Loading

spencerkclark commented Feb 19, 2019

zhonghua-zheng commented Feb 19, 2019 • edited Loading

spencerkclark commented Feb 19, 2019

zhonghua-zheng commented Feb 19, 2019

spencerkclark commented Jun 6, 2018 •

edited

Loading

aidanheerdegen commented Jun 22, 2018 •

edited

Loading

zhonghua-zheng commented Feb 18, 2019 •

edited

Loading

zhonghua-zheng commented Feb 19, 2019 •

edited

Loading