New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding resample functionality to CFTimeIndex #2191

Open
spencerahill opened this Issue May 28, 2018 · 14 comments

Comments

Projects
None yet
7 participants
@spencerahill
Copy link
Contributor

spencerahill commented May 28, 2018

Now that CFTimeIndex has been implemented (#1252), one thing that remains to implement is resampling. @shoyer provided a sketch of how to implement it: #1252 (comment). In the interim, @spencerkclark provided a sketch of a workaround for some use-cases using groupby: #1270 (comment).

I thought it would be useful to have a new Issue specifically on this topic from which future conversation can continue. @shoyer, does that sketch you provided still seem like a good starting point?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 28, 2018

Yes, I think so. The main thing we need is a function to map from datetime -> datetime at start of frequency.

@naomi-henderson

This comment has been minimized.

Copy link

naomi-henderson commented Jun 5, 2018

I am trying to combine the monthly CMIP5 rcp85 ts datasets (go past 2064AD) with the myriad calendars, so I love the new CFTimeIndex! But I need resample(time='MS') in order to force them all to start on the first of each month
thanks!

@spencerkclark

This comment has been minimized.

Copy link
Member

spencerkclark commented Jun 5, 2018

@naomi-henderson thanks! In the meantime here's a possible workaround, in case you haven't figured one out already:

import numpy as np
import xarray as xr

from cftime import num2date, DatetimeNoLeap


times = num2date(np.arange(730), calendar='noleap', units='days since 0001-01-01')
da = xr.DataArray(np.arange(730), coords=[times], dims=['time'])

month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
da['MS'] = xr.DataArray(month_start, coords=da.time.coords)
resampled = da.groupby('MS').mean('time').rename({'MS': 'time'})
@naomi-henderson

This comment has been minimized.

Copy link

naomi-henderson commented Jun 5, 2018

@spencerkclark thanks! I hadn't figured out that particular workaround, but it works, albeit quite slow. For now it will get me to the next step, but just changing to first-of-the-month takes longer than regridding all models to a common grid!

@spencerkclark

This comment has been minimized.

Copy link
Member

spencerkclark commented Jun 6, 2018

Indeed what I had above is quite slow!

In [6]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
   ...:
1 loop, best of 3: 588 ms per loop

Iterating over the contents of da.time generates DataArray instances encapsulating single dates. We can iterate over the dates themselves directly, which is much (over 1000x) faster:

In [7]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in da.time.values]
   ...:
1000 loops, best of 3: 302 µs per loop
@naomi-henderson

This comment has been minimized.

Copy link

naomi-henderson commented Jun 6, 2018

Yes, when open_mfdataset decides to convert to CFTime this is much faster. When time is in datetime64, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-72-a96fa0263d3e> in <module>()
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

<ipython-input-72-a96fa0263d3e> in <listcomp>(.0)
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

AttributeError: 'numpy.datetime64' object has no attribute 'year'

You can see I made a feeble attempt to fix it to work for all the CMIP5 calendars, but is just as slow. Any suggestions?

@spencerkclark

This comment has been minimized.

Copy link
Member

spencerkclark commented Jun 6, 2018

When the time coordinate contains np.datetime64 objects I recommend using resample directly, because the underlying index will be a pandas DatetimeIndex (so you just need some logic to detect if that's the case).

I think the most general workaround for right now would probably look something like the example below. This has the property that it preserves the underlying calendar type of the time index.

import pandas as pd
import xarray as xr

def resample_ms_freq(ds, dim='time'):
    """Resample the dataset to 'MS' frequency regardless of the
    calendar used.
    
    Parameters
    ----------
    ds : Dataset
        Dataset to be resampled
    dim : str
        Dimension name associated with the time index
        
    Returns
    -------
    Dataset
    """
    index = ds.indexes[dim]
    if isinstance(index, pd.DatetimeIndex):
        return ds.resample(**{dim: 'MS'}).mean(dim)
    elif isinstance(index, xr.CFTimeIndex):
        date_type = index.date_type
        month_start = [date_type(date.year, date.month, 1) for date in ds[dim].values]
        ms = xr.DataArray(month_start, coords=ds[dim].coords)
        ds = ds.assign_coords(MS=ms)
        return ds.groupby('MS').mean(dim).rename({'MS': dim})
    else:
        raise TypeError(
            'Resampling to month start frequency requires using a time index of either '
            'type pd.DatetimeIndex or xr.CFTimeIndex.')

with xr.set_options(enable_cftimeindex=True):
    ds = xr.open_mfdataset(files)
resampled = resample_ms_freq(ds)
@aidanheerdegen

This comment has been minimized.

Copy link

aidanheerdegen commented Jun 22, 2018

I'm not sure if my issue belongs in here, but I didn't want to create a new Issue (there are already 455 open ones).

I am experimenting with the new CFTimeIndex functionality (thanks heaps BTW! That was a mammoth effort if the PR thread is anything to go by).

I am trying to shift a time index as I need to align datasets to a common start point. So using the example code above,

da.time.get_index('time').shift(1,'D')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-71-db48b2fbb340> in <module>()
----> 1 da.time.get_index('time').shift(1,'D')

/g/data3/hh5/public/apps/miniconda3/envs/analysis27-18.04/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in shift(self, periods, freq)
   2627         """
   2628         raise NotImplementedError("Not supported for type %s" %
-> 2629                                   type(self).__name__)
   2630 
   2631     def argsort(self, *args, **kwargs):

NotImplementedError: Not supported for type CFTimeIndex

Is this not implemented because it might require resampling?

I ask because this works:

times[0] + pd.Timedelta('365 days')
cftime.DatetimeNoLeap(2, 1, 1, 0, 0, 0, 0, -1, 1)

I guess I am asking, if I want to shift a time index is the best (only?) way currently is to loop over all the individual elements of the index and add a time offset to each?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 22, 2018

@aidanheerdegen

This comment has been minimized.

Copy link

aidanheerdegen commented Jun 22, 2018

Does this need it's own issue then, so it doesn't get lost?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 22, 2018

@huard

This comment has been minimized.

Copy link

huard commented Oct 1, 2018

I'm trying to wrap my head around what is needed to get the resample method to work but I must say I'm confused. Would it be possible/practical to create a branch with stubs in the code for the methods that need to be written (with a #2191 comment) so newbies can help fill-in the gaps?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 2, 2018

Take a look at #2458 for a very basic version of this.

@spencerkclark

This comment has been minimized.

Copy link
Member

spencerkclark commented Oct 2, 2018

Thanks @shoyer for getting things started! @huard your help would be very much appreciated in implementing this. As mentioned in #2437 (comment), this is one of the biggest remaining gaps in functionality between xarray objects indexed by a CFTimeIndex and xarray objects indexed by a DatetimeIndex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment