Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding resample functionality to CFTimeIndex #2191

Closed
spencerahill opened this issue May 28, 2018 · 22 comments
Closed

Adding resample functionality to CFTimeIndex #2191

spencerahill opened this issue May 28, 2018 · 22 comments
Labels

Comments

@spencerahill
Copy link
Contributor

@spencerahill spencerahill commented May 28, 2018

Now that CFTimeIndex has been implemented (#1252), one thing that remains to implement is resampling. @shoyer provided a sketch of how to implement it: #1252 (comment). In the interim, @spencerkclark provided a sketch of a workaround for some use-cases using groupby: #1270 (comment).

I thought it would be useful to have a new Issue specifically on this topic from which future conversation can continue. @shoyer, does that sketch you provided still seem like a good starting point?

@shoyer
Copy link
Member

@shoyer shoyer commented May 28, 2018

Yes, I think so. The main thing we need is a function to map from datetime -> datetime at start of frequency.

@naomi-henderson
Copy link

@naomi-henderson naomi-henderson commented Jun 5, 2018

I am trying to combine the monthly CMIP5 rcp85 ts datasets (go past 2064AD) with the myriad calendars, so I love the new CFTimeIndex! But I need resample(time='MS') in order to force them all to start on the first of each month
thanks!

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Jun 5, 2018

@naomi-henderson thanks! In the meantime here's a possible workaround, in case you haven't figured one out already:

import numpy as np
import xarray as xr

from cftime import num2date, DatetimeNoLeap


times = num2date(np.arange(730), calendar='noleap', units='days since 0001-01-01')
da = xr.DataArray(np.arange(730), coords=[times], dims=['time'])

month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
da['MS'] = xr.DataArray(month_start, coords=da.time.coords)
resampled = da.groupby('MS').mean('time').rename({'MS': 'time'})
@naomi-henderson
Copy link

@naomi-henderson naomi-henderson commented Jun 5, 2018

@spencerkclark thanks! I hadn't figured out that particular workaround, but it works, albeit quite slow. For now it will get me to the next step, but just changing to first-of-the-month takes longer than regridding all models to a common grid!

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Jun 6, 2018

Indeed what I had above is quite slow!

In [6]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in da.time]
   ...:
1 loop, best of 3: 588 ms per loop

Iterating over the contents of da.time generates DataArray instances encapsulating single dates. We can iterate over the dates themselves directly, which is much (over 1000x) faster:

In [7]: %%timeit
   ...: month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in da.time.values]
   ...:
1000 loops, best of 3: 302 µs per loop
@naomi-henderson
Copy link

@naomi-henderson naomi-henderson commented Jun 6, 2018

Yes, when open_mfdataset decides to convert to CFTime this is much faster. When time is in datetime64, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-72-a96fa0263d3e> in <module>()
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

<ipython-input-72-a96fa0263d3e> in <listcomp>(.0)
      9     dss = xr.open_mfdataset(files,decode_times=True,autoclose=True)
     10     #month_start = [DatetimeNoLeap(date.dt.year, date.dt.month, 1) for date in dss.time]
---> 11     month_start = [DatetimeNoLeap(date.year, date.month, 1) for date in dss.time.values]
     12     #month_start = [DatetimeNoLeap(yr, mon, 1) for yr,mon in zip(dss.time.dt.year,dss.time.dt.month)]
     13     #break

AttributeError: 'numpy.datetime64' object has no attribute 'year'

You can see I made a feeble attempt to fix it to work for all the CMIP5 calendars, but is just as slow. Any suggestions?

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Jun 6, 2018

When the time coordinate contains np.datetime64 objects I recommend using resample directly, because the underlying index will be a pandas DatetimeIndex (so you just need some logic to detect if that's the case).

I think the most general workaround for right now would probably look something like the example below. This has the property that it preserves the underlying calendar type of the time index.

import pandas as pd
import xarray as xr

def resample_ms_freq(ds, dim='time'):
    """Resample the dataset to 'MS' frequency regardless of the
    calendar used.
    
    Parameters
    ----------
    ds : Dataset
        Dataset to be resampled
    dim : str
        Dimension name associated with the time index
        
    Returns
    -------
    Dataset
    """
    index = ds.indexes[dim]
    if isinstance(index, pd.DatetimeIndex):
        return ds.resample(**{dim: 'MS'}).mean(dim)
    elif isinstance(index, xr.CFTimeIndex):
        date_type = index.date_type
        month_start = [date_type(date.year, date.month, 1) for date in ds[dim].values]
        ms = xr.DataArray(month_start, coords=ds[dim].coords)
        ds = ds.assign_coords(MS=ms)
        return ds.groupby('MS').mean(dim).rename({'MS': dim})
    else:
        raise TypeError(
            'Resampling to month start frequency requires using a time index of either '
            'type pd.DatetimeIndex or xr.CFTimeIndex.')

with xr.set_options(enable_cftimeindex=True):
    ds = xr.open_mfdataset(files)
resampled = resample_ms_freq(ds)
@aidanheerdegen
Copy link
Contributor

@aidanheerdegen aidanheerdegen commented Jun 22, 2018

I'm not sure if my issue belongs in here, but I didn't want to create a new Issue (there are already 455 open ones).

I am experimenting with the new CFTimeIndex functionality (thanks heaps BTW! That was a mammoth effort if the PR thread is anything to go by).

I am trying to shift a time index as I need to align datasets to a common start point. So using the example code above,

da.time.get_index('time').shift(1,'D')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-71-db48b2fbb340> in <module>()
----> 1 da.time.get_index('time').shift(1,'D')

/g/data3/hh5/public/apps/miniconda3/envs/analysis27-18.04/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in shift(self, periods, freq)
   2627         """
   2628         raise NotImplementedError("Not supported for type %s" %
-> 2629                                   type(self).__name__)
   2630 
   2631     def argsort(self, *args, **kwargs):

NotImplementedError: Not supported for type CFTimeIndex

Is this not implemented because it might require resampling?

I ask because this works:

times[0] + pd.Timedelta('365 days')
cftime.DatetimeNoLeap(2, 1, 1, 0, 0, 0, 0, -1, 1)

I guess I am asking, if I want to shift a time index is the best (only?) way currently is to loop over all the individual elements of the index and add a time offset to each?

@shoyer
Copy link
Member

@shoyer shoyer commented Jun 22, 2018

@aidanheerdegen
Copy link
Contributor

@aidanheerdegen aidanheerdegen commented Jun 22, 2018

Does this need it's own issue then, so it doesn't get lost?

@shoyer
Copy link
Member

@shoyer shoyer commented Jun 22, 2018

@huard
Copy link
Contributor

@huard huard commented Oct 1, 2018

I'm trying to wrap my head around what is needed to get the resample method to work but I must say I'm confused. Would it be possible/practical to create a branch with stubs in the code for the methods that need to be written (with a #2191 comment) so newbies can help fill-in the gaps?

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 2, 2018

Take a look at #2458 for a very basic version of this.

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Oct 2, 2018

Thanks @shoyer for getting things started! @huard your help would be very much appreciated in implementing this. As mentioned in #2437 (comment), this is one of the biggest remaining gaps in functionality between xarray objects indexed by a CFTimeIndex and xarray objects indexed by a DatetimeIndex.

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Feb 3, 2019

This has been implemented in #2593 🎉.

@zzheng93
Copy link

@zzheng93 zzheng93 commented Feb 18, 2019

Hi folks,
I have some data like
2000-01-01 00:00:00, 2000-01-01 12:00:00,
2000-01-02 00:00:00, 2000-01-02 12:00:00.
The index is cftime
And I want to take the average within the same date and save the results.
I am wondering if it is possible to resample them at a daily level (e.g., the results will be 2000-01-01 00:00:00 and 2000-01-02 00:00:00)?

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Feb 18, 2019

@zzheng93 this will be possible in the next release of xarray, so not quite yet, but soon. If you're in a hurry you could install the development version.

@zzheng93
Copy link

@zzheng93 zzheng93 commented Feb 18, 2019

@zzheng93 this will be possible in the next release of xarray, so not quite yet, but soon. If you're in a hurry you could install the development version.

@spencerkclark Thank you very much :)
I am new to the Xarray community. I am wondering if there is any instruction regarding installing the latest development version and how to implement the daily resampling function.

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Feb 19, 2019

@zzheng93 welcome! One way to install the development version is to clone this repo, and do an editable install:

$ git clone https://github.com/pydata/xarray.git
$ cd xarray
$ pip install -e .

Then using resample with a daily frequency would look something like:

In [1]: import xarray as xr

In [2]: times = xr.cftime_range('2000', periods=4, freq='12H')

In [3]: times
Out[3]:
CFTimeIndex([2000-01-01 00:00:00, 2000-01-01 12:00:00, 2000-01-02 00:00:00,
             2000-01-02 12:00:00],
            dtype='object')

In [4]: da = xr.DataArray(range(4), [('time', times)])

In [5]: da.resample(time='D').mean()
Out[5]:
<xarray.DataArray (time: 2)>
array([0.5, 2.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 2000-01-02 00:00:00
@zzheng93
Copy link

@zzheng93 zzheng93 commented Feb 19, 2019

@spencerkclark Thank you very much for your help! I will install the development version on my local machine.
Currently I am using NCAR Cheyenne to manipulate the climate data. What I am doing on Cheyenne as a detour is:
xarray.assign_coords(time = xarray.indexes['time'].to_datetimeindex()) xarray.resample(time="D").mean("time")
I hope NCAR will support the next release of xarray.
A follow-up question is that when we using xarray to manipulate the large dataset such as <xarray.DataArray (time: 14600, lat: 192, lon: 288)> and want to save the results for further machine learning applications (e.g., using sklearn or XGBoost, even deep learning), what will be a good format to store the data on server or local machine that will be easily used by sklearn or XGBoost?

@spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Feb 19, 2019

@zzheng93 sure thing!

I hope NCAR will support the next release of xarray.

I know you didn't ask for help with this, but I can't resist :) -- I recommend you set up your own Python environment on Cheyenne. This is nice because it gives you full control over the packages you install (so you don't need to wait until someone else installs them for you). A good place to start on how to do this is the "Getting started with Pangeo on HPC" page on the Pangeo website.

A follow-up question is that when we using xarray to manipulate the large dataset such as <xarray.DataArray (time: 14600, lat: 192, lon: 288)> and want to save the results for further machine learning applications (e.g., using sklearn or XGBoost, even deep learning), what will be a good format to store the data on server or local machine that will be easily used by sklearn or XGBoost?

I think with some more specific details regarding what you are looking to do, this could potentially be a good question to ask in the (relatively new) pangeo-data/ml-workflow-examples repo, where they are discussing machine learning workflows connected to xarray.

@zzheng93
Copy link

@zzheng93 zzheng93 commented Feb 19, 2019

@spencerkclark
Very helpful!!! Thanks a million! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants