New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: sketch of resample support for CFTimeIndex #2458

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@shoyer
Copy link
Member

shoyer commented Oct 2, 2018

Example usage:

>>> import xarray
>>> times = xarray.cftime_range('2000', periods=30, freq='MS')
>>> da = xarray.DataArray(range(30), [('time', times)])
>>> da.resample(time='1AS').mean()
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
WIP: sketch of resample support for CFTimeIndex
Example usage:

>>> import xarray
>>> times = xarray.cftime_range('2000', periods=30, freq='MS')
>>> da = xarray.DataArray(range(30), [('time', times)])
>>> da.resample(time='1AS').mean()
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
if isinstance(self.indexes[dim], CFTimeIndex):
# TODO: handle closed, label and base arguments, and the case where
# frequency is specified without an integer count.
grouper = self.indexes[dim].shift(n=int(freq[0]), freq=freq[1:])

This comment has been minimized.

@spencerkclark

spencerkclark Oct 2, 2018

Member

Conveniently, I think this could be written as self.indexes[dim].shift(1, freq) and it would handle both the case where the frequency was specified with an integer multiple and without (internally in shift the frequency string is converted to an offset that has the appropriate multiple, so n can always be 1 in this case):

In [1]: import xarray as xr

In [2]: times = xr.cftime_range('2000', periods=5, freq='D')

In [3]: times
Out[3]:
CFTimeIndex([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00,
             2000-01-04 00:00:00, 2000-01-05 00:00:00],
            dtype='object')

In [4]: times.shift(1, 'D')
Out[4]:
CFTimeIndex([2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00,
             2000-01-05 00:00:00, 2000-01-06 00:00:00],
            dtype='object')

In [5]: times.shift(1, '2D')
Out[5]:
CFTimeIndex([2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00,
             2000-01-06 00:00:00, 2000-01-07 00:00:00],
            dtype='object')

This comment has been minimized.

@spencerkclark

spencerkclark Oct 3, 2018

Member

In general though (e.g. for frequency multiples greater than 1), something as simple as shift will not suffice. I think one will need some sort of binning mechanism like pandas.cut to bin the dates in regular intervals starting from the start date, with a result from a call to xr.cftime_range forming the bin edges.

@huard that's all I have for now -- I may add some notes over the weekend if I have time. I have looked into doing this some in the past, but I didn't arrive at something I was totally pleased with.

This comment has been minimized.

@huard

huard Oct 3, 2018

@spencerkclark Thanks! I'm confused with how groupby on grouper actually works in the example above. I'll take a look at cut.

This comment has been minimized.

@shoyer

shoyer Oct 3, 2018

Member

Here's how the magic of grouper works:

if grouper is not None:
index = safe_cast_to_index(group)
if not index.is_monotonic:
# TODO: sort instead of raising an error
raise ValueError('index must be monotonic for resampling')
s = pd.Series(np.arange(index.size), index)
first_items = s.groupby(grouper).first()
full_index = first_items.index
if first_items.isnull().any():
first_items = first_items.dropna()
sbins = first_items.values.astype(np.int64)
group_indices = ([slice(i, j)
for i, j in zip(sbins[:-1], sbins[1:])] +
[slice(sbins[-1], None)])
unique_coord = IndexVariable(group.name, first_items.index)

Basically, it's just used as the variable over which the groupby operation is done.

This comment has been minimized.

@huard

huard Oct 3, 2018

It's the line first_items = s.groupby(grouper).first() that feels magic to me. How does groubpy know to split into years, when grouper is just a shifted monthly index ?

This comment has been minimized.

@shoyer

shoyer Oct 3, 2018

Member

Well, it clearly doesn't work with the shifted monthly index, as Spencer pointed out :).

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 3, 2018

Inspired by @spencerkclark's suggestion, I tried another version based on cftime_range and reindex with method='pad'. This one seems to be working in more cases:

In [7]: times = xarray.cftime_range('2000', periods=30, freq='MS')

In [8]: da = xarray.DataArray(range(30), [('time', times)])

In [9]: times
Out[9]:
CFTimeIndex([2000-01-01 00:00:00, 2000-02-01 00:00:00, 2000-03-01 00:00:00,
             2000-04-01 00:00:00, 2000-05-01 00:00:00, 2000-06-01 00:00:00,
             2000-07-01 00:00:00, 2000-08-01 00:00:00, 2000-09-01 00:00:00,
             2000-10-01 00:00:00, 2000-11-01 00:00:00, 2000-12-01 00:00:00,
             2001-01-01 00:00:00, 2001-02-01 00:00:00, 2001-03-01 00:00:00,
             2001-04-01 00:00:00, 2001-05-01 00:00:00, 2001-06-01 00:00:00,
             2001-07-01 00:00:00, 2001-08-01 00:00:00, 2001-09-01 00:00:00,
             2001-10-01 00:00:00, 2001-11-01 00:00:00, 2001-12-01 00:00:00,
             2002-01-01 00:00:00, 2002-02-01 00:00:00, 2002-03-01 00:00:00,
             2002-04-01 00:00:00, 2002-05-01 00:00:00, 2002-06-01 00:00:00],
            dtype='object')

In [10]: da.resample(time='12MS').mean()
Out[10]:
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [11]: da.resample(time='6MS').mean()
Out[11]:
<xarray.DataArray (time: 5)>
array([ 2.5,  8.5, 14.5, 20.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [12]: da.resample(time='3MS').mean()
Out[12]:
<xarray.DataArray (time: 10)>
array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28.])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-04-01 00:00:00
@huard

This comment has been minimized.

Copy link

huard commented Oct 4, 2018

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

@spencerkclark

This comment has been minimized.

Copy link
Member

spencerkclark commented Oct 4, 2018

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

My instinct would be to first pursue the simple approach that @shoyer has started here. If it turns out that passing a pandas.Series rather than a pandas.Grouper instance in line 236 of groupby.py prevents us from replicating some important behavior of resample, then it might be something to think about.

As of yet, while there are a few details that need to be added to Stephan's implementation (e.g., as he notes in the to-do comment, proper handling of the closed, label, and base arguments; there is some other complexity regarding how to handle gaps in the time series, etc.), I do not (yet) see any reason why these couldn't be handled with some modifications to the current approach. The logic in TimeGrouper is definitely a good reference for how to handle the different arguments to resample, but if we can, I think it would be nice to avoid the complexity of defining a new Grouper class.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2018

@jwenfai jwenfai referenced this pull request Nov 27, 2018

Merged

Resample v2 clean #1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment