Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: sketch of resample support for CFTimeIndex #2458

Closed
wants to merge 2 commits into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 2, 2018

Example usage:

>>> import xarray
>>> times = xarray.cftime_range('2000', periods=30, freq='MS')
>>> da = xarray.DataArray(range(30), [('time', times)])
>>> da.resample(time='1AS').mean()
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00

Example usage:

>>> import xarray
>>> times = xarray.cftime_range('2000', periods=30, freq='MS')
>>> da = xarray.DataArray(range(30), [('time', times)])
>>> da.resample(time='1AS').mean()
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
if isinstance(self.indexes[dim], CFTimeIndex):
# TODO: handle closed, label and base arguments, and the case where
# frequency is specified without an integer count.
grouper = self.indexes[dim].shift(n=int(freq[0]), freq=freq[1:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conveniently, I think this could be written as self.indexes[dim].shift(1, freq) and it would handle both the case where the frequency was specified with an integer multiple and without (internally in shift the frequency string is converted to an offset that has the appropriate multiple, so n can always be 1 in this case):

In [1]: import xarray as xr

In [2]: times = xr.cftime_range('2000', periods=5, freq='D')

In [3]: times
Out[3]:
CFTimeIndex([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00,
             2000-01-04 00:00:00, 2000-01-05 00:00:00],
            dtype='object')

In [4]: times.shift(1, 'D')
Out[4]:
CFTimeIndex([2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00,
             2000-01-05 00:00:00, 2000-01-06 00:00:00],
            dtype='object')

In [5]: times.shift(1, '2D')
Out[5]:
CFTimeIndex([2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00,
             2000-01-06 00:00:00, 2000-01-07 00:00:00],
            dtype='object')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general though (e.g. for frequency multiples greater than 1), something as simple as shift will not suffice. I think one will need some sort of binning mechanism like pandas.cut to bin the dates in regular intervals starting from the start date, with a result from a call to xr.cftime_range forming the bin edges.

@huard that's all I have for now -- I may add some notes over the weekend if I have time. I have looked into doing this some in the past, but I didn't arrive at something I was totally pleased with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark Thanks! I'm confused with how groupby on grouper actually works in the example above. I'll take a look at cut.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how the magic of grouper works:

if grouper is not None:
index = safe_cast_to_index(group)
if not index.is_monotonic:
# TODO: sort instead of raising an error
raise ValueError('index must be monotonic for resampling')
s = pd.Series(np.arange(index.size), index)
first_items = s.groupby(grouper).first()
full_index = first_items.index
if first_items.isnull().any():
first_items = first_items.dropna()
sbins = first_items.values.astype(np.int64)
group_indices = ([slice(i, j)
for i, j in zip(sbins[:-1], sbins[1:])] +
[slice(sbins[-1], None)])
unique_coord = IndexVariable(group.name, first_items.index)

Basically, it's just used as the variable over which the groupby operation is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the line first_items = s.groupby(grouper).first() that feels magic to me. How does groubpy know to split into years, when grouper is just a shifted monthly index ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it clearly doesn't work with the shifted monthly index, as Spencer pointed out :).

@shoyer
Copy link
Member Author

shoyer commented Oct 3, 2018

Inspired by @spencerkclark's suggestion, I tried another version based on cftime_range and reindex with method='pad'. This one seems to be working in more cases:

In [7]: times = xarray.cftime_range('2000', periods=30, freq='MS')

In [8]: da = xarray.DataArray(range(30), [('time', times)])

In [9]: times
Out[9]:
CFTimeIndex([2000-01-01 00:00:00, 2000-02-01 00:00:00, 2000-03-01 00:00:00,
             2000-04-01 00:00:00, 2000-05-01 00:00:00, 2000-06-01 00:00:00,
             2000-07-01 00:00:00, 2000-08-01 00:00:00, 2000-09-01 00:00:00,
             2000-10-01 00:00:00, 2000-11-01 00:00:00, 2000-12-01 00:00:00,
             2001-01-01 00:00:00, 2001-02-01 00:00:00, 2001-03-01 00:00:00,
             2001-04-01 00:00:00, 2001-05-01 00:00:00, 2001-06-01 00:00:00,
             2001-07-01 00:00:00, 2001-08-01 00:00:00, 2001-09-01 00:00:00,
             2001-10-01 00:00:00, 2001-11-01 00:00:00, 2001-12-01 00:00:00,
             2002-01-01 00:00:00, 2002-02-01 00:00:00, 2002-03-01 00:00:00,
             2002-04-01 00:00:00, 2002-05-01 00:00:00, 2002-06-01 00:00:00],
            dtype='object')

In [10]: da.resample(time='12MS').mean()
Out[10]:
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [11]: da.resample(time='6MS').mean()
Out[11]:
<xarray.DataArray (time: 5)>
array([ 2.5,  8.5, 14.5, 20.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [12]: da.resample(time='3MS').mean()
Out[12]:
<xarray.DataArray (time: 10)>
array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28.])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-04-01 00:00:00

@huard
Copy link
Contributor

huard commented Oct 4, 2018

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

@spencerkclark
Copy link
Member

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

My instinct would be to first pursue the simple approach that @shoyer has started here. If it turns out that passing a pandas.Series rather than a pandas.Grouper instance in line 236 of groupby.py prevents us from replicating some important behavior of resample, then it might be something to think about.

As of yet, while there are a few details that need to be added to Stephan's implementation (e.g., as he notes in the to-do comment, proper handling of the closed, label, and base arguments; there is some other complexity regarding how to handle gaps in the time series, etc.), I do not (yet) see any reason why these couldn't be handled with some modifications to the current approach. The logic in TimeGrouper is definitely a good reference for how to handle the different arguments to resample, but if we can, I think it would be nice to avoid the complexity of defining a new Grouper class.

@shoyer
Copy link
Member Author

shoyer commented Oct 4, 2018 via email

@shoyer
Copy link
Member Author

shoyer commented Feb 3, 2019

Implemented in #2593

@shoyer shoyer closed this Feb 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants