WIP: sketch of resample support for CFTimeIndex #2458

shoyer · 2018-10-02T15:44:36Z

Example usage:

>>> import xarray
>>> times = xarray.cftime_range('2000', periods=30, freq='MS')
>>> da = xarray.DataArray(range(30), [('time', times)])
>>> da.resample(time='1AS').mean()
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00

Example usage: >>> import xarray >>> times = xarray.cftime_range('2000', periods=30, freq='MS') >>> da = xarray.DataArray(range(30), [('time', times)]) >>> da.resample(time='1AS').mean() <xarray.DataArray (time: 3)> array([ 5.5, 17.5, 26.5]) Coordinates: * time (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00

spencerkclark · 2018-10-02T16:03:44Z

xarray/core/common.py

+        if isinstance(self.indexes[dim], CFTimeIndex):
+            # TODO: handle closed, label and base arguments, and the case where
+            # frequency is specified without an integer count.
+            grouper = self.indexes[dim].shift(n=int(freq[0]), freq=freq[1:])


Conveniently, I think this could be written as self.indexes[dim].shift(1, freq) and it would handle both the case where the frequency was specified with an integer multiple and without (internally in shift the frequency string is converted to an offset that has the appropriate multiple, so n can always be 1 in this case):

In [1]: import xarray as xr In [2]: times = xr.cftime_range('2000', periods=5, freq='D') In [3]: times Out[3]: CFTimeIndex([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00], dtype='object') In [4]: times.shift(1, 'D') Out[4]: CFTimeIndex([2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00], dtype='object') In [5]: times.shift(1, '2D') Out[5]: CFTimeIndex([2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00], dtype='object')

In general though (e.g. for frequency multiples greater than 1), something as simple as shift will not suffice. I think one will need some sort of binning mechanism like pandas.cut to bin the dates in regular intervals starting from the start date, with a result from a call to xr.cftime_range forming the bin edges.

@huard that's all I have for now -- I may add some notes over the weekend if I have time. I have looked into doing this some in the past, but I didn't arrive at something I was totally pleased with.

@spencerkclark Thanks! I'm confused with how groupby on grouper actually works in the example above. I'll take a look at cut.

Here's how the magic of grouper works:

xarray/xarray/core/groupby.py

Lines 230 to 244 in 0f70a87

if grouper is not None:

index = safe_cast_to_index(group)

if not index.is_monotonic:

# TODO: sort instead of raising an error

raise ValueError('index must be monotonic for resampling')

s = pd.Series(np.arange(index.size), index)

first_items = s.groupby(grouper).first()

full_index = first_items.index

if first_items.isnull().any():

first_items = first_items.dropna()

sbins = first_items.values.astype(np.int64)

group_indices = ([slice(i, j)

for i, j in zip(sbins[:-1], sbins[1:])] +

[slice(sbins[-1], None)])

unique_coord = IndexVariable(group.name, first_items.index)

Basically, it's just used as the variable over which the groupby operation is done.

It's the line first_items = s.groupby(grouper).first() that feels magic to me. How does groubpy know to split into years, when grouper is just a shifted monthly index ?

Well, it clearly doesn't work with the shifted monthly index, as Spencer pointed out :).

shoyer · 2018-10-03T15:30:45Z

Inspired by @spencerkclark's suggestion, I tried another version based on cftime_range and reindex with method='pad'. This one seems to be working in more cases:

In [7]: times = xarray.cftime_range('2000', periods=30, freq='MS')

In [8]: da = xarray.DataArray(range(30), [('time', times)])

In [9]: times
Out[9]:
CFTimeIndex([2000-01-01 00:00:00, 2000-02-01 00:00:00, 2000-03-01 00:00:00,
             2000-04-01 00:00:00, 2000-05-01 00:00:00, 2000-06-01 00:00:00,
             2000-07-01 00:00:00, 2000-08-01 00:00:00, 2000-09-01 00:00:00,
             2000-10-01 00:00:00, 2000-11-01 00:00:00, 2000-12-01 00:00:00,
             2001-01-01 00:00:00, 2001-02-01 00:00:00, 2001-03-01 00:00:00,
             2001-04-01 00:00:00, 2001-05-01 00:00:00, 2001-06-01 00:00:00,
             2001-07-01 00:00:00, 2001-08-01 00:00:00, 2001-09-01 00:00:00,
             2001-10-01 00:00:00, 2001-11-01 00:00:00, 2001-12-01 00:00:00,
             2002-01-01 00:00:00, 2002-02-01 00:00:00, 2002-03-01 00:00:00,
             2002-04-01 00:00:00, 2002-05-01 00:00:00, 2002-06-01 00:00:00],
            dtype='object')

In [10]: da.resample(time='12MS').mean()
Out[10]:
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [11]: da.resample(time='6MS').mean()
Out[11]:
<xarray.DataArray (time: 5)>
array([ 2.5,  8.5, 14.5, 20.5, 26.5])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00

In [12]: da.resample(time='3MS').mean()
Out[12]:
<xarray.DataArray (time: 10)>
array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28.])
Coordinates:
  * time     (time) object 2000-01-01 00:00:00 ... 2002-04-01 00:00:00

huard · 2018-10-04T12:51:21Z

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

spencerkclark · 2018-10-04T15:05:20Z

Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ?

My instinct would be to first pursue the simple approach that @shoyer has started here. If it turns out that passing a pandas.Series rather than a pandas.Grouper instance in line 236 of groupby.py prevents us from replicating some important behavior of resample, then it might be something to think about.

As of yet, while there are a few details that need to be added to Stephan's implementation (e.g., as he notes in the to-do comment, proper handling of the closed, label, and base arguments; there is some other complexity regarding how to handle gaps in the time series, etc.), I do not (yet) see any reason why these couldn't be handled with some modifications to the current approach. The logic in TimeGrouper is definitely a good reference for how to handle the different arguments to resample, but if we can, I think it would be nice to avoid the complexity of defining a new Grouper class.

shoyer · 2018-10-04T15:12:07Z

I never saw clear use-cases for TimeGrouper but I could be convinced.

…

On Thu, Oct 4, 2018 at 2:51 PM David Huard ***@***.***> wrote: Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2458 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1rj2dc2Fk1el8e5kjU5VQGs5759jks5uhgRKgaJpZM4XEUir> .

shoyer · 2019-02-03T03:21:52Z

Implemented in #2593

shoyer mentioned this pull request Oct 2, 2018

Adding resample functionality to CFTimeIndex #2191

Closed

spencerkclark reviewed Oct 2, 2018

View reviewed changes

New implementation using cftime_range

b488148

jhamman added the topic-CF conventions label Oct 10, 2018

jwenfai mentioned this pull request Nov 27, 2018

Resample v2 clean Ouranosinc/xarray#1

Merged

shoyer closed this Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: sketch of resample support for CFTimeIndex #2458

WIP: sketch of resample support for CFTimeIndex #2458

shoyer commented Oct 2, 2018

spencerkclark Oct 2, 2018

spencerkclark Oct 3, 2018

huard Oct 3, 2018

shoyer Oct 3, 2018

huard Oct 3, 2018

shoyer Oct 3, 2018

shoyer commented Oct 3, 2018

huard commented Oct 4, 2018

spencerkclark commented Oct 4, 2018

shoyer commented Oct 4, 2018 via email

shoyer commented Feb 3, 2019

	if grouper is not None:
	index = safe_cast_to_index(group)
	if not index.is_monotonic:
	# TODO: sort instead of raising an error
	raise ValueError('index must be monotonic for resampling')
	s = pd.Series(np.arange(index.size), index)
	first_items = s.groupby(grouper).first()
	full_index = first_items.index
	if first_items.isnull().any():
	first_items = first_items.dropna()
	sbins = first_items.values.astype(np.int64)
	group_indices = ([slice(i, j)
	for i, j in zip(sbins[:-1], sbins[1:])] +
	[slice(sbins[-1], None)])
	unique_coord = IndexVariable(group.name, first_items.index)

WIP: sketch of resample support for CFTimeIndex #2458

WIP: sketch of resample support for CFTimeIndex #2458

Conversation

shoyer commented Oct 2, 2018

spencerkclark Oct 2, 2018

Choose a reason for hiding this comment

spencerkclark Oct 3, 2018

Choose a reason for hiding this comment

huard Oct 3, 2018

Choose a reason for hiding this comment

shoyer Oct 3, 2018

Choose a reason for hiding this comment

huard Oct 3, 2018

Choose a reason for hiding this comment

shoyer Oct 3, 2018

Choose a reason for hiding this comment

shoyer commented Oct 3, 2018

huard commented Oct 4, 2018

spencerkclark commented Oct 4, 2018

shoyer commented Oct 4, 2018 via email

shoyer commented Feb 3, 2019