Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby-like API for resampling #1272

Merged
merged 55 commits into from Sep 22, 2017
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
0767397
Implement basic functionality by adding "DataArrayResample" and "Data…
Feb 16, 2017
fce727f
Re-factor old resample logic to separate method
Feb 16, 2017
afa31fc
Adding test cases for new api
Feb 16, 2017
829b4c1
Adding more DataArray resample tests
Feb 16, 2017
9b742c4
Adding test_dataset test cases
Feb 16, 2017
0ec8b71
Update to use proxy __resample_dim__ on resampling ops
Feb 20, 2017
09a6989
Re-factor proxy resampling dimension and catch rare error if used as …
Feb 20, 2017
afa14e4
BUG: Fixed loss of attrs on DatasetResample even when keep_attrs=True
Feb 20, 2017
1fec1f9
Update docs to add information about new resampling api
Feb 20, 2017
4f70131
Adding 'Whats new' entry
Feb 20, 2017
3a05c50
Tweak auxiliary groupby apply and reduce methods
Apr 3, 2017
464a067
Squash bugs from py.test following rebase to master
Jul 19, 2017
c213de9
Fixing typo in groupby.py
Jul 26, 2017
db500d2
Add a test for resampling which includes timezones in the datetime st…
Aug 3, 2017
4f29932
Rolling back the timezone tests... this will need to be tackled separ…
Aug 3, 2017
5f4d6a5
Re-factored resample into it's own module for easier maintenance
Aug 3, 2017
3e2cc45
Adding support for 'count' via old api
Aug 3, 2017
a98bb2e
Re-organizing new vs old resampling api tests
Aug 3, 2017
2664b8e
Expanded old-vs-new api tests for Dataset to replace deprecated tests
Aug 3, 2017
ee4b2ef
Consolidated old reasmpling api tests into new-vs-old for dataarray
Aug 3, 2017
07e6fb1
Wrapping old api test invocations with pytest.warns
Aug 3, 2017
304e250
Added stub tests for upsampling
Aug 4, 2017
949291a
Update documentation with upsampling stub
Aug 4, 2017
c898b23
Factor out a Resample object and add initial up-sampling methods - bf…
Aug 4, 2017
37c8e8b
Add interpolation up-sampling to DataArray
Aug 4, 2017
6949c06
Refine DataArray upsampling interpolation and extend to Dataset
Aug 4, 2017
d85fa81
Fix wrong time dimension length on test cases for upsampling
Aug 4, 2017
4177d79
First initial revisions to @shoyer's comments; before modifying imple…
Aug 19, 2017
a7bd1fd
Tweaks to resample.py to lean on super-methods
Aug 19, 2017
0f071ee
Implementing interpolation test cases
Aug 19, 2017
985600e
BUG: Fix asfreq only returning 1D data in nd case
Aug 19, 2017
2e985c6
Add pad/asfreq upsampling tests
Aug 19, 2017
bc58b05
Add a check if old/new api is mixed; remove old api details from resa…
Aug 19, 2017
529406f
Fix an old bug in datetime components of timeseries doc
Aug 19, 2017
b6cf938
Tweaking time-series doc
Aug 19, 2017
406f4e2
Added what's new entry
Aug 19, 2017
ce97f3a
Drop existing non-dimension coordinates along resample dimension
Sep 1, 2017
2a7efee
Update seaborn to v0.8 to fix issues with default plot styles
Sep 11, 2017
829d292
nearest-neighbor up-sampling now relies on re-index instead of interp…
Sep 11, 2017
38f6d86
Adding nearest upsampling test; tweaked inference of re-indexing dime…
Sep 11, 2017
b2307d0
Move what's new entry to breaking changes
Sep 11, 2017
85ed5ba
Updating docs in breaking changes with example and link to timeseries…
Sep 11, 2017
5082040
BUG: Fixing creating merged coordinates for Dataset upsampling case
Sep 11, 2017
ed8d5c9
Remove old notice about resampling api updates
Sep 11, 2017
5d23a99
Applying shoyer's clean-up of figuring out valid coords on Dataset up…
Sep 11, 2017
8c7d6cf
Add note about monotonicity assumption before interpolation
Sep 11, 2017
31e5510
fix some pep8 and comments
Sep 11, 2017
9b43d00
More informative error message when resampling/interpolating dask arrays
Sep 13, 2017
2839107
Merge branch 'master' into refactor-resample-api
Sep 13, 2017
9a92211
Fix flake8
Sep 20, 2017
6df6dde
Fixing issues with test cases, including adding one for upsampling da…
Sep 20, 2017
af1ab3d
Merge branch 'master' into refactor-resample-api
Sep 20, 2017
d03b25f
Clean up scipy imports
Sep 20, 2017
dd11565
Adding additional tweaks to cover scipy/numpy dependencies in tests
Sep 21, 2017
5cfba57
Merge branch 'master' into refactor-resample-api
Sep 22, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/environment.yml
Expand Up @@ -8,7 +8,7 @@ dependencies:
- pandas=0.20.1
- numpydoc=0.6.0
- matplotlib=2.0.0
- seaborn=0.7.1
- seaborn=0.8
- dask=0.12.0
- ipython=5.1.0
- sphinx=1.5
Expand Down
2 changes: 1 addition & 1 deletion doc/examples/weather-data.rst
Expand Up @@ -68,7 +68,7 @@ Monthly averaging

.. ipython:: python

monthly_avg = ds.resample('1MS', dim='time', how='mean')
monthly_avg = ds.resample(time='1MS').mean()

@savefig examples_tmin_tmax_plot_mean.png
monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Expand Down
40 changes: 32 additions & 8 deletions doc/time-series.rst
Expand Up @@ -15,6 +15,7 @@ core functionality.
import numpy as np
import pandas as pd
import xarray as xr

np.random.seed(123456)

Creating datetime64 data
Expand Down Expand Up @@ -95,8 +96,8 @@ given ``DataArray`` can be quickly computed using a special ``.dt`` accessor.

.. ipython:: python

time = time = pd.date_range('2000-01-01', freq='6H', periods=365 * 4)
ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})
time = pd.date_range('2000-01-01', freq='6H', periods=365 * 4)
ds = xr.Dataset({'foo': ('time', np.arange(365 * 4)), 'time': time})
ds.time.dt.hour
ds.time.dt.dayofweek

Expand Down Expand Up @@ -128,6 +129,8 @@ the first letters of the corresponding months.

You can use these shortcuts with both Datasets and DataArray coordinates.

.. _resampling:

Resampling and grouped operations
---------------------------------

Expand All @@ -150,17 +153,38 @@ For example, we can downsample our dataset from hourly to 6-hourly:

.. ipython:: python

ds.resample('6H', dim='time', how='mean')
ds.resample(time='6H')

This will create a specialized ``Resample`` object which saves information
necessary for resampling. All of the reduction methods which work with
``Resample`` objects can also be used for resampling:

.. ipython:: python

Resample also works for upsampling, in which case intervals without any
values are marked by ``NaN``:
ds.resample(time='6H').mean()

You can also supply an arbitrary reduction function to aggregate over each
resampling group:

.. ipython:: python

ds.resample('30Min', 'time')
ds.resample(time='6H').reduce(np.mean)

For upsampling, xarray provides four methods: ``asfreq``, ``ffill``, ``bfill``,
and ``interpolate``. ``interpolate`` extends ``scipy.interpolate.interp1d`` and
supports all of its schemes. All of these resampling operations work on both
Dataset and DataArray objects with an arbitrary number of dimensions.

.. note::

The ``resample`` api was updated in version 0.10.0 to reflect similar
updates in pandas ``resample`` api to be more groupby-like. Older style
calls to ``resample`` will still be supported for a short period:

.. ipython:: python

ds.resample('6H', dim='time', how='mean')

Of course, all of these resampling and groupby operation work on both Dataset
and DataArray objects with any number of additional dimensions.

For more examples of using grouped operations on a time dimension, see
:ref:`toy weather data`.
30 changes: 30 additions & 0 deletions doc/whats-new.rst
Expand Up @@ -27,6 +27,36 @@ Breaking changes
(:issue:`727`).
By `Joe Hamman <https://github.com/jhamman>`_.

- A new resampling interface to match pandas' group-by-like API was added to
:py:meth:`~xarray.Dataset.resample` and :py:meth:`~xarray.DataArray.resample`
(:issue:`1272`). :ref:`Timeseries resampling <resampling>` is
fully supported for data with arbitrary dimensions as is both downsampling
and upsampling (including linear, quadratic, cubic, and spline interpolation).

Old syntax:

.. ipython::
:verbatim:

In [1]: ds.resample('24H', dim='time', how='max')
Out[1]:
<xarray.Dataset>
[...]

New syntax:

.. ipython::
:verbatim:

In [1]: ds.resample(time='24H').max()
Out[1]:
<xarray.Dataset>
[...]

Note that both versions are currently supported, but using the old syntax will
produce a warning encouraging users to adopt the new syntax. By `Daniel
Rothenberg <https://github.com/darothen>`_

Backward Incompatible Changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
90 changes: 59 additions & 31 deletions xarray/core/common.py
Expand Up @@ -3,6 +3,7 @@
from __future__ import print_function
import numpy as np
import pandas as pd
import warnings

from .pycompat import basestring, suppress, dask_array_type, OrderedDict
from . import dtypes
Expand Down Expand Up @@ -479,55 +480,33 @@ def rolling(self, min_periods=None, center=False, **windows):
return self._rolling_cls(self, min_periods=min_periods,
center=center, **windows)

def resample(self, freq, dim, how='mean', skipna=None, closed=None,
label=None, base=0, keep_attrs=False):
"""Resample this object to a new temporal resolution.
def resample(self, freq=None, dim=None, how=None, skipna=None,
closed=None, label=None, base=0, keep_attrs=False, **indexer):
"""Returns a Resample object for performing resampling operations.

Handles both downsampling and upsampling. Upsampling with filling is
not yet supported; if any intervals contain no values in the original
not supported; if any intervals contain no values from the original
object, they will be given the value ``NaN``.

Parameters
----------
freq : str
String in the '#offset' to specify the step-size along the
resampled dimension, where '#' is an (optional) integer multipler
(default 1) and 'offset' is any pandas date offset alias. Examples
of valid offsets include:

* 'AS': year start
* 'QS-DEC': quarterly, starting on December 1
* 'MS': month start
* 'D': day
* 'H': hour
* 'Min': minute

The full list of these offset aliases is documented in pandas [1]_.
dim : str
Name of the dimension to resample along (e.g., 'time').
how : str or func, optional
Used for downsampling. If a string, ``how`` must be a valid
aggregation operation supported by xarray. Otherwise, ``how`` must be
a function that can be called like ``how(values, axis)`` to reduce
ndarray values along the given axis. Valid choices that can be
provided as a string include all the usual Dataset/DataArray
aggregations (``all``, ``any``, ``argmax``, ``argmin``, ``max``,
``mean``, ``median``, ``min``, ``prod``, ``sum``, ``std`` and
``var``), as well as ``first`` and ``last``.
skipna : bool, optional
Whether to skip missing values when aggregating in downsampling.
closed : 'left' or 'right', optional
Side of each interval to treat as closed.
label : 'left or 'right', optional
Side of each interval to use for labeling.
base : int, optionalt
base : int, optional
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '24H' frequency, base could
range from 0 through 23.
keep_attrs : bool, optional
If True, the object's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.
**indexer : {dim: freq}
Dictionary with a key indicating the dimension name to resample
over and a value corresponding to the resampling frequency.

Returns
-------
Expand All @@ -540,18 +519,67 @@ def resample(self, freq, dim, how='mean', skipna=None, closed=None,
.. [1] http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
"""
from .dataarray import DataArray
from .resample import RESAMPLE_DIM

if dim is not None:
if how is None:
how = 'mean'
return self._resample_immediately(freq, dim, how, skipna, closed,
label, base, keep_attrs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly disallow the specification of how when a indexers is provided? I don't think we need to support that functionality since we know it will be deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that should be a TypeError

if (how is not None) and indexer:
raise TypeError("If passing an 'indexer' then 'dim' "
"and 'how' should not be used")

# More than one indexer is ambiguous, but we do in fact need one if
# "dim" was not provided, until the old API is fully deprecated
if len(indexer) != 1:
raise ValueError(
"Resampling only supported along single dimensions."
)
dim, freq = indexer.popitem()

if isinstance(dim, basestring):
dim_name = dim
dim = self[dim]
else:
raise TypeError("Dimension name should be a string; "
"was passed %r" % dim)
group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM)
time_grouper = pd.TimeGrouper(freq=freq, closed=closed,
label=label, base=base)
resampler = self._resample_cls(self, group=group, dim=dim_name,
grouper=time_grouper,
resample_dim=RESAMPLE_DIM)

return resampler

def _resample_immediately(self, freq, dim, how, skipna,
closed, label, base, keep_attrs):
"""Implement the original version of .resample() which immediately
executes the desired resampling operation. """
from .dataarray import DataArray
RESAMPLE_DIM = '__resample_dim__'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be better way to handle this, but currently using RESAMPLE_DIM is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.


warnings.warn("\n.resample() has been modified to defer "
"calculations. Instead of passing 'dim' and "
"'how=\"{how}\", instead consider using "
".resample({dim}=\"{freq}\").{how}() ".format(
dim=dim, freq=freq, how=how
), DeprecationWarning, stacklevel=3)

if isinstance(dim, basestring):
dim = self[dim]
group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM)
group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM)
time_grouper = pd.TimeGrouper(freq=freq, how=how, closed=closed,
label=label, base=base)
gb = self._groupby_cls(self, group, grouper=time_grouper)
if isinstance(how, basestring):
f = getattr(gb, how)
if how in ['first', 'last']:
result = f(skipna=skipna, keep_attrs=keep_attrs)
elif how == 'count':
result = f(dim=dim.name, keep_attrs=keep_attrs)
else:
result = f(dim=dim.name, skipna=skipna, keep_attrs=keep_attrs)
else:
Expand Down
14 changes: 12 additions & 2 deletions xarray/core/dataarray.py
Expand Up @@ -12,6 +12,7 @@
from . import duck_array_ops
from . import indexing
from . import groupby
from . import resample
from . import rolling
from . import ops
from . import utils
Expand All @@ -34,7 +35,7 @@ def _infer_coords_and_dims(shape, coords, dims):
"""All the logic for creating a new DataArray"""

if (coords is not None and not utils.is_dict_like(coords) and
len(coords) != len(shape)):
len(coords) != len(shape)):
raise ValueError('coords is not dict-like, but it has %s items, '
'which does not match the %s dimensions of the '
'data' % (len(coords), len(shape)))
Expand Down Expand Up @@ -115,6 +116,7 @@ class _ThisArray(object):
"""An instance of this object is used as the key corresponding to the
variable when converting arbitrary DataArray objects to datasets
"""

def __repr__(self):
return '<this-array>'

Expand Down Expand Up @@ -159,6 +161,8 @@ class DataArray(AbstractArray, BaseDataObject):
"""
_groupby_cls = groupby.DataArrayGroupBy
_rolling_cls = rolling.DataArrayRolling
_resample_cls = resample.DataArrayResample

dt = property(DatetimeAccessor)

def __init__(self, data, coords=None, dims=None, name=None,
Expand Down Expand Up @@ -1490,8 +1494,10 @@ def from_cdms2(cls, variable):

def _all_compat(self, other, compat_str):
"""Helper function for equals and identical"""

def compat(x, y):
return getattr(x.variable, compat_str)(y.variable)

return (utils.dict_equiv(self.coords, other.coords, compat=compat) and
compat(self, other))

Expand Down Expand Up @@ -1565,6 +1571,7 @@ def _unary_op(f):
@functools.wraps(f)
def func(self, *args, **kwargs):
return self.__array_wrap__(f(self.variable.data, *args, **kwargs))

return func

@staticmethod
Expand All @@ -1574,7 +1581,8 @@ def func(self, other):
if isinstance(other, (Dataset, groupby.GroupBy)):
return NotImplemented
if hasattr(other, 'indexes'):
align_type = OPTIONS['arithmetic_join'] if join is None else join
align_type = (OPTIONS['arithmetic_join']
if join is None else join)
self, other = align(self, other, join=align_type, copy=False)
other_variable = getattr(other, 'variable', other)
other_coords = getattr(other, 'coords', None)
Expand All @@ -1586,6 +1594,7 @@ def func(self, other):
name = self._result_name(other)

return self._replace(variable, coords, name)

return func

@staticmethod
Expand All @@ -1604,6 +1613,7 @@ def func(self, other):
with self.coords._merge_inplace(other_coords):
f(self.variable, other_variable)
return self

return func

def _copy_attrs_from(self, other):
Expand Down
2 changes: 2 additions & 0 deletions xarray/core/dataset.py
Expand Up @@ -14,6 +14,7 @@
from . import ops
from . import utils
from . import groupby
from . import resample
from . import rolling
from . import indexing
from . import alignment
Expand Down Expand Up @@ -305,6 +306,7 @@ class Dataset(Mapping, ImplementsDatasetReduce, BaseDataObject,
"""
_groupby_cls = groupby.DatasetGroupBy
_rolling_cls = rolling.DatasetRolling
_resample_cls = resample.DatasetResample

def __init__(self, data_vars=None, coords=None, attrs=None,
compat='broadcast_equals'):
Expand Down