Groupby-like API for resampling #1272

darothen · 2017-02-16T19:04:07Z

This is a work-in-progress to resolve #1269.

Openly welcome feedback/critiques on how I approached this. Subclassing Data{Array/set}GroupBy may not be the best way, but it would be easy enough to re-write the necessary helper functions (just apply(), I think) so that we do not need to inherit form them directly. Additional issues I'm working to resolve:

I tried make sure that calls using the old API won't break by refactoring the old logic to _resample_immediately(). This may not be the best approach!
Similarly, I copied all the original test cases and added the suffix ..._old_api; these could trivially be placed into their related test cases for the new API.
BUG: keep_attrs is ignored when you call it on methods chained to Dataset.resample(). Oddly enough, if I hard-code keep_attrs=True inside reduce_array() in DatasetResample::reduce it works just fine. I haven't figured out where the kwarg is getting lost.
BUG: Some of the test cases (for instance, test_resample_old_vs_new_api) fail because the resampling by calling self.groupby_cls ends up not working - it crashes because the group sizes that get computed are not what it expects. Occurs with both new and old API

shoyer · 2017-02-18T02:04:57Z

xarray/core/common.py

+                      ".resample({dim}=\"{freq}\").{how}() ".format(
+                            dim=dim, freq=freq, how=how
+                      ), DeprecationWarning, stacklevel=3)
+
        if isinstance(dim, basestring):
            dim = self[dim]
        group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM)


Note: I think this should probably be group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM) but for some reason this works.

shoyer · 2017-02-18T02:07:47Z

xarray/core/common.py

+                              closed, label, base, keep_attrs):
+        """Implement the original version of .resample() which immediately
+        executes the desired resampling operation. """
+        from .dataarray import DataArray
        RESAMPLE_DIM = '__resample_dim__'


There might be better way to handle this, but currently using RESAMPLE_DIM is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.

darothen · 2017-02-20T21:36:06Z

Smoothed out most of the problems from earlier and missing details. Still not sure if it's wise to refactor most of the resampling logic into a new resample.py, like what was done with rolling, but it still makes some sense to keep things in groupby.py because we're just subclassing existing machinery from there.

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

shoyer · 2017-02-20T21:50:05Z

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

Yes, use pop, e.g., dim = kwargs.pop('dim', None). pop removes the arguments from kwargs, so you can pass on the remaining ones unchanged to the super class method.

shoyer

It would be good to think about valid uses (and add some test coverage) for groupby-like resampling that doesn't use aggregation. For example, what happens if you want to use resample with arithmetic, .apply or to iterate over groups? This means that the ugly '__resample_dim__' name will leak out into the external API.

We still need a way to deal with redundant dimension/coordinate names, but maybe we can give a slightly friendlier name for the coordinate, e.g., resampled_time if time is the name of the resampled coordinate.

shoyer · 2017-02-20T21:51:41Z

xarray/core/groupby.py

+                             **kwargs)
+        result = self.apply(reduce_array, shortcut=shortcut)
+
+        return result.rename({self._resample_dim: self._dim})


Maybe put this .rename({self._resample_dim: self._dim}) business at the end of .apply instead? That would give a slightly better result for non-reduce uses of the new resample (e.g., for arithmetic).

shoyer · 2017-02-20T21:55:53Z

xarray/tests/test_dataarray.py

+        with self.assertRaisesRegexp(ValueError, 'index must be monotonic'):
+            array[[2, 0, 1]].resample(time='1D')
+
+    def test_resample_old_api(self):


It would probably be better to remove some of these old API tests in preference to tests using the new API. We don't really need all of them to be sure the old API still works.

shoyer · 2017-02-20T21:57:39Z

xarray/tests/test_dataarray.py

+        array = DataArray(np.ones(10), [('time', times)])
+
+        # Simple mean
+        old_mean = array.resample('1D', 'time', how='mean')


Use pytest.warns around each use of the old API to verify that the right warning is raised (and also to ensure the warnings don't get issued when we run the test suite).

shoyer · 2017-02-20T22:11:40Z

doc/time-series.rst

+
+   ds.resample(time='6H').reduce(np.mean)
+
+Resampling does not yet work for upsampling.


I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).

darothen · 2017-02-20T23:51:01Z

Thanks for the feedback, @shoyer! Will circle back around to continue working on this in a few days when I have some free time. - Daniel

darothen · 2017-03-01T19:46:46Z

Should .apply() really work on non-aggregation functions? Based on the pandas documentation it seems like "resample" is truly just a synonym for a transformation of the time dimension. I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

As written, non-aggregation ("transformation"?) doesn't work because the call in _combine() to _maybe_reorder() messes things up (it drops all of the data along the resampled dimension). It shouldn't be too hard to fix this, although I'm leaning more and more to making stand-alone Data{Array,set}Resample classes in a separate file which only loosely inherit from their Data{Array,set}GroupBy cousins, since they need to re-write some really critical parts of the underlying machinery.

shoyer · 2017-03-01T21:19:13Z

I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

I think this is mostly because TimeGrouper has been around far longer than non-aggregating resample.

jhamman · 2017-07-13T19:38:14Z

@darothen - can you give a summary of what's left to do here?

darothen · 2017-07-14T13:10:22Z

I think a pull against the new releases is critical to see what breaks. Beyond that, just code clean up and testing. I can try to bump this higher on my priority list.

…setResample" subclasses of their equivalent "GroupBy" implementation

…actual resampling dimension name

darothen · 2017-07-19T14:07:00Z

I did my best to re-base everything to master... plan on spending an hour or so figuring out what's broken and at least restoring the status quo.

Un-do _combine temporary debugging output

darothen · 2017-07-19T14:24:38Z

TODO

ensure that count() works on Data{Array,set}Resample objects
refactor Data{Array,set}Resample objects into a stand-alone file core/resample.py alongside core/groupby.py
wrap pytest.warns around tests targeting old API
move old API tests into stand-alone
Crude up-sampling. Copy/pasting Stephan's earlier comment from Feb 20:

I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).

Alright @jhamman, here's the complete list of work left here. I'll tackle some of it during my commutes this week.

max-sixty · 2017-07-19T18:06:46Z

xarray/tests/test_dataset.py

+        for how in ['mean', 'sum', 'first', 'last', ]:
+            method = getattr(actual, how)
+            result = method()
+            self.assertDatasetEqual(expected, result)


No need to change, but in future you can use pytest parameters for these sorts of loops, and assert_equal rather than the self... methods throughout

darothen · 2017-09-13T16:43:32Z

@shoyer fixed.

shoyer

I think this is good to go! I'll merge this in a day or two once other have had the chance to look at it over.

shoyer · 2017-09-13T16:48:38Z

xarray/core/resample.py

+                        "variable '{}', but it is a dask array; dask arrays not "
+                        "yet supprted in resample.interpolate(). Load into "
+                        "memory with Dataset.load() before resampling."
+                        .format(name)


love it -- thanks!

shoyer · 2017-09-13T16:49:50Z

xarray/core/resample.py

+        ----------
+        kind : str {'linear', 'nearest', 'zero', 'slinear',
+               'quadratic', 'cubic'}
+            Interpolation scheme to use


add a period

jhamman

I gave this another pass through. I found a number of pep8 violations (run git diff upstream/master | flake8 --diff) that we should fix. One comment of substance: we probably need better test coverage on the new errors.

Passes git diff upstream/master | flake8 --diff

jhamman · 2017-09-18T21:00:01Z

xarray/core/common.py

+                                      label=label, base=base)
+        resampler = self._resample_cls(self, group=group, dim=dim_name,
+                                      grouper=time_grouper,
+                                      resample_dim=resample_dim)


fix indentation

jhamman · 2017-09-18T21:00:28Z

xarray/core/common.py

        if isinstance(dim, basestring):
+            dim_name = dim


dim_name is never used

jhamman · 2017-09-18T21:00:52Z

xarray/core/common.py

            else:
                result = f(dim=dim.name, skipna=skipna, keep_attrs=keep_attrs)
        else:
            result = gb.reduce(how, dim=dim.name, keep_attrs=keep_attrs)
        result = result.rename({RESAMPLE_DIM: dim.name})
        return result

+


PEP8: remove extra blank line

jhamman · 2017-09-18T21:01:13Z

xarray/core/dataarray.py

@@ -34,7 +35,7 @@ def _infer_coords_and_dims(shape, coords, dims):
    """All the logic for creating a new DataArray"""

    if (coords is not None and not utils.is_dict_like(coords) and
-            len(coords) != len(shape)):
+                len(coords) != len(shape)):


fix indentation

jhamman · 2017-09-18T21:03:01Z

xarray/core/resample.py

+                "yet supprted in resample.interpolate(). Load into "
+                "memory with Dataset.load() before resampling."
+                .format(name)
+            )


Can we add a unit test that covers this TypeError? Right now, I think you'll actually get a NameError because name is not defined.

jhamman · 2017-09-18T21:03:48Z

xarray/core/resample.py

+        return DataArray(f(new_x), coords, dims, name=dummy.name,
+                         attrs=dummy.attrs)
+
+ops.inject_reduce_methods(DataArrayResample)


PEP8: add a second empty line after class definition

jhamman · 2017-09-18T21:05:29Z

xarray/tests/test_dataarray.py

+            # convention from old api
+            new_api = getattr(resampler, method)(keep_attrs=False)
+            with pytest.warns(DeprecationWarning):
+             old_api = array.resample('1D', dim='time', how=method)


indentation

jhamman · 2017-09-18T21:06:27Z

xarray/tests/test_dataarray.py

-    def test_resample_mean_keep_attrs(self):
+        array = DataArray(np.arange(10), [('__resample_dim__', times)])
+        actual = array.resample('1D', dim='__resample_dim__', how='first')
+        self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension')


I think this needs to be:

with self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension'): actual = array.resample('1D', dim='__resample_dim__', how='first')

I'm not sure why this is passing actually.

This was butchered in more ways than one; I fixed the test which revealed some underlying flaws, too, which are also fixed.

jhamman · 2017-09-18T21:06:44Z

xarray/tests/test_dataarray.py

+        ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
+        tcoord = DataArray(tt, {'time': times}, ('time', ))
+        ds = Dataset({'data': array, 'xc': xcoord,
+                         'yc': ycoord, 'tc': tcoord})


indentation

jhamman · 2017-09-18T21:07:18Z

xarray/tests/test_dataset.py

+        ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
+        tcoord = DataArray(tt, {'time': times}, ('time', ))
+        ds = Dataset({'data': array, 'xc': xcoord,
+                         'yc': ycoord, 'tc': tcoord})


indentation

darothen · 2017-09-19T12:58:34Z

@jhamman Gotcha, I'll clean everything up by the end of the week. If that's going to block 0.10.0, let me know and I'll shuffle some things around to prioritize this.

jhamman · 2017-09-19T15:23:54Z

@darothen - we have a few other PRs to wrap up for 0.10 so end of the week is okay.

…sk arrays

darothen · 2017-09-20T12:47:08Z

@jhamman Think we're good. I deferred 4 small pep8 issues because they're in parts of the codebase which I don't think I ever touched, and i'm worried they're going to screw up the merge.

jhamman · 2017-09-20T16:37:59Z

Thanks @darothen - can you resolve the merge conflicts?

darothen · 2017-09-20T16:41:01Z

@jhamman done - caught me right while I was compiling GEOS-Chem, and the merge conflicts were very simple.

jhamman

two more changes that need to be made...These were caught by travis.

jhamman · 2017-09-20T18:08:06Z

xarray/tests/test_dataarray.py

+                             ('x', 'y', 'time'))
+        self.assertDataArrayIdentical(expected, actual)
+
+    def test_upsample_interpolate(self):


this needs a @requires_scipy decorator or put

interp1d = pytest.importorskip('scipy.interpolate.interp1d')

in the first line of the test (instead of the import)

jhamman · 2017-09-20T18:10:06Z

xarray/core/resample.py

+    def _interpolate(self, kind='linear'):
+        """Apply scipy.interpolate.interp1d along resampling dimension."""
+        from .dataarray import DataArray
+


let's import interp1d here, instead of at the module level. scipy is still a optional dependency.

Gotcha - it needed to be imported under DatasetResample, too.

jhamman · 2017-09-21T20:54:51Z

@darothen - almost there. Two or three more dependency conflicts in the tests.

darothen · 2017-09-21T21:02:39Z

@jhamman Ohhh i totally misunderstood the last readout from travis-ci. Dealing with the scipy dependency is easy enough. ~~However, another test fails because it uses np.flip() which wasn't added to numpy until v1.12.0. Do we want to bump the numpy version in the dependencies? Or is there another aproach to take here?~~

Nevermind, easy solution is just to use other axis-reversal methods :)

shoyer · 2017-09-21T21:11:07Z

We just bumped numpy to 1.11, but 1.12 would be too new.

Let's just add a np.flip backport to core/npcompat.py. The whole function is only a couple of lines.

jhamman · 2017-09-22T16:27:28Z

In it goes. Thanks @darothen!

shoyer reviewed Feb 18, 2017

View reviewed changes

shoyer reviewed Feb 20, 2017

View reviewed changes

darothen mentioned this pull request Mar 24, 2017

Add 'count' as option for how in dataset resample #1327

Closed

darothen mentioned this pull request Apr 17, 2017

Add DatetimeAccessor for accessing datetime fields via .dt attribute #1356

Merged

4 tasks

jhamman added enhancement topic-groupby topic-pandas-like labels Jul 13, 2017

darothen added 11 commits July 19, 2017 09:26

Implement basic functionality by adding "DataArrayResample" and "Data…

0767397

…setResample" subclasses of their equivalent "GroupBy" implementation

Re-factor old resample logic to separate method

fce727f

Adding test cases for new api

afa31fc

Adding more DataArray resample tests

829b4c1

Adding test_dataset test cases

9b742c4

Update to use proxy __resample_dim__ on resampling ops

0ec8b71

Re-factor proxy resampling dimension and catch rare error if used as …

09a6989

…actual resampling dimension name

BUG: Fixed loss of attrs on DatasetResample even when keep_attrs=True

afa14e4

Update docs to add information about new resampling api

1fec1f9

Adding 'Whats new' entry

4f70131

Tweak auxiliary groupby apply and reduce methods

3a05c50

darothen force-pushed the refactor-resample-api branch from 2995ad7 to 3a05c50 Compare July 19, 2017 13:57

Squash bugs from py.test following rebase to master

464a067

Un-do _combine temporary debugging output

max-sixty reviewed Jul 19, 2017

View reviewed changes

darothen added 2 commits September 13, 2017 12:11

Add note about monotonicity assumption before interpolation

8c7d6cf

fix some pep8 and comments

31e5510

darothen force-pushed the refactor-resample-api branch from 7a767d8 to 31e5510 Compare September 13, 2017 16:23

darothen added 2 commits September 13, 2017 12:37

More informative error message when resampling/interpolating dask arrays

9b43d00

Merge branch 'master' into refactor-resample-api

2839107

shoyer approved these changes Sep 13, 2017

View reviewed changes

shoyer changed the title ~~WIP: Groupby-like API for resampling~~ Groupby-like API for resampling Sep 13, 2017

jhamman requested changes Sep 18, 2017

View reviewed changes

darothen added 2 commits September 20, 2017 08:16

Fix flake8

9a92211

Fixing issues with test cases, including adding one for upsampling da…

6df6dde

…sk arrays

Merge branch 'master' into refactor-resample-api

af1ab3d

jhamman approved these changes Sep 20, 2017

View reviewed changes

jhamman requested changes Sep 20, 2017

View reviewed changes

Clean up scipy imports

d03b25f

darothen added 2 commits September 21, 2017 17:11

Adding additional tweaks to cover scipy/numpy dependencies in tests

dd11565

Merge branch 'master' into refactor-resample-api

5cfba57

jhamman approved these changes Sep 22, 2017

View reviewed changes

jhamman merged commit dc7d733 into pydata:master Sep 22, 2017

jhamman mentioned this pull request Oct 12, 2017

Resample / upsample behavior diverges from pandas #1631

Closed

dcherian mentioned this pull request Aug 12, 2018

New Resample-Syntax leading to cancellation of dimensions #2356

Closed


		ds.resample(time='6H').reduce(np.mean)

		Resampling does not yet work for upsampling.

Groupby-like API for resampling #1272

Groupby-like API for resampling #1272

Conversation

darothen commented Feb 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darothen commented Feb 20, 2017

shoyer commented Feb 20, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darothen commented Feb 20, 2017 via email

darothen commented Mar 1, 2017

shoyer commented Mar 1, 2017

jhamman commented Jul 13, 2017

darothen commented Jul 14, 2017

darothen commented Jul 19, 2017

darothen commented Jul 19, 2017 • edited

TODO

Choose a reason for hiding this comment

darothen commented Sep 13, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darothen commented Sep 19, 2017

jhamman commented Sep 19, 2017

darothen commented Sep 20, 2017

jhamman commented Sep 20, 2017

darothen commented Sep 20, 2017

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Sep 21, 2017

darothen commented Sep 21, 2017 • edited

shoyer commented Sep 21, 2017

jhamman commented Sep 22, 2017

darothen commented Feb 16, 2017 •

edited

darothen commented Jul 19, 2017 •

edited

jhamman left a comment •

edited

darothen commented Sep 21, 2017 •

edited