Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby-like API for resampling #1272

Merged
merged 55 commits into from Sep 22, 2017
Merged

Conversation

darothen
Copy link

@darothen darothen commented Feb 16, 2017

This is a work-in-progress to resolve #1269.

  • Basic functionality
  • Cleanly deprecate old API
  • New test cases
  • Documentation / examples
  • "What's new"

Openly welcome feedback/critiques on how I approached this. Subclassing Data{Array/set}GroupBy may not be the best way, but it would be easy enough to re-write the necessary helper functions (just apply(), I think) so that we do not need to inherit form them directly. Additional issues I'm working to resolve:

  • I tried make sure that calls using the old API won't break by refactoring the old logic to _resample_immediately(). This may not be the best approach!
  • Similarly, I copied all the original test cases and added the suffix ..._old_api; these could trivially be placed into their related test cases for the new API.
  • BUG: keep_attrs is ignored when you call it on methods chained to Dataset.resample(). Oddly enough, if I hard-code keep_attrs=True inside reduce_array() in DatasetResample::reduce it works just fine. I haven't figured out where the kwarg is getting lost.
  • BUG: Some of the test cases (for instance, test_resample_old_vs_new_api) fail because the resampling by calling self.groupby_cls ends up not working - it crashes because the group sizes that get computed are not what it expects. Occurs with both new and old API

".resample({dim}=\"{freq}\").{how}() ".format(
dim=dim, freq=freq, how=how
), DeprecationWarning, stacklevel=3)

if isinstance(dim, basestring):
dim = self[dim]
group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think this should probably be group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM) but for some reason this works.

closed, label, base, keep_attrs):
"""Implement the original version of .resample() which immediately
executes the desired resampling operation. """
from .dataarray import DataArray
RESAMPLE_DIM = '__resample_dim__'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be better way to handle this, but currently using RESAMPLE_DIM is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.

@darothen
Copy link
Author

Smoothed out most of the problems from earlier and missing details. Still not sure if it's wise to refactor most of the resampling logic into a new resample.py, like what was done with rolling, but it still makes some sense to keep things in groupby.py because we're just subclassing existing machinery from there.

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

@shoyer
Copy link
Member

shoyer commented Feb 20, 2017

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

Yes, use pop, e.g., dim = kwargs.pop('dim', None). pop removes the arguments from kwargs, so you can pass on the remaining ones unchanged to the super class method.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to think about valid uses (and add some test coverage) for groupby-like resampling that doesn't use aggregation. For example, what happens if you want to use resample with arithmetic, .apply or to iterate over groups? This means that the ugly '__resample_dim__' name will leak out into the external API.

We still need a way to deal with redundant dimension/coordinate names, but maybe we can give a slightly friendlier name for the coordinate, e.g., resampled_time if time is the name of the resampled coordinate.

**kwargs)
result = self.apply(reduce_array, shortcut=shortcut)

return result.rename({self._resample_dim: self._dim})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this .rename({self._resample_dim: self._dim}) business at the end of .apply instead? That would give a slightly better result for non-reduce uses of the new resample (e.g., for arithmetic).

with self.assertRaisesRegexp(ValueError, 'index must be monotonic'):
array[[2, 0, 1]].resample(time='1D')

def test_resample_old_api(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to remove some of these old API tests in preference to tests using the new API. We don't really need all of them to be sure the old API still works.

array = DataArray(np.ones(10), [('time', times)])

# Simple mean
old_mean = array.resample('1D', 'time', how='mean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pytest.warns around each use of the old API to verify that the right warning is raised (and also to ensure the warnings don't get issued when we run the test suite).


ds.resample(time='6H').reduce(np.mean)

Resampling does not yet work for upsampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).

@darothen
Copy link
Author

darothen commented Feb 20, 2017 via email

@darothen
Copy link
Author

darothen commented Mar 1, 2017

Should .apply() really work on non-aggregation functions? Based on the pandas documentation it seems like "resample" is truly just a synonym for a transformation of the time dimension. I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

As written, non-aggregation ("transformation"?) doesn't work because the call in _combine() to _maybe_reorder() messes things up (it drops all of the data along the resampled dimension). It shouldn't be too hard to fix this, although I'm leaning more and more to making stand-alone Data{Array,set}Resample classes in a separate file which only loosely inherit from their Data{Array,set}GroupBy cousins, since they need to re-write some really critical parts of the underlying machinery.

@shoyer
Copy link
Member

shoyer commented Mar 1, 2017

I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

I think this is mostly because TimeGrouper has been around far longer than non-aggregating resample.

@jhamman
Copy link
Member

jhamman commented Jul 13, 2017

@darothen - can you give a summary of what's left to do here?

@darothen
Copy link
Author

I think a pull against the new releases is critical to see what breaks. Beyond that, just code clean up and testing. I can try to bump this higher on my priority list.

@darothen
Copy link
Author

I did my best to re-base everything to master... plan on spending an hour or so figuring out what's broken and at least restoring the status quo.

Un-do _combine temporary debugging output
@darothen
Copy link
Author

darothen commented Jul 19, 2017

TODO

  • ensure that count() works on Data{Array,set}Resample objects
  • refactor Data{Array,set}Resample objects into a stand-alone file core/resample.py alongside core/groupby.py
  • wrap pytest.warns around tests targeting old API
  • move old API tests into stand-alone
  • Crude up-sampling. Copy/pasting Stephan's earlier comment from Feb 20:

I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).


Alright @jhamman, here's the complete list of work left here. I'll tackle some of it during my commutes this week.

for how in ['mean', 'sum', 'first', 'last', ]:
method = getattr(actual, how)
result = method()
self.assertDatasetEqual(expected, result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but in future you can use pytest parameters for these sorts of loops, and assert_equal rather than the self... methods throughout

@darothen
Copy link
Author

@shoyer fixed.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go! I'll merge this in a day or two once other have had the chance to look at it over.

"variable '{}', but it is a dask array; dask arrays not "
"yet supprted in resample.interpolate(). Load into "
"memory with Dataset.load() before resampling."
.format(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it -- thanks!

----------
kind : str {'linear', 'nearest', 'zero', 'slinear',
'quadratic', 'cubic'}
Interpolation scheme to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period

@shoyer shoyer changed the title WIP: Groupby-like API for resampling Groupby-like API for resampling Sep 13, 2017
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another pass through. I found a number of pep8 violations (run git diff upstream/master | flake8 --diff) that we should fix. One comment of substance: we probably need better test coverage on the new errors.

  • Passes git diff upstream/master | flake8 --diff

label=label, base=base)
resampler = self._resample_cls(self, group=group, dim=dim_name,
grouper=time_grouper,
resample_dim=resample_dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

if isinstance(dim, basestring):
dim_name = dim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dim_name is never used

else:
result = f(dim=dim.name, skipna=skipna, keep_attrs=keep_attrs)
else:
result = gb.reduce(how, dim=dim.name, keep_attrs=keep_attrs)
result = result.rename({RESAMPLE_DIM: dim.name})
return result


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: remove extra blank line

@@ -34,7 +35,7 @@ def _infer_coords_and_dims(shape, coords, dims):
"""All the logic for creating a new DataArray"""

if (coords is not None and not utils.is_dict_like(coords) and
len(coords) != len(shape)):
len(coords) != len(shape)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

"yet supprted in resample.interpolate(). Load into "
"memory with Dataset.load() before resampling."
.format(name)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test that covers this TypeError? Right now, I think you'll actually get a NameError because name is not defined.

return DataArray(f(new_x), coords, dims, name=dummy.name,
attrs=dummy.attrs)

ops.inject_reduce_methods(DataArrayResample)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: add a second empty line after class definition

# convention from old api
new_api = getattr(resampler, method)(keep_attrs=False)
with pytest.warns(DeprecationWarning):
old_api = array.resample('1D', dim='time', how=method)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

def test_resample_mean_keep_attrs(self):
array = DataArray(np.arange(10), [('__resample_dim__', times)])
actual = array.resample('1D', dim='__resample_dim__', how='first')
self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be:

with self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension'):
    actual = array.resample('1D', dim='__resample_dim__', how='first')

I'm not sure why this is passing actually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was butchered in more ways than one; I fixed the test which revealed some underlying flaws, too, which are also fixed.

ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
tcoord = DataArray(tt, {'time': times}, ('time', ))
ds = Dataset({'data': array, 'xc': xcoord,
'yc': ycoord, 'tc': tcoord})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
tcoord = DataArray(tt, {'time': times}, ('time', ))
ds = Dataset({'data': array, 'xc': xcoord,
'yc': ycoord, 'tc': tcoord})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@darothen
Copy link
Author

@jhamman Gotcha, I'll clean everything up by the end of the week. If that's going to block 0.10.0, let me know and I'll shuffle some things around to prioritize this.

@jhamman
Copy link
Member

jhamman commented Sep 19, 2017

@darothen - we have a few other PRs to wrap up for 0.10 so end of the week is okay.

@darothen
Copy link
Author

@jhamman Think we're good. I deferred 4 small pep8 issues because they're in parts of the codebase which I don't think I ever touched, and i'm worried they're going to screw up the merge.

@jhamman
Copy link
Member

jhamman commented Sep 20, 2017

Thanks @darothen - can you resolve the merge conflicts?

@darothen
Copy link
Author

@jhamman done - caught me right while I was compiling GEOS-Chem, and the merge conflicts were very simple.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more changes that need to be made...These were caught by travis.

('x', 'y', 'time'))
self.assertDataArrayIdentical(expected, actual)

def test_upsample_interpolate(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a @requires_scipy decorator or put

interp1d = pytest.importorskip('scipy.interpolate.interp1d')

in the first line of the test (instead of the import)

def _interpolate(self, kind='linear'):
"""Apply scipy.interpolate.interp1d along resampling dimension."""
from .dataarray import DataArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's import interp1d here, instead of at the module level. scipy is still a optional dependency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - it needed to be imported under DatasetResample, too.

@jhamman
Copy link
Member

jhamman commented Sep 21, 2017

@darothen - almost there. Two or three more dependency conflicts in the tests.

@darothen
Copy link
Author

darothen commented Sep 21, 2017

@jhamman Ohhh i totally misunderstood the last readout from travis-ci. Dealing with the scipy dependency is easy enough. However, another test fails because it uses np.flip() which wasn't added to numpy until v1.12.0. Do we want to bump the numpy version in the dependencies? Or is there another aproach to take here?

Nevermind, easy solution is just to use other axis-reversal methods :)

@shoyer
Copy link
Member

shoyer commented Sep 21, 2017

We just bumped numpy to 1.11, but 1.12 would be too new.

Let's just add a np.flip backport to core/npcompat.py. The whole function is only a couple of lines.

@jhamman
Copy link
Member

jhamman commented Sep 22, 2017

In it goes. Thanks @darothen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy like API for resample
4 participants