Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_first by using apply_ufunc in ops.fillna #1204

Merged
merged 18 commits into from
Jan 23, 2017

Conversation

chunweiyuan
Copy link
Contributor

@chunweiyuan chunweiyuan commented Jan 12, 2017

Implementing ops.fillna using apply_ufunc, with a join kwarg. The new ops.fillna is then used to implement combine_first in both DataArray and Dataset objects.

def _fillna(data, other):
left, right = np.broadcast_arrays(data, other)
result = left.copy() # view must be copied before being written
result[isnull(result)] = right[isnull(result)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An advantage of using where(isnull(data), other, data) instead of all the logic in this function is that it works on dask arrays, too.

@@ -382,7 +382,15 @@ def fillna(self, value):
Dataset.fillna
DataArray.fillna
"""
return self._fillna(value)
def _yield_applied(this, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think apply_ufunc should even work on groupby objects, so you shouldn't need this helper function. I would be interested to know I'm wrong about that, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I replace the whole method with just return apply_ufunc(ops.fillna, self, value), I'd fail line 2507 of test_dataset.py, where self.assertEqual(actual.attrs, ds.attrs) fails, because the attrs aren't copied over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is perhaps to add something into apply_groupby_ufunc to make sure the attrs are copied over, similar to what's in _copy_attrs_from in dataset.py/dataarray.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is perhaps to add something into apply_groupby_ufunc to make sure the attrs are copied over, similar to what's in _copy_attrs_from in dataset.py/dataarray.py.

I like this -- it would be useful for other functions in apply_ufunc, too. For consistency, I would call the new argument keep_attrs.

Or if you don't want to bother with that, could just use _copy_attrs_from in this method:

out = ops.fillna(self, other)
out._copy_attrs_from(self._obj)
return out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copped out and took your simple approach. It's mainly because I'm not sure how to deal with more than 2 objects in apply_ufunc when keep_attrs=True. Do I only take the first object's attributes? Maybe best left for another PR.

result = left.copy() # view must be copied before being written
result[isnull(result)] = right[isnull(result)]
return result
return apply_ufunc(_fillna, data, other, join=join)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have dask_array='allowed'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to separate the join argument to apply_ufunc into two parts:

  1. The join used for aligning indexes.
  2. The join used for aligning data variables between datasets.

For fillna, I think we would always want data_vars_join='left', though I guess it's equivalent to data_vars_join='left' when we constrain value not to contain new keys.

of ``join='outer'``. Vacant cells in the expanded coordinates are
filled with np.nan.

Renders the same result as xr.merge([self, other]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right -- combine_first/fillna are OK with conflicting values in self and other, but merge will raise an error.

out.attrs = self.attrs
return out

def combine_first(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth making a separate method here to save a few characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either we have combine_first for both Dataset and DataArray, or we lump everything into fillna. The use of combine_first is just that there's this concrete direct analogy from pandas. What's the decision here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that data_vars_join could sensibly differ for combine_first (as you have implemented it) suggests that a second method is reasonable. Consistency with pandas is also a nice factor. So I guess I've come around!

'must be contained in the original dataset')
out = ops.fillna(self, value, join="left")
out._copy_attrs_from(self)
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good time to remove all the fillna specific logic from Dataset._calculate_binary_op that is no longer used

@@ -1955,7 +1955,35 @@ def fillna(self, value):
-------
Dataset
"""
out = self._fillna(value)
if utils.is_dict_like(value):
value_keys = value.data_vars.keys() if isinstance(value, Dataset)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe slightly cleaner to use duck typing, e.g., value_keys = getattr(value, 'data_vars', value).keys()

if not set(value_keys) <= set(self.data_vars.keys()):
raise ValueError('all variables in the argument to `fillna` '
'must be contained in the original dataset')
out = ops.fillna(self, value, join="left")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join should be a parameter for Dataset.fillna, too.

@@ -106,6 +106,13 @@ Deprecations

Enhancements
~~~~~~~~~~~~

- Added the xarray equivalent of `pandas.Dataframe.combine_first` as an instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note -- this will need to go in a new section for 0.9.1, since 0.9.0 will probably be released first.


- Added the xarray equivalent of `pandas.Dataframe.combine_first` as an instance
method to DataArray/Dataset objects, facilitated by the new `ops.fillna` that
uses `apply_ufunc` for different `join` options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop implementation details like the fact that this uses apply_ufunc from the user facing release notes.

@chunweiyuan
Copy link
Contributor Author

So I addressed some of the comments, but not all. Please let me know what you think.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming together nicely.


def _fillna(data, other):
left, right = np.broadcast_arrays(data, other)
return np.where(isnull(left), right, left)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should literally be:

def _fillna(data, other):
     return where(isnull(data), other, data)

The where and isnull functions defined in this module already do broadcasting, and work for both dask and numpy arrays.

result_vars = apply_dict_of_variables_ufunc(
func, *args, signature=signature, join=join, fill_value=fill_value)
func, *args, signature=signature, join=data_vars_join,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice to have a unit test that explicitly covers data_vars_join in test_computation.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior of the following?

ds0 = xr.Dataset({'a': ('x', [1, 2]), 'x': [0, 1]})
ds1 = xr.Dataset({'b': ('x', [99, 3]), 'x': [1, 2]})

def add(a, b, join, data_vars_join, **kwargs):
        return apply_ufunc(operator.add, a, b, join=join,
                           data_vars_join=data_vars_join, **kwargs)

out = add(ds0, ds1, 'outer', 'outer')

Currently, it raises a TypeError at result_vars[name] = func(*variable_args) of apply_dict_of_variables_ufunc.

On the other hand, if you run

out = add(ds0, ds2, 'outer', 'outer', dataset_fill_value=np.nan)

you get

>> out
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    a        (x) float64 nan nan nan
    b        (x) float64 nan nan nan

While I can understand why the code renders such result, as a user, I was hoping for something more like

<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    a        (x) float64 1.0 2.0 nan
    b        (x) float64 nan 99.0 3.0

which would be a little more useful.

would we want to make changes for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make that work with dataset_fill_value=0:

In [11]: add(ds0, ds1, 'outer', 'outer', dataset_fill_value=0)
Out[11]:
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    a        (x) float64 1.0 2.0 nan
    b        (x) float64 nan 99.0 3.0

Are you suggesting changing the default value to 0? It's not obvious to me what is the best choice for the default dataset_fill_value is. Maybe we should remove the default value of None and raise a more descriptive error if this happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

1.) What do you think about removing the join kwarg in the fillna signature in both dataarray.py and dataset.py. The writing of ds0.fillna(ds1) already implies a directionality, so adding a join option can lead to confusion.

2.) I've decided not to touch the default value of dataset_fill_value=None. It leads to errors in many other places once you change that. I'll look into raising some descriptive errors now that we have a data_vars_join option. But if we want to make

out = add(ds0, ds2, 'outer', 'outer', dataset_fill_value=np.nan)

render

<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    a        (x) float64 1.0 2.0 nan
    b        (x) float64 nan 99.0 3.0

I think we'll have to do another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.) What do you think about removing the join kwarg in the fillna signature in both dataarray.py and dataset.py. The writing of ds0.fillna(ds1) already implies a directionality, so adding a join option can lead to confusion.

Agreed, there isn't really a need for the join argument once we have combine_first.

For your example add(ds0, ds2, 'outer', 'outer', dataset_fill_value=np.nan), can you clarify what the values of ds0 and ds2 are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake. That ds2 is supposed to be ds1, as the one above:

ds0 = xr.Dataset({'a': ('x', [1, 2]), 'x': [0, 1]})
ds1 = xr.Dataset({'b': ('x', [99, 3]), 'x': [1, 2]})

@@ -702,7 +713,8 @@ def stack(objects, dim, new_coord):
elif any(is_dict_like(a) for a in args):
return apply_dataset_ufunc(variables_ufunc, *args, signature=signature,
join=join, exclude_dims=exclude_dims,
fill_value=dataset_fill_value)
fill_value=dataset_fill_value,
data_vars_join=data_vars_join)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this argument to the functools.partial for GroupBy objects in the clause above this one.

@@ -382,7 +382,15 @@ def fillna(self, value):
Dataset.fillna
DataArray.fillna
"""
return self._fillna(value)
def _yield_applied(this, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is perhaps to add something into apply_groupby_ufunc to make sure the attrs are copied over, similar to what's in _copy_attrs_from in dataset.py/dataarray.py.

I like this -- it would be useful for other functions in apply_ufunc, too. For consistency, I would call the new argument keep_attrs.

Or if you don't want to bother with that, could just use _copy_attrs_from in this method:

out = ops.fillna(self, other)
out._copy_attrs_from(self._obj)
return out

out.attrs = self.attrs
return out

def combine_first(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that data_vars_join could sensibly differ for combine_first (as you have implemented it) suggests that a second method is reasonable. Consistency with pandas is also a nice factor. So I guess I've come around!

self.assertDatasetEqual(actual, expected)
self.assertDatasetEqual(actual, xr.merge([dsx0, dsx1]))

dsy2 = DataArray([2, 2, 2], [('x', ['b', 'c', 'd'])]).\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per PEP8, please avoid writing \. Either split this on two lines or use parentheses for the implicit line continuation.

expected = DataArray([[0, 0], [0, 0], [2, 2]],
[('x', ['a', 'b', 'd']), ('y', [-1, 0])])
self.assertDataArrayEqual(actual, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two line breaks should separate code blocks, not three

ar1.combine_first(ar0)
ar0.combine_first(ar2)

For datasets, ``ds0.combine_first(ds1)`` works just like ``xr.merge([ds0, ds1])``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this isn't quite true -- it succeeds in cases where merge will raise an error. It would be nice to highlight the difference here.

@@ -2324,11 +2356,6 @@ def _calculate_binary_op(self, f, other, join='inner',
inplace=False, fillna=False):

def apply_over_both(lhs_data_vars, rhs_data_vars, lhs_vars, rhs_vars):
if fillna and join != 'left':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove the fillna argument from the function signature entirely (and also from Dataset._binary_op)

@@ -13,6 +13,7 @@ Combining data

* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
# For combining datasets or data arrays with outer-join alignment, see combine_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: For combining datasets with different coordinates or missing values

@shoyer
Copy link
Member

shoyer commented Jan 14, 2017 via email

@chunweiyuan
Copy link
Contributor Author

I'm not sure why continuous integration check is taking forever.....

@shoyer
Copy link
Member

shoyer commented Jan 19, 2017

Appveyor is acting up. I tried cancelling and restarting the build -- we'll see if that works. But the Travis-CI tests pass, so I wouldn't worry about.

@shoyer
Copy link
Member

shoyer commented Jan 19, 2017

Looks like the build is starting now.

@@ -664,8 +676,10 @@ def stack(objects, dim, new_coord):

signature = kwargs.pop('signature', None)
join = kwargs.pop('join', 'inner')
data_vars_join = kwargs.pop('data_vars_join', 'inner')
keep_attrs = kwargs.pop('keep_attrs', True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should default to False, as it does for arithmetic.

join=join, exclude_dims=exclude_dims,
fill_value=dataset_fill_value,
data_vars_join=data_vars_join)
if keep_attrs and (type(first_obj) is type(out)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move keep_attrs one level up into apply_dataset_ufunc and apply_dataarray_ufunc.

exclude_dims = kwargs.pop('exclude_dims', frozenset())
dataset_fill_value = kwargs.pop('dataset_fill_value', None)
dataset_fill_value = kwargs.pop('dataset_fill_value', np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually rather force dataset_fill_value to be set explicitly.

One option is to create a top level constant for the default:

_DEFAULT_FILL_VALUE = object()

...
dataset_fill_value = kwargs.pop('dataset_fill_value', _DEFAULT_FILL_VALUE)

Then, in collect_dict_values raise an error if you need to use the fill value and fill_value is _DEFAULT_FILL_VALUE.

@chunweiyuan
Copy link
Contributor Author

time to merge this guy?

@chunweiyuan
Copy link
Contributor Author

checking in on this PR...

@@ -258,6 +270,9 @@ def join_dict_keys(objects, how='inner'):

def collect_dict_values(objects, keys, fill_value=None):
# type: (Iterable[Union[Mapping, Any]], Iterable, Any) -> List[list]
if fill_value is _DEFAULT_FILL_VALUE:
raise ValueError('Inappropriate fill value for Dataset: {}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default fill value doesn't have an informative repr, so I wouldn't print it in the error message. Instead, say something like: "To apply an operation to datasets with different data variables, you must supply the dataset_fill_value argument"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: we should only raise this error if filling actually needs to be done, or better yet exactly whendata_vars_join != 'inner' (so the function behavior does not depend on data values). In the current version, you always need to supply this argument when applying operation to datasets, which is too strict.

If we switch to raising when data_vars_join != 'inner', then this should be a TypeError and be moved up into apply_dataset_ufunc.

@@ -202,6 +206,8 @@ def apply_dataarray_ufunc(func, *args, **kwargs):
signature = kwargs.pop('signature')
join = kwargs.pop('join', 'inner')
exclude_dims = kwargs.pop('exclude_dims', _DEFAULT_FROZEN_SET)
keep_attrs = kwargs.pop('keep_attrs', False)
first_obj = args[0] # we'll copy attrs from this in case keep_attrs=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style suggestion: move this line to just about if keep_attrs below, then you don't need the comment to explain why it exists.

@@ -89,7 +89,8 @@ def test_apply_identity():
data_array = xr.DataArray(variable, [('x', -array)])
dataset = xr.Dataset({'y': variable}, {'x': -array})

identity = functools.partial(apply_ufunc, lambda x: x)
identity = functools.partial(apply_ufunc, lambda x: x,
dataset_fill_value=np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, we shouldn't need to set dataset_fill_value unless data_vars_join has been modified from the default.

exclude_dims = kwargs.pop('exclude_dims', frozenset())
dataset_fill_value = kwargs.pop('dataset_fill_value', None)
dataset_fill_value = kwargs.pop('dataset_fill_value', _DEFAULT_FILL_VALUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename this argument to data_vars_fill_value, to help clarify that it is paired with data_vars_join? I would put them next to each other in the docstring, too.

Or maybe the pair should be dataset_fill_value/dataset_join? dataset_data_vars_fill_value/dataset_data_vars_join is probably too long :).

'b': ('x', [np.nan, np.nan, np.nan]),
'x': [0, 1, 2]})
assert_identical(actual, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to verify that an appropriate error is raised (use pytest.raises) if dataset_fill_value is not provided when it is it necessary.

@@ -473,6 +480,37 @@ def test_broadcast_compat_data_2d():
broadcast_compat_data(var, ('w', 'y', 'x', 'z'), ()))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a test in here for keep_attrs.

@@ -13,6 +13,8 @@ Combining data

* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
# For combining datasets with different coordinates or missing values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use * at the start for consistency

Maybe say "datasets or data arrays"?

Also, maybe "different indexes" instead of "different coordinates"?

@@ -13,6 +13,8 @@ Combining data

* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
# For combining datasets with different coordinates or missing values,
or data arrays with outer-join alignment, see combine_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop "or data arrays with outer-join alignment" -- it's a little confusing.

@@ -13,8 +13,7 @@ Combining data

* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
# For combining datasets with different coordinates or missing values,
or data arrays with outer-join alignment, see combine_.
* For combining datasets or data arrays with different indexes or missing values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you lost "see combine" here


if dataset_join != 'inner' and fill_value is _DEFAULT_FILL_VALUE:
raise TypeError('To apply an operation to datasets with different ',
'data variables, you must supply the ',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add if dataset_join != 'inner' to the docstring

dataset_fill_value : optional
Value used in place of missing variables on Dataset inputs when the
datasets do not share the exact same ``data_vars``. Only relevant if
``dataset_join != 'inner'``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch the last sentence to:

Required if dataset_join != 'inner', otherwise ignored.

@@ -607,23 +609,23 @@ def apply_ufunc(func, *args, **kwargs):
- 'inner': use the intersection of object indexes
- 'left': use indexes from the first object with each dimension
- 'right': use indexes from the last object with each dimension
data_vars_join : {'outer', 'inner', 'left', 'right'}, optional
dataset_join : {'outer', 'inner', 'left', 'right'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the function signature at the top of the docstring, also to add in the new arguments?

Enhancements
~~~~~~~~~~~~

- Added the xarray equivalent of `pandas.Dataframe.combine_first` as an instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's move this into v0.9.0. I think this change is pretty safe, and it would be nice to get it in promptly.

@shoyer shoyer merged commit c5146e8 into pydata:master Jan 23, 2017
@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

Thanks for persevering on this, @chunweiyuan !

@chunweiyuan
Copy link
Contributor Author

tears in my eyes.......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants