Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple dimensions in DataArray.argmin() and DataArray.argmax() methods #3936

Merged
merged 46 commits into from Jun 29, 2020

Conversation

johnomotani
Copy link
Contributor

@johnomotani johnomotani commented Apr 5, 2020

These return dicts of the indices of the minimum or maximum of a DataArray over several dimensions. Inspired by @fujiisoup's work in #1469. With #3871, replaces #1469. Provides a simpler solution to #3160.

Implemented so that

da.isel(da.argmin(list_of_dim)) == da.min(list_of_dim)
da.isel(da.argmax(list_of_dim)) == da.max(list_of_dim)
  • Closes Argmin indexes #1469
  • Tests added
  • Passes isort -rc . && black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

These return dicts of the indices of the minimum or maximum of a
DataArray over several dimensions.
@shoyer
Copy link
Member

shoyer commented Apr 5, 2020

This looks really comprehensive, thank you!

Before doing a really careful review here, I'd like to try to work out the full API design we want. I'll write out some of my thoughts here, but your thoughts would also be very welcome!

Here's my summary of the current situation:

  1. We have two pairs of methods, argmin/argmax and the new idxmin/idxmax for calculating integer positions or coordinate labels of the minimum/maximum along one dimension.
  2. We would like to have a way to do a similar calculation in multiple dimensions, which should return a dictionary of DataArray objects suitable for feeding into isel/sel.

This PR implements the multidimensional equivalent of argmin/argmax. It's fine to save the multidimensional equivalent of idxmin/idxmax for later, but we'll want to copy whatever design we pick here.

My first concern is about the name: it isn't obvious to me whether indices_min refers to integer positions or coordinate labels. I think a name based on argmin/argmax could help, e.g., perhaps argmin_nd or argmin_dict.

Another option would be to overload argmin/argmax to also support return a dictionary, depending on the type of the arguments. If you pass a list of dimensions over which to calculate the min or max locations, you would get a dictionary back:

  • array.argmin('x') would return a single DataArray
  • array.argmin(['x', 'y']) would return a dict of DataArray objects with keys {'x', 'y'}
  • array.argmin(['x']) would also return a dict {'x': ...}

I think I like this last option best but I would be curious what others think! @pydata/xarray any thoughts on this?

@johnomotani
Copy link
Contributor Author

johnomotani commented Apr 5, 2020

@shoyer I think your last option sounds good. Questions:

  • What should da.argmin() with no arguments do?
    • Currently returns the flattened index of the global minimum.
    • I think returning a dict of indices would be much more useful, but it does change existing behaviour (more useful because you can then do da.isel(da.argmin())).
    • Could anyway do da.argmin(da.dims) to get the dict result, but that's a bit ugly
    • Could have something like da.argmin(...) - maybe as a temporary workaround while we deprecate the current behaviour of da.argmin()?
  • If we overload argmin, what's the cleanest way to get the existing argmin() method within the new one? Would something like from .duck_array_ops import argmin as argmin_1d work?

@johnomotani
Copy link
Contributor Author

Maybe worth noting, at the moment if you try to call argmin(dim=("x", "y")) with multiple dimensions, there's a not-very-helpful exception

>>> array.argmin(dim=("x", "y"))
Traceback (most recent call last):
  File "/home/john/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
TypeError: 'tuple' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/<...>/xarray/xarray/core/common.py", line 46, in wrapped_func
    return self.reduce(func, dim, axis, skipna=skipna, **kwargs)
  File "/<...>/xarray/xarray/core/dataarray.py", line 2288, in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
  File "/<...>/xarray/xarray/core/variable.py", line 1579, in reduce
    data = func(input_data, axis=axis, **kwargs)
  File "/<...>/xarray/xarray/core/duck_array_ops.py", line 304, in f
    return func(values, axis=axis, **kwargs)
  File "/<...>/xarray/xarray/core/duck_array_ops.py", line 47, in f
    return wrapped(*args, **kwargs)
  File "<__array_function__ internals>", line 6, in argmin
  File "/<...>/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1267, in argmin
    return _wrapfunc(a, 'argmin', axis=axis, out=out)
  File "/<...>/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 70, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/<...>/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: 'tuple' object cannot be interpreted as an integer

@TomNicholas
Copy link
Contributor

Another option would be to overload argmin

+1 for overloading argmin (and later idxmin). IMO we should never have one function for a 1D operation and one for an N-D operation if we can avoid it, everything should be N-dimensional.

I also really like how neat this resultant property is

da.isel(da.argmin(list_of_dim)) == da.min(list_of_dim)

we could even use a hypothesis test to check it...

I think returning a dict of indices would be much more useful, but it does change existing behaviour (more useful because you can then do da.isel(da.argmin())).

Although it's breaking and would require a deprecation cycle, I think this is what we should aim for.

there's a not-very-helpful exception

Yes let's take the time to make that clearer for users - this will be a commonly-used function.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Apr 6, 2020

+1 for making argmin/argmax (and idxmin/idxmax) work, when given multiple dimensions.

According to the current docstring it should already work that way for argmin/'argmax`:

Reduce this DataArray’s data by applying argmin along some dimension(s).

Returns: New DataArray/Dataset object with argmin applied to it's data and the indicated dimension(s) removed

But this behaviour is broken currently (works only for one given dim).

My main concern for changing the API as suggested above is, how should we discern (at least for argmin/argmax), if the user want's to:

  1. reduce over all given dimensions? (current behaviour according to docstring)
  2. reduce along all given dimensions? (suggested behaviour by @johnomotani [please correct if I'm wrong])

@TomNicholas
Copy link
Contributor

how should we discern (at least for argmin/argmax), if the user wants to...

That's a good question @kmuehlbauer, and the distinction probably needs to be clearer in the docs in general.

reduce over all given dimensions?

By this do you mean find the minimum as if the array were first (partially or totally) flattened along the given dims somehow? I'm not sure we provide that kind of behaviour anywhere in the current API.

@kmuehlbauer
Copy link
Contributor

reduce over all given dimensions?

By this do you mean find the minimum as if the array were first (partially or totally) flattened along the given dims somehow? I'm not sure we provide that kind of behaviour anywhere in the current API.

@TomNicholas Probably I was bit confused by the current docstring. I think I understand now and there should be no problem at all.

@kmuehlbauer
Copy link
Contributor

@johnomotani FYI: For #3871 (merged) there is #3922 (yet unmerged) to fix dask-handling.

When argmin or argmax are called with a sequence for 'dim', they now
return a dict with the indices for each dimension in dim.
If single dim is passed to Dataset.argmin() or Dataset.argmax(), then
pass through to _argmin_base or _argmax_base. If a sequence is passed
for dim, raise an exception, because the result for each DataArray would
be a dict, which cannot be stored in a Dataset.
The basic numpy-style argmin() and argmax() methods were renamed when
adding support for handling multiple dimensions in DataArray.argmin()
and DataArray.argmax(). Variable.argmin() and Variable.argmax() are
therefore renamed as Variable._argmin_base() and
Variable._argmax_base().
@johnomotani johnomotani changed the title DataArray.indices_min() and DataArray.indices_max() methods Support multiple dimensions in DataArray.argmin() and DataArray.argmax() methods Apr 6, 2020
@johnomotani
Copy link
Contributor Author

I've updated so the new functionality is provided by argmin() and argmax(), when they are passed a multiple dimensions.

  • I've added Dataset.argmin() and Dataset.argmax(), which only work for a single dimension, but give informative exceptions for unsupported cases. If anyone has some behaviour that they would like for Dataset.argmin() with multiple dimensions, I'm happy to implement it, but returning something like a dict of Datasets didn't seem nice, and as far as I understand a Dataset cannot contain something like a dict of DataArrays.
  • I renamed argmin and argmax in ops.py and duck_array_ops.py to _argmin_base and _argmax_base so that I could still use them inside the argmin and argmax methods. Any alternatives, or better suggestions for naming them?
  • Variable no longer has argmin or argmax methods. Is that a problem? I just updated test_variable.py and test_dask.py to use _argmin_base and _argmax_base.

@johnomotani
Copy link
Contributor Author

The rename to _argmin_base is also causing test_aggregation (which is a test of units with pint) to fail, because the numpy function and the Variable method do not have the same name. Any suggestions how to fix this? It seems tricky because I can't pass either argmin (no Variable method) or _argmin_base (no numpy method) to the test.

@johnomotani
Copy link
Contributor Author

These test failures seem to have uncovered a larger issue: overriding a method injected by ops.inject_all_ops_and_reduce_methods(DataArray, priority=60) is tricky. I think there may be a way around it by mangling the name of the injected operation if that method is already defined on the class; I tried this and it nearly works, but I think I need to implement an override for argmin/argmax on Variable, because DataArray.__array_wrap__ uses the Variable version of the method... Work in progress!

@shoyer
Copy link
Member

shoyer commented Apr 7, 2020

Also, feel free to rewrite to avoid using inject_all_ops_and_reduce_methods() for argmin/argmax at all. Method injection is pretty hacky, and generally not worthwhile, e.g., see this note about removing it.

If a method (such as 'argmin') has been explicitly defined on a class
(so that hasattr(cls, "argmin")==True), then do not inject that method,
as it would override the explicitly defined one. Instead inject a
private method, prefixed by "_injected_" (such as '_injected_argmin'), so
that the injected method is available to the explicitly defined one.

Do not perform the hasattr check on binary ops, because this breaks
some operations (e.g. addition between DataArray and int in
test_dask.py).
Now not needed because of change to injection in ops.py.
@johnomotani
Copy link
Contributor Author

Merge conflicts fixed, this PR should be ready to review/merge.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beautiful work!

I am so excited about getting this into xarray. We are now one short step away from getting a canonical answer to the most frequently asked question about xarray on StackOverflow! https://stackoverflow.com/questions/40179593/how-to-get-the-coordinates-of-the-maximum-in-xarray

Comment on lines 3822 to 3828
def argmin(
self,
dim: Union[Hashable, Sequence[Hashable]] = None,
axis: Union[int, None] = None,
keep_attrs: bool = None,
skipna: bool = None,
) -> Union["DataArray", Dict[Hashable, "DataArray"]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it would be nice to use either TypeVar or @overload so tools like mypy can find out the type of the the return value from argmin() based on the type of dim.

But definitely don't worry about that now, we can save that for a follow-up :)

@shoyer
Copy link
Member

shoyer commented Jun 25, 2020

I'm going to wait a little while before merging in case anyone else has comment, but otherwise will merge in a day or two (definitely before the next release)

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @johnomotani This is great!

I have some minor comments below that should be easy to fix.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
johnomotani and others added 3 commits June 26, 2020 11:43
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Pass an explicit axis or dim argument instead to avoid the warning.
Prefer to pass reduce_dims=None when possible, including for variables
with only one dimension. Avoids an error if an 'axis' keyword was
passed.
xarray/core/dataset.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator

keewis commented Jun 26, 2020

I noticed a few issues while debugging:

  • calling on a array with missing values raises ValueError: All-NaN slice encountered (try ds = xr.tutorial.open_dataset("rasm"); ds.argmin(dim="y"))
  • ds.argmin(dim=...) returns a single index (the same result as ds.argmin()) but ds.Tair.argmin(dim=...) returns something different from ds.Tair.argmin(). Is that intentional?

@johnomotani
Copy link
Contributor Author

ds.argmin(dim=...) returns a single index (the same result as ds.argmin()) but ds.Tair.argmin(dim=...) returns something different from ds.Tair.argmin(). Is that intentional?

I think this is a bug. dim=... to Dataset.argmin should be an error because dim=... returns a dict of results, and it's not clear how to do that consistently for a Dataset that might have several members with different combinations of dimensions.

@keewis
Copy link
Collaborator

keewis commented Jun 26, 2020

well, we'd need to somehow be able to use sel to index every variable in a Dataset differently. While that would probably be possible it would make sel way too complicated, so I agree that an error would be good here.

I think no further changes to the pint tests should be required (except refactoring, but I can do that in a new PR after this one has been merged).

@johnomotani
Copy link
Contributor Author

calling on a array with missing values raises ValueError: All-NaN slice encountered (try ds = xr.tutorial.open_dataset("rasm"); ds.argmin(dim="y"))

Some missing values should be OK. In your example though for example

In [27]: ds.Tair.isel(time=0, x=0).values                                                     
Out[27]: 
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

So there are no values in the array slice that argmin should be applied over, which is an error.

I guess we could add some special handling for this (not sure what though, because we can't set a variable with type int to nan or None), but I think that is a separate issue that would need a new PR.

@johnomotani
Copy link
Contributor Author

Thanks @keewis! I think we've addressed all the review comments now.

doc/whats-new.rst Outdated Show resolved Hide resolved
Co-authored-by: keewis <keewis@users.noreply.github.com>
@dcherian dcherian merged commit bdcfab5 into pydata:master Jun 29, 2020
@dcherian
Copy link
Contributor

Thanks @johnomotani . This is a significant contribution.

@johnomotani
Copy link
Contributor Author

Thanks @dcherian 😄 You're very welcome!

dcherian added a commit to raphaeldussin/xarray that referenced this pull request Jul 1, 2020
* upstream/master: (21 commits)
  fix typo in error message in plot.py (pydata#4188)
  Support multiple dimensions in DataArray.argmin() and DataArray.argmax() methods (pydata#3936)
  Show data by default in HTML repr for DataArray (pydata#4182)
  Blackdoc (pydata#4177)
  Add CONTRIBUTING.md for the benefit of GitHub
  Correct dask handling for 1D idxmax/min on ND data (pydata#4135)
  use assert_allclose in the aggregation-with-units tests (pydata#4174)
  Remove old auto combine (pydata#3926)
  Fix 4009 (pydata#4173)
  Limit length of dataarray reprs (pydata#3905)
  Remove <pre> from nested HTML repr (pydata#4171)
  Proposal for better error message about in-place operation (pydata#3976)
  use builtin python types instead of the numpy alias (pydata#4170)
  Revise pull request template (pydata#4039)
  pint support for Dataset (pydata#3975)
  drop eccodes in docs (pydata#4162)
  Update issue templates inspired/based on dask (pydata#4154)
  Fix failing upstream-dev build & remove docs build (pydata#4160)
  Improve typehints of xr.Dataset.__getitem__ (pydata#4144)
  provide a error summary for assert_allclose (pydata#3847)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants