Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isin: better docs, support older numpy and use dask.array.isin. #2038

Merged
merged 4 commits into from
Apr 6, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ Missing value handling
Dataset.bfill
Dataset.interpolate_na
Dataset.where
Dataset.isin

Computation
-----------
Expand Down Expand Up @@ -174,7 +175,6 @@ Computation
:py:attr:`~Dataset.cumsum`
:py:attr:`~Dataset.cumprod`
:py:attr:`~Dataset.rank`
:py:attr:`~Dataset.isin`

**Grouped operations**:
:py:attr:`~core.groupby.DatasetGroupBy.assign`
Expand Down Expand Up @@ -285,6 +285,7 @@ Missing value handling
DataArray.bfill
DataArray.interpolate_na
DataArray.where
DataArray.isin

Comparisons
-----------
Expand Down Expand Up @@ -340,7 +341,6 @@ Computation
:py:attr:`~DataArray.cumsum`
:py:attr:`~DataArray.cumprod`
:py:attr:`~DataArray.rank`
:py:attr:`~DataArray.isin`

**Grouped operations**:
:py:attr:`~core.groupby.DataArrayGroupBy.assign_coords`
Expand Down
25 changes: 25 additions & 0 deletions doc/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,31 @@ elements that are fully masked:

arr2.where(arr2.y < 2, drop=True)

.. _selecting values with isin:

Selecting values with ``isin``
------------------------------

To check whether elements of an xarray object contain a single object, you can
compare with the equality operator ``==`` (e.g., ``arr == 3``). To check
multiple values, use :py:meth:`~xarray.DataArray.isin`:

.. ipython:: python

arr = xr.DataArray([1, 2, 3, 4, 5], dims=['x'])
arr.isin([2, 4])

:py:meth:`~xarray.DataArray.isin` works particularly well with
:py:meth:`~xarray.DataArray.where` to support indexing by arrays that are not
already labels of an array:

.. ipython:: python

lookup = xr.DataArray([-1, -2, -3, -4, -5], dims=['x'])
arr.where(lookup.isin([-2, -4]), drop=True)

However, some caution is in order: when done repeatedly, this type of indexing
is significantly slower than using :py:meth:`~xarray.DataArray.sel`.

.. _vectorized_indexing:

Expand Down
22 changes: 12 additions & 10 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,18 +37,20 @@ Documentation
Enhancements
~~~~~~~~~~~~

- `~xarray.DataArray.isin` and `~xarray.Dataset.isin` methods, which test each value
in the array for whether it is contained in the supplied list, returning a bool array.
Similar to the ``np.isin`` function. Requires NumPy >= 1.13
By `Maximilian Roos <https://github.com/maxim-lian>`
- :py:meth:`~xarray.DataArray.isin` and :py:meth:`~xarray.Dataset.isin` methods,
which test each value in the array for whether it is contained in the
supplied list, returning a bool array. See :ref:`selecting values with isin`
for full details. Similar to the ``np.isin`` function.
By `Maximilian Roos <https://github.com/maxim-lian>`_.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change! Thank you for adding the docs


- Some speed improvement to construct :py:class:`~xarray.DataArrayRolling`
object (:issue:`1993`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
- Handle variables with different values for ``missing_value`` and
``_FillValue`` by masking values for both attributes; previously this
resulted in a ``ValueError``. (:issue:`2016`)
By `Ryan May <https://github.com/dopplershift>`_.
object (:issue:`1993`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.

- Handle variables with different values for ``missing_value`` and
``_FillValue`` by masking values for both attributes; previously this
resulted in a ``ValueError``. (:issue:`2016`)
By `Ryan May <https://github.com/dopplershift>`_.

Bug fixes
~~~~~~~~~
Expand Down
28 changes: 28 additions & 0 deletions licenses/DASK_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
:py:meth:`~xarray.DataArray.isin`Copyright (c) 2014-2018, Anaconda, Inc. and contributors
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

Neither the name of Anaconda nor the names of any contributors may be used to
endorse or promote products derived from this software without specific prior
written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.
42 changes: 31 additions & 11 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import numpy as np
import pandas as pd

from . import dtypes, formatting, ops
from . import duck_array_ops, dtypes, formatting, ops
from .arithmetic import SupportsArithmetic
from .pycompat import OrderedDict, basestring, dask_array_type, suppress
from .utils import Frozen, SortedKeysDict
Expand Down Expand Up @@ -746,32 +746,52 @@ def close(self):
self._file_obj = None

def isin(self, test_elements):
"""Tests each value in the array for whether it is in the supplied list
Requires NumPy >= 1.13
"""Tests each value in the array for whether it is in the supplied list.

Parameters
----------
element : array_like
Input array.
test_elements : array_like
The values against which to test each value of `element`.
This argument is flattened if an array or array_like.
See numpy notes for behavior with non-array-like parameters.

Returns
-------
isin : same as object, bool
Has the same shape as object
Has the same shape as this object.

Examples
--------

>>> array = xr.DataArray([1, 2, 3], dims='x')
>>> array.isin([1, 3])
<xarray.DataArray (x: 3)>
array([ True, False, True])
Dimensions without coordinates: x

See also
--------
numpy.isin
"""
if LooseVersion(np.__version__) < LooseVersion('1.13.0'):
raise ImportError('isin requires numpy version 1.13.0 or later')
from .computation import apply_ufunc
from .dataset import Dataset
from .dataarray import DataArray
from .variable import Variable

if isinstance(test_elements, Dataset):
raise TypeError(
'isin() argument must be convertible to an array: {}'
.format(test_elements))
elif isinstance(test_elements, (Variable, DataArray)):
# need to explicitly pull out data to support dask arrays as the
# second argument
test_elements = test_elements.data

return apply_ufunc(
np.isin,
duck_array_ops.isin,
self,
kwargs=dict(test_elements=test_elements),
dask='parallelized',
output_dtypes=[np.bool_],
dask='allowed',
)

def __enter__(self):
Expand Down
33 changes: 33 additions & 0 deletions xarray/core/dask_array_compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from __future__ import absolute_import, division, print_function

from functools import wraps
import numpy as np
import dask.array as da

try:
from dask.array import isin
except ImportError: # pragma: no cover
# Copied from dask v0.17.3.
# Used under the terms of Dask's license, see licenses/DASK_LICENSE.

def _isin_kernel(element, test_elements, assume_unique=False):
values = np.in1d(element.ravel(), test_elements,
assume_unique=assume_unique)
return values.reshape(element.shape + (1,) * test_elements.ndim)

@wraps(getattr(np, 'isin', None))
def isin(element, test_elements, assume_unique=False, invert=False):
element = da.asarray(element)
test_elements = da.asarray(test_elements)
element_axes = tuple(range(element.ndim))
test_axes = tuple(i + element.ndim for i in range(test_elements.ndim))
mapped = da.atop(_isin_kernel, element_axes + test_axes,
element, element_axes,
test_elements, test_axes,
adjust_chunks={axis: lambda _: 1 for axis in test_axes},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (81 > 79 characters)

dtype=bool,
assume_unique=assume_unique)
result = mapped.any(axis=test_axes)
if invert:
result = ~result
return result
24 changes: 13 additions & 11 deletions xarray/core/duck_array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,23 +26,24 @@
has_bottleneck = False

try:
import dask.array as da
has_dask = True
import dask.array as dask_array
from . import dask_array_compat
except ImportError:
has_dask = False
dask_array = None
dask_array_compat = None


def _dask_or_eager_func(name, eager_module=np, list_of_args=False,
n_array_args=1):
def _dask_or_eager_func(name, eager_module=np, dask_module=dask_array,
list_of_args=False, n_array_args=1):
"""Create a function that dispatches to dask for dask array inputs."""
if has_dask:
if dask_module is not None:
def f(*args, **kwargs):
if list_of_args:
dispatch_args = args[0]
else:
dispatch_args = args[:n_array_args]
if any(isinstance(a, da.Array) for a in dispatch_args):
module = da
if any(isinstance(a, dask_array.Array) for a in dispatch_args):
module = dask_module
else:
module = eager_module
return getattr(module, name)(*args, **kwargs)
Expand All @@ -63,8 +64,8 @@ def fail_on_dask_array_input(values, msg=None, func_name=None):

around = _dask_or_eager_func('around')
isclose = _dask_or_eager_func('isclose')
notnull = _dask_or_eager_func('notnull', pd)
_isnull = _dask_or_eager_func('isnull', pd)
notnull = _dask_or_eager_func('notnull', eager_module=pd)
_isnull = _dask_or_eager_func('isnull', eager_module=pd)


def isnull(data):
Expand All @@ -80,7 +81,8 @@ def isnull(data):

transpose = _dask_or_eager_func('transpose')
_where = _dask_or_eager_func('where', n_array_args=3)
insert = _dask_or_eager_func('insert')
isin = _dask_or_eager_func('isin', eager_module=npcompat,
dask_module=dask_array_compat, n_array_args=2)
take = _dask_or_eager_func('take')
broadcast_to = _dask_or_eager_func('broadcast_to')

Expand Down
93 changes: 93 additions & 0 deletions xarray/core/npcompat.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,3 +255,96 @@ def flip(m, axis):
raise ValueError("axis=%i is invalid for the %i-dimensional "
"input array" % (axis, m.ndim))
return m[tuple(indexer)]

try:
from numpy import isin
except ImportError:

def isin(element, test_elements, assume_unique=False, invert=False):
"""
Calculates `element in test_elements`, broadcasting over `element` only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

Returns a boolean array of the same shape as `element` that is True
where an element of `element` is in `test_elements` and False otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)


Parameters
----------
element : array_like
Input array.
test_elements : array_like
The values against which to test each value of `element`.
This argument is flattened if it is an array or array_like.
See notes for behavior with non-array-like parameters.
assume_unique : bool, optional
If True, the input arrays are both assumed to be unique, which
can speed up the calculation. Default is False.
invert : bool, optional
If True, the values in the returned array are inverted, as if
calculating `element not in test_elements`. Default is False.
``np.isin(a, b, invert=True)`` is equivalent to (but faster
than) ``np.invert(np.isin(a, b))``.

Returns
-------
isin : ndarray, bool
Has the same shape as `element`. The values `element[isin]`
are in `test_elements`.

See Also
--------
in1d : Flattened version of this function.
numpy.lib.arraysetops : Module with a number of other functions for
performing set operations on arrays.

Notes
-----

`isin` is an element-wise function version of the python keyword `in`.
``isin(a, b)`` is roughly equivalent to
``np.array([item in b for item in a])`` if `a` and `b` are 1-D sequences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (81 > 79 characters)


`element` and `test_elements` are converted to arrays if they are not
already. If `test_elements` is a set (or other non-sequence collection)
it will be converted to an object array with one element, rather than an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

array of the values contained in `test_elements`. This is a consequence
of the `array` constructor's way of handling non-sequence collections.
Converting the set to a list usually gives the desired behavior.

.. versionadded:: 1.13.0

Examples
--------
>>> element = 2*np.arange(4).reshape((2, 2))
>>> element
array([[0, 2],
[4, 6]])
>>> test_elements = [1, 2, 4, 8]
>>> mask = np.isin(element, test_elements)
>>> mask
array([[ False, True],
[ True, False]])
>>> element[mask]
array([2, 4])
>>> mask = np.isin(element, test_elements, invert=True)
>>> mask
array([[ True, False],
[ False, True]])
>>> element[mask]
array([0, 6])

Because of how `array` handles sets, the following does not
work as expected:

>>> test_set = {1, 2, 4, 8}
>>> np.isin(element, test_set)
array([[ False, False],
[ False, False]])

Casting the set to a list gives the expected result:

>>> np.isin(element, list(test_set))
array([[ False, True],
[ True, False]])
"""
element = np.asarray(element)
return np.in1d(element, test_elements, assume_unique=assume_unique,
invert=invert).reshape(element.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W292 no newline at end of file