Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Everything that XArray Expects #1

Open
14 of 16 tasks
mrocklin opened this issue Apr 17, 2017 · 49 comments
Open
14 of 16 tasks

Support Everything that XArray Expects #1

mrocklin opened this issue Apr 17, 2017 · 49 comments
Labels
enhancement Indicates new feature requests

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Apr 17, 2017

bolt-project/bolt#58

  • single argument ufuncs (sin, exp, etc.) and ufunc like functions (pd.isnull, notnull, astype, around, isclose)
  • broadcasting binary operations (e.g., ndarray arithmetic)
  • three argument version of where (preferably with broadcasting)
  • aggregation methods
    • max/min/sum/prod
    • argmax/argmin/std/var
    • nan-skipping aggregations (e.g., nanmin, nanmax, nanprod, nansum)
  • indexing with integer arrays, booleans, slices, None
  • transpose
  • indexing:
    • basic indexing (int/slice)
    • outer indexing for a single array
    • outer indexing (int, slice and 1d integer arrays separately applied to each axis)
    • vectorized indexing (integer arrays with broadcasting, like NumPy)
  • broadcast_to (NumPy 1.10)
  • concatenate and stack (NumPy 1.10)
@hameerabbasi hameerabbasi added the enhancement Indicates new feature requests label Dec 28, 2017
@shoyer
Copy link
Member

shoyer commented Jan 30, 2018

I updated the check-list (apparently I'm an "Owner" for pydata) to drop "insert/fancy indexing" and add a three-item checklist for indexing instead.

@hameerabbasi
Copy link
Collaborator

Do we have partial XArray compatibility or does XArray fail with sparse arrays outright?

@shoyer
Copy link
Member

shoyer commented Jan 30, 2018

@hameerabbasi Not yet, see pydata/xarray#1375. But sparse is close enough that it might be worth trying soon.

From a usability perspective, the most helpful additional feature would be nan-skipping aggregations. I know sparse will probably rarely be used with NaNs, but xarray's aggregation methods default to skipna=True so this would be valuable for consistency.

In addition to the direct uses, xarray uses where and outer/vectorized indexing internally primarily for alignment with join='outer' (e.g., in concat or merge). This sort of alignment would require making sparse arrays dense, unless they have a fill value of NaN rather than 0. So these operations are less essential than they might otherwise seem.

@hameerabbasi
Copy link
Collaborator

I just tested with your example code, NaN-skipping aggregations use ufuncs with reduce, so they should already be supported in theory.

>>> class A(np.ndarray):
...     def __array_ufunc__(self, *args, **kwargs):
...         print('ufunc', args, kwargs)
...         
>>> np.nansum(np.arange(5).view(A))
ufunc (<ufunc 'add'>, 'reduce', A([0, 1, 2, 3, 4])) {'axis': None, 'dtype': None, 'keepdims': False}

Although I have absolutely now idea how to discern that this was NaN skipping from the args/kwargs.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 30, 2018

Ah, it was a dtype issue.

>>> np.nansum(np.arange(5, dtype=np.float).view(A))
ufunc (<ufunc 'isnan'>, '__call__', A([0., 1., 2., 3., 4.])) {}
ufunc (<ufunc 'add'>, 'reduce', A([0., 1., 2., 3., 4.])) {'axis': None, 'dtype': None, 'keepdims': False}

So we're good to go on those already. :-)

Edit: Concrete example:

>>> ar = sparse.DOK((2, 2))
>>> ar[1, 1] = np.nan
>>> s = sparse.COO(ar)
>>> np.nansum(s)
0.0

@mrocklin
Copy link
Contributor Author

mrocklin commented Jan 30, 2018

FWIW this is a great demonstration of the value of NumPy's use of protocols. It would be great if we could get an alternative ndarray implementation, sparse, to work with a complex downstream user of NumPy code, XArray, without either library having to explicitly know about the other. cc @njsmith

@hameerabbasi
Copy link
Collaborator

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

@shoyer
Copy link
Member

shoyer commented Jan 30, 2018

NumPy's nan-aggregations work, but only by converting sparse arrays into numpy arrays, inside np.lib.nanfunctions._replace_nan:

import numpy as np
import sparse

class NoncoercibleCOO(sparse.COO):
  def __array__(self, *args, **kwargs):
    raise Exception('cannot convert to numpy')
  
ar = sparse.DOK((2, 2))
ar[1, 1] = np.nan
s = NoncoercibleCOO(ar)
np.nansum(s)

This results in:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-13-481c3bc38f6d> in <module>()
      9 ar[1, 1] = np.nan
     10 s = NoncoercibleCOO(ar)
---> 11 np.nansum(s)

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in nansum(a, axis, dtype, out, keepdims)
    579 
    580     """
--> 581     a, mask = _replace_nan(a, 0)
    582     return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
    583 

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in _replace_nan(a, val)
     62 
     63     """
---> 64     a = np.array(a, subok=True, copy=True)
     65 
     66     if a.dtype == np.object_:

<ipython-input-13-481c3bc38f6d> in __array__(self, *args, **kwargs)
      4 class NoncoercibleCOO(sparse.COO):
      5   def __array__(self, *args, **kwargs):
----> 6     raise Exception('cannot convert to numpy')
      7 
      8 ar = sparse.DOK((2, 2))

Exception: cannot convert to numpy

So this really isn't a good solution yet :).

You might actually view this as the flip side of implementing protocols (like __array__): you get default implementations, but they aren't always good ones and it's not always obvious when that's the case.

@mrocklin
Copy link
Contributor Author

Is this already reported upstream in NumPy?

@njsmith
Copy link
Member

njsmith commented Jan 30, 2018

Hmm. So for this what we really want is something like: implement nansum using which=, and make reductions support which=?

@shoyer
Copy link
Member

shoyer commented Jan 30, 2018

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

Is this already reported upstream in NumPy?

I don't know if this really qualifies as a bug, so much as an unintended consequence. I think the easiest fix would be to replace np.copyto() in _replace_nan with a protocol dispatching version of the three argument variant of np.where().

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 30, 2018

A simpler option would be to add skipnan= in ufunc reduce and accumulate methods.

Edit: And reduceat.

Edit: Or maybe in ufuncs themselves. ufunc(NaN, NaN) = NaN, otherwise, return the non-NaN item.

@hameerabbasi
Copy link
Collaborator

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

You might, for example, do x[x == 0] = np.nan (please note, multidimensional boolean indexing not currently supported) and it would work.

@shoyer
Copy link
Member

shoyer commented Jan 30, 2018

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

Yes, I see the elegance in that.

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

@hameerabbasi
Copy link
Collaborator

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

The fill value is viable, and pretty easy to implement. I would just need to make a decorator that could be used to filter out operations that require a zero fill value, and could be used like this:

@zero_fill_value
dot(a, b)
    ...

@fujiisoup
Copy link
Member

I recently raised an issue of nan-aggregation in numpy/numpy#10456, but it looks a little hard to implement nan-aggregations without copy.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 2, 2018

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

If someone can come up with an easier way to do N-way broadcasting of sparse arrays rather than just repeating them over and over... I'd be all for it. Broadcasts in Numpy are views... Here they're separate objects (repeats), with all the memory and performance losses that come with that.

Edit: Also the performance speedups when one of the operators returns zeros when one side is zero (like and and multiply) are significant.

Edit 2: Currently, I've optimised _elemwise_binary so that it doesn't have to repeat them. That's hard to do for N-way broadcasts, which is the main issue here.

@hameerabbasi
Copy link
Collaborator

I just realized there is a complicated (yet, hiding in plain sight) way to recursively reduce an N-way broadcast down into a 2-way/1-way (without losing out on any performance benefits). It will also simplify the code hugely. I'll try it today and tomorrow, maybe I'll come up with something genius.

@shoyer
Copy link
Member

shoyer commented Feb 2, 2018

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

In practice broadcasting for where is often trivial, with one or both of the second and three arguments as scalars, e.g., where(x < 0, np.nan, x). But it's nice to support broadcasting for completeness, even if it will be slower.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 5, 2018

The broadcasting is turning out to be more complex than I anticipated (it took the most of last weekend).

  • On the plus side:
    • I've got an algorithm down for general N-ary broadcasting.
    • It's no worse in big-O for any case (multiplication for N arrays will sidestep most of the computation)
    • The code is generally more elegant.
  • On the down side:
    • It has to do an exponential amount of matches (exponential in the number of inputs, concretely 2^N - 1 for N inputs), but that's unavoidable for a general elemwise function, with or without broadcasting.
      • For this reason, it's better to do x * y * z rather than sparse.elemwise(lambda x, y, z: x*y*z, x, y, z)
      • where will require 7 matches when all 3 inputs are COO but but none in the trivial case. However, it'll benefit from skips. In the case you described, it'll act exactly as a binary function, requiring only 3 matches. (Scalars are not counted in the total inputs and are functools.partial-d away).
    • I had to give up some performance on mixed ndarray-COO combinations. I'm working on it, it might be possible to put this back in.
      • Is it worth supporting this? I'm considering dropping it entirely and requiring a conversion to COO. It will also put roadblocks in the way of fill values.
    • A bit of caching is still required to get this optimal.

@shoyer
Copy link
Member

shoyer commented Feb 5, 2018

@hameerabbasi what exactly is a involved in the "match" you're proposing?

My initial thought is that an algorithm for this could simultaneously iterate over all input arguments exactly once each. Assuming sorted coordinates, this looks something like:

# warning: untested!
iter_coords = [iter(input.coords) for input in inputs]
iter_datas = [iter(input.data) for input in inputs]
indices = [next(it) for it in iters]
result_coords = []
result_data = []
num_finished = 0
while True:
    current_index = min(indices)
    current_data = [next(it_data) if current_index == index else input.fill_value
                    for it_data, index, input in zip(iter_data, indices, inputs)]
    result_coords.append(current_index)

    # in reality, we probably save an explicit indexer array in each input's data,
    # using -1 for fill-values, and then compute the result data once by applying
    # it to 1D vectors
    result_data.append(func(*current_data))

    next_indices = []
    for it, index in zip(iter_coords, indices):
        if current_index == index:
            index = next(it, default=None)
            if index is None:
                num_finished += 1
                if num_finished == len(inputs):
                    break
        next_indices.append(index)

This would runtime O(NK), where N=sum(input.coords.size for input in inputs) and K=len(inputs). Of course, it would need to be implemented in something fast (e.g., Cython, Numba, C++) to be viable.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 6, 2018

Since my last comment was a mishmash of incoherent half-formed thoughts, I'm re-commenting.

This is much simpler than what I had in mind, and can be easily extended to a broadcasting scenario. The main problem I see with something like this is the following:

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

If this optimization can somehow be ported to your method (I don't see how, unfortunately, while keeping broadcasting), I'd be perfectly willing to write up a Cython equivalent of it.

To answer your original question, a "match" is a calcultation of where input 1 is nonzero, input 2 is zero, and so on... For every possible combination.

@shoyer
Copy link
Member

shoyer commented Feb 6, 2018

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

It seems like the solution we need here is a "sparse broadcast" function, which only repeats non-zero indices in the result. . The algorithm would look something like:

  • Find all indices that are non-zero for each dimension on one of more arrays.
  • When broadcasting, only fill in these these indices.

"sparse" vs "regular" broadcasting could be chosen based on the nature of the operator being applied (e.g., * vs +)

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 6, 2018

That's (roughly) what I'm doing with my matching setup. I do the following (in psuedocode):

iterate through all possible combinations of zero/nonzero inputs:
    match nonzero inputs' coordinates (find coordinates that match)
    find func(corresponding_data, zeros)
    filter zeros from func (and corresponding coordinates) (this does the optimization I describe)
    match matched coordinates with with every zero-input and filter out those
    Broadcast leftover coords/data
    add coords, data to a list

concatenate all coords and data lists

Edit: The matching setup is on a branch on my fork, it isn't in the main codebase, and it isn't in working condition yet. It needs more work.

Edit 2: Choosing it based on the operator is still troublesome for some edge cases involving np.inf or np.nan.

@shoyer
Copy link
Member

shoyer commented Feb 6, 2018

I don't think it's necessary to iterate through every possible combination of zero/nonzero inputs. Maybe something like:

  • Based on the operator, pick a version of broadcasting to do ("dense" broadcasting like NumPy for + or "sparse" broadcasting that only repeats non-zero elements for *)
  • Figure out the target shape and sparsity structure across all the inputs
  • "Broadcast" each input to the target shape / sparsity. This requires copying data, but should only add a constant factor overhead for the operation.
  • All broadcast inputs now have the same sparsity structure, so applying an element-wise NumPy function is as simply as applying it to the data on each array, e.g., func(*[inp.data for inp in broadcast_inputs]).

If you are concerned about the (constant factor) overhead for explicit broadcasting in cases like 2 * x, you could add a shortcut that converts these operators into "unary" operators that are only applied to a single sparse array.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 7, 2018

What I'm more concerned about is:

  • The maintenance overhead of this approach. For example, fill-values will be affected by this (it won't generalize to arbitrary fill values).
  • If there's an nan or inf in there, the values won't match Numpy (because np.nan * 0 = np.nan and np.inf * 0 = np.nan, thus we need "dense" broadcasting in this case). To fix this, we would have to make the "switch" between the two approaches more complex.
  • It is guaranteed to not kick in for user-defined functions. For example, sparse.elemwise(lambda x,y:x*y, x, y).
  • It won't work kick in for all cases, even for us. Such as np.where(x < 0, 0, x).

Some things that could be made independent of the implementation:

  • Constants or 0-d arrays are already functools.partial-d away, by a custom wrapped function in my implementation. Even np.where(x < 0, np.nan, x) would be treated as a binary function equivalent to lambda a, b: np.where(a, np.nan, b) with a = x < 0 and b = x in my implementation, and the cost would be the same as a binary function. Although, of course, we could further bring down the cost by doing sparse.elemwise(lambda x: np.where(x < 0, np.nan, x), x) (which would be treated as unary). Of course, this could be ported to any implementation, so is not a downside/upside.
  • np.where(x < 0, np.nan, 5) or similar would already be treated as a unary function (zero matches!)

To be perfectly clear, I would prefer your approach much more (it is undeniably more performant in the "dense" broadcasting case) if these edge-cases etc. could be nicely handled automatically.

@shoyer
Copy link
Member

shoyer commented Feb 7, 2018

For example, fill-values will be affected by this (it won't generalize to arbitrary fill values).

The property you need here is op(fill_value, x) == fill_value for all x (or op(x, fill_value) == fill_value). In general this only holds for particular combinations of op and fill_value, e.g.,

  • 0 * x -> 0
  • x * 0 -> 0
  • 0 / x -> 0
  • 1 ** x -> 1
  • Almost any operation with NaN -> NaN
  • Various other identities with inf and -inf

So sure, the way this works is fill-value specific, but you could store the list of these identities and operations in a table.

If there's an nan or inf in there, the values won't match Numpy (because np.nan * 0 = np.nan and np.inf * 0 = np.nan, thus we need "dense" broadcasting in this case). To fix this, we would have to make the "switch" between the two approaches more complex.

This feels like a separate issue to me. How do you handle this in your current approach?

One option is to keep track of the finiteness of sparse arrays, e.g., by adding a (possibly cached) all_finite property.

It is guaranteed to not kick in for user-defined functions. For example, sparse.elemwise(lambda x,y:x*y, x, y).

I think I finally understand your approach: you are actually computing func(inputs[0].data, 0), func(0, inputs[1].data), etc., alongside non-zero elements in both arrays. This feels little crazy to me if you can know it statically :).

I would suggest adding an Operation class for wrapping functions applied to sparse arrays that could add information like the types of identities it obeys. Builtin operations and NumPy ufuncs could be automatically be mapped to the appropriate Operation. You basically need this anyways if you want to let users write their own versions of functions like np.sin().

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Feb 8, 2018

If you're asking how we "match" for a particular combination of zero/nonzero inputs, we:

  • Find out which axes are common (non-broadcasted) for all nonzero inputs.
  • Perform the equivalent to an SQL "outer join" for the coordinates for just those axes.
  • Repeat data accordingly.
  • Calculate f(data_for_nonzeros, zero_for_zeros)
  • Filter out zeros/fill values and corresponding coordinates
  • Broadcast result to output shape.
  • For each zero input
    • Filter out matches with that zero input.

At the end, we simply concatenate all possible data/coords and return the result.

@hameerabbasi
Copy link
Collaborator

I've implemented N-ary broadcasting following the algorithm here over in #98. Feedback welcome. :)

@mrocklin
Copy link
Contributor Author

OK, I'm hearing the following thoughts (paraphrasing):

  • @shoyer: "we should take advantage of statically known information about our function, like that zeros in certain inputs always yield zeros in the output"
  • @hameerabbasi : "forcing users to create Operation classes seems a bit heavy, we should support vanilla functions without much fore-knowledge"

These both seem like reasonable positions to me. I wonder if we can accomplish both by having optional parameters to elemwise that specify which inputs propagate zeros

elemwise(operation.mul, x, y, z, zero_prop=[0, 1, 2])
or
elemwise(operation.add, x, y, z, zero_prop=[])

With perhaps values like True -> range(len(args)) and False -> [] (also, please don't use the name zero_prop, there are certainly better options)

This is only an example of such an argument. There are many more complex situations that might arise, but I suspect that this handles the common case.

I do think that some static information like this is likely to be desirable when we start profiling. Also, I suspect that most use of elemwise will be by expert users (like us when we're using elemwise internally), and so providing this information might be easy.

@mrocklin
Copy link
Contributor Author

Alternativley, what is our current support for operations that have non-zero return values when one of the inputs is zero? My guess was that we erred. If this is the case then can we just ignore the situation where we produce values on zero-valued inputs? Am I understanding things correctly?

@hameerabbasi
Copy link
Collaborator

I do think that some static information like this is likely to be desirable when we start profiling. Also, I suspect that most use of elemwise will be by expert users (like us when we're using elemwise internally), and so providing this information might be easy.

I agree 100% with this. But (at the risk of repeating myself), this ignores the NaN/inf problem with mul. While doing this, we would have to filter/check for those, or risk inconsistency in these cases as compared to Numpy. Also, it doesn't work for arbitrary fill values. Also that refactor I did in #84 will need to be partially reversed.

Alternativley, what is our current support for operations that have non-zero return values when one of the inputs is zero? My guess was that we erred. If this is the case then can we just ignore the situation where we produce values on zero-valued inputs? Am I understanding things correctly?

Currently, we err iff func(all_zeros) will produce a nonzero. If zeros/nonzeros are mixed (like in add), we don't err but compute func(zero, other.data) etc. Here, I've just generalized that behavior to multiple operands. The plan is to make it so that func(all_fillvalues) becomes a fill-value for the return array.

@mrocklin
Copy link
Contributor Author

@shoyer I don't suppose you happen to have a TestCase for what an XArray container class should support do you? If so it would be good to import that into our test suite to track progress in the future between the projects.

@shoyer
Copy link
Member

shoyer commented Feb 21, 2018 via email

@hameerabbasi
Copy link
Collaborator

Feel free to ping me or involve me in the project. :-)

@hameerabbasi
Copy link
Collaborator

Three-argument version of where and NaN-skipping aggregations are in!

@mrocklin
Copy link
Contributor Author

mrocklin commented Mar 4, 2018

@shoyer what is the right thing for folks to push on here if they have time? The multipledispatch VarArgs conversation? Internal fixes to XArray? A test suite?

@hameerabbasi
Copy link
Collaborator

My intuition would be the multipledispatch VarArgs conversation. We should do things properly. 😄 I don't mind it taking time, it'll save time and headache down the road. Of course, if the consensus is on another solution, I'd happily work on that, too.

@shoyer
Copy link
Member

shoyer commented Mar 4, 2018

VarArgs will be needed for finishing this in xarray, but really only for a few functions (concatenate and stack). So I think we could simultaneously work on:

  1. Getting VarArgs of some form into multipledispatch.
  2. Porting as much of the existing xarray dispatch machinery for NumPy and Dask to use multipledispatch. This has at least two components: the wrappers in xarray/core/duck_array_ops.py, and implementing __array_ufunc__ (__array_ufunc__ for xarray xarray#1617).

@hameerabbasi
Copy link
Collaborator

Since the final review deadline for the papers in SciPy 2018 is June 18, I thought I'd revive this thread. I'd love to show off Dask/Sparse and XArray/Sparse integration in the paper and at SciPy 2018.

cc @mrocklin @shoyer If you need anything done on my end, or anything at all you think I'm capable of that'd speed this along (Including contributing to XArray or Dask, I'd love to dip my feet in), don't hesitate to say so.

I looked into multipledispatch but VarArgs seems like something better done by someone else. arrayish is already up, if any changes are required there, let me know.

@mrocklin
Copy link
Contributor Author

Experiments with dask.array and sparse would be welcome. I think that this should work today (we do include some tests here in dask.array's CI). It might be interesting to do profile a few workflows. I'll think a bit about what these might be.

For XArray my guess is that the right thing to do is to follow @shoyer 's two paths listed just above. Looking at duck_array_ops.py and how that functionality gets used within XArray seems like a sensible place to start, though @shoyer may have other suggestions (I don't know this codebase as well). My guess is that there some work to get through here, but probably nothing that is too controversial.

@mrocklin
Copy link
Contributor Author

It also looks like the conversation in mrocklin/multipledispatch#72 ended in a nice place. Again, it seems like consensus was reached and now this requires some hopefully straightforward work.

@hameerabbasi
Copy link
Collaborator

I’ve implemented outer indexing for a single array, but not multiple. When NumPy implements oindex and vindex, we’ll follow suit. I somehow don’t feel like implementing every single fancy index case with its weird behaviour.

@khaeru
Copy link

khaeru commented Mar 7, 2021

Not sure if this is the right place to mention, but it doesn't appear elsewhere in issues/PRs on the repo.

I ran into the following trying to call .shift() on a sparse.COO-backed DataArray:

../../.local/lib/python3.8/site-packages/xarray/core/dataarray.py:3077: in shift
    variable = self.variable.shift(
../../.local/lib/python3.8/site-packages/xarray/core/variable.py:1205: in shift
    result = result._shift_one_dim(dim, count, fill_value=fill_value)
../../.local/lib/python3.8/site-packages/xarray/core/variable.py:1166: in _shift_one_dim
    data = duck_array_ops.pad(
../../.local/lib/python3.8/site-packages/xarray/core/duck_array_ops.py:56: in f
    return wrapped(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<COO: shape=(2, 5), dtype=float64, nnz=4, fill_value=nan>, [(1, 0), (0, 0)]), kwargs = {'constant_values': nan, 'mode': 'constant'}
relevant_args = (<COO: shape=(2, 5), dtype=float64, nnz=4, fill_value=nan>,)

>   ???
E   TypeError: no implementation found for 'numpy.pad' on types that implement __array_function__: [<class 'sparse._coo.core.COO'>]

<__array_function__ internals>:5: TypeError

There were other similar cases. So far I've noticed:

  • numpy.flip for DataArray.bfill
  • numpy.nancumprod for DataArray.cumprod
  • numpy.pad for DataArray.shift

@hameerabbasi
Copy link
Collaborator

@khaeru Do you have use-cases for these for sparse arrays? If so, please raise separate issues for each missing function.

In a lot of cases, one can’t achieve better performance on these than dense arrays.

@khaeru
Copy link

khaeru commented Mar 7, 2021

@hameerabbasi the concern is less performance than being able to use the DataArray API as it is defined without a bunch of extra wrapper code—what I understood as the topic of this issue, given the title.

My current mitigation is to have a subclass that allows some of these calls by converting to dense/numpy.ndarray-backed DataArray, performing the operation, then re-converting to sparse.COO-backed. I know that subclassing the xarray classes is not encouraged, I would be very happy to not have to do that :)

I'll do as you suggest and open an issue for one example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

6 participants