Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interop with Dask #22

Open
saulshanabrook opened this issue Jul 10, 2018 · 20 comments
Open

Interop with Dask #22

saulshanabrook opened this issue Jul 10, 2018 · 20 comments

Comments

@saulshanabrook
Copy link
Member

saulshanabrook commented Jul 10, 2018

I am opening this issue to track how xnd/gumath could work with Dask based on talking with @mrocklin.

We can use the dask.array.from_array on an xnd object, to have Dask chunk up that array and execute operations in parallel. The requirements dask has for an array like object are listed here.

We were going back and forth on whether the numpy-ish API should be implemented on the xnd object directly or if we should create a wrapper class, like this, that adds numpy methods.

The most basic requirement is to have shape and dtype attributes on the object. We can forward those from the type attribute:

In [1]: from xnd import xnd

In [3]: x = xnd(list(range(10)))

In [4]: import dask.array as da

In [7]: x.shape = x.type.shape

In [15]: x.dtype = str(x.type.hidden_dtype)

In [17]: d = da.from_array(x, chunks=(5,))

In [19]: d.sum()
Out[19]: dask.array<sum-aggregate, shape=(), dtype=int64, chunksize=()>

In [21]: d.sum().compute()
Out[21]: 45

In [23]: np.exp(d).sum().compute()
Out[23]: 12818.308050524603

In [24]: d[::3]
Out[24]: dask.array<getitem, shape=(4,), dtype=int64, chunksize=(2,)>

In [25]: d[::3].compute()
Out[25]: array([0, 3, 6, 9])

We could also implement __array_ufunc__ so that when we call np.sin on an xnd array, we could get back an xnd array by using the gumath.sin function. This is how cupy implements it.

We can also create a concat and register it with dask.array.core.concatenate_lookup. I don't think a concat gufunc exists yet.

@hameerabbasi
Copy link
Contributor

There is an issue with this: Dask casts to a NumPy array by default in da.from_array unless you pass the asarray=False kwarg. So internally, all calculations are happening as a NumPy array.

@teoliphant
Copy link
Member

teoliphant commented Oct 29, 2018 via email

@skrah
Copy link
Member

skrah commented Mar 26, 2019

I've implemented __array_ufunc__ and __array__. Unfortunately dask calls np.dtype(obj), which does not use __array_typestr__.

I'm not sure if __array_typestr__ is deprecated, but it is exactly what is needed here.

@saulshanabrook's example works now if xnd.array.dtype returns a PEP-3118 string, but that is a bit of a waste, because many ndt types cannot be expressed as PEP-3118.

By "works", I mean that d.sum().compute() also returns an xnd.array again.

So it is only __array_typestr__ or a similar protocol that is missing.

@skrah
Copy link
Member

skrah commented Mar 26, 2019

I think I can try the __array_interface__, though that seems like a lot of code just for the typestring.

@teoliphant
Copy link
Member

teoliphant commented Mar 26, 2019 via email

@skrah
Copy link
Member

skrah commented Mar 27, 2019

Yes, I've added __array_function__ now. It seems to work with NumPy.

EDIT: deleted the CuPy example, which was wrong.

@mrocklin
Copy link

cc @pentschev for visibility. He has been doing the Dask+CuPy work and might have suggestions for you all

@skrah
Copy link
Member

skrah commented Mar 27, 2019

Thanks @mrocklin @pentschev for suggestions. For xnd the problem is (but perhaps I'm doing something silly):

The following results in array(15.0, type='float32') <class 'xnd.array'>, which is what I want:

from xnd import array
import dask.array as da

x = array([1,2,3,4,5], dtype="float32")
d = da.from_array(x, chunks=(5,), asarray=False)
ans = d.sum().compute()
print(ans, type(ans))

This gives 15.0 <class 'numpy.float32'>:

from xnd import array
import dask.array as da

x = array([1,2,3,4,5], dtype="float32")
d = da.from_array(x, chunks=(2,), asarray=False)
ans = d.sum().compute()
print(ans, type(ans))

So the question is essentially if conversion to ndarray is expected somewhere along the way if the chunks are not trivial.

@skrah
Copy link
Member

skrah commented Mar 27, 2019

Here are hopefully correct examples from the website. For xnd the problem is the same, the cupy example has a ValueError.

Cupy and dask are the latest versions from PyPI, installed today.

$ cat xnd_numpy.py 
import os
os.environ["NUMPY_EXPERIMENTAL_ARRAY_FUNCTION"] = "1"
import numpy as np
import dask.array as da
import dask
import xnd

x = np.random.random((10000, 1000))
y = xnd.array.from_buffer(x)
print(type(y))

u, s, v = np.linalg.svd(y)
print(type(u))
print(type(s))
print(type(v))
$ 
$ /home/stefan/rel/bin/python3 xnd_numpy.py 
<class 'xnd.array'>
<class 'xnd.array'>
<class 'xnd.array'>
<class 'xnd.array'>
$ cat xnd_dask.py 
import os
os.environ["NUMPY_EXPERIMENTAL_ARRAY_FUNCTION"] = "1"
import numpy as np
import dask.array as da
import dask
import xnd

x = np.random.random((10000, 1000))
y = xnd.array.from_buffer(x)
print(type(y))

dy = da.from_array(y, chunks=(5000, 1000), asarray=False)
u, s, v = np.linalg.svd(dy)
u, s, v = dask.compute(u, s, v)
print(type(u))
print(type(s))
print(type(v))
$ 
$ /home/stefan/rel/bin/python3 xnd_dask.py 
<class 'xnd.array'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
$ cat cupy_dask.py 
import os
os.environ["NUMPY_EXPERIMENTAL_ARRAY_FUNCTION"] = "1"
import numpy as np
import dask.array as da
import dask
import cupy

x = np.random.random((10000, 1000))
y = cupy.array(x)
print(type(y))

dy = da.from_array(y, chunks=(5000, 1000), asarray=False)
u, s, v = np.linalg.svd(dy)
u, s, v = dask.compute(u, s, v)
print(type(u))
print(type(s))
print(type(v))
$ 
$ /home/stefan/rel/bin/python3 cupy_dask.py 
<class 'cupy.core.core.ndarray'>
Traceback (most recent call last):
  File "cupy_dask.py", line 13, in <module>
    u, s, v = np.linalg.svd(dy)
  File "/home/stefan/rel/lib/python3.6/site-packages/numpy/core/overrides.py", line 165, in public_api
    implementation, public_api, relevant_args, args, kwargs)
  File "/home/stefan/rel/lib/python3.6/site-packages/numpy/linalg/linalg.py", line 1592, in svd
    a, wrap = _makearray(a)
  File "/home/stefan/rel/lib/python3.6/site-packages/numpy/linalg/linalg.py", line 117, in _makearray
    new = asarray(a)
  File "/home/stefan/rel/lib/python3.6/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/home/stefan/rel/lib/python3.6/site-packages/dask/array/core.py", line 1002, in __array__
    x = np.array(x)
ValueError: object __array__ method not producing an array

@pentschev
Copy link

For __array_function__ to work with CuPy you need at least version 6.0.0b3, which is not in PyPI yet, I think the only way to get that is to install it from source. And that last code on CuPy should work as you are expecting it to once you have CuPy 6.0.0b3 and Dask 1.1.4.

I don't see any issues with your xnd example, the issue must be somewhere in the implementation.

@pentschev
Copy link

Ah, sorry, you're also gonna need Dask master. Release 1.1.4 doesn't yet include __array_function__ support. It was only merged a couple of weeks ago.

dask/dask#4567

@skrah
Copy link
Member

skrah commented Mar 27, 2019

The cupy example works here now with numpy, dask and cupy master branches. I had tried dask and cupy from source before, but numpy was from PyPI (1.16.2).

@skrah
Copy link
Member

skrah commented Mar 29, 2019

@pentschev The issue for xnd.array is that explicit registering of backends is required:

backends.py:8:    concatenate_lookup.register(cupy.ndarray, cupy.concatenate)

If concatenate is not registered, numpy.empty() is called here (the traceback is not an error but an explicit instrumentation of some functions in the example):

  File "/home/stefan/rel/lib/python3.8/site-packages/dask-1.1.4+22.g96c3381-py3.8.egg/dask/array/core.py", line 3520, in concatenate3
    result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))

One way for unregistered backends would be to try to call numpy.concatenate directly, which works on xnd.array.

@pentschev
Copy link

Ah, right. I think in a post-__array_function__ world, calling numpy.concatenate is the preferred/expected way.

@mrocklin correct me if I'm wrong, but I think once __array_function__ is consolidated, we want to get rid of these lookups, right?

@mrocklin
Copy link

mrocklin commented Mar 29, 2019 via email

@hameerabbasi
Copy link
Contributor

Can we add a fallback to __array_function__ when something isn't registered via hasattr?

@pentschev
Copy link

If I understand what you're suggesting @hameerabbasi, I think we had a long discussion about this in numpy/numpy#12974, but no clear solution yet.

@hameerabbasi
Copy link
Contributor

I'm suggesting something different... Where some chunk types have __array_function__, use that to do the dispatch rather than Dask's custom-rolled concatenate implementation. Same for other functions Dask looks for the overrides to.

@pentschev
Copy link

Thanks for clarifying @hameerabbasi. That would probably work, I don't know if that's something that Dask would be ok with doing though. @skrah would you mind opening an issue on Dask's GitHub repository, perhaps also add @hameerabbasi's suggestion so we can involve and get suggestions from other Dask developers as well?

@skrah
Copy link
Member

skrah commented Apr 2, 2019

@pentschev Thanks, done in dask/dask#4662. If that approach looks good, I can also open a real PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants