Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support out-of-core computation using dask #328

Closed
15 tasks done
shoyer opened this issue Feb 20, 2015 · 7 comments
Closed
15 tasks done

Support out-of-core computation using dask #328

shoyer opened this issue Feb 20, 2015 · 7 comments

Comments

@shoyer
Copy link
Member

shoyer commented Feb 20, 2015

Dask is a library for out of core computation somewhat similar to biggus in conception, but with slightly grander aspirations. For examples of how Dask could be applied to weather data, see this blog post by @mrocklin: http://matthewrocklin.com/blog/work/2015/02/13/Towards-OOC-Slicing-and-Stacking/

It would be interesting to explore using dask internally in xray, so that we can implement lazy/out-of-core aggregations, concat and groupby to complement the existing lazy indexing. This functionality would be quite useful for xray, and even more so than merely supporting datasets-on-disk (#199).

A related issue is #79: we can easily imagine using Dask with groupby/apply to power out-of-core and multi-threaded computation.

Todos for xray:

  • refactor Variable.concat to make use of functions like concatenate and stack instead of in-place array modification (Dask arrays do not support mutation, for good reasons)
  • refactor reindex_variables to not make direct use of mutation (e.g., by using da.insert below)
  • add some sort of internal abstraction to represent "computable" arrays that are not necessarily numpy.ndarray objects (done: this is the data attribute)
  • expose reblock in the public API
  • load datasets into dask arrays from disk
  • load dataset from multiple files into dask
  • some sort of API for user controlled lazy apply on dask arrays (using groupby, mostly likely) (not necessary for initial release)
  • save from dask arrays
  • an API for lazy ufuncs like sin and sqrt
  • robustly handle indexing along orthogonal dimensions if dask can't handle it directly.

Todos for dask (to be clear, none of these are blockers for a proof of concept):

  • support for NaN skipping aggregations
  • support for interleaved concatenation (necessary for transformations by group, which are quite common) (turns out to be a one-liner with concatenate and take, see below)
  • support for something like take_nd from pandas: like np.take, but with -1 as a sentinel value for "missing" (necessary for many alignment operations) da.insert, modeled after np.insert would solve this problem.
  • support "orthogonal" MATLAB-like array-based indexing along multiple dimensions (taking along one axis at a time is close enough)
  • broadcast_to: see ENH: add np.broadcast_to and reimplement np.broadcast_arrays numpy/numpy#5371
@mrocklin
Copy link
Contributor

  • support for NaN skipping aggregations

    Presumably we could drop in numbagg here. The reductions are generally pretty straightforward to extend. I can do this relatively soon. See https://github.com/ContinuumIO/dask/blob/master/dask/array/reductions.py#L43-L111

  • support for interleaved concatenation (necessary for transformations by group, which are quite common)

    Do we have this already? Or rather can you point me to how you would do this with NumPy.

  • support super-imposing array values inter-leaved on top of a constant array of NaN (necessary for many alignment operations)

    Would this be solved by an elementwise ifelse operation? ifelse(condition, x, y)

  • support "orthogonal" MATLAB-like array-based indexing along multiple dimensions

    You can do this now by repeated slicing x[[1, 2, 3], :][:, [4, 5, 6]] and get a fully efficient solution. I can roll this in to the normal syntax though. I might pause for a bit as I think about the break that this causes with NumPy but I'll probably go ahead anyway.

@mrocklin
Copy link
Contributor

support super-imposing array values inter-leaved on top of a constant array of NaN (necessary for many alignment operations)

@shoyer can you clarify this one? Would the np.choose interface satisfy this?

In [1]: import numpy as np

In [2]: a = np.arange(4).reshape(2, 2)

In [3]: a
Out[3]: 
array([[0, 1],
       [2, 3]])

In [4]: x = np.array([[True, False], [True, True]])

In [5]: np.choose(x, [-10, a])
Out[5]: 
array([[  0, -10],
       [  2,   3]])

@shoyer
Copy link
Member Author

shoyer commented Feb 23, 2015

support for interleaved concatenation (necessary for transformations by group, which are quite common)

Turns out what I was thinking of here can be written as a one liner in terms of concatenate and take:

def interleaved_concatenate(arrays, indices, axis=0):
    return np.take(np.concatenate(arrays, axis), np.concatenate(indices))

So I've crossed that one off the line.

support super-imposing array values inter-leaved on top of a constant array of NaN (necessary for many alignment operations)

What I need here is something similar to the private take_nd functions that pandas defines that works like np.take, but that uses -1 as a sentinel value for "missing":

In [1]: import pandas

In [2]: import numpy as np

In [3]: x = np.arange(5)

In [4]: pandas.core.common.take_nd(x, [0, -1, 1, -1, 2])
Out[4]: array([  0.,  nan,   1.,  nan,   2.])

(In xray, I implement this a little differently so that I can take along all multiple axes simultaneously using array indexing, but this version would suffice.)

@mrocklin
Copy link
Contributor

Am I right in thinking that this is almost equivalent to fancy indexing with a list of indices?

@shoyer
Copy link
Member Author

shoyer commented Feb 23, 2015

Yes, take_nd is very similar to fancy indexing but only non-negative indices are valid (-1 means insert NaN).

@shoyer
Copy link
Member Author

shoyer commented Mar 30, 2015

@mrocklin It occurs to me now that a much simpler version of the functionality I'm looking for with take_nd would be dask.array.insert modeled after np.insert, which we could combine with array indexing. For the purposes of xray, we would only need support for insert with a scalar value, e.g., like da.insert(x, [1, 5, 6], np.nan, axis=1).

@shoyer
Copy link
Member Author

shoyer commented Apr 17, 2015

Basic support for dask.array is merged on master.

Continued in #394

@shoyer shoyer closed this as completed Apr 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants