Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Persist #1344

Closed
mrocklin opened this issue Mar 30, 2017 · 5 comments · Fixed by #1349
Closed

Dask Persist #1344

mrocklin opened this issue Mar 30, 2017 · 5 comments · Fixed by #1349

Comments

@mrocklin
Copy link
Contributor

It would be convenient to load constituent dask.arrays into memory as dask.arrays rather than as numpy arrays. This would help with distributed computations where we want to load a large amount of data into distributed memory once and then iterate on the full xarray dataset repeatedly without reloading from disk every time.

We can probably solve this from either side:

  1. XArray could make a .persist method that replaced all of its dask.arrays with a persisted version of that array
import dask

dset.x, dset.y, dset.z = dask.persist(dset.x, dset.y, dset.z)
  1. We could look into the Dask duck type solution again Expected interface for dask objects dask/dask#1068

cc @shoyer @jcrist @rabernat @pwolfram

@shoyer
Copy link
Member

shoyer commented Mar 30, 2017

I'm happy with either or both of these solutions.

@rabernat
Copy link
Contributor

since we already have .load(), I think adding .persist() directly to xarray is the simplest way to go. Ideally xarray users should be able to avoid interacting directly with the dask api. (It's a beautiful api, don't get me wrong! but new users are easily overwhelmed by multiple interacting packages.)

@mrocklin
Copy link
Contributor Author

We do eventually want to support some sort of duck typing. This way people can do things like the following while still benefiting from shared intermediates:

x, y, z = dask.persist(my_df, my_arrray, my_xarray)

But this may not happen as quickly as a .persist method. Also we've added .persist to the Base dask collections class, so it's part of the standard dask API at this point.

@rabernat
Copy link
Contributor

In this example, are x, y and z dask objects or xarray objects?

@shoyer
Copy link
Member

shoyer commented Mar 31, 2017

In this example, are x, y and z dask objects or xarray objects?

I think the idea is that they could be mixed types, e.g., a dask-dataframe, a dask-array and an xarray Dataset or DataArray.

mrocklin added a commit to mrocklin/xarray that referenced this issue Apr 3, 2017
mrocklin added a commit to mrocklin/xarray that referenced this issue Apr 3, 2017
shoyer pushed a commit that referenced this issue Apr 4, 2017
* Add persist method to DataSet

Fixes #1344

* add persist method to DataArray

* add whats new entry

* add doc section on persist

* doc: dask array now supports automatic chunk alignment

(at least on most operations)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants