Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use deterministic names for dask arrays from open_dataset #555

Merged
merged 2 commits into from
Sep 14, 2015

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Aug 31, 2015

This will allow xray users to take advantage of dask's nascent support for
caching intermediate results (dask/dask#502).

For example:

In [1]: import xray

In [2]: from dask.diagnostics.cache import Cache

In [3]: c = Cache(5e7)

In [4]: c.register()

In [5]: ds = xray.open_mfdataset('/Users/shoyer/data/era-interim/2t/2014-*.nc', engine='scipy')

In [6]: %time ds.sum().load()
CPU times: user 2.72 s, sys: 2.7 s, total: 5.41 s
Wall time: 3.85 s
Out[6]:
<xray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    t2m      float64 5.338e+10

In [7]: %time ds.mean().load()
CPU times: user 5.31 s, sys: 1.86 s, total: 7.17 s
Wall time: 1.81 s
Out[7]:
<xray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    t2m      float64 279.0

In [8]: %time ds.mean().load()
CPU times: user 7.73 ms, sys: 2.73 ms, total: 10.5 ms
Wall time: 8.45 ms
Out[8]:
<xray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    t2m      float64 279.0

Still needs docs (probably in the dask section) and a what's new item.

Also, this will update the minimum required version of dask to 0.7 (which should be called out in docs).

This will allow xray users to take advantage of dask's nascent support for
caching intermediate results (dask/dask#502).

For example:

	In [1]: import xray

	In [2]: from dask.diagnostics.cache import Cache

	In [3]: c = Cache(5e7)

	In [4]: c.register()

	In [5]: ds = xray.open_mfdataset('/Users/shoyer/data/era-interim/2t/2014-*.nc', engine='scipy')

	In [6]: %time ds.sum().load()
	CPU times: user 2.72 s, sys: 2.7 s, total: 5.41 s
	Wall time: 3.85 s
	Out[6]:
	<xray.Dataset>
	Dimensions:  ()
	Coordinates:
	    *empty*
	Data variables:
	    t2m      float64 5.338e+10

	In [7]: %time ds.mean().load()
	CPU times: user 5.31 s, sys: 1.86 s, total: 7.17 s
	Wall time: 1.81 s
	Out[7]:
	<xray.Dataset>
	Dimensions:  ()
	Coordinates:
	    *empty*
	Data variables:
	    t2m      float64 279.0

	In [8]: %time ds.mean().load()
	CPU times: user 7.73 ms, sys: 2.73 ms, total: 10.5 ms
	Wall time: 8.45 ms
	Out[8]:
	<xray.Dataset>
	Dimensions:  ()
	Coordinates:
	    *empty*
	Data variables:
	    t2m      float64 279.0
shoyer added a commit that referenced this pull request Sep 14, 2015
Use deterministic names for dask arrays from open_dataset
@shoyer shoyer merged commit df044e2 into pydata:master Sep 14, 2015
@shoyer shoyer deleted the deterministic-names branch September 14, 2015 20:33
@shoyer shoyer modified the milestone: 0.6.1 Oct 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant