Implement dask-specific methods #196

darothen · 2023-01-13T23:11:19Z

This is an initial implementation of the feature requested in #97.

The first implementation here very closely follows the implementation of these methods by xarray.Dataset. For the majority of the methods, this should work fine; we iterate over all the nodes in our tree, starting at the root, and perform the necessary dask.collections API operation. However, __dask_post{compute,persist}__ is a bit more complicated; some additional testing is required to ensure that we're appropriately applying the available support utilities to re-construct our final DataTree without any superfluous work.

Closes Dask-specific methods #97
Tests added
Passes pre-commit run --all-files
New functions/methods are listed in api.rst
Changes are summarized in docs/source/whats-new.rst

for more information, see https://pre-commit.ci

darothen · 2023-01-13T23:24:21Z

Tag @TomNicholas, will work on testing this over the coming days as I have time.

darothen · 2023-01-14T20:09:09Z

Here's a gist based on @jbusecke's CMIP6 demo showing the top-level integration of load and compute (you can just easily modify it to show that persist works.

Still left to do are writing some test cases and further deep-diving to make sure that the dask collections API functions we provided here are used.

TomNicholas

This looks great @darothen ! Thank you for taking this on. There's just a couple of places that look a little off to me which I've highlighted.

TomNicholas · 2023-01-15T04:22:02Z

datatree/datatree.py

+                # darothen: Are we sure that results_iter is ordered the same as
+                # self.subtree?
+                # self.subtree?


Where does the iterable of DatasetView come from? Presumably you are looping over the nodes, but where are you doing that? Like I don't understand where results is passed in from.

TomNicholas · 2023-01-15T04:26:22Z

datatree/datatree.py

+        new_datatree_dict = {node.path: node.ds.load(**kwargs) for node in self.subtree}
+        return DataTree.from_dict(new_datatree_dict)


I'm not sure if this will have the behavior you intend: DataTree.from_dict will construct a completely new tree object, and then you are inserting whatever you get when you call DatasetView.load(). It's not altering self in-place.

Also I think it would be worth double-checking that DatasetView.load() does what you expect too with regard to new objects / copying - I never really thought about that case when I wrote DatasetView.

If you want to return the same tree but with all the data loaded I think you need to alter the current tree in-place instead of creating a new one, i.e. something like

for node in self.subtree: self[node.path] = node.ds.load

though this might not fail gracefully...

TomNicholas · 2023-01-15T04:30:38Z

datatree/datatree.py

+        new_datatree_dict = {
+            node.path: node.ds.persist(**kwargs) for node in self.subtree
+        }
+        return DataTree.from_dict(new_datatree_dict)


The comment on .load() applies to this too I think.

TomNicholas · 2023-01-15T04:31:54Z

datatree/datatree.py

+        ds_tokens = {node.path: node.ds.__dask_tokenize__() for node in self.subtree}
+        ds_tokens = {node.path: node.ds.__dask_tokenize__() for node in self.subtree}
+
+        return normalize_token((type(self), ds_tokens))
+
+        return normalize_token((type(self), ds_tokens))


Unintentional repetition of lines? The double return shouldn't even be valid syntax should it??

TomNicholas · 2023-01-15T04:32:25Z

datatree/datatree.py

+        graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}
+        graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}


More unintentional repetition?

TomNicholas · 2023-01-15T04:32:40Z

datatree/datatree.py

+        return [node.ds.__dask_keys__() for node in self.subtree]
+        return [node.ds.__dask_keys__() for node in self.subtree]


darothen · 2023-01-15T18:30:16Z

Thanks for the quick review @TomNicholas, hoping to address later today or tomorrow. Note on the line repetition - looks like I screwed up a merge somewhere, will need to fix that separately.

TomNicholas · 2023-03-06T20:08:15Z

@darothen wondering if you had any time soon to revisit this PR? Would be great to get it in soon because Julius and I are writing another blog post about using datatree with dask on CMIP6 data.

darothen · 2023-03-25T20:56:45Z

@TomNicholas I'm hacking on some projects this weekend, let me see if I can wrap things up. Apologies for the delay... it became very hectic at work shortly after the hackathon and I haven't had much time for side projects.

darothen and others added 8 commits January 13, 2023 14:54

Adds basic implementation of load, compute, and persist.

68eabcf

Adds stubs for dask collections API methods.

c8a16c5

Implements dask dunder methods except for post*

71154da

Implements the postcompute/persist dask ops.

721fa8d

[pre-commit.ci] auto fixes from pre-commit.com hooks

29acc55

for more information, see https://pre-commit.ci

Removes stub / sample code.

c9c83d9

Applies fixes from pre-commit

a2145dd

Applies additional pre-commit fixes

cf84ed1

TomNicholas requested changes Jan 15, 2023

View reviewed changes

TomNicholas mentioned this pull request Aug 1, 2023

Parallelize map_over_subtree #252

Open

slevang mentioned this pull request Nov 15, 2023

Dask-specific methods #97

Open

TomNicholas mentioned this pull request Dec 22, 2023

Track merging datatree into xarray pydata/xarray#8572

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dask-specific methods #196

Implement dask-specific methods #196

darothen commented Jan 13, 2023 •

edited by TomNicholas

darothen commented Jan 13, 2023

darothen commented Jan 14, 2023

TomNicholas left a comment

TomNicholas Jan 15, 2023

TomNicholas Jan 15, 2023

TomNicholas Jan 15, 2023

TomNicholas Jan 15, 2023

TomNicholas Jan 15, 2023

TomNicholas Jan 15, 2023

darothen commented Jan 15, 2023

TomNicholas commented Mar 6, 2023

darothen commented Mar 25, 2023

		new_datatree_dict = {node.path: node.ds.load(**kwargs) for node in self.subtree}
		return DataTree.from_dict(new_datatree_dict)

		graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}
		graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}

		return [node.ds.__dask_keys__() for node in self.subtree]
		return [node.ds.__dask_keys__() for node in self.subtree]

Implement dask-specific methods #196

Are you sure you want to change the base?

Implement dask-specific methods #196

Conversation

darothen commented Jan 13, 2023 • edited by TomNicholas

darothen commented Jan 13, 2023

darothen commented Jan 14, 2023

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

TomNicholas Jan 15, 2023

Choose a reason for hiding this comment

darothen commented Jan 15, 2023

TomNicholas commented Mar 6, 2023

darothen commented Mar 25, 2023

darothen commented Jan 13, 2023 •

edited by TomNicholas