Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dask-specific methods #196

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

darothen
Copy link

@darothen darothen commented Jan 13, 2023

This is an initial implementation of the feature requested in #97.

The first implementation here very closely follows the implementation of these methods by xarray.Dataset. For the majority of the methods, this should work fine; we iterate over all the nodes in our tree, starting at the root, and perform the necessary dask.collections API operation. However, __dask_post{compute,persist}__ is a bit more complicated; some additional testing is required to ensure that we're appropriately applying the available support utilities to re-construct our final DataTree without any superfluous work.

  • Closes Dask-specific methods #97
  • Tests added
  • Passes pre-commit run --all-files
  • New functions/methods are listed in api.rst
  • Changes are summarized in docs/source/whats-new.rst

@darothen
Copy link
Author

Tag @TomNicholas, will work on testing this over the coming days as I have time.

@darothen
Copy link
Author

Here's a gist based on @jbusecke's CMIP6 demo showing the top-level integration of load and compute (you can just easily modify it to show that persist works.

Still left to do are writing some test cases and further deep-diving to make sure that the dask collections API functions we provided here are used.

Copy link
Collaborator

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @darothen ! Thank you for taking this on. There's just a couple of places that look a little off to me which I've highlighted.

Comment on lines +1483 to +1485
# darothen: Are we sure that results_iter is ordered the same as
# self.subtree?
# self.subtree?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the iterable of DatasetView come from? Presumably you are looping over the nodes, but where are you doing that? Like I don't understand where results is passed in from.

Comment on lines +1367 to +1368
new_datatree_dict = {node.path: node.ds.load(**kwargs) for node in self.subtree}
return DataTree.from_dict(new_datatree_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this will have the behavior you intend: DataTree.from_dict will construct a completely new tree object, and then you are inserting whatever you get when you call DatasetView.load(). It's not altering self in-place.

Also I think it would be worth double-checking that DatasetView.load() does what you expect too with regard to new objects / copying - I never really thought about that case when I wrote DatasetView.

If you want to return the same tree but with all the data loaded I think you need to alter the current tree in-place instead of creating a new one, i.e. something like

for node in self.subtree:
    self[node.path] = node.ds.load

though this might not fail gracefully...

Comment on lines +1411 to +1414
new_datatree_dict = {
node.path: node.ds.persist(**kwargs) for node in self.subtree
}
return DataTree.from_dict(new_datatree_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on .load() applies to this too I think.

Comment on lines +1427 to +1432
ds_tokens = {node.path: node.ds.__dask_tokenize__() for node in self.subtree}
ds_tokens = {node.path: node.ds.__dask_tokenize__() for node in self.subtree}

return normalize_token((type(self), ds_tokens))

return normalize_token((type(self), ds_tokens))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional repetition of lines? The double return shouldn't even be valid syntax should it??

Comment on lines +1435 to +1436
graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}
graphs = {node.path: node.ds.__dask_graph__() for node in self.subtree}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More unintentional repetition?

Comment on lines +1452 to +1453
return [node.ds.__dask_keys__() for node in self.subtree]
return [node.ds.__dask_keys__() for node in self.subtree]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

@darothen
Copy link
Author

Thanks for the quick review @TomNicholas, hoping to address later today or tomorrow. Note on the line repetition - looks like I screwed up a merge somewhere, will need to fix that separately.

@TomNicholas
Copy link
Collaborator

@darothen wondering if you had any time soon to revisit this PR? Would be great to get it in soon because Julius and I are writing another blog post about using datatree with dask on CMIP6 data.

@darothen
Copy link
Author

@TomNicholas I'm hacking on some projects this weekend, let me see if I can wrap things up. Apologies for the delay... it became very hectic at work shortly after the hackathon and I haven't had much time for side projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dask-specific methods
2 participants