Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake, catalogs, and datatree #134

Open
TomNicholas opened this issue Jul 30, 2022 · 3 comments
Open

Intake, catalogs, and datatree #134

TomNicholas opened this issue Jul 30, 2022 · 3 comments
Labels
design question question Further information is requested usage question User questions about usage

Comments

@TomNicholas
Copy link
Collaborator

Thanks @TomNicholas and sorry for creating issue noise. I guess I got a bit carried away with these comments in the readme:

  • Has functions for mapping user-supplied functions over every node in the tree,
  • Automatically dispatches some of xarray.Dataset's API over every node in the tree (such as .isel),

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds. Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes. I guess I was thinking that some sort of formalism over a nested datastructure could help in dask computational graph composition. I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I wonder if @andersy005, @mdurant or @rsignell have any experience or thoughts about if it makes any sense for interface between this library and intake?

Originally posted by @pbranson in #97 (comment)

@TomNicholas TomNicholas added question Further information is requested design question usage question User questions about usage labels Jul 30, 2022
@TomNicholas
Copy link
Collaborator Author

@pbranson thanks for your ideas about integration of datatree with the intake ecosystem, this is definitely something I'm really interested in, and a potential use case I had in mind when originally creating this package.

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds.

I think this makes sense. Datatree is almost like an in-memory catalog of datasets.

Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes.

Yep. There are probably lots of cool possibilities. My priority would be to build datatree in such a way that other packages can easily understand the model and experiment with interfacing in ways they think are sensible.

I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues. Datatree just makes it easier to express the complex operation which behaves poorly when run via dask.

cc @rabernat who has also pointed out the correspondence between datatree and intake catalogs to me before.

@pbranson
Copy link

@TomNicholas Thanks for breaking this out of #97!

I should have guessed that this would have been part of your discussions! I just scanned back over the issues prompting the creation of datatree

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues.

The dask-side challenges could entirely be due to detail with my naïve usage! :-)

@TomNicholas
Copy link
Collaborator Author

The ability to open a set of intake catalogs as a DataTree was actually added to intake-esm in intake/intake-esm#512 by @mgrover!

Separately it's also been suggested that we might want to write a plugin for intake proper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question question Further information is requested usage question User questions about usage
Projects
None yet
Development

No branches or pull requests

2 participants