Intake, catalogs, and datatree #134

TomNicholas · 2022-07-30T21:30:27Z

Thanks @TomNicholas and sorry for creating issue noise. I guess I got a bit carried away with these comments in the readme:

Has functions for mapping user-supplied functions over every node in the tree,

Automatically dispatches some of xarray.Dataset's API over every node in the tree (such as .isel),

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds. Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes. I guess I was thinking that some sort of formalism over a nested datastructure could help in dask computational graph composition. I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I wonder if @andersy005, @mdurant or @rsignell have any experience or thoughts about if it makes any sense for interface between this library and intake?

Originally posted by @pbranson in #97 (comment)

TomNicholas · 2022-07-30T21:36:32Z

@pbranson thanks for your ideas about integration of datatree with the intake ecosystem, this is definitely something I'm really interested in, and a potential use case I had in mind when originally creating this package.

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds.

I think this makes sense. Datatree is almost like an in-memory catalog of datasets.

Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes.

Yep. There are probably lots of cool possibilities. My priority would be to build datatree in such a way that other packages can easily understand the model and experiment with interfacing in ways they think are sensible.

I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues. Datatree just makes it easier to express the complex operation which behaves poorly when run via dask.

cc @rabernat who has also pointed out the correspondence between datatree and intake catalogs to me before.

pbranson · 2022-07-30T22:09:05Z

@TomNicholas Thanks for breaking this out of #97!

I should have guessed that this would have been part of your discussions! I just scanned back over the issues prompting the creation of datatree

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues.

The dask-side challenges could entirely be due to detail with my naïve usage! :-)

TomNicholas · 2022-08-31T21:31:35Z

The ability to open a set of intake catalogs as a DataTree was actually added to intake-esm in intake/intake-esm#512 by @mgrover!

Separately it's also been suggested that we might want to write a plugin for intake proper.

TomNicholas added question Further information is requested design question usage question User questions about usage labels Jul 30, 2022

TomNicholas mentioned this issue Jul 30, 2022

Dask-specific methods #97

Open

TomNicholas mentioned this issue Jan 5, 2023

use datatree instead of dictionary of datasets jbusecke/xMIP#278

Open

TomNicholas mentioned this issue Feb 14, 2024

Datatree design discussions - weekly meeting pydata/xarray#8747

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intake, catalogs, and datatree #134

Intake, catalogs, and datatree #134

TomNicholas commented Jul 30, 2022

TomNicholas commented Jul 30, 2022

pbranson commented Jul 30, 2022

TomNicholas commented Aug 31, 2022

Intake, catalogs, and datatree #134

Intake, catalogs, and datatree #134

Comments

TomNicholas commented Jul 30, 2022

TomNicholas commented Jul 30, 2022

pbranson commented Jul 30, 2022

TomNicholas commented Aug 31, 2022