New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intake, catalogs, and datatree #134
Comments
@pbranson thanks for your ideas about integration of datatree with the intake ecosystem, this is definitely something I'm really interested in, and a potential use case I had in mind when originally creating this package.
I think this makes sense. Datatree is almost like an in-memory catalog of datasets.
Yep. There are probably lots of cool possibilities. My priority would be to build datatree in such a way that other packages can easily understand the model and experiment with interfacing in ways they think are sensible.
I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues. Datatree just makes it easier to express the complex operation which behaves poorly when run via dask. cc @rabernat who has also pointed out the correspondence between datatree and intake catalogs to me before. |
@TomNicholas Thanks for breaking this out of #97! I should have guessed that this would have been part of your discussions! I just scanned back over the issues prompting the creation of datatree
The dask-side challenges could entirely be due to detail with my naïve usage! :-) |
The ability to open a set of intake catalogs as a Separately it's also been suggested that we might want to write a plugin for intake proper. |
Thanks @TomNicholas and sorry for creating issue noise. I guess I got a bit carried away with these comments in the readme:
I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds. Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes. I guess I was thinking that some sort of formalism over a nested datastructure could help in dask computational graph composition. I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset
I wonder if @andersy005, @mdurant or @rsignell have any experience or thoughts about if it makes any sense for interface between this library and intake?
Originally posted by @pbranson in #97 (comment)
The text was updated successfully, but these errors were encountered: