Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance of open_datatree #8994

Open
4 tasks
TomNicholas opened this issue May 2, 2024 · 4 comments · May be fixed by #9014
Open
4 tasks

Improving performance of open_datatree #8994

TomNicholas opened this issue May 2, 2024 · 4 comments · May be fixed by #9014
Labels
topic-backends topic-DataTree Related to the implementation of a DataTree class topic-performance

Comments

@TomNicholas
Copy link
Contributor

What is your issue?

The implementation of open_datatree works, but is inefficient, because it calls open_dataset once for every group in the file. We should refactor this to improve the performance, which would fix issues like xarray-contrib/datatree#330.

We discussed this in the datatree meeting, and my understanding is that concretely we need to:

  • Create an asv benchmark for open_datatree, probably involving first writing then benchmarking the opening of a special netCDF file that has no data but lots of groups.
  • Refactor the NetCDFDatastore class to only create one CachingFileManager object per file, not one per group, see
    manager = CachingFileManager(
    .
  • Refactor NetCDF4BackendEntrypoint.open_datatree to use an implementation that goes through NetCDFDatastore without calling the top-level xr.open_dataset again.
  • Check the performance of calling xr.open_datatree on a netCDF file has actually improved.

It would be great to get this done soon as part of the datatree integration project. @kmuehlbauer I know you were interested - are you willing / do you have time to take this task on?

@TomNicholas TomNicholas added topic-backends topic-performance topic-DataTree Related to the implementation of a DataTree class labels May 2, 2024
@TomNicholas TomNicholas added this to To do in DataTree integration via automation May 2, 2024
@TomNicholas
Copy link
Contributor Author

cc also @mgrover1 , @aladinor, @flamingbear, @owenlittlejohns, @eni-awowale in case I missed anything

@kmuehlbauer
Copy link
Contributor

Thanks @TomNicholas for adding more traction here. Unfortunately I'm unable to dedicate as much time as needed here in the upcoming 4 weeks. IIUC @aladinor is already working towards a prototype based on #7437. Please correct me if I'm wrong.

I've myself played with that branch a bit to get familiar with the code, too. I was trying rebasing/refactoring to recent main, fixing some immediate issues to make it work, but did not come far. Too much has changed in that part of the codebase, which makes rebasing a bit of a pain. I'll see if I can at least get something to work over the weekend.

@aladinor
Copy link

aladinor commented May 3, 2024

Thanks, @TomNicholas, for putting this together. Indeed, I've been working on the aforementioned steps, and I'd be happy to share some results with you at our next dtree meeting. BTW, When is the next meeting?

@keewis
Copy link
Collaborator

keewis commented May 3, 2024

see #8747. As a summary, we have a weekly meeting every Tuesday at 11:30 EST.

Too much has changed in that part of the codebase, which makes rebasing a bit of a pain.

Indeed, it may be easier to start again from the current state. The plugin mechanism basically works, but a lot of the details (like the chunk handling) are still missing, and are currently done by the call to open_dataset.

@aladinor aladinor linked a pull request May 7, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-DataTree Related to the implementation of a DataTree class topic-performance
Projects
Development

Successfully merging a pull request may close this issue.

4 participants