Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add open_datatree to xarray #8697

Merged
merged 31 commits into from Feb 14, 2024
Merged

Conversation

flamingbear
Copy link
Contributor

@flamingbear flamingbear commented Feb 2, 2024

Draft: I'd like to open this up and start a discussion on a couple of things.

Here is a first stab at adding open_datatree to the backend of xarray.

  1. I'm not sure which tests to migrate/write with this change. The only datatree tests that have open_datatree are the ones in tests/test_io.py. While those (and all of the datatree tests) run and pass, they are still located in _datatree, and they don't seem to fit in the test_backends.py. I do see that Tom did a good job of naming the tests so that they would fit with the existing tests in many places. Migrated tests over into xarray/tests/datatree once approved and merged can be added into the existing tests where appropriate (from Tom).
  2. I was able to open datatrees with each of the engines, netcdf4, h5netcdf and zarr.
  3. I haven't moved any documentation. I remember hearing that we could add it and mark it as experimental? Is that the correct way forward?

No check boxes checked yet.

  • Closes part of first bullet of Track merging datatree into xarray #8572
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Adds additional ignore to mypy

Adds additional ignore to doctests

Excludes xarray/datatree_ from all pre-commmit.ci
First stab. Will need to add/move tests.
I mistakenly thought we wanted to use the hidden version of datatree_ and we do not.
We do not want to expose open_datatree at top level until all of the code is migrated.
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 2, 2024
engine : str, optional
Xarray backend engine to us. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
kwargs :
Additional keyword arguments passed to :py:meth:`~xarray.open_dataset` for each group.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't exist yet. at that location.

Copy link
Collaborator

@keewis keewis Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why? This should exist, if you use :py:func: instead of :py:meth: (you can use sphobjinv to find the right role)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't exist at the top level because we wanted to migrate over the code before allowing a direct import.

from xarray import open_datatree

Until all of the code was merged, we said it would be imported from xarray.core.datatree. From #8572 First comment "EDIT: We decided it should return an xarray.DataTree object, or even xarray.core.datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times."

So I had thought xarray.core.datatree.DataTree made more sense for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the second part of your comment, makes me think I might not understand what you were asking.

Copy link
Collaborator

@keewis keewis Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm asking is: open_dataset definitely exists (and it should be possible to link to), so I suppose you meant to write xarray.open_datatree?

Suggested change
Additional keyword arguments passed to :py:meth:`~xarray.open_dataset` for each group.
Additional keyword arguments passed to :py:func:`~xarray.open_datatree` for each group.

Also, once that's fixed, does this cause the docs build to fail? If not I believe it would be fine to leave as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, I think this should still be open_dataset. That is what is in the original docs. And this is describing open_datatree, where it's calling open_dataset with these **kwargs each time. So I think this commit should be reverted. As to :py:meth: vs :py:func: I don't know and will look at the difference unless you tell me before I figure it out.

@TomNicholas TomNicholas added this to In progress in DataTree integration via automation Feb 2, 2024
xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/common.py Outdated Show resolved Hide resolved
@flamingbear
Copy link
Contributor Author

flamingbear commented Feb 2, 2024

Apparently I misunderstood this for excluding mypy from datatree_?

Well I understood it, but it does did not help me when we import from there.

Edit: I was able to ignore type errors for the imported datatree_ modules 9f89256

Copy link
Contributor

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which tests to migrate/write with this change.

I think we can start by moving test_io.py to a new folder xarray/tests/datatree/test_io.py. This keeps a distinction between testing functionality we have collectively "okayed" and what we haven't got to yet. We can think about potentially integrating those tests into xarray/tests/test_backends.py when we actually integrate the internals of open_datatree with the backends internals.

I was able to open datatrees with each of the engines, netcdf4, h5netcdf and zarr.

Great! These all have tests right?

I haven't moved any documentation. I remember hearing that we could add it and mark it as experimental? Is that the correct way forward?

Yeah I'm not sure what to do for this. We could put it in the main docs but with a big experimental warning on it, but then we haven't got to the rest of the datatree functionality yet. I think I would favour moving the code first, then exposing the documentation after. Perhaps the datatree docs should be moved piecemeal to a dedicated section in the documentation (under "For developers/contributors" maybe), then moved into the main docs only once everything is done?

@flamingbear
Copy link
Contributor Author

I was able to open datatrees with each of the engines, netcdf4, h5netcdf and zarr.

Great! These all have tests right?

I was thinking no, but when I went to write them. I think these are the tests I would write. They make me slightly uncomfortable only in that they use the to_zarr, to_netcdf backends.

@flamingbear
Copy link
Contributor Author

I think I would favour moving the code first, then exposing the documentation after

Future us problem, I like that. But it also makes sense to me

Add some typing for mygrated tests.
Adds display_expand_groups to core options.
This is cargo-cult.  I wonder if there's a different CI test that wanted these
and since this is now excluded at the top level.  I'm putting them back until
migration into main codebase.
@flamingbear flamingbear marked this pull request as ready for review February 5, 2024 22:31
@flamingbear
Copy link
Contributor Author

Marking as ready for review, but mostly to start comments coming.

pyproject.toml Show resolved Hide resolved
xarray/datatree_/datatree/datatree.py Show resolved Hide resolved
Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me, at least for now: we definitely need to refactor this at a later point (for example, there should be a function that converts a given group to a dataset, which then can be called from both open_datatree and open_dataset). I'd probably have started by migrating the datatree code to xarray.core, though.

I realize that this PR was created on top of #8656, but before merging it might be good to remove the commits from that (this will save us some trouble down the line).

Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
engine : str, optional
Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
kwargs :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs :
**kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine : str, optional
Xarray backend engine to us. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
kwargs :
Additional keyword arguments passed to :py:meth:`~xarray.open_dataset` for each group.
Copy link
Collaborator

@keewis keewis Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why? This should exist, if you use :py:func: instead of :py:meth: (you can use sphobjinv to find the right role)

xarray/backends/api.py Outdated Show resolved Hide resolved
@flamingbear
Copy link
Contributor Author

I realize that this PR was created on top of #8656, but before merging it might be good to remove the commits from that (this will save us some trouble down the line).

I didn't see that originally. Will they still cause problems when this is squashed merged? I assume you'd want me to get rid of the 5 commits before Jan 31st (when you merged #8688)? do you want to give me a git-wizard hint?

@keewis
Copy link
Collaborator

keewis commented Feb 12, 2024

The only way that I know of is interactive rebasing, but that would mean that you'd have to remove all merge commits and maybe even resolve merge conflicts again. So if you checked that github doesn't show any weird things in this PR it maybe doesn't matter.

flamingbear and others added 2 commits February 12, 2024 11:02
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
@flamingbear
Copy link
Contributor Author

flamingbear commented Feb 12, 2024

So if you checked that github doesn't show any weird things in this PR it maybe doesn't matter.

I don't see anything weird, no conflicts or anything, just those old commits with different lineage.

edit: I think it's fine if it's squash merged. If not, I did rebase this onto the current pydata/main and the resulting PR would be clean, but I don't actually want to push that on top of this.

@TomNicholas
Copy link
Contributor

@flamingbear bear is there a comment thread for the mypy issue you raised earlier? I have no idea why mypy needs to be explicitly told about the types here, but mypy seems to pass CI on this branch... 🤷‍♂️

@TomNicholas
Copy link
Contributor

Also @flamingbear this now has two approvals - is it ready to merge from your end?

@flamingbear
Copy link
Contributor Author

but mypy seems to pass CI on this branch... 🤷‍♂️

I think that's because it's excluded in this PR. I've moved the that test module in the PR you linked to.

Also @flamingbear this now has two approvals - is it ready to merge from your end?

It is ready on my end. Merge away. 🙇

@TomNicholas
Copy link
Contributor

Let's start a habit of adding a whats-new.rst entry (under "internal changes") for each of these PRs, then I will merge.

@flamingbear
Copy link
Contributor Author

Let's start a habit of adding a whats-new.rst entry (under "internal changes") for each of these PRs, then I will merge.

Did some pattern matching because I'm not building the docs without error anymore locally.

@TomNicholas TomNicholas enabled auto-merge (squash) February 14, 2024 19:57
auto-merge was automatically disabled February 14, 2024 21:15

Head branch was pushed to by a user without write access

@flamingbear
Copy link
Contributor Author

@TomNicholas I was hoping to kick off the CI again by fixing the rst formating since the macos 3.11 had timed out and was hanging. But I also turned off automerge.

@TomNicholas TomNicholas merged commit fffb03c into pydata:main Feb 14, 2024
29 checks passed
DataTree integration automation moved this from In progress to Done Feb 14, 2024
@keewis
Copy link
Collaborator

keewis commented Feb 14, 2024

I'm kinda late with this, but one concern is that when merging the datatree repo we effectively removed xarray/datatree_ from packaging. With the code in xr.open_datatree we now use that code, but won't put it into build artifacts.

So my question is: should we remove that exclude and instead put the whole repo into build artifacts, or should we revert this and wait until we have done the series of PRs copying over the datatree library to its final place? To be clear, I'm mostly interested about the timing of things.

@TomNicholas
Copy link
Contributor

If we are not advertising any of this to be imported by users until the move is complete then does the distinction matter?

@keewis
Copy link
Collaborator

keewis commented Feb 14, 2024

if we're fine with it being in a broken state (because datatree_ would not be in the installed library), then no. We'd just have to make sure that the import of datatree_ is not at the top level, so open_dataset would still work. Looking at the code, it appears that the imports are either local or behind TYPE_CHECKING guards, so at least at runtime this would be fine. Now we'd only need to see whether this breaks type checking, but I'm not an expert on that.

@TomNicholas
Copy link
Contributor

I personally think that's fine - we're likely going to have DataTree be in a broken state anyway whilst we have like treenode.py moved but not datatree.py moved or whatever. As long as it doesn't break advertised functionality for users by crapping out, or break code quality tools we use (like mypy) then I don't see a problem.

@flamingbear flamingbear deleted the mhs/open_datatree branch February 14, 2024 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Development

Successfully merging this pull request may close these issues.

None yet

3 participants