Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate datatree.py module into xarray.core. #8789

Merged

Conversation

owenlittlejohns
Copy link
Contributor

This PR migrates the datatree.py module to xarray/core/datatree.py, as part of the on-going effort to merge xarray-datatree into xarray itself.

Most of the changes are import path changes, and type-hints, but there is one minor change to the methods available on the DataTree class: to_array has been converted to to_dataarray, to align with the method on Dataset. (See conversation here)

This PR was initially published as a draft here.

  • Completes migration step for datatree/datatree.py Track merging datatree into xarray #8572
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Copy link

welcome bot commented Feb 27, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@@ -77,7 +73,7 @@
# """


T_Path = Union[str, NodePath]
T_Path = str | NodePath
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mypy and the Python 3.9 tests don't like this change. I can revert it if that sounds best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that syntax won't work on 3.9 so we can't use it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can add the pragma at the top? from __future__ import annotations. Maybe it's technically a pragma.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was already there, so I'm guessing the custom type here is not covered by __future__.annotations in the same way that type hints are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - that looks better but still not perfect.

I'm working on the mypy issues still present, including things like annotations in the test files. I see some other things that I will try to debug tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I get for guessing without looking.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 28, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation Feb 28, 2024
Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two random things I forgot to mention earlier.

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I added these changes in: 2c5e54c

Copy link
Contributor Author

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some things to work out, but I wanted to update the PR with some changes, to get some feedback and guidance.

@@ -169,33 +166,32 @@ def update(self, other) -> NoReturn:
)

# FIXME https://github.com/python/mypy/issues/7328
@overload
def __getitem__(self, key: Mapping) -> Dataset: # type: ignore[misc]
@overload # type: ignore[override]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few new type: ignore[override] annotations on DatasetView methods. @flamingbear and I spent a while today trying to disentangle things here.

Essentially, the issue comes down to DatasetView inheriting from Dataset, which uses typing.Self as a return type in a few places. This would mean DatasetView would have a return type of DatasetView. But, instead, these return types are explicitly a Dataset. (mypy has some examples mentioning that subclasses shouldn't have more generalised return types than their parents, which seems related - thanks @flamingbear for the reference)

I'm not sure that just ignoring the error is the best thing to do, but I don't think we had a better idea for an implementation. I think @TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So I guess I have violated the Liskov substitution principle here... Does that represent an actual possible failure mode? Like are there any places where "if you can use a Dataset you can use a DatasetView" fails? I suppose passing a DatasetView into any function that attempts to call __setitem__ would violate this.

Are there alternative designs that would give us immutability without DatasetView being a subclass of Dataset? One alternative might be something like

class FrozenDataset:
    _dataset: Dataset

    def __init__(self, ds):
        self._dataset = ds

    def __getattr__(self, name):
        # Forward the method call to the Dataset class
        if name in ['__setitem__', ...]:
            raise AttributeError("Mutation of the DatasetView is not allowed, ...")
        else:
            return getattr(self._ds, name)

but then that would not pass an isinstance check, i.e. isinstance(dt.ds, Dataset) would return False.

@TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.

I was just talking about having both Dataset and DataTree inherit from a common DataMapping class that holds variables. But I don't think that would cover this, as that DataMapping should also be returning Self.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much more like the Frozen class that xarray sometimes uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this in Owen's absence. It seems like there are two paths.

  • Use a frozenDataset type thing, that fails the isinstance check?
  • Overriding the mypy errors understanding the the DatasetView is an library internal implementation and we know not to misuse it?

The first seems like the right thing to do, but I don't know what breaks with the isinstance check or why that would be necessary.
Is there some ABC that both Dataset and DatasetView could implement to pass the isinstance check and is that a big change? (edit: I'm thinking no. Also, this may show that I'm not afraid to ask the dumb questions in public.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make it a wrapper. and isinstance(dt.ds, Dataset) will return False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this with things that have been tried (and sadly not succeeded to date):

  • First up, I tried to implement the suggested FrozenDataset wrapper. There was an issue with using getattr - as it can't be used to intercept magic methods on Python classes.
  • Next: Trying to implement a Metaclass to allow interception of magic methods (initially inspired by this snippet). This proved tricky to do (I didn't quite get it fully working) and felt very much like it was an overly complicated solution for the problem we were trying to solve.
  • Next: Trying a mix-in to overwrite the affected methods. I put a time-box on this attempt as we want to unblock the migration. I did not get this working in the allocated time.
  • Lastly: Conceding defeat and adding the type ignore statements to cover this.

Not ideal, but the DatasetView and Datatree.ds property both have usage over the last couple of years without significant issues. I opened #8855 to capture that we need to work on a better fix at a later date.

@@ -636,31 +629,31 @@ def __array__(self, dtype=None):
"invoking the `to_array()` method."
)

def __repr__(self) -> str:
return formatting.datatree_repr(self)
def __repr__(self) -> str: # type: ignore[override]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another annotation I wanted to call out. I think it makes sense here. NamedNode.__repr__ has an optional level kwarg, which isn't an argument in repr_datatree. repr_datatree is using RenderTree, which does have a maxLevel, but I don't think that's quite the same thing. (If it is, though, we could edit DataTree.__repr__ to match the signature of NamedNode.__repr__, and then pass it on through)

I don't think properly maps to the maxLevel


if d:
# Populate tree with children determined from data_objects mapping
for path, data in d.items():
# Create and set new node
node_name = NodePath(path).name
if isinstance(data, cls):
if isinstance(data, DataTree):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was largely to make mypy happy, but I think it makes sense. Just using cls, I don't think mypy was realising that objects that were DataTree instances couldn't get to the else statement.

@@ -1064,14 +1059,18 @@ def from_dict(

# First create the root node
root_data = d.pop("/", None)
obj = cls(name=name, data=root_data, parent=None, children=None)
if isinstance(root_data, DataTree):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra code here is a duplication of what is happening for the child nodes. mypy was unhappy with the types for the root_data, in case it was a DataTree instance.

assert dt.name is None

def test_bad_names(self):
with pytest.raises(TypeError):
DataTree(name=5)
DataTree(name=5) # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few mypy annotations like this. I think they make sense, because the tests are explicitly checking what happens when you use the wrong type of argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you always have to do this within tests that check for TypeErrors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool beans. I didn't want people to think I'd just thrown in type: ignore statements in just to quieten down mypy. (I'd thought about it a bit, and then thrown in type: ignore statements to quieten down mypy 😉)

results = DataTree(name="results", data=data)
xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]])
results: DataTree = DataTree(name="results", data=data)
xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]]) # type: ignore[index]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable - this and test_getitem_dict_like_selection_access_to_dataset are using Iterable types (a list and a dictionary), which currently would raise exceptions (and the tests are marked as xfail). Other options I considered included writing tests that would make sure the expected assertions were raised, but it feels like the actual desired behaviour is TBD, and that the best thing to do was to leave things as they were.

@@ -446,7 +440,7 @@ def ds(self) -> DatasetView:
return DatasetView._from_node(self)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a comment for L430:

def ds(self) -> DatasetView:

This makes sense (because of wanting an immutable version of the Dataset), but it's causing the bulk of the remaining mypy issues in the tests, where Dataset objects are being directly assigned to DataTree.ds. That said, I think I'm a little surprised this is causing an issue, given the signature below for the @ds.setter. I'd love some guidance on resolving those mypy errors!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also surprised that causes a problem... Is this pattern valid typing in general for properties?

class A:
    ...


class B(A):
    ...


class C:
    @property
    def foo(self) -> B:
        ...

    @property.setter
    def foo(self, value: A):
        ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine until you set C.foo with A.

class A:
    ...

class B(A):
    ...

class C:
    @property
    def foo(self) -> B:
        return B()

    @foo.setter
    def foo(self, value: A):
        pass


nope = C()
nope.foo = A()

output:

> mypy demo/demo7.py
demo/demo7.py:20: error: Incompatible types in assignment (expression has type "A", variable has type "B")  [assignment]

And I think this is the open issue for fixing/ignoring this issue
python/mypy#3004

But it seems like it's not high priority since it's been open for 7 years.

def update(self, other: Dataset | Mapping[str, DataTree | DataArray]) -> None:
def update(
self, other: Dataset | Mapping[str, DataTree | DataArray | Variable]
) -> None:
"""
Update this node's children and / or variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment for L937 (can't add it directly)...

This is another mypy issue because the type of k can be a Hashable (from iterating through a Dataset) or a str, but DataTree.name needs a str.

There seem to be a couple of options:

  1. Cast k as a string.
  2. Check that k is a string, and raise an exception if it isn't.

Happy to do either (or something else). I couldn't think of an immediate issue with casting a Hashable to a string, but wanted to check (in case there might be some chance of a weird collision between e.g. 1 and "1").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is symptomatic of a more general issue that technically all the keys in DataTree should be Hashable to match what xarray "mostly" supports (see #8294 (comment)).

@headtr1ck do you have any thoughts on supporting Hashable for names of child nodes in DataTree?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably have to support Hashable at some point, otherwise any operation that combines a DataTree with a Dataset or DataArray will be a nightmare to type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks. In that case I defer to Owen on whether it would be easier to do that in this PR or a follow-up one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm helping out while Owen is at a conference. I was going to see if it were easy to add this change to handle Hashable and quickly ran into the PurePosixPath's pathsegments needing to be PathLike.

I think if we want to use NodePath for traversing and altering the tree, the paths are going to need to be coerced into strings.

DataTree's _name can be changed to Hashable, but when it's calling NamedNode's __init__ I think that property setter will have to coerce to a string.
But I'm not thinking of what kind of problems is that going to cause? Beyond say someone using (7,) as a key/path as well as "(7,)" and having collisions.

I think we can handle that by checking collisions. but I'm not 100% sure yet. Am I missing something obvious?

I think I was missing something obvious (just making the DataTree's _name Hashable is not going to help here, it will also get converted by the NamedNode pieces.)

This is going to be a question for tomorrow's meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that meeting we discussed how having Hashables in datatree is possibly more trouble than it is worth, because (a) you can't form a valid path out of a list of hashables (but a list of strings is always fine), and (b) names of groups can't be hashable-like in netCDF or definitely not in Zarr anyway, so there doesn't seem like much of a use case for hashable-named groups at least.

For now it was decided to see if we could type: ignore our way to having an initial implementation that does not support hashables in datatree (which as we can explicitly forbid hashables at tree creation time hopefully isn't a ridiculous idea). #8836 was made to track the intent to revisit this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not really follow all the discussions about this PR.
But for me, not accepting Hashables sounds reasonable (even though just the thought of a tree-like datastructure sounds like this should be possible for any Hashables names. E.g. a node object with a Hashables name and a number of node children). But agreed, it makes things like getitem unnecessarily complicated.

Probably the best for now is some type ignores or casts.
In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g. Dataset[str].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not really follow all the discussions about this PR.

Yeah sorry - a lot of them happened over zoom (see the March 19th meeting notes here).

sounds like this should be possible for any Hashables names

The problem is not the tree construction, it's serialization (because I can't guarantee being able to make a single valid unix path out of .join'ing all those hashables).

Probably the best for now is some type ignores or casts.

Cool, that's what we've done.

In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g. Dataset[str].

That would be awesome!

xarray/core/datatree.py Outdated Show resolved Hide resolved
@@ -1449,7 +1448,7 @@ def merge_child_nodes(self, *paths, new_path: T_Path) -> DataTree:

# TODO some kind of .collapse() or .flatten() method to merge a subtree

def as_array(self) -> DataArray:
def as_dataarray(self) -> DataArray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget a quick update to whats-new.rst for this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch.

@flamingbear flamingbear force-pushed the DAS-2062-migrate-datatree-module-pr branch from 94b17f9 to c45c56a Compare March 4, 2024 17:40
@@ -32,6 +32,10 @@ New Features
Breaking changes
~~~~~~~~~~~~~~~~

- ``Datatree``'s ``as_array`` renamed ``to_dataarray`` to align with ``Dataset``. (:pull:`8789`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas should this have been kept out of breaking changes? mostly because it's not actually released? Wasn't sure where I should keep it. reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we never had a need before for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package" 😅

These datatree breaking changes really only need to be written down somewhere, even a GH issue, so that we can point to them all at once when it comes time to do the grand reveal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8807 and reverted. 😬

So this is where we are moving forward with the assumption that DataTree nodes
are alway named with a string. In this section of `update` even though we know
the key is a str, mypy refuses.

I chose explicit recast over mypy ignores, tell me why that's wrong?
@owenlittlejohns
Copy link
Contributor Author

@TomNicholas - I think this PR is at a point now where it can be reviewed again in earnest. 🤞

Copy link
Contributor

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get this merged. Most of the changes are small and/or typing (I'm impressed you got mypy to pass!). I only have one substantive comment (about mixins).

Comment on lines +34 to +49
from xarray.datatree_.datatree.common import TreeAttrAccessMixin
from xarray.datatree_.datatree.formatting import datatree_repr
from xarray.datatree_.datatree.formatting_html import (
datatree_repr as datatree_repr_html,
)
from xarray.datatree_.datatree.mapping import (
TreeIsomorphismError,
check_isomorphic,
map_over_subtree,
)
from xarray.datatree_.datatree.ops import (
DataTreeArithmeticMixin,
MappedDatasetMethodsMixin,
MappedDataWithCoords,
)
from .render import RenderTree
from xarray.core.treenode import NamedNode, NodePath, Tree
from xarray.datatree_.datatree.render import RenderTree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make sense to not actually import and use any of these from xarray.datatree._datatree imports in this PR? For example the DataTree object should still pass 99% of its tests without inheriting from the TreeAttrAccessMixin. That way we are still being explicit about what has and has not been "merged and approved".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are losing some of the testing if you do that. Right now we're still collecting and running all of the tests in datatree_. If you pull out all of the pieces that aren't migrated you'll lose that testing. I think we're explicit about what is merged and approved by what is not in datatree_ anymore. And none of this is visible to a user yet. Let's talk in today's meeting and you can convince me otherwise.

@TomNicholas
Copy link
Contributor

With @shoyer 's blessing (in the meeting) I hereby merge this PR

@TomNicholas TomNicholas merged commit 473b87f into pydata:main Mar 26, 2024
29 checks passed
DataTree integration automation moved this from In progress to Done Mar 26, 2024
Copy link

welcome bot commented Mar 26, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@owenlittlejohns owenlittlejohns deleted the DAS-2062-migrate-datatree-module-pr branch March 26, 2024 17:15
dcherian added a commit to dcherian/xarray that referenced this pull request Apr 2, 2024
* main: (26 commits)
  [pre-commit.ci] pre-commit autoupdate (pydata#8900)
  Bump the actions group with 1 update (pydata#8896)
  New empty whatsnew entry (pydata#8899)
  Update reference to 'Weighted quantile estimators' (pydata#8898)
  2024.03.0: Add whats-new (pydata#8891)
  Add typing to test_groupby.py (pydata#8890)
  Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867)
  Check for aligned chunks when writing to existing variables (pydata#8459)
  Add dt.date to plottable types (pydata#8873)
  Optimize writes to existing Zarr stores. (pydata#8875)
  Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886)
  Don't allow overwriting indexes with region writes (pydata#8877)
  Migrate datatree.py module into xarray.core. (pydata#8789)
  warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874)
  groupby: Dispatch quantile to flox. (pydata#8720)
  Opt out of auto creating index variables (pydata#8711)
  Update docs on view / copies (pydata#8744)
  Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869)
  numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865)
  upstream-dev CI: Fix interp and cumtrapz (pydata#8861)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Development

Successfully merging this pull request may close these issues.

None yet

5 participants