Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a bound TypeVar for DataArray and Dataset methods #8208

Closed
wants to merge 6 commits into from

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Sep 19, 2023

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

Edit: I added a comment outlining a problem with this


I think we should be using a TypeVar(..., bound=...) for our typing, since otherwise subtypes aren't allowed. I don't see a compelling reason to have all of T_DataWithCoords and T_DataArrayOrSet and T_Xarray, though I might be missing something.

So this unifies all T_DataWithCoords, T_DataArrayOrSet and T_Xarray to T_Xarray, and changes that type to be bound on a union of DataArray & Dataset (similar to the existing T_DataArrayOrSet). This covers the bulk of our API — functions which transform either a DataArray or Dataset and return the input type.

This does require some manual casts; I think because when there's a concrete path for both DataArray & Dataset, mypy doesn't unify them back together. It's also possible I'm missing something.

One alternative — a minor change — would be to bound on DataWithCoords, like the existing T_DataWithCoords, rather than the union. A quick comparison showed each required some additional casts / ignores over the other.

I _think_ we should be using a `TypeVar(..., bound=...)` for our typing, since otherwise subtypes aren't allowed. I don't see a compelling reason to have both, though I might be missing something.

So this changes all `T_DataWithCoords` to `T_Xarray`, and changes that type to bound on a union of `DataArray` & `Dataset`. This covers the bulk of the API which offer transformations of either a `DataArray` or `Dataset` and return that object.

This does require some manual casts; I think because when there's a concrete path for both `DataArray` & `Dataset`, mypy doesn't unify them back together. It's also possible I'm missing something.

One alternative — a minor change — would be to bound on `DataWithCoords`, rather than the union. A quick comparison showed both required some casts / ignores.
@max-sixty
Copy link
Collaborator Author

Here's a partial description of why we need casts from GPT-4 — not completely satisfying from a mypy perspective though: https://chat.openai.com/share/5c1e3d71-ebf8-4bc1-a861-288d6dc83f55

xarray/core/common.py Outdated Show resolved Hide resolved
window: Mapping[Any, int] | None = None,
window_type: str = "span",
**window_kwargs,
) -> RollingExp[T_DataWithCoords]:
) -> RollingExp[T_Xarray]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> RollingExp[T_Xarray]:
) -> RollingExp[Self]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this doesn't work, similar to the reason outlined below — T_Xarray isn't compatible with T_DataWithCoords...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this doesn't work, similar to the reason outlined below — T_Xarray isn't compatible with T_DataWithCoords...

I think thats why originally I was typing RollingExp using T_DataWithCoords

@@ -800,7 +801,10 @@ def __getitem__(self, key: GroupKey) -> T_Xarray:
"""
Get DataArray or Dataset corresponding to a particular group label.
"""
return self._obj.isel({self._group_dim: self.groups[key]})
return cast(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird that these casts are now necessary...

@max-sixty
Copy link
Collaborator Author

max-sixty commented Sep 19, 2023

On reflection, a problem with this came up in #7780 (comment):

  • If we use TypeVar('foo', bound=Union["DataArray", "Dataset"]), then a function that calls separate methods (e.g. .reindex, which is not a method on parent class of DataArray & Dataset) won't be unified.

  • So if we have a function like:

    def foo(obj: T_DataArrayOrSet) -> T_DataArrayOrSet:
        return x**2

    ...this will fail, because the return type is expected as Union["DataArray", "Dataset"]. That's why we need all these casts, and user code would have the same problem.

  • But when we use TypeVar('foo', 'DataArray', 'Dataset'), then subclasses don't work. And we have three quite similar type definitions, which aren't fully compatible with each other.

  • I'm not sure it's possible to resolve these. I've had a search, but not much came up. Possibly we'd need to write a Protocol...?

Co-authored-by: Michael Niklas  <mick.niklas@gmail.com>
@max-sixty max-sixty marked this pull request as draft September 19, 2023 08:45
@headtr1ck
Copy link
Collaborator

headtr1ck commented Sep 19, 2023

Just some insight between a TypeVar with bound to union and simply both classes as args:

T1 = TypeVar("T1", int, str)
T2 = TypeVar("T2", bound=Union[int, str])

def f1(x: T1, y: T1) -> T1:
    return x + y

def f2(x: T2, y: T2) -> T2:
    return x + y  # mypy complains here

mypy will complain twice, once for unsupported operand + between str and int and once for the return type (like your issue above). If you require that the types of x and y are the same, you need T1 (whjich is T_Xarray).
Now, the error about the return type might actually be an mypy bug, possibly it is this bug: python/mypy#10817

Now of course in xarray DataArrays and Datasets behave quite nicely among themselves and you can basically do any operation on any combination of them. So it is quite natural to type them similar to T2 (which is T_DataArrayOrSet)

But subclasses should work with the first approach

class FStr(str):
    pass

fstr = FStr("asd")

f1(fstr, fstr)  # is allowed

@max-sixty
Copy link
Collaborator Author

But subclasses should work with the first approach

OK, I'm not sure how I came to such a misunderstanding.

Thanks a lot for your explanation, very clear and helpful.


Let me see whether I can resurrect anything useful out of this PR; at the very least I can add some comments on when to use which T_foo

max-sixty added a commit to max-sixty/xarray that referenced this pull request Sep 19, 2023
@max-sixty max-sixty closed this Sep 19, 2023
max-sixty added a commit to max-sixty/xarray that referenced this pull request Sep 19, 2023
Also from the chars of pydata#8208 -- this uses the TypeVars we define, which hopefully sets a standard and removes ambiguity for how to type functions. It also allows subclasses.

Where possible, it uses `T_Xarray`, otherwise `T_DataArrayOrSet` where it's not possible for `mypy` to narrow the concrete type down.
max-sixty added a commit that referenced this pull request Sep 20, 2023
* Add comments on when to use which `TypeVar`

From the chars of #8208
@max-sixty max-sixty deleted the bound-for-main-typing branch September 28, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants