Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shape and dtype as typevars in NamedArray #8294

Merged
merged 173 commits into from
Oct 18, 2023

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Oct 11, 2023

Using a different TypeVar strategy compared to #8281. The idea here is to typevar shape and dtype instead, just like numpy does.

Previously I tried to use the _data array as the TypeVar but that causes all kinds of issues since TypeVar is usually invariant and can't be updated to a new type. Since the dtype changes very frequently when doing array operations it quickly gets difficult to pass along the correct typing.

  • This PR adds a from_array function. The intention is to use that function to create NamedArrays when you are passing around ArrayLikes. The init for NamedArray will now just assume the input data is correct. At runtime at least, mypy will catch any non-supported array types. There's some precedent to this:
    • numpy.array_api.Array forces to use xp.asarray.
    • Cubed assumes the inputs are correct. Has a xp.asarray and a from_array function.
    • The ugly fastpath argument is therefore not needed.
  • Adds a bunch of type hint classes, duckarray[ShapeType, DType] (corresponding to np.ndarray) or DuckArray[ScalarType] (corresponding to np.typing.NDArray) are the recommended ones.
    • It's better to use these kinds of classes over creating is_duck_array functions with typeguards because isinstance also works on the else clause.
  • This PR adds some array_api functions, the idea here is that NamedArray could also be array_api compliant.

References:
https://github.com/tomwhite/cubed/blob/ea885193dd37d27917a24878b51bb086aaef5fb1/cubed/core/ops.py#L34
https://stackoverflow.com/questions/74633074/how-to-type-hint-a-generic-numpy-array
https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars
https://github.com/numpy/numpy/blob/040ed2dc9847265c581a342301dd87d2b518a3c2/numpy/__init__.pyi#L1423
https://github.com/numpy/numpy/blob/040ed2dc9847265c581a342301dd87d2b518a3c2/numpy/_typing/_array_like.py#L32
https://stackoverflow.com/questions/69186176/determine-if-subclass-has-a-base-classs-method-implemented-in-python

Illviljan and others added 30 commits October 7, 2023 14:18
Illviljan and others added 3 commits October 17, 2023 13:44
@Illviljan Illviljan added the plan to merge Final call for comments label Oct 17, 2023
@Illviljan Illviljan marked this pull request as draft October 18, 2023 05:45
@Illviljan Illviljan removed the plan to merge Final call for comments label Oct 18, 2023
@Illviljan
Copy link
Contributor Author

And back to the drawing board.

@headtr1ck
Copy link
Collaborator

And back to the drawing board.

Yeah, all these NamedArray PRs will have heavy merge conflicts:/

@Illviljan Illviljan marked this pull request as ready for review October 18, 2023 06:07
@Illviljan
Copy link
Contributor Author

Went easier than expected getting tests green. That's suspicious! I'll dig around regarding that in a follow up PR.

@Illviljan Illviljan enabled auto-merge (squash) October 18, 2023 06:11
@Illviljan Illviljan merged commit 087fe45 into pydata:main Oct 18, 2023
28 checks passed
Comment on lines -52 to -57
def as_compatible_data(
data: T_DuckArray | np.typing.ArrayLike, fastpath: bool = False
) -> T_DuckArray:
if fastpath and getattr(data, "ndim", 0) > 0:
# can't use fastpath (yet) for scalars
return cast(T_DuckArray, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Illviljan, i've been reviewing the latest changes on the main branch and i've noticed that this pull request removed the as_compatible_data function as well as fastpath in NamedArray's constructor. i'm curious if this was intentional or if there was some discussion about it that I may have missed.

Copy link
Member

@andersy005 andersy005 Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, i just saw this line in the PR description:

init for NamedArray will now just assume the input data is correct. At runtime at least, mypy will catch any non-supported array types. There's some precedent to this

The ugly fastpath argument is therefore not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should invert this.

Internally it's OK to use from_array but a user should be able to do NamedArray('x', [1, 2, 3]) without issues. I like the idea of a classmethod NamedArray.from_array for the fastpath usecase

Copy link
Contributor Author

@Illviljan Illviljan Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is the user of NamedArray again? Aren't we doing a lightweight Variable now?

The modern array packages I've looked at (Cubed, np.array_api) either doesn't allow an init or just assumes it's correct. They rather recommend you to use asarray or from_array functions.

  • NamedArray(('x',), np.array([1, 2, 3])) is not that bad
  • xp.asarray([1, 2, 3], dims="x") -> Namedarray(dims=("x",), data=np.array([1, 2, 3])) is quick too.

I think it's better to start strict (and fast) and see if users actually thinks it's a problem.

Copy link
Member

@andersy005 andersy005 Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to start strict (and fast) and see if users actually thinks it's a problem.

i'm pro being strict in Namedarray(). xarray still gets to keep its as_compatible_data() check.

@maresb
Copy link
Contributor

maresb commented Dec 12, 2023

Hey, I see that this was fairly recently merged. I have a question, and I was hoping it'd be appropriate to post here.

Why is this:

_Dim = Hashable

not

_Dim = str

?

I'm trying to write code like

da_dims: tuple[str] = da.dims

which doesn't work since Hashable isn't a subtype of str.

On the other hand,

>>> da = xr.DataArray(data=[1,2,3], dims=[7])
TypeError: dimension 7 is not a string

so it seems like it can't be any hashable other than str.

@Illviljan
Copy link
Contributor Author

Because dim can be anything Hashable: 1, "x", ("x", 23), None.
So it is the opposite, str is a subtype of Hashable!

from typing import Hashable


dims_str: tuple[Hashable, ...] = ("x",)
dims_int: tuple[Hashable, ...] = (654, 23)
dims_tuple: tuple[Hashable, ...] = (("#", "sdf"), ("s",))

# mypy --strict:
# Success: no issues found in 1 source file

# pyright 1.1.280
# 0 errors, 0 warnings, 0 informations
# Completed in 0.865sec

This PR deals with the typing of namedarray specifically, so I'm not promising everything is perfect typing wise at the DataArray level.
But DataArray is the odd one out currently:

import numpy as np
import xarray as xr

a = xr.namedarray.core.NamedArray(data=np.array([1, 2, 3]), dims=(7,))
b = xr.Variable(data=np.array([1, 2, 3]), dims=(7,))
c = xr.Dataset({"b": b})
d = xr.DataArray(data=[1, 2, 3], dims=(7,))  # error

@headtr1ck
Copy link
Collaborator

Let me elaborate on this a bit...

Why is this:

_Dim = Hashable

not

_Dim = str

?

Because we mostly support non-string types for dimension and variable names.
The support has become better and better in the last year.
You could use custom classes like enums or tuples as dimension names.

I'm trying to write code like

da_dims: tuple[str] = da.dims

which doesn't work since Hashable isn't a subtype of str.

That is a limitation on how things currently work. In this case you will have to use cast(tuple[str, ...], da.dims)
There is some effort in the new NamedArray class to make the dimensions Generic as well, see #8276
Then the dims type will depend on how you construct the object.

On the other hand,

>>> da = xr.DataArray(data=[1,2,3], dims=[7])
TypeError: dimension 7 is not a string

so it seems like it can't be any hashable other than str.

The constructors are still a weak spot of typing so far (and error messages as well as it seems) because they allow many different combinations of how to create a DataArray (Dataset) and are therefore highly dynamic and difficult to statically type.
Actually, I would consider this example a bug because we clearly state that hashables are supported. Feel free to open an issue about that!

@maresb maresb mentioned this pull request Dec 12, 2023
5 tasks
@maresb
Copy link
Contributor

maresb commented Dec 12, 2023

Thanks so much @Illviljan and @headtr1ck for the fast and detailed response!!!

As per @headtr1ck's suggestion I opened #8546.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-NamedArray Lightweight version of Variable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NamedArray.shape does not support unknown dimensions
5 participants