Use shape and dtype as typevars in NamedArray #8294

Illviljan · 2023-10-11T05:08:32Z

Using a different TypeVar strategy compared to #8281. The idea here is to typevar shape and dtype instead, just like numpy does.

Previously I tried to use the _data array as the TypeVar but that causes all kinds of issues since TypeVar is usually invariant and can't be updated to a new type. Since the dtype changes very frequently when doing array operations it quickly gets difficult to pass along the correct typing.

This PR adds a from_array function. The intention is to use that function to create NamedArrays when you are passing around ArrayLikes. The init for NamedArray will now just assume the input data is correct. At runtime at least, mypy will catch any non-supported array types. There's some precedent to this:
- numpy.array_api.Array forces to use xp.asarray.
- Cubed assumes the inputs are correct. Has a xp.asarray and a from_array function.
- The ugly fastpath argument is therefore not needed.
Adds a bunch of type hint classes, duckarray[ShapeType, DType] (corresponding to np.ndarray) or DuckArray[ScalarType] (corresponding to np.typing.NDArray) are the recommended ones.
- It's better to use these kinds of classes over creating is_duck_array functions with typeguards because isinstance also works on the else clause.
This PR adds some array_api functions, the idea here is that NamedArray could also be array_api compliant.

Tests added
Closes NamedArray.shape does not support unknown dimensions #8291

References:
https://github.com/tomwhite/cubed/blob/ea885193dd37d27917a24878b51bb086aaef5fb1/cubed/core/ops.py#L34
https://stackoverflow.com/questions/74633074/how-to-type-hint-a-generic-numpy-array
https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars
https://github.com/numpy/numpy/blob/040ed2dc9847265c581a342301dd87d2b518a3c2/numpy/__init__.pyi#L1423
https://github.com/numpy/numpy/blob/040ed2dc9847265c581a342301dd87d2b518a3c2/numpy/_typing/_array_like.py#L32
https://stackoverflow.com/questions/69186176/determine-if-subclass-has-a-base-classs-method-implemented-in-python

for more information, see https://pre-commit.ci

…xarray into namedarray_from_array

for more information, see https://pre-commit.ci

…xarray into namedarray_from_array

for more information, see https://pre-commit.ci

…xarray into namedarray_from_array

for more information, see https://pre-commit.ci

…xarray into namedarray_from_array

for more information, see https://pre-commit.ci

Co-authored-by: Michael Niklas <mick.niklas@gmail.com>

for more information, see https://pre-commit.ci

…xarray into namedarray_scalartype

for more information, see https://pre-commit.ci

Illviljan · 2023-10-18T05:45:53Z

And back to the drawing board.

…xarray into namedarray_scalartype

headtr1ck · 2023-10-18T06:03:01Z

And back to the drawing board.

Yeah, all these NamedArray PRs will have heavy merge conflicts:/

Illviljan · 2023-10-18T06:10:18Z

Went easier than expected getting tests green. That's suspicious! I'll dig around regarding that in a follow up PR.

andersy005 · 2023-10-19T16:44:49Z

xarray/namedarray/core.py

-def as_compatible_data(
-    data: T_DuckArray | np.typing.ArrayLike, fastpath: bool = False
-) -> T_DuckArray:
-    if fastpath and getattr(data, "ndim", 0) > 0:
-        # can't use fastpath (yet) for scalars
-        return cast(T_DuckArray, data)


@Illviljan, i've been reviewing the latest changes on the main branch and i've noticed that this pull request removed the as_compatible_data function as well as fastpath in NamedArray's constructor. i'm curious if this was intentional or if there was some discussion about it that I may have missed.

never mind, i just saw this line in the PR description:

init for NamedArray will now just assume the input data is correct. At runtime at least, mypy will catch any non-supported array types. There's some precedent to this

The ugly fastpath argument is therefore not needed.

I think we should invert this.

Internally it's OK to use from_array but a user should be able to do NamedArray('x', [1, 2, 3]) without issues. I like the idea of a classmethod NamedArray.from_array for the fastpath usecase

Who is the user of NamedArray again? Aren't we doing a lightweight Variable now?

The modern array packages I've looked at (Cubed, np.array_api) either doesn't allow an init or just assumes it's correct. They rather recommend you to use asarray or from_array functions.

NamedArray(('x',), np.array([1, 2, 3])) is not that bad

xp.asarray([1, 2, 3], dims="x") -> Namedarray(dims=("x",), data=np.array([1, 2, 3])) is quick too.

I think it's better to start strict (and fast) and see if users actually thinks it's a problem.

I think it's better to start strict (and fast) and see if users actually thinks it's a problem.

i'm pro being strict in Namedarray(). xarray still gets to keep its as_compatible_data() check.

maresb · 2023-12-12T16:46:44Z

Hey, I see that this was fairly recently merged. I have a question, and I was hoping it'd be appropriate to post here.

Why is this:

xarray/xarray/namedarray/_typing.py

Line 68 in 562f2f8

_Dim = Hashable

not

_Dim = str

?

I'm trying to write code like

da_dims: tuple[str] = da.dims

which doesn't work since Hashable isn't a subtype of str.

On the other hand,

>>> da = xr.DataArray(data=[1,2,3], dims=[7])
TypeError: dimension 7 is not a string

so it seems like it can't be any hashable other than str.

Illviljan · 2023-12-12T19:51:19Z

Because dim can be anything Hashable: 1, "x", ("x", 23), None.
So it is the opposite, str is a subtype of Hashable!

from typing import Hashable


dims_str: tuple[Hashable, ...] = ("x",)
dims_int: tuple[Hashable, ...] = (654, 23)
dims_tuple: tuple[Hashable, ...] = (("#", "sdf"), ("s",))

# mypy --strict:
# Success: no issues found in 1 source file

# pyright 1.1.280
# 0 errors, 0 warnings, 0 informations
# Completed in 0.865sec

This PR deals with the typing of namedarray specifically, so I'm not promising everything is perfect typing wise at the DataArray level.
But DataArray is the odd one out currently:

import numpy as np
import xarray as xr

a = xr.namedarray.core.NamedArray(data=np.array([1, 2, 3]), dims=(7,))
b = xr.Variable(data=np.array([1, 2, 3]), dims=(7,))
c = xr.Dataset({"b": b})
d = xr.DataArray(data=[1, 2, 3], dims=(7,))  # error

headtr1ck · 2023-12-12T19:58:14Z

Let me elaborate on this a bit...

Why is this:

xarray/xarray/namedarray/_typing.py

Line 68 in 562f2f8

_Dim = Hashable

not
_Dim = str
?

Because we mostly support non-string types for dimension and variable names.
The support has become better and better in the last year.
You could use custom classes like enums or tuples as dimension names.

I'm trying to write code like
da_dims: tuple[str] = da.dims
which doesn't work since Hashable isn't a subtype of str.

That is a limitation on how things currently work. In this case you will have to use cast(tuple[str, ...], da.dims)
There is some effort in the new NamedArray class to make the dimensions Generic as well, see #8276
Then the dims type will depend on how you construct the object.

On the other hand,
>>> da = xr.DataArray(data=[1,2,3], dims=[7])
TypeError: dimension 7 is not a string
so it seems like it can't be any hashable other than str.

The constructors are still a weak spot of typing so far (and error messages as well as it seems) because they allow many different combinations of how to create a DataArray (Dataset) and are therefore highly dynamic and difficult to statically type.
Actually, I would consider this example a bug because we clearly state that hashables are supported. Feel free to open an issue about that!

maresb · 2023-12-12T20:41:24Z

Thanks so much @Illviljan and @headtr1ck for the fast and detailed response!!!

As per @headtr1ck's suggestion I opened #8546.

Illviljan and others added 30 commits October 7, 2023 14:18

Add from_array function

5e706c4

Update core.py

adab7c9

some fixes

790336c

Update test_namedarray.py

46482fa

[pre-commit.ci] auto fixes from pre-commit.com hooks

8823dab

for more information, see https://pre-commit.ci

fixes

534a040

Merge branch 'namedarray_from_array' of https://github.com/Illviljan/…

f64c914

…xarray into namedarray_from_array

[pre-commit.ci] auto fixes from pre-commit.com hooks

166b647

for more information, see https://pre-commit.ci

fixes

2f6af70

fixes

9dc72a1

fixes

db96f17

[pre-commit.ci] auto fixes from pre-commit.com hooks

446e2a5

for more information, see https://pre-commit.ci

Update utils.py

68a9d9c

Merge branch 'namedarray_from_array' of https://github.com/Illviljan/…

40fac21

…xarray into namedarray_from_array

more

fd7bb9f

Update core.py

4322eaf

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d2f0ba

for more information, see https://pre-commit.ci

Update test_namedarray.py

91f06c0

Merge branch 'namedarray_from_array' of https://github.com/Illviljan/…

13d232c

…xarray into namedarray_from_array

[pre-commit.ci] auto fixes from pre-commit.com hooks

60481e0

for more information, see https://pre-commit.ci

Update core.py

df13e1c

Merge branch 'namedarray_from_array' of https://github.com/Illviljan/…

5ce29c8

…xarray into namedarray_from_array

fixes

955f3c4

fkxes

ebdd69e

[pre-commit.ci] auto fixes from pre-commit.com hooks

1232c6a

for more information, see https://pre-commit.ci

more

1302979

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6613a4

for more information, see https://pre-commit.ci

Update test_namedarray.py

d199c21

Update test_namedarray.py

be9cbf0

Update test_namedarray.py

051c065

Illviljan and others added 3 commits October 17, 2023 13:44

Apply suggestions from code review

ab7b8ad

Co-authored-by: Michael Niklas <mick.niklas@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2f7be4d

for more information, see https://pre-commit.ci

Update core.py

21c323f

headtr1ck approved these changes Oct 17, 2023

View reviewed changes

Illviljan added the plan to merge Final call for comments label Oct 17, 2023

Illviljan and others added 6 commits October 18, 2023 07:08

Merge branch 'main' into namedarray_scalartype

71e3d0c

[pre-commit.ci] auto fixes from pre-commit.com hooks

55034cc

for more information, see https://pre-commit.ci

Update core.py

e435e15

Merge branch 'namedarray_scalartype' of https://github.com/Illviljan/…

c86d8b2

…xarray into namedarray_scalartype

Update core.py

1bce439

[pre-commit.ci] auto fixes from pre-commit.com hooks

eaf8ade

for more information, see https://pre-commit.ci

Illviljan marked this pull request as draft October 18, 2023 05:45

Illviljan removed the plan to merge Final call for comments label Oct 18, 2023

Illviljan added 2 commits October 18, 2023 07:58

Update core.py

e807272

Merge branch 'namedarray_scalartype' of https://github.com/Illviljan/…

9bf347f

…xarray into namedarray_scalartype

Illviljan marked this pull request as ready for review October 18, 2023 06:07

Illviljan enabled auto-merge (squash) October 18, 2023 06:11

Illviljan merged commit 087fe45 into pydata:main Oct 18, 2023
28 checks passed

andersy005 reviewed Oct 19, 2023

View reviewed changes

maresb mentioned this pull request Dec 12, 2023

DataArray types must be strings #8546

Closed

5 tasks

Illviljan mentioned this pull request Jan 17, 2024

run CI on python=3.12 #8605

Merged

1 task

TomNicholas mentioned this pull request Mar 1, 2024

Migrate datatree.py module into xarray.core. #8789

Merged

4 tasks

Illviljan mentioned this pull request Jul 11, 2024

Type chunkmanagers #9227

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shape and dtype as typevars in NamedArray #8294

Use shape and dtype as typevars in NamedArray #8294

Illviljan commented Oct 11, 2023 •

edited by headtr1ck

Loading

Illviljan commented Oct 18, 2023

headtr1ck commented Oct 18, 2023

Illviljan commented Oct 18, 2023

andersy005 Oct 19, 2023

andersy005 Oct 19, 2023 •

edited

Loading

dcherian Oct 19, 2023

Illviljan Oct 19, 2023 •

edited

Loading

andersy005 Oct 19, 2023 •

edited

Loading

maresb commented Dec 12, 2023 •

edited

Loading

Illviljan commented Dec 12, 2023

headtr1ck commented Dec 12, 2023

maresb commented Dec 12, 2023

Use shape and dtype as typevars in NamedArray #8294

Use shape and dtype as typevars in NamedArray #8294

Conversation

Illviljan commented Oct 11, 2023 • edited by headtr1ck Loading

Illviljan commented Oct 18, 2023

headtr1ck commented Oct 18, 2023

Illviljan commented Oct 18, 2023

andersy005 Oct 19, 2023

Choose a reason for hiding this comment

andersy005 Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

dcherian Oct 19, 2023

Choose a reason for hiding this comment

Illviljan Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

andersy005 Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

maresb commented Dec 12, 2023 • edited Loading

Illviljan commented Dec 12, 2023

headtr1ck commented Dec 12, 2023

maresb commented Dec 12, 2023

Illviljan commented Oct 11, 2023 •

edited by headtr1ck

Loading

andersy005 Oct 19, 2023 •

edited

Loading

Illviljan Oct 19, 2023 •

edited

Loading

andersy005 Oct 19, 2023 •

edited

Loading

maresb commented Dec 12, 2023 •

edited

Loading