Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for more expressive Array signatures #12

Closed
ramonhagenaars opened this issue Feb 6, 2020 · 10 comments · Fixed by #60
Closed

Allow for more expressive Array signatures #12

ramonhagenaars opened this issue Feb 6, 2020 · 10 comments · Fixed by #60
Labels
feature A new feature WIP

Comments

@ramonhagenaars
Copy link
Owner

ramonhagenaars commented Feb 6, 2020

See also issues #9, #10 and #11.

There have been several requests to extend the expressiveness of Array. I don't feel much for a sudden signature change of Array. Rather, I'd like to introduce a new type NDArray (which name I like more than Array anyway) that will "slowly" replace Array.

I have the following signature in mind:

Signature design
NDArray any dimension of any size of any type
NDArray[...] 1 dimension of any size of any type
NDArray[3] 1 dimension of size 3 of any type
NDArray[(3, 3, 5)] 3 dimensions (3 x 3 x 5) of any type
NDArray[(3, ..., 5)] 3 dimensions (3 x ? x 5) of any type
NDArray[(D1, 3, D1)] 3 dimensions (D1 x 3 x D1 where D1 is an nptyping constant that can be
imported to express a dimension variable, see #9 and #11) of any type

NDArray[int] any dimension of any size of type int
NDArray[..., int] 1 dimension of any size of type int
NDArray[(3, 3, 5), int] 3 dimensions (3 x 3 x 5) of type int
NDArray[(3, 3, 5), np.dtype('int16')] 3 dimensions (3 x 3 x 5) of type int16
NDArray[(3, 3), np.dtype([('f1', np.int16), ('f2', np.int16)])] 2 dimensions (3 x 3) with structured types

Process
The new NDArray is to replace the current Array. Once introduced, the original Array will become deprecated to be removed upon the minor release that follows next.

Before I start investing time into this, I'd love to hear your opinion on this. Please leave any feedback, any comments, any suggestions.

@ramonhagenaars ramonhagenaars added help wanted Extra attention is needed feature A new feature labels Feb 6, 2020
This was referenced Feb 6, 2020
@nannau
Copy link

nannau commented Feb 7, 2020

Thank you!

I think the proposed signature design sounds great, and incorporates clearly how to embed the dimension/rank of arrays - my inquiry in #10. Renaming to NDArray is also more in the spirit of numpy arrays themselves. Having the sizes defined in a tuple (ndim > 1), and the dimensions/rank defined implicitly by the size of the tuple makes a lot of sense, too.

It would extend many of the efficiencies of type hinting into the science stack realm, which is an area that could greatly benefit from this!

I'd love to help, where I can.

@jameshiebert
Copy link

jameshiebert commented Feb 7, 2020

This seems like a solid proposal to me. It would cover the majority of our potential use cases, I wouldn't envision many down sides.

The only thing that I see that could be missing is possibly declaring rank without specifying the exact dimension sizes. Not sure how hard it would be to implement or even how useful it is from a typing perspective. But I know that we occasionally have arrays of fixed rank and length that might have their dimensions resized. E.g. a basic transpose operation would fit under this use case.

Edit: Nevermind paragraph two. A colleague pointed out to me that something like NDArray[(..., ...)] could cover exactly this use case.

@alimanfoo
Copy link

This sounds great. Would there be a way to name your own dimension variables? Could be very helpful as part of documenting the intent of each dimension. E.g., NDArray[(LAT, LON), float] or NDArray[(VARIANTS, SAMPLES, PLOIDY), int])?

@ramonhagenaars
Copy link
Owner Author

ramonhagenaars commented Feb 8, 2020

@alimanfoo , I think you would rather explicitly name the columns of a dimension, rather than the dimension itself. Correct me if I'm wrong.

This is what the signature of an array of coordinates would look like:
NDArray[(..., 2), float] indefinite number of rows, with 2 columns (lat, lon).

So in your case, you want to further elaborate on that 2. With the current design, you could declare the constants LAT = 1 and LON = 1. Then you could write:
NDArray[(..., LAT + LON), float]


We could take this one step further though, by introducing something that allows you to be more precise on what a column value should be:

from nptyping import NamedColumn

# NamedColumn takes a name and an optional predicate to validate a value.
lat = NamedColumn('lattitude', lambda x: x >= 0)
lon = NamedColumn('longitude', lambda x: x >= 0)

NDArray[(..., (lat, lon)), float]

The optional predicate of a NamedColumn would allow the isinstance check of NDArray to validate the correctness of the values of those columns.

With this, you could also write:

from nptyping import NamedColumn

lat = NamedColumn('lattitude', lambda x: isinstance(x, float) and x >= 0)
lon = NamedColumn('longitude', lambda x: isinstance(x, float) and x >= 0)

NDArray[(..., (lat, lon))]  # indefinite number of coordinates
NDArray[(5, (lat, lon))]    # 5 coordinates

Or even something like this:

somewhere_in_europe = NamedColumn('coordinate somewhere in Europe', lambda x: is_in_polygon(x, EU))
somewhere_in_usa = NamedColumn('coordinate somewhere in USA', lambda x: is_in_polygon(x, USA))

NDArray[((somewhere_in_europe, somewhere_in_usa), (lat, lon))]    # 2 coordinates

One needs to keep in mind that instance checks will get more expensive with the typings being more precise. I would recommend type checking only during development anyway, not in a production environment.


Does this extension with NamedColumn make sense? It may be introduced in a following stage after releasing the NDArray.

@ramonhagenaars ramonhagenaars added WIP and removed help wanted Extra attention is needed labels Apr 4, 2020
@ramonhagenaars
Copy link
Owner Author

The major part of this issue have been addressed and is released in v.1.0.0. Next in line are the dimension variables and the named columns.

@alimanfoo
Copy link

Great news!

@nannau
Copy link

nannau commented Apr 8, 2020

Awesome news, and work!

@petered
Copy link

petered commented Jan 7, 2022

This is great and very useful. Since we use arrays everywhere, it would be really nice to have a less "brackety" syntax that allows you to name your dimensions to signify that you expect consistency, like:

def compute_image_mask(image: NDArray['H,W,3', np.uint8]) -> NDArray['H,W', bool] 

@petered
Copy link

petered commented Jan 7, 2022

Second thing:

NDArray[..., int] 1 dimension of any size of type int

I think this conflicts with the way Ellipsis (...) is used in numpy and could cause confusion. In Numpy ... means "all remaining axes". e.g.

>>> arr = np.random.randn(5, 4, 3)
>>> arr[..., 0].shape
(5, 4)
>>> arr[..., 0, 0].shape
(5,)
>>> arr[..., 0, 0, 0].shape
()

Whereas : means "an axis":

>>> arr[:, 0].shape
(5, 3)
>>> arr[:, 0, 0].shape
(5,)
>>> arr[:, 0, 0, 0].shape
IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed

Since : cannot be used as an object in numpy, could we just use typeing.Any instead? e.g. NDArray[Any, int] to represent and arbitrarily-sized 1-dimensional array (or a named dimension like NDArray['N_points', int] as suggested above)


I hope it's not too late to change, but I would propose that ... be used to signify "zero or more axes", as this is also a very useful thing to be able to type. Trivial example:

def take_xy_locations(points: NDArray[(..., 3), float]) -> NDArray[(..., 2), float]:
    return points[..., :2]

assert take_xy_locations(np.random.randn(4, 3)).shape == (4, 2)
assert take_xy_locations(np.random.randn(5, 4, 3)).shape == (5, 4, 2)

@kevinsuedmersen
Copy link

kevinsuedmersen commented Mar 20, 2023

Hi @ramonhagenaars

Thanks for this really cool repo.

I'm really looking forward to the NamedColumn feature as described by you in this commend. Do you know when it can be released or can you recommend a workaround in the meantime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature WIP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants