Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use plugin to support numpy #3540

Closed
JukkaL opened this issue Jun 14, 2017 · 16 comments
Closed

Use plugin to support numpy #3540

JukkaL opened this issue Jun 14, 2017 · 16 comments
Labels
needs discussion priority-1-normal topic-plugins The plugin API and ideas for new plugins

Comments

@JukkaL
Copy link
Collaborator

JukkaL commented Jun 14, 2017

This was originally discussed at #1240. It's unclear how this would work exactly -- we may first need to design a static type system extension for numpy.

@JukkaL JukkaL added needs discussion priority-2-low topic-plugins The plugin API and ideas for new plugins labels Jun 14, 2017
@ilevkivskyi
Copy link
Member

we may first need to design a static type system extension for numpy.

I think we would only need some dependent type to describe a fixed size array Array[np.foat32, n, k], where n and k are integer numbers known statically. Related #3062 and #3345

@kmike
Copy link

kmike commented Jul 11, 2017

This project looks somewhat related: http://datashape.readthedocs.io, https://github.com/blaze/datashape

@mrocklin
Copy link

There is some conversation about type annotations in this NumPy issue: numpy/numpy#7370

@shoyer
Copy link

shoyer commented Oct 20, 2017

Here's my brain dump on static typing for arrays:

TensorFlow (which copies the design of NumPy in many ways) has a form of static typing for array shapes that is useful precedence here. Notably, it distinguishes between three cases (copied from that page):

  • Fully-known shape: has a known number of dimensions and a known size for each dimension. e.g. TensorShape([16, 256])
  • Partially-known shape: has a known number of dimensions, and an unknown size for one or more dimension. e.g. TensorShape([None, 256])
  • Unknown shape: has an unknown number of dimensions, and an unknown size in all dimensions. e.g. TensorShape(None)

TensorFlow sets shapes using C++ or Python, mostly with SetShapeFn. Looking through how it's used internally by TensorFlow operations could give hints about the necessary use cases.

In particular, one important use case supported by most NumPy functions is broadcasting, where output shapes are propagated via broadcasting rules. This could allow for identifying broadcasting errors (one of the most frequent user-errors in NumPy) at compile-time instead of run-time.

Finally, a concept like TypeVar for arrays dimensions (call it DimensionVar?) could be extremely powerful. In NumPy, an operation like x + y for inputs x: Vector[N] and y: Vector[M] only can succeed if N=1, M=1 or N=M. Obviously, this is error prone -- N and M can coincide purely by chance, even if they represent semantically distinct axes. For static checking, it could be useful to assume that N and M are always distinct, unless cast to use the same dimension variables. This would also catch many errors.

@mitar
Copy link

mitar commented Oct 20, 2017

I think we would only need some dependent type

I think dependent types are needed in very rare cases. What we probably need more, is something like static constraints. So dependent types would mean that some part of the type is based on value of function argument. For example, if function would be:

def foo(array, array_dtype):
    ...

Where array has dtype of array_dtype, then you would need a dependent type to express that array argument has numpy type with array_dtype as its dtype.

But what we first need is even support to express "numpy with dtype" type. I see this as an extra constraint on numpy type. The issue is that it is not enough just to have support for static constraints, but one needs a language so that you can describe that input arguments to a function can have some relations between them (like one argument array has to be 2x in width than the other argument array). It gets pretty tricky soon. See contracts project for an attempt at this.

@petered
Copy link

petered commented Mar 23, 2018

Working on a large codebase of scientific python code, a generic array-type annotation would be incredibly useful. An example that would be nice to type-check:

def compute_costs(
    predictions,  # type: Array(n_samples, n_outputs)[float]   # Array of probabilities
    labels,  # type: Array(n_samples)[int,>=0,<n_outputs]  # Correct labels in range 0 to n_outputs-1
):  # type: (...) -> Array(n_samples)[float]  # The cost per-sample

This documents that:

  • The predictions and the labels must have the same number of samples
  • The returned costs has the same number of samples as predictions
  • The elements in labels should not exceed the dimensions of predictions.

It would be good to add more flexibility to this example. For example it may be the case that the data points are not flattened along one dimension, e.g.

def compute_costs(
    predictions,  # type: Array(n_samples, *data_dims, n_outputs)[float]   # Array of probabilities
    labels,  # type: Array(n_samples, *data_dims)[int,>=0,<n_outputs]  # Correct labels in range 0 to n_outputs-1
):  # type: (...) -> Array(n_samples)[float]  # The cost per-sample

This would indicate that the predictions and labels can have any shapes as long as they match.

It would also be great to have some kind of support for Generics Types. For example:

def minibatch_iterator(
   data_iterator,  # type: Generator[Array(*dims)[<dtype>]],
   minibatch_size,  # type: int
):  # type: (...) -> Generator[Array(minibatch_size, *dims)[<dtype>]],

This annotation would show that

  • The first dimension of the yielded arrays would match the minibatch size, and the remaining dimensions would match whatever comes out of the data iterator
  • The type would match whatever comes out of the data iterator.

@JelleZijlstra
Copy link
Member

FYI, the https://github.com/numpy/numpy_stubs project is now working on support for typing numpy-using code.

@shoyer
Copy link

shoyer commented Mar 23, 2018

Yes, and we are very much interested in collaboration. See here for ongoing discussion about typing shapes: https://github.com/numpy/numpy_stubs/issues/5

@gvanrossum
Copy link
Member

gvanrossum commented Mar 23, 2018 via email

@ilevkivskyi
Copy link
Member

@gvanrossum
I guess Stephan means collaboration between numpy team and mypy team. There is a (very preliminary) draft proposal here, discussion about it is tracked also in python/typing#513. Just as a coincidence yesterday we discussed with Jukka possible plans for supporting numpy. I am personally very interested in this (taking into account my extensive experience with numpy), and it looks like Jukka is interested as well (mostly due to the impact this will have). However, we both indeed don't have bandwidth for this now. In our discussion we agreed that it would be great if either or both of us can work on this in late Q3 or Q4.

@shoyer
Copy link

shoyer commented Mar 26, 2018

@gvanrossum I suppose by "We" I was referring to NumPy fellow developers, and others in the broader numerical Python community. (But to be honest, I wasn't being very carefully with my post.)

There are a few others in the mypy community who have also expressed interest here. I met up with @JelleZijlstra and @ethanhs along with @njsmith (from the NumPy side) a few months ago to discuss the issues in python/typing#513. This prompted the creation of numpy-stubs, but there hasn't been significant progress on the major issues since then (which of course is totally fine, this being open source I have no expectations on anyone else's time).

Is there a complete proposal for the syntax yet? It would have to aim at becoming a
PEP (a sibling to PEP 484 and PEP 526) and someone should attempt an
implementation for mypy, perhaps in the form of a plugin.

My preliminary partial proposal (which @ilevkivskyi linked to above) takes a stab at syntax, and also outlines some of the broader issues for typing. I agree that a PEP will be necessary eventually, but I'm not sure we're quite ready for that yet. A proof of concept implementation would certainly help clarify many of these issues.

@rmcgibbo wrote an experimental mypy plugin for NumPy in https://github.com/rmcgibbo/numpy-mypy, but it only supports checks on the number of array dimensions (ndim), not shapes or dimension identity. I think dimension identity checks could be very powerful (along the same lines as what @petered writes above), but certainly support for ndim would be better than nothing.

@gvanrossum
Copy link
Member

gvanrossum commented Mar 26, 2018 via email

@shoyer
Copy link

shoyer commented Mar 29, 2018

@gvanrossum unfortunately, I don't think I'll be able to make it to PyCon this year (though it is definitely tempting!). But if you could still have a good meeting with @njsmith and @mrocklin from the numpy/scientific python side. I would also be happy to do a video-conference/call with anyone interested in higher bandwidth discussion.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Jan 29, 2020

This topic has been subsequently discussed in various other forums (including the Bay Area typing meetups -- and PyCon 2019, I think). Keeping this issue open doesn't seem worthwhile, since it's missing a lot of context.

@JukkaL JukkaL closed this as completed Jan 29, 2020
@olegsinavski
Copy link

@JukkaL Is there a link to the discussions you mentioned? How can we get more context?

@JukkaL
Copy link
Collaborator Author

JukkaL commented Jun 19, 2020

There has been some recent discussion in the typing-sig@ mailing list. That seems like the best place to find up-to-date information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion priority-1-normal topic-plugins The plugin API and ideas for new plugins
Projects
None yet
Development

No branches or pull requests

10 participants