-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support scalars in tuple (and list) arguments provided to __array_function__
#2045
Conversation
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since __array_function__
overloads functions with a variety of different signatures, are you sure it's a good idea to regularize the types centrally?
This is for the fallback case in which we haven't defined an overload for a given function (implemented.get(func) is None
), and need to turn any Awkward Arrays into NumPy arrays, so that the NumPy function can work with it.
Before this PR, tuples were recognized as non-array sequences; with this PR, lists are now treated the same way as tuples. You're also wrestling with what should be done here:
Nearly-always a tuple indicates a "collection of arays", whereas lists are themselves often "array-like", but seemingly this isn't formalised anywhere.
Maybe instead of trying to enumerate all of the Sequence (Iterable?) types that aren't arrays, we should try to enumerate all of the Sequence types that are arrays. Maybe _to_rectilinear
should be:
def _to_rectilinear(arg):
if isinstance(arg, np.ndarray):
return arg
elif isinstance(arg, (ak.highlevel.Array, ak.contents.Content)):
nplike = ak._nplikes.nplike_of(arg)
return nplike.to_rectilinear(arg)
elif isinstance(arg, (str, bytes)):
return arg
elif isinstance(arg, Iterable):
return [_to_rectilinear(x) for x in arg]
else:
return arg
Since this is recursive, I suppose the Record types should be included in the second predicate.
The first predicate would have to include array types for the other arrays that we support, cp.ndarray
, jax.DeviceArray
, by checking type(arg).__module__
names so that this function doesn't import those modules. Or better yet, by checking for has(arg, "__array__")
to accept anything that NumPy would consider array-convertible. (That includes ak.highlevel.Array
and ak.contents.Content
, by the way.)
The idea of inverting the check, essentially turning a whitelist into a blacklist (or vice-versa: I don't know which word applies here because neither is excluding types, just converting them), is to pass the problem of deciding whether a list is a collection of arrays or one big array to NumPy. That decision depends on the function. For instance, np.concatenate
(if we didn't have an overload for it) expects to consume an iterable collection of arrays: making that one array before np.concatenate
decides what to do with it would be a mistake. Other functions expect to consume a single array, and if they see a list, they'll make that list into an array.
Another asterisk about my proposed implementation, above, is that it turns any non-array, non-string Iterable
into a list, though it might have originally been a tuple or something else. Maybe tuple is a better choice, but we just need some standard Python concrete Sequence as a stand-in for any such thing we receive, as long as NumPy acts on it the same way as it would with the original collection type.
src/awkward/_connect/numpy.py
Outdated
ak.highlevel.Array, | ||
ak.highlevel.Record, | ||
ak.record.Record, | ||
ak.contents.Content, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include Records (both types)? Records aren't like NumPy arrays; they're like scalars of a structured array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm replicating the existing logic; I haven't given this particular part a great deal of thought. I would expect, though, that we want to maintain the idea that record arrays map to structured arrays, which are arguments that NumPy __array_function__
overloads do support.
I think I was overly concerned in the PR description. The safest thing to do(in my view), which is what I've opted for, is to rebuild the argument list and replace any IIRC we need to preserve the tuple vs list separation; some functions distinguish between the two. My implementation is non-recursive; we assume that at most an I think just recursing a single time, and handling only lists and tuples is a reasonable scope for supporting "unsupported" NumPy functions. The most robust solution would probably be to formally define the array function overloads for all NumPy functions that we can support trivially. This would entail some small shims that manually, explicitly, perform the argument conversion. |
The thing that I'm worried about is class MyAwesomeListType:
def __init__(self, data):
self._data = data
def __len__(self):
return len(self._data)
def __getitem__(self, where):
return self._data[where]
np.stack(MyAwesomeListType([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])])) causing some slow conversion of the two The recursion wasn't the essential part. I doubt there are any NumPy functions whose signatures expect non-array collections of non-array collections of arrays. I know about some one-level deep functions (like |
I think this is vanishingly unlikely, but I completely agree with the idea that we should consider whether inverting the logic is a better solution. For context, the existing code handles only tuples of My suggestion is that we don't have to make In short; NumPy might support the above, but there's no reason that we have to; How would you feel about just making this a documentation-level warning? |
Yes, this is an improvement (previously, only one non-array type was okay); I was just wondering if we could take it further/all the way (so that any non-array type is okay). And my intention is not to actually handle every case, but to pass it off on NumPy to decide what to do.
Doesn't it, though? In general, the way we define correct behavior for f(ak.Array(np.array(x)), ak.Array(np.array(y)), ...) is equal to f(np.array(x), np.array(y), ...) and that it has the same performance characteristics (doesn't replace a vectorized array operation with Pythonic iteration). The arbitrary Moreover, the work isn't on our side to implement all of these cases; we can pass off most of the work to the particular NumPy functions. The one thing that we can do that NumPy can't is convert our special arrays into plain nplike arrays, in whatever non-array collection they might be hiding. That's why I suggested this. However, handling both tuples and lists is better than handling just tuples, so I can accept the PR as-is. Footnotes
|
I agree with this sentiment, and my concern is whether we can actually do this. Unless we make assumptions (e.g. Moreover, in some cases, NumPy does care about the types,1 so a custom sequence that becomes a list or a tuple might change the behaviour w.r.t to NumPy. I don't know; this is guesswork on my part. What I mean by
is that the API itself is just a dispatcher. It doesn't define the semantics for how it should handle arguments, besides stating that optional arguments can be omitted.). The NEP does set-out how NumPy can programmatically demonstrate which arguments are array-like using a dispatcher that generates the Based upon the NEP, I feel confident in saying that we could choose to only implement a subset of types. If we can't predict whether an argument needs list semantics or tuple semantics, and we can't re-create custom sequences, then we can't safely convert custom sequences containing Awkward types.1 Maybe if we encounter a sequence that is neither a list nor a tuple, we should either:
An independent point is whether we should recurse further, which I haven't spoken to strongly, yet. Footnotes |
948c07d
to
94fd9ad
Compare
I'm not assuming that, which is why I suggested that our argument-regularization converts I was assuming that NumPy doesn't care about the distinction between a We don't have to implement all features of the overloaded NumPy functions, but they shouldn't fail silently. Your suggestion of being more up-front about unrecognized patterns (raising errors) is a good one. |
I'm following you!
How do you feel about lists? Are we happy to try and iterate through them in search of Custom sequences are trickier. The problem is that we don't know whether the called function expects an array-like or tuple-like argument, and IIRC some functions test for lists and warn in such cases (i.e. where an API changed). I wish I could recall the function. So, I don't think we can have a safe rule in such a case that is always predictable. I'd prefer to just error loudly if we encounter a sequence type that isn't |
94fd9ad
to
8ec84fc
Compare
It would be pretty common to be given a list of arrays as an argument, e.g. np.stack([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])]) and it would be much better to iterate through the list, turning any (The above applies to any subclasses of If it's an unrecognized non-string Sequence (or even Iterable) without an If it's an unrecognized object with an Alternatively, you can check a whitelist of known-to-be-okay container types and raise an exception for anything unrecognized. That makes fewer assumptions about what NumPy will do with what we give it. While it handles fewer cases than NumPy likely does, the cases it doesn't handle are noisy exceptions, rather than silent mistakes (including performance mistakes), and that's good. A user might someday point out a NumPy function that can't be executed because of this (dicts?), but then we can just add it at that time. This is a safe option. I'd be happy with either one. |
OK, so we're talking about (in the allow-list sense)
That seems like the safest approach. It's more restrictive than NumPy, but I feel that we're allowed to make those kinds of decisions given NEP18 is fairly permissive. There may be some exceptions where we need to support these types (I'm not aware of any), but we can extend the manual overloads in these cases. I'm most in favour of this solution, as it's easier to reason about :) |
raise ak._errors.wrap_error( | ||
TypeError("to_rectilinear argument must be iterable") | ||
) | ||
return ak.operations.ak_to_numpy.to_numpy(array, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, to_rectilinear
is just a dispatcher for to_numpy
etc. In the near future, we should remove to_rectilinear
altogether, and replace the ak._util.to_arraylib
mechanism with an equivalent ak._util.to_backend_array
.
I suggest to_backend_array
because this function is responsible for converting the Awkward layout types to an array, so it operates at the layout level, for which "backend" is a better abstraction than "nplike".
@jpivarski any additional comments before I merge this? :) |
Re-reading the code, it looks like this PR extends our special handling of collection-of-array arguments in I thought I'd try it (not in the PR; I'm checking the old behavior) with >>> np.stack([np.array([1, 2, 3]), np.array([1.1, 2.2, 3.3])])
array([[1. , 2. , 3. ],
[1.1, 2.2, 3.3]])
>>> np.stack((ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])))
<Array [[1, 2, 3], [1.1, 2.2, 3.3]] type='2 * 3 * float64'>
>>> np.stack([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])])
<Array [[1, 2, 3], [1.1, 2.2, 3.3]] type='2 * 3 * float64'> although I think what's happening here is that the list case is unnecessarily being turned into a single Except maybe (The documentation of that function says that its argument must be a tuple.) Well, I'm going to get out of the weeds on this one. This PR preserves more information from the So yes, this is an improvement and we should merge it. Cases of other collection types beyond tuple and list are pretty rare. |
Right, and notably it changes this function to be recursive, so long-lists of lists will incur a penalty. I've decided that's negligible because list-of-list is not a high-perf data-type as far as Python's concerned ;) |
If there's a long list, then somebody has to iterate over it, even if that is Awkward's |
This fixes #1318.
It's not clear to me whether NumPy has a specification for the semantic meaning of its argument types. Nearly-always a tuple indicates a "collection of arays", whereas lists are themselves often "array-like", but seemingly this isn't formalised anywhere. To err on the side of caution, the conversion logic in this PR now treats both cases as "may contain an array", and produces a list/tuple of the nplike's raw array in such a case. Other types are not converted e.g. scalars.
This logic now relaxes the requirement that each array have the same nplike. I think this is OK — now it's NumPy / CuPy / JAX's responsibility to deal with such a case; Awkward just converts the arrays to their underlying array-library form.