Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support scalars in tuple (and list) arguments provided to __array_function__ #2045

Merged
merged 7 commits into from
Jan 3, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Dec 30, 2022

This fixes #1318.

It's not clear to me whether NumPy has a specification for the semantic meaning of its argument types. Nearly-always a tuple indicates a "collection of arays", whereas lists are themselves often "array-like", but seemingly this isn't formalised anywhere. To err on the side of caution, the conversion logic in this PR now treats both cases as "may contain an array", and produces a list/tuple of the nplike's raw array in such a case. Other types are not converted e.g. scalars.

This logic now relaxes the requirement that each array have the same nplike. I think this is OK ­— now it's NumPy / CuPy / JAX's responsibility to deal with such a case; Awkward just converts the arrays to their underlying array-library form.

@codecov
Copy link

codecov bot commented Dec 30, 2022

Codecov Report

Merging #2045 (9f1fb56) into main (3220532) will increase coverage by 0.03%.
The diff coverage is 78.57%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_nplikes.py 64.24% <50.00%> (+0.71%) ⬆️
src/awkward/_connect/numpy.py 67.12% <90.00%> (+1.25%) ⬆️

@agoose77 agoose77 temporarily deployed to docs-preview December 30, 2022 12:21 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview December 30, 2022 14:08 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since __array_function__ overloads functions with a variety of different signatures, are you sure it's a good idea to regularize the types centrally?

This is for the fallback case in which we haven't defined an overload for a given function (implemented.get(func) is None), and need to turn any Awkward Arrays into NumPy arrays, so that the NumPy function can work with it.

Before this PR, tuples were recognized as non-array sequences; with this PR, lists are now treated the same way as tuples. You're also wrestling with what should be done here:

Nearly-always a tuple indicates a "collection of arays", whereas lists are themselves often "array-like", but seemingly this isn't formalised anywhere.

Maybe instead of trying to enumerate all of the Sequence (Iterable?) types that aren't arrays, we should try to enumerate all of the Sequence types that are arrays. Maybe _to_rectilinear should be:

def _to_rectilinear(arg):
    if isinstance(arg, np.ndarray):
        return arg
    elif isinstance(arg, (ak.highlevel.Array, ak.contents.Content)):
        nplike = ak._nplikes.nplike_of(arg)
        return nplike.to_rectilinear(arg)
    elif isinstance(arg, (str, bytes)):
        return arg
    elif isinstance(arg, Iterable):
        return [_to_rectilinear(x) for x in arg]
    else:
        return arg

Since this is recursive, I suppose the Record types should be included in the second predicate.

The first predicate would have to include array types for the other arrays that we support, cp.ndarray, jax.DeviceArray, by checking type(arg).__module__ names so that this function doesn't import those modules. Or better yet, by checking for has(arg, "__array__") to accept anything that NumPy would consider array-convertible. (That includes ak.highlevel.Array and ak.contents.Content, by the way.)

The idea of inverting the check, essentially turning a whitelist into a blacklist (or vice-versa: I don't know which word applies here because neither is excluding types, just converting them), is to pass the problem of deciding whether a list is a collection of arrays or one big array to NumPy. That decision depends on the function. For instance, np.concatenate (if we didn't have an overload for it) expects to consume an iterable collection of arrays: making that one array before np.concatenate decides what to do with it would be a mistake. Other functions expect to consume a single array, and if they see a list, they'll make that list into an array.

Another asterisk about my proposed implementation, above, is that it turns any non-array, non-string Iterable into a list, though it might have originally been a tuple or something else. Maybe tuple is a better choice, but we just need some standard Python concrete Sequence as a stand-in for any such thing we receive, as long as NumPy acts on it the same way as it would with the original collection type.

Comment on lines 31 to 34
ak.highlevel.Array,
ak.highlevel.Record,
ak.record.Record,
ak.contents.Content,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include Records (both types)? Records aren't like NumPy arrays; they're like scalars of a structured array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm replicating the existing logic; I haven't given this particular part a great deal of thought. I would expect, though, that we want to maintain the idea that record arrays map to structured arrays, which are arguments that NumPy __array_function__ overloads do support.

@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 30, 2022

I think I was overly concerned in the PR description. The safest thing to do(in my view), which is what I've opted for, is to rebuild the argument list and replace any ak.Array objects with their rectilinear counterparts.

IIRC we need to preserve the tuple vs list separation; some functions distinguish between the two.

My implementation is non-recursive; we assume that at most an ak.Array can be nested inside a tuple or list in the argument list. I think this is a pretty reasonable assumption; I can't think of any array functions for which this is not true off the top of my head. Perhaps a more rigorous implementation would be to recursively visit the argument list, and treat known array objects as leaves. Yet, that brings performance and complexity concerns.

I think just recursing a single time, and handling only lists and tuples is a reasonable scope for supporting "unsupported" NumPy functions. The most robust solution would probably be to formally define the array function overloads for all NumPy functions that we can support trivially. This would entail some small shims that manually, explicitly, perform the argument conversion.

@jpivarski
Copy link
Member

The thing that I'm worried about is

class MyAwesomeListType:
    def __init__(self, data):
        self._data = data
    def __len__(self):
        return len(self._data)
    def __getitem__(self, where):
        return self._data[where]

np.stack(MyAwesomeListType([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])]))

causing some slow conversion of the two ak.Arrays into a single NumPy array when a function like np.stack really wants separate arrays after all. In the above, MyAwesomeListType is neither a list nor a tuple; it's something we didn't know we needed to check for. That's why I think we want to invert the whitelisty-blacklistiness of it, to make arrays of the things we know we can (inexpensively) make arrays out of and leave anything else as a Python object. (The prototype implementation I wrote above would turn MyAwesomeListType into a list, but that would be fine. Maybe tuple would be safer/less likely to be interpreted by NumPy as something to unnecessarily convert to an array.)

The recursion wasn't the essential part. I doubt there are any NumPy functions whose signatures expect non-array collections of non-array collections of arrays. I know about some one-level deep functions (like np.concatenate and np.stack), but not any deeper. The implementation doesn't have to be recursive; it was just the easiest way to deal with the two levels we know we need to deal with in the same helper function.

@agoose77
Copy link
Collaborator Author

The thing that I'm worried about is

I think this is vanishingly unlikely, but I completely agree with the idea that we should consider whether inverting the logic is a better solution.

For context, the existing code handles only tuples of ak.Array specially; every other argument is passed directly to nplike.to_rectilinear. This means we don't support scalars, but also means that your example would probably fail very similarly to how it would fail in this PR; in this PR, NumPy would try and coerce the argument to an array, whereas in main I believe that to_rectilinear would do the same thing (and fail for ragged arrays).

My suggestion is that we don't have to make np.some_function work for every possible set of inputs; especially when we want to implement a general solution that doesn't know the details about the individual NumPy functions being overloaded. I would be fine with our documentation explaining that the np. functions may work, but subject to certain constraints. This is already the case (hence this PR), but I'm not sure whether we have a disclaimer anywhere.

In short; NumPy might support the above, but there's no reason that we have to; __array_function__ is not required to behave exactly as the numpy.XXX equivalent (in my understanding, anyway).

How would you feel about just making this a documentation-level warning?

@jpivarski
Copy link
Member

For context, the existing code handles only tuples of ak.Array specially; every other argument is passed directly to nplike.to_rectilinear.

Yes, this is an improvement (previously, only one non-array type was okay); I was just wondering if we could take it further/all the way (so that any non-array type is okay). And my intention is not to actually handle every case, but to pass it off on NumPy to decide what to do.

NumPy might support the above, but there's no reason that we have to; __array_function__ is not required to behave exactly as the numpy.XXX equivalent (in my understanding, anyway).

Doesn't it, though? In general, the way we define correct behavior for ak.overloaded_numpy_function is to ask, "Does this do what NumPy would do if rectilinear ak.Arrays are replaced by np.ndarrays?" and "Are the non-rectilinear extensions reasonable?" Here, we're overloading arbitrary NumPy functions for only the rectilinear ak.Arrays. The touchstone for correctness is whether

f(ak.Array(np.array(x)), ak.Array(np.array(y)), ...)

is equal to

f(np.array(x), np.array(y), ...)

and that it has the same performance characteristics (doesn't replace a vectorized array operation with Pythonic iteration).

The arbitrary f might pack some of these arrays into tuples, lists, and MyAwesomeListType (which seems unlikely on the face of it, but we can be surprised by other people's concerns—maybe they need to add metadata to their lists...1). It probably doesn't pack them more than one level deep, and I doubt any NumPy functions care whether the non-array collections are tuples, lists, or MyAwesomeListType, so they can all turn into lists (or tuples).

Moreover, the work isn't on our side to implement all of these cases; we can pass off most of the work to the particular NumPy functions. The one thing that we can do that NumPy can't is convert our special arrays into plain nplike arrays, in whatever non-array collection they might be hiding.

That's why I suggested this. However, handling both tuples and lists is better than handling just tuples, so I can accept the PR as-is.

Footnotes

  1. I've replaced built-in types with custom types to debug memory leaks with Pympler, since it summarizes memory use by object type. If I suspect that a million dicts are being created and not destroyed by a particular line of code, I'll replace dict with a NoticeMeeeee MutableMapping, so that it will be separated from all of the other dicts in Pympler's table. If such a substitution caused the code to break or act differently, that would be hard to figure out. Well, this case is not an example of what we're looking for because I would very likely make NoticeMeeeee a subclass of dict and isinstance would work, but it gets close to the kind of surprise I'm thinking of. You know, Uproot's STLVector is a non-list Sequence. If an Uproot user gets an std::vector<std::vector<int>> from ROOT, they might be in this situation.

@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 30, 2022

Doesn't it, though? In general, the way we define correct behavior for ak.overloaded_numpy_function is to ask, "Does this do what NumPy would do if rectilinear ak.Arrays are replaced by np.ndarrays?" and "Are the non-rectilinear extensions reasonable?" Here, we're overloading arbitrary NumPy functions for only the rectilinear ak.Arrays.

I agree with this sentiment, and my concern is whether we can actually do this. Unless we make assumptions (e.g. SequenceLike(list_of_values) is a valid constructor), we can't support custom types without potentially changing the result with respect to the result of calling the NumPy function directly with NumPy arrays; there's nothing stopping one NumPy function treating a list-like object as a "list of arrays" , and another treating the same argument as a "single n-dim array"; we can't tell without looking at the particular function in question.

Moreover, in some cases, NumPy does care about the types,1 so a custom sequence that becomes a list or a tuple might change the behaviour w.r.t to NumPy. I don't know; this is guesswork on my part.

What I mean by

_array_function__ is not required to behave exactly as the numpy.XXX equivalent (in my understanding, anyway).

is that the API itself is just a dispatcher. It doesn't define the semantics for how it should handle arguments, besides stating that optional arguments can be omitted.). The NEP does set-out how NumPy can programmatically demonstrate which arguments are array-like using a dispatcher that generates the types list, but we can't use that to go in the reverse direction. My impression of the array API is that it's not really designed for generic forwarding dispatch, but rather for hand-rolled overloads.

Based upon the NEP, I feel confident in saying that we could choose to only implement a subset of types. If we can't predict whether an argument needs list semantics or tuple semantics, and we can't re-create custom sequences, then we can't safely convert custom sequences containing Awkward types.1 Maybe if we encounter a sequence that is neither a list nor a tuple, we should either:

  1. check for Awkward Arrays. It's only if we find one that we would hit this problem anyway. Then we could raise an exception, and require a manual overload to fix it. This wouldn't handle the non-sequence aspects, e.g. if a custom dict-like object had Awkward Arrays as values, but I'm pretty sure no NumPy functions expect dict-like objects.
  2. just throw an exception

An independent point is whether we should recurse further, which I haven't spoken to strongly, yet.

Footnotes

  1. Again, I believe list-vs-tuple matters for NumPy functions, but if not then the point becomes less relevant. 2

@agoose77 agoose77 force-pushed the agoose77/fix-array-function-arguments branch from 948c07d to 94fd9ad Compare December 30, 2022 21:07
@agoose77 agoose77 temporarily deployed to docs-preview December 30, 2022 21:16 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

Unless we make assumptions (e.g. SequenceLike(list_of_values) is a valid constructor)

I'm not assuming that, which is why I suggested that our argument-regularization converts SequenceLike into list (or maybe tuple). In other words, don't try to reconstruct the SequenceLike.

I was assuming that NumPy doesn't care about the distinction between a SequenceLike and a built-in concrete Sequence. (That's where I have a slight hesitation about turning all of the SequenceLikes into lists; if NumPy cares about list-vs-tuple anywhere, it would be to assume that list might "want" to be an array, whereas tuple doesn't. Turning all of the unrecognized SequenceLikes into tuples seems slightly safer, more hands-off.)

We don't have to implement all features of the overloaded NumPy functions, but they shouldn't fail silently. Your suggestion of being more up-front about unrecognized patterns (raising errors) is a good one.

@agoose77
Copy link
Collaborator Author

Unless we make assumptions (e.g. SequenceLike(list_of_values) is a valid constructor)

I'm not assuming that, which is why I suggested that our argument-regularization converts SequenceLike into list (or maybe tuple). In other words, don't try to reconstruct the SequenceLike.

I'm following you!

I was assuming that NumPy doesn't care about the distinction between a SequenceLike and a built-in concrete Sequence. (That's where I have a slight hesitation about turning all of the SequenceLikes into lists; if NumPy cares about list-vs-tuple anywhere, it would be to assume that list might "want" to be an array, whereas tuple doesn't. Turning all of the unrecognized SequenceLikes into tuples seems slightly safer, more hands-off.)
We don't have to implement all features of the overloaded NumPy functions, but they shouldn't fail silently. Your suggestion of being more up-front about unrecognized patterns (raising errors) is a good one.

How do you feel about lists? Are we happy to try and iterate through them in search of ak.Array? I was thinking no before, but actually, lists are not high-performance data types. It's not unreasonable that if a user passes a big list to a NumPy function it will be visited at least once. If we're happy to do this, then we can drop the depth limit of the conversion, and visit all true leaves of sequences (once we resolve custom sequence handling).

Custom sequences are trickier. The problem is that we don't know whether the called function expects an array-like or tuple-like argument, and IIRC some functions test for lists and warn in such cases (i.e. where an API changed). I wish I could recall the function. So, I don't think we can have a safe rule in such a case that is always predictable. I'd prefer to just error loudly if we encounter a sequence type that isn't list, tuple, str, bytes. Does that seem too restrictive?

@agoose77 agoose77 force-pushed the agoose77/fix-array-function-arguments branch from 94fd9ad to 8ec84fc Compare December 30, 2022 22:08
@jpivarski
Copy link
Member

It would be pretty common to be given a list of arrays as an argument, e.g.

np.stack([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])])

and it would be much better to iterate through the list, turning any ak.Arrays you find there into np.ndarrays, which is what this PR does (at my first reading of it). Turning that list into an array would be bad, and I think that's what was happening before this PR.

(The above applies to any subclasses of list, too. Same for tuples.)

If it's an unrecognized non-string Sequence (or even Iterable) without an __array__ method, I'd say you can iterate over it, turning any ak.Arrays into np.ndarrays, and give NumPy a list of the results. We lost information about what kind of Iterable it was, but I'm guessing/assuming/asserting that NumPy doesn't care.

If it's an unrecognized object with an __array__ method, pass it on to NumPy. NumPy will call the __array__ method.

Alternatively, you can check a whitelist of known-to-be-okay container types and raise an exception for anything unrecognized. That makes fewer assumptions about what NumPy will do with what we give it. While it handles fewer cases than NumPy likely does, the cases it doesn't handle are noisy exceptions, rather than silent mistakes (including performance mistakes), and that's good. A user might someday point out a NumPy function that can't be executed because of this (dicts?), but then we can just add it at that time. This is a safe option.

I'd be happy with either one.

@agoose77
Copy link
Collaborator Author

OK, so we're talking about (in the allow-list sense)

  • recursively visit lists, tuples
  • prohibit non tuple/list iterables

That seems like the safest approach. It's more restrictive than NumPy, but I feel that we're allowed to make those kinds of decisions given NEP18 is fairly permissive. There may be some exceptions where we need to support these types (I'm not aware of any), but we can extend the manual overloads in these cases.

I'm most in favour of this solution, as it's easier to reason about :)

@agoose77 agoose77 temporarily deployed to docs-preview December 30, 2022 22:16 — with GitHub Actions Inactive
raise ak._errors.wrap_error(
TypeError("to_rectilinear argument must be iterable")
)
return ak.operations.ak_to_numpy.to_numpy(array, *args, **kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, to_rectilinear is just a dispatcher for to_numpy etc. In the near future, we should remove to_rectilinear altogether, and replace the ak._util.to_arraylib mechanism with an equivalent ak._util.to_backend_array.

I suggest to_backend_array because this function is responsible for converting the Awkward layout types to an array, so it operates at the layout level, for which "backend" is a better abstraction than "nplike".

@agoose77 agoose77 temporarily deployed to docs-preview December 30, 2022 22:42 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 3, 2023

@jpivarski any additional comments before I merge this? :)

@jpivarski
Copy link
Member

Re-reading the code, it looks like this PR extends our special handling of collection-of-array arguments in __array_function__ overloads from "collection must be tuple" to "collection must be tuple or list," which is definitely an improvement.

I thought I'd try it (not in the PR; I'm checking the old behavior) with np.stack, but it appears to work for both tuples and lists

>>> np.stack([np.array([1, 2, 3]), np.array([1.1, 2.2, 3.3])])
array([[1. , 2. , 3. ],
       [1.1, 2.2, 3.3]])

>>> np.stack((ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])))
<Array [[1, 2, 3], [1.1, 2.2, 3.3]] type='2 * 3 * float64'>

>>> np.stack([ak.Array([1, 2, 3]), ak.Array([1.1, 2.2, 3.3])])
<Array [[1, 2, 3], [1.1, 2.2, 3.3]] type='2 * 3 * float64'>

although I think what's happening here is that the list case is unnecessarily being turned into a single ak.Array, which np.stack unpacks into two arrays in the list case but not the tuple case. It would therefore differ in performance, but not the final result, so I can't see the difference from an example. I guess it's a theorem that this is always true: when a NumPy function expects a collection of arrays and gets a single Awkward Array, it treats that Awkward Array as a collection and extracts the items that it needs via __getitem__ or __iter__. Thus, it would always be "just a performance thing."

Except maybe np.ravel_multi_index, which motivated this in issue #1318?

(The documentation of that function says that its argument must be a tuple.)


Well, I'm going to get out of the weeds on this one. This PR preserves more information from the ak.Array.__array_function__ arguments to the NumPy function's arguments: if we get a tuple or list of arrays, we give NumPy a tuple or list of arrays, and let NumPy decide what to do with it.

So yes, this is an improvement and we should merge it. Cases of other collection types beyond tuple and list are pretty rare.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 3, 2023

Right, and notably it changes this function to be recursive, so long-lists of lists will incur a penalty. I've decided that's negligible because list-of-list is not a high-perf data-type as far as Python's concerned ;)

@jpivarski
Copy link
Member

If there's a long list, then somebody has to iterate over it, even if that is Awkward's from_iter or NumPy's fromiter to make an array. So, no loss.

@agoose77 agoose77 merged commit 47dc662 into main Jan 3, 2023
@agoose77 agoose77 deleted the agoose77/fix-array-function-arguments branch January 3, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_rectilinear should support scalars
2 participants