iterations over awkward arrays #2392

grst · 2023-04-12T13:40:06Z

grst
Apr 12, 2023

I have an awkward array ragged lists of records like this:

arr = ak.Array([
    [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
    [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
    [],
    [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
    ...
])

I want to apply a function to each of the Records, and I'm currently doing this naively in a double for-loop:

for row in arr:
    for record in row: 
       f(record)

which is not exactly fast. In fact, it is even faster to do a

for row in ak.to_list(arr):
    for record in row: 
       f(record)

Is there a better way of doing this?

(I guess the answer is "it depends" and there might be specific solutions using ak.xxx function -- but for the sake of the question let's assume it's not covered by the builtins and needs to be a custom python function instead)

Answered by jpivarski

Apr 12, 2023

You are right: imperative iteration over Awkward Arrays is considerably slower than iteration over Python builtin types, which is itself considerably slower than a vectorized computation. ("Considerably slower" means orders of magnitude, somewhere between 10× and 1000×.)

We have specialized, streamlined paths for simple ak.Array.__iter__ to make iteration as fast as possible, but it's a function-call stack several deep, terminating on Content._getitem_at, and that can't compete with the short path that Python builtins take to PyIter_Next and PyList_GetItem.

BEGIN details:

ak.Array.__iter__ has a specialized path for NumpyArray (here), in the hope that np.ndarray.__iter__ is a fast impleme…

View full answer

jpivarski · 2023-04-12T15:01:20Z

jpivarski
Apr 12, 2023
Maintainer

You are right: imperative iteration over Awkward Arrays is considerably slower than iteration over Python builtin types, which is itself considerably slower than a vectorized computation. ("Considerably slower" means orders of magnitude, somewhere between 10× and 1000×.)

We have specialized, streamlined paths for simple ak.Array.__iter__ to make iteration as fast as possible, but it's a function-call stack several deep, terminating on Content._getitem_at, and that can't compete with the short path that Python builtins take to PyIter_Next and PyList_GetItem.

BEGIN details:

ak.Array.__iter__ has a specialized path for NumpyArray (here), in the hope that np.ndarray.__iter__ is a fast implementation, but for any non-trivial type, it goes to Content.__iter__ (here). Content.__iter__ calls Content._getitem_at for each integer index (here), and thus it skips a lot of type-checking and regularization that would happen for a generic Content.__getitem__ call, since we already know that the argument is an integer. Then each Content subclass has its own _getitem_at implementation, which generally passes down to the next level until we get to a NumpyArray (such as ListOffsetArray._getitem_at, here).

One thing that you noticed, that

for row in ak.to_list(arr):
    for record in row: 
       f(record)

is faster than it would be without the ak.to_list, is only true thanks to the fact that ak.to_list is not itself implemented by an iteration. Without any overriding behaviors (Content._to_list_custom, here), to_list builds the lists in a semi-vectorized way. First, it descends to the NumpyArray, calls np.ndarray.tolist, then when it pops back up the recursion, it subdivides the elements of that list into new lists. It's effectively using knowledge that we want all of the data to get it in a faster way than could be had by iteration.

END details.

As for what you can do about it—assuming that this process can't be addressed with an array-oriented function—is to use Numba. This comes with caveats about Numba's supported Python features and Numba's supported NumPy features, but any Awkward Array without union-types can be iterated over in Numba-compiled functions, which run at the speed of C. (That is, considerably faster than the Python loop over builtin types.)

For example,

>>> import awkward as ak
>>> import numba as nb
>>> @nb.njit
... def do_something(array):
...     for row in arr:
...         for record in row:
...             print(record["locus"])
... 
>>> arr = ak.Array([
...     [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
...     [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
...     [],
...     [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
... ])
>>> do_something(arr)
TRA
TRB
IGH
IGH

The iteration over Awkward Arrays is read-only, but you can construct (1) NumPy arrays for output naively, thanks to implementations in Numba itself, (2) use ak.ArrayBuilder in Numba (see this tutorial, or this one (video)), and (3) @ianna is working on implementing LayoutBuilder in Numba, which will be faster because the type is known upfront.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iterations over awkward arrays #2392

{{title}}

Replies: 1 comment

{{title}}

Select a reply

iterations over awkward arrays #2392

grst Apr 12, 2023

Replies: 1 comment

jpivarski Apr 12, 2023 Maintainer

grst
Apr 12, 2023

jpivarski
Apr 12, 2023
Maintainer