-
I have an awkward array ragged lists of records like this: arr = ak.Array([
[{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
[{"locus": "IGH", "junction_aa": "CDGFFA..."}],
[],
[{"locus": "IGH", "junction_aa": "CDGFFA..."}],
...
]) I want to apply a function to each of the Records, and I'm currently doing this naively in a double for-loop: for row in arr:
for record in row:
f(record) which is not exactly fast. In fact, it is even faster to do a for row in ak.to_list(arr):
for record in row:
f(record) Is there a better way of doing this? (I guess the answer is "it depends" and there might be specific solutions using |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
You are right: imperative iteration over Awkward Arrays is considerably slower than iteration over Python builtin types, which is itself considerably slower than a vectorized computation. ("Considerably slower" means orders of magnitude, somewhere between 10× and 1000×.) We have specialized, streamlined paths for simple BEGIN details:
One thing that you noticed, that for row in ak.to_list(arr):
for record in row:
f(record) is faster than it would be without the END details. As for what you can do about it—assuming that this process can't be addressed with an array-oriented function—is to use Numba. This comes with caveats about Numba's supported Python features and Numba's supported NumPy features, but any Awkward Array without union-types can be iterated over in Numba-compiled functions, which run at the speed of C. (That is, considerably faster than the Python loop over builtin types.) For example, >>> import awkward as ak
>>> import numba as nb
>>> @nb.njit
... def do_something(array):
... for row in arr:
... for record in row:
... print(record["locus"])
...
>>> arr = ak.Array([
... [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
... [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
... [],
... [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
... ])
>>> do_something(arr)
TRA
TRB
IGH
IGH The iteration over Awkward Arrays is read-only, but you can construct (1) NumPy arrays for output naively, thanks to implementations in Numba itself, (2) use ak.ArrayBuilder in Numba (see this tutorial, or this one (video)), and (3) @ianna is working on implementing |
Beta Was this translation helpful? Give feedback.
You are right: imperative iteration over Awkward Arrays is considerably slower than iteration over Python builtin types, which is itself considerably slower than a vectorized computation. ("Considerably slower" means orders of magnitude, somewhere between 10× and 1000×.)
We have specialized, streamlined paths for simple
ak.Array.__iter__
to make iteration as fast as possible, but it's a function-call stack several deep, terminating onContent._getitem_at
, and that can't compete with the short path that Python builtins take to PyIter_Next and PyList_GetItem.BEGIN details:
ak.Array.__iter__
has a specialized path for NumpyArray (here), in the hope thatnp.ndarray.__iter__
is a fast impleme…