# 2020-01-22-numba-demo

## 1. This notebook

This demo of Awkward Array was presented on January 22, 2020, before the first stable version (1.0) was released. Some interfaces may have changed. To run this notebook, make sure you have version 0.1.87  ([GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.1.87), [pip](https://pypi.org/project/awkward1/0.1.87/)) by installing

```bash
pip install 'awkward1==0.1.87'
```

before executing it in Jupyter (or include that release number in the Binder URL).

Depending on where you execute this notebook and how you installed or didn't install Awkward Array, you might need the following.

In [2]:
# The base of the GitHub repo is two levels up from this notebook.
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), "..", ".."))

## 2. Introduction to Awkward Array

Awkward Array is a library for manipulating data structures with NumPy-like idioms. For a core set of NumPy features—slicing, broadcasting, array-at-a-time operations, and such—it is a strict generalization from rectilinear arrays of numeric data types to unequal-width and heterogeneous lists and nested objects.

The name arose organically: these kinds of arrays are usually awkward to deal with.

### 2.1 Distinction from NumPy object arrays

Although NumPy arrays can contain arbitrary objects with `dtype('O')` type, such arrays can't be sliced or operated on with NumPy's usual idioms because they're really just pointers to pure Python objects.

In [12]:
import numpy as np
import awkward1 as ak

nparray = np.array([[1, 2, 3], [], [4, None, 5], [{"something": 1, "else": [2, 3]}]])
akarray = ak.Array([[1, 2, 3], [], [4, None, 5], [{"something": 1, "else": [2, 3]}]])

In [13]:
# NumPy can't slice into Python objects
nparray[2:, 0]

IndexError: too many indices for array

In [14]:
# Awkward can
akarray[2:, 0]

<Array [4, {something: 1, else: [2, 3]}] type='2 * ?union[int64, {"something": i...'>

In [15]:
# NumPy can't pass ufuncs down to numerical data
np.sin(nparray)

TypeError: loop of ufunc does not support argument 0 of type list which has no callable sin method

In [16]:
# Awkward can
np.sin(akarray)

<Array [[0.841, 0.909, 0.141], ... 0.141]}]] type='4 * var * ?union[float64, {"s...'>

In [17]:
# Here's a little more detail on the above:
ak.tolist(np.sin(akarray))

[[0.8414709848078965, 0.9092974268256817, 0.1411200080598672],
 [],
 [-0.7568024953079282, None, -0.9589242746631385],
 [{'something': 0.8414709848078965,
   'else': [0.9092974268256817, 0.1411200080598672]}]]

Like NumPy (as well as [Apache Arrow](https://arrow.apache.org/) and [XND](https://xnd.io/)), Awkward Array operates on columnar arrays and prefers _O(1)_ views, rather than _O(n)_ computations (where _n_ is the number of elements in the array) wherever possible.

In [18]:
# Columnar structure of the above array
akarray.layout

<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 6 7]" offset="0" at="0x55f4d7fda0c0"/></offsets>
    <content><IndexedOptionArray64>
        <index><Index64 i="[0 1 2 3 -1 4 5]" offset="0" at="0x55f4d7fde0e0"/></index>
        <content><UnionArray8_64>
            <content index="0">
                <NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x55f4d7fdc0d0"/>
            </content>
            <content index="1">
                <RecordArray>
                    <field index="0" key="something">
                        <NumpyArray format="l" shape="1" data="1" at="0x55f4d7fe2100"/>
                    </field>
                    <field index="1" key="else">
                        <ListOffsetArray64>
                            <offsets><Index64 i="[0 2]" offset="0" at="0x55f4d7fe4110"/></offsets>
                            <content><NumpyArray format="l" shape="2" data="2 3" at="0x55f4d7fe6120"/></content>
                        </ListOffsetArray64>
                 

A major goal of this project is to use existing standards wherever possible. The columnar layout above is expressed in XML notation simply because it is a readable, standard way to express nesting (and generalizes from Python's convention of representing objects in `<angle brackets>`).

High-level data types are expressed in [Datashape](https://datashape.readthedocs.io/en/latest/) notation

In [19]:
ak.typeof(akarray)

4 * var * ?union[int64, {"something": int64, "else": var * int64}]

with [extensions where necessary](https://github.com/blaze/datashape/issues/237). Similarly, these arrays will be portable to and from Apache Arrow (and other formats, if requested). The idea is that Awkward Array provides **manipulation** capabilities, not **serialization** or **transport**.

### 2.2 Relevance for Numba

Numba, as you know, provides **computation** capabilities in a way that complements NumPy. Whereas NumPy requires array-at-a-time operations for performance, Numba enables imperative, pure Python code to have equal and often exceeding performance.

The analogy with Awkward is one-to-one:

|   | without Numba | with Numba |
|:-:|:-------------:|:----------:|
| **with NumPy** | array-at-a-time processing on numbers | general code on NumPy arrays and Python objects |
| **with Awkward** | array-at-a-time processing on data structures | general code on Awkward data structures |

The Awkward Array library includes Numba extensions with near feature parity: most operations that run outside of JIT-compiled functions run inside them as well.

In [33]:
import numba as nb

@nb.jit(nopython=True)
def run(array):
    out = np.empty(len(array), np.float64)
    for i in range(len(array)):
        out[i] = array[i]["x"]
        for y in array[i]["y"]:
            out[i] += y
    return out

run(ak.Array([{"x": 100, "y": [1.1, 2.2]}, {"x": 200, "y": []}, {"x": 300, "y": [3.3]}]).layout)

array([103.3, 200. , 303.3])

Although Numba and take and return builtin Python objects (e.g. tuples, lists, dicts) and can define extensions for class instances with `@jitclass`, these objects need to be boxed and unboxed, which can be a bottleneck.

Data in an Awkward Array are columnar, and therefore boxing and unboxing is limited to the depth of the columnar layout, rather than the number of elements in the array. In this example, 4 nodes get unboxed, but there can be a million array elements.

![](img/example-hierarchy.png)

In [59]:
%%timeit -n 1 -r 1

builder = ak.FillableArray()

for i in range(1000000):
    builder.beginrecord()
    builder.field("x")
    builder.integer(np.random.poisson(3) * 100)
    builder.field("y")
    builder.beginlist()
    for j in range(np.random.poisson(3)):
        builder.real(np.random.randint(5) * 1.1)
    builder.endlist()
    builder.endrecord()

akarray = builder.snapshot()
print(akarray)

[{x: 100, y: []}, {x: 200, y: [1.1]}, ... {x: 100, y: [3.3, 0, 2.2, 4.4]}]
24.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [57]:
run(akarray.layout)

array([509.9, 403.3, 317.6, ..., 213.2, 409.9, 406.6])

Even dynamically typed data like this `FillableArray` can be used in Numba, with a dramatic speedup (50×).

In [60]:
%%timeit -n 1 -r 1

@nb.jit(nopython=True)
def build(builder):
    for i in range(1000000):
        builder.beginrecord()
        builder.field("x")
        builder.integer(np.random.poisson(3) * 100)
        builder.field("y")
        builder.beginlist()
        for j in range(np.random.poisson(3)):
            builder.real(np.random.randint(5) * 1.1)
        builder.endlist()
        builder.endrecord()
    return builder

print(ak.Array(build(ak.layout.FillableArray()).snapshot()))

[{x: 400, y: [3.3, 0, 1.1]}, {x: 300, y: [3.3, ... {x: 300, y: [2.2, 2.2, 1.1]}]
571 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


The equivalent in Numba is about as fast, though it has to box _O(million)_ lists and numbers.

In [64]:
%%timeit -n 1 -r 1

@nb.jit(nopython=True)
def build():
    outx = []
    outy = []
    for i in range(1000000):
        outx.append(np.random.poisson(3) * 100)
        tmp = []
        for j in range(np.random.poisson(3)):
            tmp.append(np.random.randint(5) * 1.1)
        outy.append(tmp)
    return (outx, outy)

outx, outy = build()
print(outx[:5])
print(outy[:5])

[600, 500, 400, 500, 200]
[[2.2, 0.0, 2.2, 2.2, 3.3000000000000003], [4.4, 2.2, 3.3000000000000003], [0.0, 0.0, 4.4, 2.2], [2.2], [4.4]]
684 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### 2.3 Why did this come from particle physics?

HERE