# 2019-12-20-coffea-demo

This demo of the new Awkward Array was presented on December 20, 2019, before the final 1.0 version was released. Some interfaces may have changed. To run this notebook, make sure you have version 0.1.33 ([GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.1.33), [pip](https://pypi.org/project/awkward1/0.1.33/)) by installing

```bash
pip install 'awkward1==0.1.33'
```

The basic concepts of Awkward arrays are presented on the [old Awkward README](https://github.com/scikit-hep/awkward-array/tree/0.12.17#readme) and the motivation for a 1.0 rewrite are presented on the [new Awkward README](https://github.com/scikit-hep/awkward-1.0/tree/0.1.32#readme).

## High-level array class

The biggest user-facing change is that, instead of mixing NumPy arrays and `JaggedArray` objects, the new Awkward has a single `Array` class.

In [1]:
# FIXME: remove this!
import sys
sys.path.insert(0, "/home/jpivarski/irishep/awkward-1.0")
sys.path.insert(0, "/home/pivarski/irishep/awkward-1.0")

In [2]:
import numpy as np
import awkward1 as ak

array1 = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
array1

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [3]:
array2 = ak.Array([{"x": 0, "y": []}, {"x": 1, "y": [1.1]}, {"x": 2, "y": [1.1, 2.2]}])
array2

<Array [{x: 0, y: []}, ... y: [1.1, 2.2]}] type='3 * {"x": int64, "y": var * flo...'>

The same `Array` class is used for all data structures, such as the array of lists in `array1` and the array of records in `array2`.

There won't be any user-level functions that apply to some data types and not others. The result of an operation is likely type-dependent, but its accessibility is not. (At this time, the only existing operations are conversions and descriptions.)

(Incidentally, the width of that string representation is exactly large enough to fit into GitHub and StackOverflow text boxes without scrolling.)

In [4]:
ak.tolist(array1)

[[1.1, 2.2, 3.3], [], [4.4, 5.5]]

In [5]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [6]:
ak.tolist(array2)

[{'x': 0, 'y': []}, {'x': 1, 'y': [1.1]}, {'x': 2, 'y': [1.1, 2.2]}]

In [7]:
ak.tojson(array2)

'[{"x":0,"y":[]},{"x":1,"y":[1.1]},{"x":2,"y":[1.1,2.2]}]'

In [8]:
ak.typeof(array1)

3 * var * float64

In [9]:
ak.typeof(array2)

3 * {"x": int64, "y": var * float64}

(Data types are described using the [datashape language](https://datashape.readthedocs.io/en/latest/). Some Awkward features are [not expressible](https://github.com/blaze/datashape/issues/237) in the current datashape specification, so they're expressed in an extension of the language using the same style of syntax.)

The next major change in interface is that operations on arrays, such as `ak.tolist` and `ak.typeof` above, are free-standing functions, rather than class methods. This is because it's desirable to put domain specific (e.g. physics) methods on the array object itself; using free-standing functions for array manipulations avoids conflicts. For example,

   * `ak.cross(array1, array2)` is an array-manipulation function (the cross-join of `array1` and `array2`)
   * `array1.cross(array2)` could be a user-defined method, such as the 3D cross-product, if `array1` and `array2` represent (arrays of) 3D vectors.
   * `array1.somefield` is a shortcut for `array1["somefield"]`.

## Low-level array classes

The old `JaggedArray` and `Table` are still available, but you have to ask for them explicitly with `layout`. They're not "private" or "internal implementations" (there's no underscore in `layout`): they're public for frameworks like Coffea but hidden from data analysts.

As such, their string representations have more low-level detail: the contents of indexes, rather than what they mean as high-level types. (The XML formatting is just an elaboration on Python's angle-bracket convention for `repr` and the fact that we need to denote nesting.)

In [10]:
array1.layout

<ListOffsetArray64>
    <type>var * float64</type>
    <offsets><Index64 i="[0 3 3 5]" offset="0" at="0x55e031af1aa0"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55e031ae9d50">
        <type>float64</type>
    </NumpyArray></content>
</ListOffsetArray64>

In [11]:
array2.layout

<RecordArray>
    <type>{"x": int64, "y": var * float64}</type>
    <field index="0" key="x">
        <NumpyArray format="l" shape="3" data="0 1 2" at="0x55e031ae6eb0">
            <type>int64</type>
        </NumpyArray>
    </field>
    <field index="1" key="y">
        <ListOffsetArray64>
            <type>var * float64</type>
            <offsets><Index64 i="[0 0 1 3]" offset="0" at="0x55e031af5420"/></offsets>
            <content><NumpyArray format="d" shape="3" data="1.1 1.1 2.2" at="0x55e031af7430">
                <type>float64</type>
            </NumpyArray></content>
        </ListOffsetArray64>
    </field>
</RecordArray>

These classes are defined in C++ and wrapped by pybind11. The `awkward1.Array` class is pure Python. Many of the same operations work for layout classes, though less attention has been paid to its interface.

In [12]:
ak.typeof(array1)

3 * var * float64

In [13]:
ak.typeof(array1.layout)

var * float64

In [14]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [15]:
ak.tojson(array1.layout)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [16]:
array1.layout.tojson()

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

## Behavioral mix-ins

The primary use of Awkward arrays so far has been to represent arrays or jagged arrays of physics objects with physics methods on the array objects themselves. In Awkward 0.x, this was implemented with Python multiple inheritance, but that's a Python-only solution that can't be passed into C++ (and it was brittle: easy for an array component to lose its methods).

Now behavioral mix-ins are a "first class citizen," built into Awkward 1.0's type system.

In [23]:
class PointClass(ak.Record):
    def __repr__(self):
        return "<Point({}, {})>".format(self["x"], self["y"])
    
    def mag(self):
        return abs(np.sqrt(self["x"]**2 + self["y"]**2))

ak.namespace["Point"] = PointClass

In [24]:
array3 = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
array3

<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

In [25]:
array3.layout.type

{"x": int64, "y": float64}

In [26]:
pointtype = array3.layout.type
pointtype["__class__"] = "Point"
pointtype["__str__"] = "PointType[{}, {}]".format(pointtype.field("x"), pointtype.field("y"))
pointtype

PointType[int64, float64]

In [27]:
# There will be a better interface for setting types...
array4 = ak.Array(array3.layout, type=ak.ArrayType(pointtype, len(array3.layout)))
array4

<Array [<Point(1, 1.1)>, ... <Point(3, 3.3)>] type='3 * PointType[int64, float64]'>

In [29]:
[x.mag() for x in array4]

[1.4866068747318506, 2.973213749463701, 4.459820624195552]