# 2020-05-01-coffea-demo-2

## 1. Introduction

This demo of the new Awkward Array was presented on May 1, 2020, before the version was named 1.0, but the interface is pretty nearly finalized. Nevertheless, It is only guaranteed to work in the current version, 0.2.18, so be sure to install that (from [GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.2.18) or [pip](https://pypi.org/project/awkward1/0.2.18/)) before running this notebook.

```bash
pip install 'awkward1==0.2.18'
```

This demo is also based on [one I presented for the EIC collaboration](https://github.com/jpivarski/2020-04-08-eic-jlab#readme) and it uses the same file.

In [5]:
!wget https://github.com/jpivarski/2020-04-08-eic-jlab/raw/master/open_charm_18x275_10k.root

--2020-04-30 08:33:58--  https://github.com/jpivarski/2020-04-08-eic-jlab/raw/master/open_charm_18x275_10k.root
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jpivarski/2020-04-08-eic-jlab/master/open_charm_18x275_10k.root [following]
--2020-04-30 08:33:58--  https://raw.githubusercontent.com/jpivarski/2020-04-08-eic-jlab/master/open_charm_18x275_10k.root
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.28.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.28.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51484369 (49M) [application/octet-stream]
Saving to: ‘open_charm_18x275_10k.root’


2020-04-30 08:35:18 (634 KB/s) - ‘open_charm_18x275_10k.root’ saved [51484369/51484369]



In [None]:
# The base of the GitHub repo is one level up from this notebook.
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), ".."))

## 2. Awkward 1 is ready for users, Uproot 4 is not

The only hold-up is that Uproot does not yet produce Awkward 1 arrays, so there's an extra step to turn Awkward 0 arrays into Awkward 1. This conversion is zero-copy (changing names and metadata, but not the array buffers.)

In [None]:
import awkward1 as ak
import uproot

In [None]:
dataset = uproot.open("open_charm_18x275_10k.root")["events/tree"]

In [None]:
# old style
dataset.array("p")

In [None]:
# new style
ak.from_awkward0(dataset.array("p"))

Let's read them all into new-style arrays.

In [None]:
arrays = {name: ak.from_awkward0(array) for name, array in dataset.arrays(namedecode="utf-8").items()}
arrays

In general, it's more useful for data to be combined into a single structure (like NanoEvents), rather than a dict or variables pointing to separate arrays.

There are tools for building structures (and they're zero-copy, as much as possible).

In [None]:
example = ak.zip({"px": arrays["px"], "py": arrays["py"], "pz": arrays["pz"]})
example

In [None]:
example[0, 0]

Building such a structure requires some knowledge of what the ROOT branches mean, but this can be done once for NanoAOD (NanoEvents!).

In [None]:
events = ak.zip({"id": arrays["evt_id"],
                 "true": ak.zip({"q2": arrays["evt_true_q2"],
                                 "x": arrays["evt_true_x"],
                                 "y": arrays["evt_true_y"],
                                 "w2": arrays["evt_true_w2"],
                                 "nu": arrays["evt_true_nu"]}),
                 "has_dis_info": arrays["evt_has_dis_info"],
                 "prt_count": arrays["evt_prt_count"],
                 "prt": ak.zip({"id": arrays["id"],
                                "pdg": arrays["pdg"],
                                "trk_id": arrays["trk_id"],
                                "charge": arrays["charge"],
                                "dir": ak.zip({"x": arrays["dir_x"],
                                               "y": arrays["dir_y"],
                                               "z": arrays["dir_z"]}, with_name="point3"),
                                "p": arrays["p"],
                                "px": arrays["px"],
                                "py": arrays["py"],
                                "pz": arrays["pz"],
                                "m": arrays["m"],
                                "time": arrays["time"],
                                "is_beam": arrays["is_beam"],
                                "is_stable": arrays["is_stable"],
                                "gen_code": arrays["gen_code"],
                                "mother": ak.zip({"id": arrays["mother_id"],
                                                  "second_id": arrays["mother_second_id"]}),
                                "pol": ak.zip({"has_info": arrays["has_pol_info"],
                                               "x": arrays["pol_x"],
                                               "y": arrays["pol_y"],
                                               "z": arrays["pol_z"]}, with_name="point3"),
                                "vtx": ak.zip({"has_info": arrays["has_vtx_info"],
                                               "id": arrays["vtx_id"],
                                               "x": arrays["vtx_x"],
                                               "y": arrays["vtx_y"],
                                               "z": arrays["vtx_z"],
                                               "t": arrays["vtx_t"]}, with_name="point3"),
                                "smear": ak.zip({"has_info": arrays["has_smear_info"],
                                                 "has_e": arrays["smear_has_e"],
                                                 "has_p": arrays["smear_has_p"],
                                                 "has_pid": arrays["smear_has_pid"],
                                                 "has_vtx": arrays["smear_has_vtx"],
                                                 "has_any_eppid": arrays["smear_has_any_eppid"],
                                                 "orig": ak.zip({"tot_e": arrays["smear_orig_tot_e"],
                                                                 "p": arrays["smear_orig_p"],
                                                                 "px": arrays["smear_orig_px"],
                                                                 "py": arrays["smear_orig_py"],
                                                                 "pz": arrays["smear_orig_pz"],
                                                                 "vtx": ak.zip({"x": arrays["smear_orig_vtx_x"],
                                                                                "y": arrays["smear_orig_vtx_y"],
                                                                                "z": arrays["smear_orig_vtx_z"]},
                                                                               with_name="point3")})})}, with_name="particle")},
                depth_limit=1)

Conceptually at least, this is now an array of objects.

<img src="../docs-img/diagrams/cartoon-schematic.png" width="600">

In [None]:
# event 0, particle 0
ak.to_list(events[0].prt[0])

<img src="../docs-img/diagrams/how-it-works-muons.png" width="1000">

## 3. What's new?

The most important new features are **robustness** and **uniformity**.

The majority of Awkward 0 issues were NumPy corner cases like `np.max([])`, ChunkedArrays not working like all the other types (to such an extent that I recommended against Uproot lazy arrays), and unimplemented special cases.

For example:

In [None]:
import awkward as old_awkward

old = old_awkward.fromiter([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [[5.5]], [], [[6.6, 7.7, 8.8, 9.9]]])
old[:, ::-1, ::2]

In [None]:
new = ak.from_iter([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [[5.5]], [], [[6.6, 7.7, 8.8, 9.9]]])
new[:, ::-1, ::2]

Many slices in many jagged dimensions? No problem!

It's because these functions are now written in C++, functionality like slicing can be written in a more natural way (recursive), allowing for generality. The restriction to only NumPy calls in the old library limited implementations to special cases. C++ type-checking also ensures that no methods are missing.

<img src="../docs-img/diagrams/awkward-1-0-layers.png" width="600">

Beyond uniformity, the main new features are:

   * Single high-level ak.Array class
   * Masking, rather than cutting
   * Easier to override with physics behaviors
   * Everything can be used in Numba
   * Everything can be used in Pandas
   * NumPy conformance and the "axis" parameter
   * Producing and consuming arrays in pure C++
   * Documentation!

## 4. Single high-level ak.Array class

(Its printed representation is exactly wide enough to fit in GitHub and StackOverflow boxes without scrolling. :)

In [None]:
events

In [None]:
ak.type(events)

You can use dots or strings for "column" slices.

In [None]:
events.prt.smear.orig.vtx, events["prt", "smear", "orig", "vtx"]

In [None]:
ak.type(events.prt.smear.orig.vtx)

You can slice them as before (with more generality).

In [None]:
from particle import Particle     # https://github.com/scikit-hep/particle
Particle.from_string("p"), Particle.from_string("pi+"), Particle.from_string("K+")

In [None]:
events.prt[abs(events.prt.pdg) == abs(Particle.from_string("pi+").pdgid)]

And assign new collections to objects (which follow the normal broadcasting rules).

In [None]:
# Assignments have to be through __setitem__ (brackets), not __setattr__ (as an attribute).
# Is that a problem? (Assigning as an attribute would have to be implemented with care, if at all.)

events["protons"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string("p").pdgid)]
events["pions"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string("pi+").pdgid)]
events["kaons"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string("K+").pdgid)]

The nested structures you remember from Awkward 0 (e.g. JaggedArray of JaggedArray) are hidden inside the `layout` parameter.

In [None]:
events.layout

For Nick: there's also a view of this without array data:

In [None]:
events.layout.form

Let's do some bump-hunting...

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import mplhep as hep             # https://github.com/scikit-hep/mplhep
import boost_histogram as bh     # https://github.com/scikit-hep/boost-histogram

def mass(pairs, left_mass, right_mass):
    left, right = ak.unzip(pairs)
    left_energy = np.sqrt(left.p**2 + left_mass**2)
    right_energy = np.sqrt(right.p**2 + right_mass**2)
    return np.sqrt((left_energy + right_energy)**2 -
                   (left.px + right.px)**2 -
                   (left.py + right.py)**2 -
                   (left.pz + right.pz)**2)

$\Lambda^0 \to p \pi$ requires a Cartesian product of protons in each event with pions in each event.

<img src="../docs-img/diagrams/cartoon-cartesian.png" width="300">

In [None]:
pairs = ak.cartesian([events.pions, events.protons])
pairs

In [None]:
mass(pairs, 0.139570, 0.938272)

In [None]:
hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(
    ak.flatten(mass(pairs, 0.139570, 0.938272))
))

$K_S \to \pi\pi$ requires unique combinations of pions in each event with themselves.

<img src="../docs-img/diagrams/cartoon-combinations.png" width="300">

In [None]:
pairs = ak.combinations(events.pions, 2, with_name="pair")
pairs

In [None]:
mass(pairs, 0.139570, 0.139570)

In [None]:
hep.histplot(bh.Histogram(bh.axis.Regular(100, 0.497611 - 0.015, 0.497611 + 0.015)).fill(
    ak.flatten(mass(pairs, 0.139570, 0.139570))
))

## 5. Masking, rather than cutting

One of the problems with using NumPy slicing to cut events is that it changes the shape of arrays; they don't line up.

In [None]:
sample = ak.Array(np.arange(10))
sample

In [None]:
cut = (sample % 2 == 0)
cut

In [None]:
sample[cut]

One of the data types that can be expressed with Awkward Arrays allows for missing data (arrays containing `None`).

`ak.mask` or `array.mask[...]` can make these arrays.

In [None]:
sample.mask[cut]

This still has 10 entries, so we can use it in formulae with other arrays with 10 entries.

In [None]:
sample.mask[cut] - sample

Physics example: apply some quality cuts to $K_S \to \pi\pi$.

In [None]:
pairs = ak.combinations(events.pions, 2, with_name="pair")
pairs

In [None]:
opposite_sign = (pairs.slot0.charge != pairs.slot1.charge)
opposite_sign

In [None]:
def far_enough(vtx, cut):
    return np.sqrt(vtx.x**2 + vtx.y**2 + vtx.z**2) > cut

left, right = ak.unzip(pairs)
displaced_vertex = far_enough(left.vtx, 0.10) & far_enough(right.vtx, 0.10)
displaced_vertex

The cuts can be added sequentially.

In [None]:
good_kaons = pairs.mask[opposite_sign]
good_kaons

In [None]:
better_kaons = good_kaons.mask[displaced_vertex]
better_kaons

Flattening at the default `axis=1` concatenates the first level of nested lists (and would get rid of any missing _lists_).

In [None]:
ak.flatten(better_kaons)

Flattening at `axis=0` gets rid of missing values at the top level.

In [None]:
ak.flatten(ak.flatten(better_kaons), axis=0)

Flattening with `axis=None` eliminates _all_ structure, leaving you with only numbers.

In [None]:
ak.flatten(better_kaons, axis=None)

We don't want to do that to record structures because we lose the difference between PDG ids and px-py-pzs.

But we could easily want to do that with a numerical array, like masses.

In [None]:
mass(better_kaons, 0.139570, 0.139570)

In [None]:
ak.flatten(mass(better_kaons, 0.139570, 0.139570), axis=None)

That's pretty much what you always want to do before plotting.

In [None]:
hep.histplot(bh.Histogram(bh.axis.Regular(100, 0.497611 - 0.015, 0.497611 + 0.015)).fill(
    ak.flatten(mass(better_kaons, 0.139570, 0.139570), axis=None)
))

## 6. Easier to override with physics behaviors

Every layout node has JSON-like metadata, and some parameters have special meaning.

The `"__record__"` parameter names data structures.

In [None]:
events.kaons.layout.content.parameters

In [None]:
events.kaons.vtx.layout.content.parameters

Named data structures can be associated with mixins through `ak.behavior`.

In [None]:
class ParticleRecord(ak.Record):
    @property
    def pt(self):
        return np.sqrt(self.px**2 + self.py**2)

ak.behavior["particle"] = ParticleRecord

In [None]:
events.kaons[0, 0]

In [None]:
events.kaons[0, 0].pt

Similarly for arrays of these data structures (any number of levels deep).

In [None]:
class ParticleArray(ak.Array):
    @property
    def pt(self):
        return np.sqrt(self.px**2 + self.py**2)

ak.behavior["*", "particle"] = ParticleArray

In [None]:
events.kaons

In [None]:
events.kaons.pt

We can also override the behavior of NumPy ufuncs, when applied to objects of a given name.

In [None]:
def point3_absolute(data):
    return np.sqrt(data.x**2 + data.y**2 + data.z**2)

def point3_distance(left, right):
    return np.sqrt((left.x - right.x)**2 + (left.y - right.y)**2 + (left.z - right.z)**2)

ak.behavior[np.absolute, "point3"] = point3_absolute
ak.behavior[np.subtract, "point3", "point3"] = point3_distance

In [None]:
# using NumPy ufuncs explicitly...
np.absolute(events.kaons.vtx)

In [None]:
# ...or implicitly
abs(events.kaons.vtx)

In [None]:
# subtract the firsts and lasts of each event
events.kaons[:, :1].vtx - events.kaons[:, -1:].vtx

## 7. Everything can be used in Numba

Numba-compiled functions can consume any ak.Array.

In [None]:
import numba as nb

In [None]:
@nb.jit
def lambda_mass(events):
    num_lambdas = 0
    for event in events:
        num_lambdas += len(event.pions) * len(event.protons)

    lambda_masses = np.empty(num_lambdas, np.float64)
    i = 0
    for event in events:
        for pion in event.pions:
            for proton in event.protons:
                pion_energy = np.sqrt(pion.p**2 + 0.139570**2)
                proton_energy = np.sqrt(proton.p**2 + 0.938272**2)
                mass = np.sqrt((pion_energy + proton_energy)**2 -
                               (pion.px + proton.px)**2 -
                               (pion.py + proton.py)**2 -
                               (pion.pz + proton.pz)**2)
                lambda_masses[i] = mass
                i += 1
    
    return lambda_masses

In [None]:
hep.histplot(bh.Histogram(bh.axis.Regular(100, 1.115683 - 0.01, 1.115683 + 0.01)).fill(
    lambda_mass(events)
))

Above, the output array is a NumPy array; we can make complex types with ak.ArrayBuilder (called FillableArray in last December's presentation).

The ak.ArrayBuilder is an append-only structure whose data type is determined by the _order_ in which its methods are called.

In [None]:
@nb.jit(nopython=True)
def closest_photon_to_each_electron(events, builder):
    for event in events:
        builder.begin_list()
        for electron in event.electrons:
            best_i = -1
            best_angle = -1.0
            for i in range(len(event.photons)):
                photon = event.photons[i]
                angle = photon.dir.x*electron.dir.x + photon.dir.y*electron.dir.y + photon.dir.z*electron.dir.z
                if angle > best_angle:
                    best_i = i
                    best_angle = angle
            if best_i == -1:
                builder.null()
            else:
                builder.append(photon)
        builder.end_list()

In [None]:
events["photons"]   = events.prt[events.prt.pdg == Particle.from_string("gamma").pdgid]
events["electrons"] = events.prt[abs(events.prt.pdg) == abs(Particle.from_string("e-").pdgid)]

builder = ak.ArrayBuilder()
closest_photon_to_each_electron(events, builder)
closest_photons = builder.snapshot()
closest_photons

In [None]:
ak.num(events.photons), ak.num(events.electrons), ak.num(closest_photons)

Limitations:

   * ak.Array and ak.ArrayBuilder cannot be created inside a Numba-compiled function; they can only be passed in and returned.
   * Fancy `__getitem__` is not available.
   * All the `ak.this` and `ak.that` functions are not available.

The bottom line is that you should write imperative, C-style code inside Numba and vectorized, NumPy-style code outside.

## 8. Everything can be used in Pandas

An ak.Array can be a Pandas column:

In [None]:
import pandas as pd

In [None]:
pd.DataFrame({"events": events})

In [None]:
pd.DataFrame({"pions": events.pions, "kaons": events.kaons, "protons": events.protons})

But they'll be more useful in Pandas if broken down to simpler types.

In [None]:
df = pd.DataFrame({"vtx": events.prt.vtx, "smear_vtx": events.prt.smear.orig.vtx})
df

In [None]:
# because we defined subtraction for "point3"
df.vtx - df.smear_vtx

Pandas's own functions are most useful when the cell data are numbers, which we can produce with `ak.pandas.df`.

Jagged lists become `pd.MultiIndex` rows and nested records become `pd.MultiIndex` columns.

In [None]:
ak.pandas.df(events.pions)

## 9. NumPy conformance and the "axis" parameter

Some of the functions in Awkward 0 chose different conventions than NumPy, which is Bad™.

Awkward 1 strictly generalizes NumPy: the same function with the same inputs yields the same outputs.

In particular, most functions in NumPy have an `axis` parameter to specify which dimension you want to apply an operation to.

In [None]:
sample = np.array([[[  0,   1,   2,   3,   4], [  5,   6,   7,   8,   9]],
                   [[ 10,  11,  12,  13,  14], [ 15,  16,  17,  18,  19]],
                   [[100, 101, 102, 103, 104], [105, 106, 107, 108, 109]]])

In [None]:
np.sum(sample, axis=0), ak.to_numpy(ak.sum(sample, axis=0))

In [None]:
np.sum(sample, axis=1), ak.to_numpy(ak.sum(sample, axis=1))

In [None]:
np.sum(sample, axis=-1), ak.to_numpy(ak.sum(sample, axis=-1))

But the Awkward version extends to jagged arrays, missing data, record structures, and all that.

In [None]:
sample = ak.Array([[[  0,   1,   2, None,   4]                            ],
                   [                      None, [ 15,  16,  17, None     ]],
                   [[100, 101, 102,  103, 104], [105, 106, 107           ]]])

In [None]:
ak.to_list(ak.sum(sample, axis=0))

In [None]:
ak.to_list(ak.sum(sample, axis=1))

In [None]:
ak.to_list(ak.sum(sample, axis=-1))

So now you can not only find the maximum pT of particles in each event...

In [None]:
ak.max(events.kaons.pt, axis=-1)

You can find the maximum pT of all particles at index `0`, all particles at index `1`, etc., across events.

In [None]:
ak.to_list(ak.max(events.kaons.pt, axis=0))

## 10. Producing and consuming arrays in pure C++

See [awkward-1.0/dependent-project](https://github.com/scikit-hep/awkward-1.0/tree/master/dependent-project) for an example of a C++ project that produces and consumes Awkward Arrays.

Libraries such as FastJet could take advantage of such an interface to

   * consume a jagged array of tracks-in-events
   * produce a jagged array of jets-in-events

without the inefficiency of creating a Python object for each track/jet (as FastJet's Python interface does) or even a Python object for each event (as pyjet does).

## 11. Documentation!

The [GitHub front page](https://github.com/scikit-hep/awkward-1.0#readme) directs users and developers to the appropriate documentation.

A JupyterBook of "how to" tutorials and "how it works" guides will be written soon.

The [C++ API reference](https://awkward-array.readthedocs.io/en/latest/_static/index.html) (Doxygen) is **complete**.

The [Python API reference](https://awkward-array.readthedocs.io/en/latest/index.html) (Sphinx) is **complete**. This also means that all public functions have docstrings.

We even have a [release history](https://awkward-array.readthedocs.io/en/latest/_auto/changelog.html) (generated using GitHub API) and [CONTRIBUTING.md](https://github.com/scikit-hep/awkward-1.0/blob/master/CONTRIBUTING.md).

## 12. Bleeding edge: PartitionedArray and VirtualArray

These were the last two types needed for Uproot (for non-broken lazy arrays, specifically).

It's also the entry point for Dask integration.

Nick can write NanoEvents now.  :)

In [None]:
cache = {}

def genx(partition):
    print("x for {}".format(partition))
    return ak.Array(np.arange(5*partition, 5*partition + 5))

def geny(partition):
    print("y for {}".format(partition))
    return ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])

lazy_array = ak.partitioned(lambda i: ak.zip({"x": ak.virtual(genx, (i,), length=5, cache=cache),
                                              "y": ak.virtual(geny, (i,), length=5, cache=cache)}, depth_limit=1),
                            100)
print(lazy_array)

In [None]:
print(lazy_array.x)

In [None]:
print(lazy_array.x + 1000)

Lazy arrays are less likely to be evaluated the more information you give them:

   * `length` (as above): so that it doesn't have to be evaluated to figure out how big each partition is
   * `form` (not shown): so that it doesn't have to be evaluated to figure out what its type is

In a system like NanoEvents, both the `length` and the `form` should be supplied.