# PyHEP 2022 - Using C++ From Numba, Fast and Automatic

## Numba

Scientific community has developed various techniques to write codes easily while also accelerating their execution speeds. Numba is one such library. Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler framework. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

### Numba: Tradeoff between flexibility and performance

Python has the flexibility of converting easily between different data types. This is because each python object is a PyObject that can store any datatype that is used in Python.

![Boxing in Python](box_pyobject.svg "Boxing float in PyObject")

So when you store a floating point number in a variable in python. Python first converts it to a Pyobject, which is called boxing, and then the pointer to this Pyobject is what the variable stores. Whenever the native value, the floating point number in this case, is required for any calculations it needs to be unboxed from the PyObject and then used for calculations.

![Unboxing in Python](unbox_pyobject.svg "Unboxing float from PyObject")

These boxing and unboxing operations are deterimental to performance but providde the necessary flexibility for Python duck typing.

Numba on the other hand gets rid of this flexibility for performance. It unboxes the inputs of the function and the whole function is run on native values and not PyObjects. At the end the output is boxed so that Python can use it. For this to work numba needs to figure out the types of not only the input and output but the intermediate variables as well. Once the types are inferred Numba converts the Python code into LLVM IR using the python package llvmlite. LLVM IR is a representation that is created by LLVM and used in it's compiler framework as a device independent representation which can be easily converted into assembly using LLVM tools.

![Numba working](numba.svg "Numba only deals with native values in the LLVM IR")

The drawback to this approach is that if the types in the program that are not determinable the speed up will be minimal.

### Performance benefits from Numba

To measure the performance benefits from numba we use a simple function. This function calculates the `tanh` of the trace of a numpy matrix and adds it back to the whole matrix. __The only difference between the two is the Numba decorator on Line 15.__

In [1]:
from numba import jit
import numpy as np
import time

################ Pure Python ###############
# Function is not compiled and runs in byte code
def go_slow(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

################ Numba ###############
# Function is compiled and runs in machine code
@jit(nopython=True) # <--------------- Numba decorator
def go_fast(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

In [2]:
x = np.arange(100).reshape(10, 10)

start = time.perf_counter()
go_slow(x)
end = time.perf_counter()
python_1r = end - start

start = time.perf_counter()
go_slow(x)
end = time.perf_counter()
python_2r = end - start

start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
numba_wc = end - start

start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
numba_ac = end - start

In [3]:
print(f"Python Function: Elapsed (1st run)    = {python_1r:.7f}s")
print(f"Numba Function : Elapsed (with comp)  = {numba_wc:.7f}s")
print()
print(f"Python Function: Elapsed (2nd run)    = {python_2r:.7f}s")
print(f"Numba Function : Elapsed (after comp) = {numba_ac:.7f}s")

Python Function: Elapsed (1st run)    = 0.0001171s
Numba Function : Elapsed (with comp)  = 0.4553034s

Python Function: Elapsed (2nd run)    = 0.0001658s
Numba Function : Elapsed (after comp) = 0.0000494s


These results show that in the first run numba takes a lot more time than a conventional python function. This is because it takes time to compile the function into LLVM IR during the first step whereas the conventional python function executes does have to go through these extra steps. After the compilation step is where numba shines as it can provide a speedup of one or two orders of magnitude depending on the program itself.

## Cppyy

Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling C++ from Python and Python from C++. It is the base for PyROOT and in such manner has been a major for many experiments to use their own data models in Python.

### Cppyy: Why should you care about C++?

Most of the codebase used by scientific community is in C++. Converting these to a language such as Python will require a huge workforce and would also suffer performance hits. Thus it is not a feasible option. Using cppyy the C++ data models that the respective HEP experiments have are readily available to the user while also providing the ease of protoptyping that Python provides. It really is the best of both worlds.

### Benefits of using Cppyy with Numba?

1) __Numba makes loops fast:__ When using cppyy with python, the loops in python are slower as compared to languages such as C/C++. Numba alleviates this problem and can make it as fast as C without much code instrumentation.

2) __Code completely in python:__ This makes debugging easier. To debug numba instrumented code you can either comment out the instrumentation line and debug the code as you would do in Python or use gdb using `numba.gdb`. Numba also has a variety of flags that can be turned on to see tracebacks and the intermediate steps taken by numba. _This is easier than to debug a code that is setup in Python and uses RDF for hotspots._

3) __No conversions in the IR:__ Cppyy can be converted to LLVM IR cleanly so we do not spend any time in type conversions and gain the maximum amount of speedup possible.

4) __Two worlds close together:__ You can switch between C++ and Python as and when you want.

### Performance

Similar to the tanh example used to compare Numba vs Python we use the std::tanh from C++ to compare the performance against numba. We just replace the tanh function and no extra changes are done.

In [4]:
import cppyy
import cppyy.numba_ext

################ Cppyy ###############
# Function is compiled and runs in machine code
@jit(nopython=True)
def go_fast_cppyy(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += cppyy.gbl.tanh(a[i, i]) # <---------------- Replaces np.tanh
    return a + trace

In [5]:
start = time.perf_counter()
go_fast_cppyy(x)
end = time.perf_counter()
cppyy_wc = end - start

start = time.perf_counter()
go_fast_cppyy(x)
end = time.perf_counter()
cppyy_ac = end - start

In [6]:
print(f"Python Function: Elapsed (1st run)    = {python_1r:.7f}s")
print(f"Numba Function : Elapsed (with comp)  = {numba_wc:.7f}s")
print(f"Cppyy Function : Elapsed (with comp)  = {cppyy_wc:.7f}s")
print()
print(f"Python Function: Elapsed (2nd run)    = {python_2r:.7f}s")
print(f"Numba Function : Elapsed (after comp) = {numba_ac:.7f}s")
print(f"Cppyy Function : Elapsed (after comp) = {cppyy_ac:.7f}s")

Python Function: Elapsed (1st run)    = 0.0001171s
Numba Function : Elapsed (with comp)  = 0.4553034s
Cppyy Function : Elapsed (with comp)  = 0.1789647s

Python Function: Elapsed (2nd run)    = 0.0001658s
Numba Function : Elapsed (after comp) = 0.0000494s
Cppyy Function : Elapsed (after comp) = 0.0000485s


The result show that overhead for using cppyy in a numba function is minimal as the time elapsed is almost similar to the numba only function.

## Features

### 1) Plug and Play

In [7]:
import numba             # Working with numba
import cppyy             # Imports the cppyy library
import cppyy.numba_ext   # Imports the necessary information for numba to work with cppyy
import math



@numba.jit(nopython=True)
def cpp_sqrt(x):
    return cppyy.gbl.sqrt(x)

print("Sqrt of 4: ", cpp_sqrt(4.0))
print("Sqrt of Pi: ", cpp_sqrt(math.pi))

Sqrt of 4:  2.0
Sqrt of Pi:  1.7724538509055159


### 2) Overload selection

In [8]:
cppyy.cppdef("""
int mul(int x) { return x * 2; }
float mul(float x) { return x * 3; }
""")

@numba.jit(nopython=True)
def oversel(a):
    total = type(a[0])(0)
    for i in range(len(a)):
        total += cppyy.gbl.mul(a[i])
    return total

a = np.array(range(10), dtype=np.float32)
print("Array: ", a)
print("Overload selection output: ", oversel(a))

a = np.array(range(10), dtype=np.int32)
print("Array: ", a)
print("Overload selection output: ", oversel(a))

Array:  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Overload selection output:  135.0
Array:  [0 1 2 3 4 5 6 7 8 9]
Overload selection output:  90


### 3) Template instantiation

In [9]:
import cppyy
import cppyy.numba_ext
import numba
import numpy as np

cppyy.cppdef("""
template<typename T>
T square(T t) { return t*t; }
""")

@numba.jit(nopython=True)
def tsa(a):
    total = type(a[0])(0)
    for i in range(len(a)):
        total += cppyy.gbl.square(a[i])
    return total

a = np.array(range(10), dtype=np.float32)
print("Array: ", a)
print("Sum of squares: ", tsa(a))

a = np.array(range(10), dtype=np.int32)
print("Array: ", a)
print("Sum of squares: ", tsa(a))

Array:  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Sum of squares:  285.0
Array:  [0 1 2 3 4 5 6 7 8 9]
Sum of squares:  285


## Demos

### 1) Numba physics example

Taken from:
https://github.com/numba/numba-examples/blob/master/examples/physics/lennard_jones/numba_scalar_impl.py

In [10]:
import numba
import cppyy

cppyy.cppdef("""
#include <vector>

struct Atom {
    float x;
    float y;
    float z;
};

std::vector<Atom> atoms = {{1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {4, 5, 6}, {5, 6, 7}};
""")

@numba.njit
def lj_numba_scalar(r):
    sr6 = (1./r)**6
    pot = 4.*(sr6*sr6 - sr6)
    return pot


@numba.njit
def distance_numba_scalar(atom1, atom2):
    dx = atom2.x - atom1.x
    dy = atom2.y - atom1.y
    dz = atom2.z - atom1.z

    r = (dx * dx + dy * dy + dz * dz) ** 0.5

    return r



def potential_numba_scalar(cluster):
    energy = 0.0
    for i in range(5 - 1):
        for j in range(i + 1, 5):
            r = distance_numba_scalar(cluster[i], cluster[j])
            e = lj_numba_scalar(r)
            energy += e
            
    return energy

print(potential_numba_scalar(cppyy.gbl.atoms))

-0.5780277345740283


### 3) Using the extension with ROOT

TLorentzVector is a root class with four properties:
Px, Py, Pz and E

It also provides the transverse momentum to the user which can be calculated by:

$$Pt = \sqrt{Px^2+Py^2}$$

In [11]:
########################################## Setup Code ###############################
import numba
import math
import ROOT
import cppyy.numba_ext
import time

ROOT.gInterpreter.Declare("""
std::vector<TLorentzVector> vec_lv;

const int no_of_samples = 1000;

void fill() {
  vec_lv.reserve(no_of_samples);
  TRandom3 R(111);
  
  for (int i = 0; i < no_of_samples; ++i) {
    double Px = R.Gaus(0,10);
    double Py = R.Gaus(0,10);
    double Pz = R.Gaus(0,10);
    double E  = R.Gaus(100,10);
    vec_lv.push_back(TLorentzVector(Px, Py, Pz, E));
  }
}
""")
ROOT.gInterpreter.ProcessLine("""
fill();
""")


Welcome to JupyROOT 6.27/01


0

In this example we calculate the same using Python and show how we can speed up the calculation using numba.
The `calc_pt` function uses pure python to calculate `Pt` whereas `the numba_calc_pt` uses numba to do the same. As before the only __difference between the two is the numba decorator__ so you do not need to change anything.

In [12]:

def calc_pt(lv):
    return math.sqrt(lv.Px() ** 2 + lv.Py() ** 2)

def calc_pt_vec(vec_lv):
    pt = []
    for i in range(vec_lv.size()):
        pt.append((calc_pt(vec_lv[i]), vec_lv[i].Pt()))
    return pt


@numba.njit
def numba_calc_pt(lv):
    return math.sqrt(lv.Px() ** 2 + lv.Py() ** 2)

def numba_calc_pt_vec(vec_lv):
    pt = []
    for i in range(vec_lv.size()):
        pt.append((numba_calc_pt(vec_lv[i]), vec_lv[i].Pt()))
    return pt


In [13]:
start = time.perf_counter()
pt = calc_pt_vec(ROOT.vec_lv)
end = time.perf_counter()
python_elapsed = end - start

start = time.perf_counter()
pt = numba_calc_pt(ROOT.vec_lv[0])
end = time.perf_counter()
numba_warmup = end - start

start = time.perf_counter()
pt = numba_calc_pt_vec(ROOT.vec_lv)
end = time.perf_counter()
numba_elapsed = end - start

print(f"Numba'd    : Warmup  = {numba_warmup  :.5f}s")
print()
print(f"Pure Python: Elapsed = {python_elapsed:.5f}s")
print(f"Numba'd    : Elapsed = {numba_elapsed :.5f}s")

print(f"Speedup              = {python_elapsed / numba_elapsed:.5f}x")

no_of_samples = 3
print("\nCalc pT \tActual pT")
print("---------------------------")
print(*(f"{x:2.5f} \t{y:2.5f}" for x,y in pt[:no_of_samples]), sep="\n")

if False in tuple(x==y for x, y in pt):
    print("\nSome values do not match")
else:
    print("\nAll values match")

Numba'd    : Warmup  = 0.03905s

Pure Python: Elapsed = 0.03697s
Numba'd    : Elapsed = 0.00366s
Speedup              = 10.11394x

Calc pT 	Actual pT
---------------------------
8.95222 	8.95222
4.11973 	4.11973
25.97929 	25.97929

All values match


### 3) RDF

You can also use it inside RDF through `ROOT.Numba.Declare`. Underneath is a simple example where it is used to calculate the power function.

In [14]:
import numba
import ROOT
import cppyy.numba_ext

ROOT.gInterpreter.Declare("""
double cpppow(double x, int y) { return pow(x, y); }
""")

@ROOT.Numba.Declare(['double', 'int'], 'double')
def pypownd(x, y):
    return ROOT.cpppow(x, y) # <--------- Numba.Declare supports ROOT python due to the numba extension


ROOT.gInterpreter.ProcessLine("""
cout << "2^3 = " << Numba::pypownd(2, 3) << endl
     << "4^5 = " << Numba::pypownd(4, 5) << endl;""")
print()

# Or we can use the callable as well within a RDataFrame workflow.
data = ROOT.RDataFrame(4).Define('x', '(float)rdfentry_')\
                         .Define('x_pow3', 'Numba::pypownd(x, 3)')\
                         .AsNumpy()
 
print('pypownd({}, 3) = {}'.format(data['x'], data['x_pow3']))


pypownd([0. 1. 2. 3.], 3) = [ 0.  1.  8. 27.]
2^3 = 8
4^5 = 1024


## Future work

1) __Complete C++ feature support__:

    - implicit conversions
    - memory management
    - constructor support
    - virtual inheritance

2) __Inlining__:

For the code:
```python
def numba_calc_pt(lv):
    return math.sqrt(lv.Px() ** 2 + lv.Py() ** 2)
```

The equivalent LLVM IR is:
```c++
define i32 @_ZN8__main__13numba_calc_ptB2v1B38c8tJTIcFHzwl2ILiXkcBV0KBSgP9CGZpAgA_3dE28CppClass_28TLorentzVector_29(double* noalias nocapture %retptr, { i8*, i32, i8* }** noalias nocapture readnone %excinfo, { i32, i32, i32, i32, i32, i32 }* %arg.lv) local_unnamed_addr {
entry:
  %.5 = bitcast { i32, i32, i32, i32, i32, i32 }* %arg.lv to i8*
  %.614 = load double (i8*)*, double (i8*)** bitcast (i8** @numba.dynamic.globals.7fbfd71f7080 to double (i8*)**), align 8
  %.8 = tail call double %.614(i8* %.5)
  %.147.i = fmul double %.8, %.8
  %.3615 = load double (i8*)*, double (i8*)** bitcast (i8** @numba.dynamic.globals.7fbfd6d7b080 to double (i8*)**), align 8
  %.38 = tail call double %.3615(i8* %.5)
  %.147.i9 = fmul double %.38, %.38
  %.63 = fadd double %.147.i, %.147.i9
  store double %.63, double* %retptr, align 8
  ret i32 0
}
```
For best performance the line ```%.3615 = load double (i8*)*, double (i8*)** bitcast (i8** @numba.dynamic.globals.7fbfd6d7b080 to double (i8*)**), align 8``` should be replaced by memory access and that will require rebasing llvmlite on top of Cling. This will allow symbols of C++ that Cling parsed are available at LLVM IR level and so we can replace the function call easily.

3) __GPU support__

4) __Automatic parallelization__: We want to support automatic parallelization using OpenMP


$$$$

----------------------------------------
# $$Thankyou$$
----------------------------------------

$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$
$$$$

$$Extra Examples$$

In [15]:
import numba
import cppyy
import numpy as np

cppyy.cppdef("""
float arr1[] = {0.1, 0.2, 0.3, 0.4, 0.5};
""")

@numba.njit
def square_arr(arr):
    ret = []
    for i in range(5):
        x = arr[i] ** 2
        ret.append(x)
            
    return np.array(ret)

print(square_arr(cppyy.gbl.arr1))

[0.01       0.04       0.09       0.16000001 0.25      ]


In [16]:
import cppyy
import numba
import numpy as np

cppyy.cppdef("""\
class MyData {
public:
    MyData(int i, int j) : fField1(i), fField2(j) {}

public:
    int get_field1() { return fField1; }
    int get_field2() { return fField2; }

    MyData copy() { return *this; }

public:
    int fField1;
    int fField2;
};""")

@numba.jit(nopython=True)
def tsdf(a, d):
    total = type(a[0])(0)
    for i in range(len(a)):
        total += a[i] + d.fField1 + d.fField2
    return total

d = cppyy.gbl.MyData(5, 6)
a = np.array(range(10), dtype=np.int32)
print(tsdf(a, d))

# example of method calls
@numba.jit(nopython=True)
def tsdm(a, d):
    total = type(a[0])(0)
    for i in range(len(a)):
        total += a[i] +  d.get_field1() + d.get_field2()
    return total

print(tsdm(a, d))

# example of object return by-value
@numba.jit(nopython=True)
def tsdcm(a, d):
    total = type(a[0])(0)
    for i in range(len(a)):
        total += a[i] + d.copy().fField1 + d.get_field2()
    return total

print(tsdcm(a, d))



155
155
155


In [17]:
import numba
import cppyy

cppyy.cppdef("""
float arr2[] = {0.1, 0.2, 0.3, 0.4, 0.5};
""")

@numba.njit
def sum_arr(arr):
    energy = 0.0
    for i in range(5):
        energy += arr[i]
            
    return energy

print(sum_arr(cppyy.gbl.arr2))

1.5000000223517418
