# Python for High Performance Computing
# <span style="font-family: Courier New, Courier, monospace;">numba</span>
<hr style="border: solid 4px green">
<br>
<center> <img src="images/arc_logo.png"; alt="Logo" style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Overview
<hr style="border: solid 4px green">

### Numba
* project sponsored by Continuum Analytics
* its core is a compiler for Python array and numerical functions
* **J**ust **I**n **T**ime (JIT) compilation
* exploits the LLVM compiler to generate optimised machine code
<br><br>

### Features
* on-the-fly code generation (at import time or runtime, according to user choice)
* native code generation for the CPU (default) and GPU
* integration with the Python scientific software stack (thanks to `Numpy`)

## Example
<hr style="border: solid 4px green">

### Fibonacci sequence

In [1]:
import numpy

def fib (n):
    """Compute Fibonacci sequence"""
    # the output is floats to cope with large n
    s = numpy.zeros(n, dtype=numpy.float64)
    if n >= 1:
        s[0] = 1.0

    if n >= 2:
        s[1] = 1.0

    for i in range (2, n):
        s[i] = s[i-1] + s[i-2]

    return s

In [2]:
% timeit fib (1024)

1000 loops, best of 3: 285 µs per loop


In [3]:
import numpy
from numba import jit

@jit
def fibJIT (n):
    """Compute Fibonacci sequence"""
    s = numpy.zeros(n, dtype=numpy.float64)
    if n >= 1:
        s[0] = 1.0

    if n >= 2:
        s[1] = 1.0

    for i in range (2, n):
        s[i] = s[i-1] + s[i-2]

    return s

In [4]:
# run once to generate machine code and cache results
fibJIT (16)
# now, time it
% timeit fibJIT (1024)

100000 loops, best of 3: 3.3 µs per loop


### 100x speedup with 1 decorator!

## How <span style="font-family: Courier New, Courier, monospace;">numba</span> works
<hr style="border: solid 4px green">

<img src="./images/how-numba-works.jpg"; alt="Logo" style="float: center; width: 80%">

## How <span style="font-family: Courier New, Courier, monospace;">numba</span> works (cont'd)

### Consider the following simple example

In [6]:
# annotate function
@jit
def add_int (a, b):
    return a + b

In [7]:
# run function
add_int (1, 2)

3

### Running the function once
* `numba` examines the Python code and translates this into an 'intermediate representation' (IR)
* LLVM translates the IR into compiled code

### Peek behind the magic: the IR
* the IR can be instpected using the `inspect_types` method

In [8]:
add_int.inspect_types()

add_int (int64, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-6-c54ac082f96d>
# --- LINE 2 --- 

@jit

# --- LINE 3 --- 

def add_int (a, b):

    # --- LINE 4 --- 
    # label 0
    #   a = arg(0, name=a)  :: int64
    #   b = arg(1, name=b)  :: int64
    #   $0.3 = a + b  :: int64
    #   del b
    #   del a
    #   $0.4 = cast(value=$0.3)  :: int64
    #   del $0.3
    #   return $0.4

    return a + b




### Peek behind the magic: LLVM
* the LLVM translation can be inspected using the `inspect_llvm()` method (returns a dictionary)

In [8]:
for k, v in add_int.inspect_llvm().items():
    print(k, v)

((int64, int64), '; ModuleID = \'add_int\'\ntarget datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"\ntarget triple = "x86_64-apple-darwin15.0.0"\n\n@PyExc_RuntimeError = external global i8\n@.const.add_int = internal constant [8 x i8] c"add_int\\00"\n@".const.Fatal error: missing _dynfunc.Closure" = internal constant [38 x i8] c"Fatal error: missing _dynfunc.Closure\\00"\n@".const.missing Environment" = internal constant [20 x i8] c"missing Environment\\00"\n\n; Function Attrs: nounwind\ndefine i32 @"__main__.add_int$2.int64.int64"(i64* noalias nocapture %retptr, { i8*, i32 }** noalias nocapture readnone %excinfo, i8* noalias nocapture readnone %env, i64 %arg.a, i64 %arg.b) #0 {\nentry:\n  %.15 = add nsw i64 %arg.b, %arg.a\n  store i64 %.15, i64* %retptr, align 8\n  ret i32 0\n}\n\ndefine i8* @"cpython.__main__.add_int$2.int64.int64"(i8* %py_closure, i8* %py_args, i8* nocapture readnone %py_kws) {\nentry:\n  %.5 = alloca i8*, align 8\n  %.6 = alloca i8*, align 8\n  %.7 = call i32 (

### JIT compilation = IR translation + LLVM compilation
* powerful type inference
* done once per data type
  * cached for later use: subsequent calls to same function are optimally fast
  * multi-dispatch compilation: a function called with two different types generates two cached versions

## How <span style="font-family: Courier New, Courier, monospace;">numba</span> works (cont'd)
<hr style="border: solid 4px green">

### Type inference recognises simple types (integer, floats)
* optimal code is generated
<br><br>

### However,
* Python objects are used when simple types cannot be inferred
* which is what happens when you try something that is natural in Python, but ambiguous

In [9]:
@jit
def add_strings(a, b):
    return a + b

In [10]:
add_strings ("str", "ing")

'string'

### It worked, but ...
* what does `inspect_types` reveal?

In [11]:
add_strings.inspect_types()

add_strings (str, str)
--------------------------------------------------------------------------------
# File: <ipython-input-9-cae23f949469>
# --- LINE 1 --- 

@jit

# --- LINE 2 --- 

def add_strings(a, b):

    # --- LINE 3 --- 
    # label 0
    #   a = arg(0, name=a)  :: pyobject
    #   b = arg(1, name=b)  :: pyobject
    #   $0.3 = a + b  :: pyobject
    #   del b
    #   del a
    #   $0.4 = cast(value=$0.3)  :: pyobject
    #   del $0.3
    #   return $0.4

    return a + b




### <span style="font-family: Courier New, Courier, monospace;">pyobject</span>: back to the safety of Python objects
* function compiled in "object" mode
* can be faster than regular Python, but not by much
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">nopython</span>: force <span style="font-family: Courier New, Courier, monospace;">numba</span> avoid objects and complain about problems when it cannot
* performance requires simple types (*e.g.* `int64`)
* to pinpoint throttled performance, set `nopython=True` to force `numba` to
  * compile everything down to low level types or
  * bail out if it runs into trouble
* useful in "debug" mode: when performance is not what is expected
  * force `nopython=True`
  * see if there are problems
  * modify Python code and repeat until right
<br><br>

> *Note*: `jit(nopython=True)` is equivalent to `njit`

In [13]:
@jit (nopython=True)
def add_strings(a, b):
    return a + b

add_strings ("str", "ing")

TypingError: Caused By:
Traceback (most recent call last):
  File "/Users/mihai/anaconda/lib/python2.7/site-packages/numba/compiler.py", line 249, in run
    stage()
  File "/Users/mihai/anaconda/lib/python2.7/site-packages/numba/compiler.py", line 466, in stage_nopython_frontend
    self.locals)
  File "/Users/mihai/anaconda/lib/python2.7/site-packages/numba/compiler.py", line 807, in type_inference_stage
    infer.propagate()
  File "/Users/mihai/anaconda/lib/python2.7/site-packages/numba/typeinfer.py", line 613, in propagate
    raise errors[0]
TypingError: Invalid usage of + with parameters (str, str)
Known signatures:
 * (int64, int64) -> int64
 * (int64, uint64) -> int64
 * (uint64, int64) -> int64
 * (uint64, uint64) -> uint64
 * (float32, float32) -> float32
 * (float64, float64) -> float64
 * (complex64, complex64) -> complex64
 * (complex128, complex128) -> complex128
 * (uint64,) -> uint64
 * (uint16,) -> uint64
 * (uint8,) -> uint64
 * (uint32,) -> uint64
 * (int32,) -> int64
 * (int16,) -> int64
 * (int64,) -> int64
 * (int8,) -> int64
 * (float32,) -> float32
 * (float64,) -> float64
 * (complex64,) -> complex64
 * (complex128,) -> complex128
 * parameterized
File "<ipython-input-13-682ecf218f6d>", line 3

Failed at nopython (nopython frontend)
Invalid usage of + with parameters (str, str)
Known signatures:
 * (int64, int64) -> int64
 * (int64, uint64) -> int64
 * (uint64, int64) -> int64
 * (uint64, uint64) -> uint64
 * (float32, float32) -> float32
 * (float64, float64) -> float64
 * (complex64, complex64) -> complex64
 * (complex128, complex128) -> complex128
 * (uint64,) -> uint64
 * (uint16,) -> uint64
 * (uint8,) -> uint64
 * (uint32,) -> uint64
 * (int32,) -> int64
 * (int16,) -> int64
 * (int64,) -> int64
 * (int8,) -> int64
 * (float32,) -> float32
 * (float64,) -> float64
 * (complex64,) -> complex64
 * (complex128,) -> complex128
 * parameterized
File "<ipython-input-13-682ecf218f6d>", line 3

## Decorators
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">numba</span> JIT works via decorators

| Decorator | Function |
| :--- | :---- |
| `jit` | compile Python function into native code |
| `cfunc` | compile Python function into a C callback (usable by foreign C/C++ libraries from Python) |
| `vectorize` | generate ufuncs operating on `NumPy` arrays of like size |
| `guvectorize` | generate ufuncs operating on `NumPy` arrays of arbitrary sizes |
<br><br>

### *Note*: <span style="font-family: Courier New, Courier, monospace;">vectorize</span> and <span style="font-family: Courier New, Courier, monospace;">guvectorize</span> can be parallelised

## <span style="font-family: Courier New, Courier, monospace;">numba</span> JIT compilation
<hr style="border: solid 4px green">

### Two options
* lazy (this is option is already illustrated above!)
* eager

### Lazy (call-time) compilation -- <span style="font-family: Courier New, Courier, monospace;">numba</span> decides when & how to optimise
* compilation deferred until the first function call
* argument types inferred at call time & optimised code generated based on this
* separate (specialised) versions of same function are generated depending on the input type
* *Pros*: very easy to use
* *Cons*: compilation can default to Python objects

### Eager (decoration-time) compilation -- <span style="font-family: Courier New, Courier, monospace;">numba</span> instructed via function signature
* restrict types
  * types of interest
  * avoid Python objects
* type signatures are passed to the decorator
* *Pros*: good practice, leads to controlled (good?) performance
* *Cons*: a little bit more planning than just decoration

## The <span style="font-family: Courier New, Courier, monospace;">jit</span> decorator
<hr style="border: solid 4px green">

### Lazy compilation (once again!)

In [14]:
from numba import jit

@jit
def add (x, y):
    return x + y

In [15]:
print add (1, 2)
print add (1j, 2)

3
(2+1j)


## The <span style="font-family: Courier New, Courier, monospace;">jit</span> decorator (cont'd)
<hr style="border: solid 4px green">

### Eager compilation
* `numba` is given a *function signature*
* defined using types for *input arguments* and *returned values*
<br><br>

### Types
* basic types
  * `void`, `bool`, `uint8`, `uint16`, `uint32`, `uint16`
  * integers: `int32`, `int64`
  * floats: `float32`, `float64`
  * complex: `complex64`, `complex128`
* array types are created by creating "slices" of basic types
  * *e.g.*: `float32[:,:]`
<br><br>

### Function signature
* list of signatures to be compiled
* return type specified first

## The <span style="font-family: Courier New, Courier, monospace;">jit</span> decorator (cont'd)
<hr style="border: solid 4px green">

### Signatures can be defined using objects
```python
from numba import jit, int32
@jit (int32(int32, int32))
```
<br><br>

### ...or using strings (avoiding importing objects)

```python
from numba import jit
@jit ("int32(int32, int32)")
```
<br><br>

### ...and can specify several signatures for the same function (producing several execution paths)

```python
from numba import jit
@jit (["int32(int32, int32)", "int64(int64, int64)"])
```

### Print a few signatures

In [16]:
from numba import void, int32, float32, float64

# function type created from calling a Numba type object
print(float32())                             # no arguments, return 4 byte float
print(void(float32))                         # return nothing
print(float64(int32, float32, float64))      # return a 8 byte float
print(void(float32[:,:], int32[:]))          # array arguments

() -> float32
(float32,) -> none
(int32, float32, float64) -> float64
(array(float32, 2d, A), array(int32, 1d, A)) -> none


### Returning to the previous example

In [17]:
from numba import jit, int32

@jit (int32(int32, int32))
def add (x, y):
    return x + y

In [18]:
# this executes correctly as types are those specified
print add (1, 2)

3


In [19]:
# but this issues an error as the input type is not the one intended
print add (1j, 2)

TypeError: No matching definition for argument type(s) complex128, int64

In [20]:
# redefine function with multiple signatures (remark mixture of imported types and strings)
@jit ([int32(int32, int32), "complex128(complex128, complex128)"], nopython=True)
def add (x, y):
    return x + y

In [21]:
# now both work
print add (1, 2)
print add (1j, 2)

(3+0j)
(2+1j)


### Ooops!  not quite the intended result...
* the type-based dispatching did not work as expected
* remember the order: most specific signature first

In [22]:
# the right order: most specific signature first
@jit ([int32(int32, int32), "int64(int64, int64)", "complex128(complex128, complex128)"], nopython=True)
def add (x, y):
    return x + y

In [23]:
# now both work *correctly*
print add (1, 2)
print add (1j, 2)

3
(2+1j)


## Example
<hr style="border: solid 4px green">

### Task
* compute the $p$-norm of a vector
$$||x||_p=\left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}$$
* reference implementation is `numpy.linalg.norm`
<br><br>

### Steps
* write Python code, decorate it and execute it once (to cache results)
* compare accuracy and performance against the `NumPy` equivalent

In [24]:
import math
import numpy
from numba import jit

@jit
def p_norm_JIT (u, p):
    n = u.size
    s = 0.0
    for i in range (n):
        s += math.pow (math.fabs(u[i]), p)

    return math.pow (s, 1.0 / float(p))

In [25]:
# test accuracy
n = 1024
u = numpy.random.rand (n)
# compute norm from linalg
nrm0 = numpy.linalg.norm (u, 3)
# compute norms using extension functions
nrm1 = p_norm_JIT (u, 3)
print "relative error =", math.fabs ( (nrm0 - nrm1) / nrm0 )

relative error = 1.41977660939e-16


In [26]:
# test performance
n = 25000000
u = numpy.random.rand (n)
% timeit numpy.linalg.norm (u, 3)
% timeit p_norm_JIT (u, 3)

1 loop, best of 3: 641 ms per loop
10 loops, best of 3: 43.3 ms per loop


### Restrict types

In [27]:
from numba import int32, float64

@jit ("float64 (float64[:], int32)")
def p_norm_JIT (u, p):
    n = u.size
    s = 0.0
    for i in range (n):
        s += math.pow (math.fabs(u[i]), p)

    return math.pow (s, 1.0 / float(p))

In [29]:
# test performance
n = 25000000
u = numpy.random.rand (n)
% timeit numpy.linalg.norm (u, 3)
% timeit p_norm_JIT (u, 3)

1 loop, best of 3: 650 ms per loop
10 loops, best of 3: 46.9 ms per loop


### In this case, adding types does not improve performance!

### Better than an order of magnitude gain compared to <span style="font-family: Courier New, Courier, monospace;">NumPy</span>!
* one import
* one decoration
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">numba</span>  compares well with C/Fortran!
<br><br>

### This can be even better!
* in certain situations...

## What about multi-threaded parallel execution?
<hr style="border: solid 4px green">

### Option 1: using the <span style="font-family: Courier New, Courier, monospace;">threading</span> module
* possible but very fussy
  * release the GIL
  * assign chunks of work to threads
  * start the threads
  * let them execute the work
  * join the threads
* complicated, with all the problems of concurrent progamming apply (*e.g.* race conditions)
<br><br>

### Option 2: ufuncs + <span style="font-family: Courier New, Courier, monospace;">vectorize</span>

## ufuncs + <span style="font-family: Courier New, Courier, monospace;">vectorize</span>
<hr style="border: solid 4px green">

### Generate <span style="font-family: Courier New, Courier, monospace;">NumPy</span> ufuncs from scalar functions
* apply the `vectorize` decorator
* resulting function can operate on scalars or `NumPy` arrays
* when used on arrays, the ufunc apply the core scalar function to every group of elements from each arguments in an element-wise fashion
<br><br>

### Writing ufuncs is usual business using C programming (<span style="font-family: Courier New, Courier, monospace;">ctypes</span>)
* but life is too short...

### <span style="font-family: Courier New, Courier, monospace;">vectorize</span>

```
@vectorize (type_signatures[, target="cpu"])
```
* returns a `NumPy` ufunc
* parameters are
  * *type_signatures* -- an iterable of type signatures
    * function type object or
    * a string describing the function type
  * *target* -- a string for hardware target; e.g. "cpu" (default), "parallel" (multicore), "gpu"
    * GPU computing should target isolated parts of the code, a useful option to have
<br><br>

### Example

In [30]:
import numpy
import math

def trig(a, b):
    return math.sin(a**2) * math.exp(b)

In [31]:
trig(1, 1)

2.2873552871788423

### All is good but only works on scalars

In [32]:
a = numpy.ones((5,5))
b = numpy.ones((5,5))
trig(a, b)

TypeError: only length-1 arrays can be converted to Python scalars

### The function can be vectorized

In [35]:
from numba import vectorize
@ vectorize
def trig_vec (a, b):
    return math.sin(a**2) * math.exp(b)

print trig_vec (1, 1)
print trig_vec (a, b)

2.28735528718
[[ 2.28735529  2.28735529  2.28735529  2.28735529  2.28735529]
 [ 2.28735529  2.28735529  2.28735529  2.28735529  2.28735529]
 [ 2.28735529  2.28735529  2.28735529  2.28735529  2.28735529]
 [ 2.28735529  2.28735529  2.28735529  2.28735529  2.28735529]
 [ 2.28735529  2.28735529  2.28735529  2.28735529  2.28735529]]


### <span style="font-family: Courier New, Courier, monospace;">vec_trig</span> is a <span style="font-family: Courier New, Courier, monospace;">NumPy</span> ufunc
<br><br>

### How does it compare to pure <span style="font-family: Courier New, Courier, monospace;">NumPy</span>?

In [36]:
def trig_numpy (a, b):
    return numpy.sin(a**2) * numpy.exp(b)

In [37]:
a = numpy.random.random((1000, 1000))
b = numpy.random.random((1000, 1000))

% timeit trig_numpy (a, b)
% timeit trig_vec (a, b)

100 loops, best of 3: 18.4 ms per loop
100 loops, best of 3: 16.3 ms per loop


### Similar performance...
<br><br>

### What if we specify a signature?  Is there a speed boost?

In [38]:
@ vectorize ("float64(float64, float64)")
def trig_vec (a, b):
    return math.sin(a**2) * math.exp(b)

% timeit trig_vec (a, b)

100 loops, best of 3: 16.3 ms per loop


### Not really.
<br><br>

### But with a signature, we can use the parallel target

In [39]:
@ vectorize ("float64(float64, float64)", target="parallel")
def trig_vec (a, b):
    return math.sin(a**2) * math.exp(b)

%timeit trig_vec (a, b)

100 loops, best of 3: 5.66 ms per loop


### Bingo!  Automatic multicore operations!

## Target performance depends on data size
<hr style="border: solid 4px green">

### Careful: <span style="font-family: Courier New, Courier, monospace;">target="parallel"</span> not always the best option
* there is an overhead in setting up the threading
* this overhead must then be offset by multithreaded performance
* simple operations on little data -- serial performance is better
* expensive operations on large data -- `parallel` is faster
<br><br>

### Guideline
|Target     | Description         | Optimal data size |
| :---      | : ---               | ---:              |
|`cpu`      | single-threaded CPU |             < 1kB |
|`parallel` | multi-core CPU      |             ~ 1MB |
|`cuda`     | CUDA GPU            |             ~ 1GB |

## gufuncs + <span style="font-family: Courier New, Courier, monospace;">guvectorize</span>
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">vectorize</span>
* generate ufuncs from functions that work on one element at a time
* return a value
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">guvectorize</span>
* generate ufuncs that work on an arbitrary number of elements of input arrays, and return arrays of differing dimensions
* no value returned -- the result to be returned is an array argument
* input and output layouts are declared
  * *e.g.* `"(n),()->(n)"` means the function takes a n-element one-dimension array, a scalar (symbolically denoted by the empty tuple ()) and returns a n-element one-dimension array
<br><br>

> *Note*: more at https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html

## Back to the p-norm example
<hr style="border: solid 4px green">

### Can we apply the above to multithread the function?
<br><br>

### Yes, but with mixed success!
* parallelising `u -> numpy.power (numpy.fabs(u), p)` is easy
* the reduction at the end breaks parallelism
* I have no solution!

### Parallelise <span style="font-family: Courier New, Courier, monospace;">u -> numpy.power (numpy.fabs(u), p)</span>

In [6]:
#suppose we wanted to vectorize only the vector part of
#the operations (without the reduction at the end)
#@vectorize parallelises nicely

import numpy
def funcNumpy (u, p):
    return numpy.power (numpy.fabs(u), p)

from numba import vectorize

# target="cpu"
@vectorize (["float64(float64, int64)"])
def funcNumba (u, p):
    return math.pow (math.fabs(u), p)

# target="parallel"
@vectorize (["float64(float64, int64)"], target="parallel")
def funcNumbaPar (u, p):
    return math.pow (math.fabs(u), p)

u = numpy.random.ranf(100)
p = 3
v0 = funcNumpy (u, p)
v1 = funcNumba (u, p)
v2 = funcNumbaPar (u, p)
# check values are computed correctly
print numpy.linalg.norm(v0-v1), numpy.linalg.norm(v0-v2)

# performance test
u = numpy.random.ranf(10**8)
p = 3
% timeit funcNumpy (u, p)
% timeit funcNumba (u, p)
% timeit funcNumbaPar (u, p)

3.0987972677
2.2188367208e-16 2.2188367208e-16
1 loop, best of 3: 3.21 s per loop
1 loop, best of 3: 663 ms per loop
10 loops, best of 3: 211 ms per loop


### Attempting to parallelise everything

In [19]:
#
# --- numpy only function (reference)
#
import numpy
def funcNumpy (u, p):
    return numpy.power (numpy.fabs(u), p)


#
# --- JIT only function -- all fine but not parallel
#
from numba import jit

@jit ("float64(float64[:], int64)")
def p_norm_JIT (u, p):
    n = u.size
    s = 0.0
    for i in range (n):
        s += math.pow (math.fabs(u[i]), p)

    return math.pow (s, 1.0 / float(p))


#
# --- JIT vectorize function with parallelised kernel for u -> math.pow (math.fabs(u), p)
#
from numba import vectorize

@vectorize ("float64(float64, int64)", target="parallel", nopython=True)
def funcNumbaPar (u, p):
    return math.pow (math.fabs(u), p)

@jit ("float64(float64[:], int64)")
def p_norm_JIT_par (u, p):
    v = funcNumbaPar (u, p)
    s = v.sum()
    # can also use the inherited methods add or reduce
    # s = funcNumbaPar.add (v, axis=0)
    return math.pow (s, 1.0 / float(p))

u = numpy.random.ranf(100)
p = 3
v0 = numpy.linalg.norm (u, p)
v1 = p_norm_JIT (u, p)
v2 = p_norm_JIT_par (u, p)
print v0
print numpy.fabs(v0-v1), numpy.fabs(v0-v2)

u = numpy.random.ranf(10**6)
p = 3
% timeit numpy.linalg.norm (u, p)
% timeit p_norm_JIT (u, p)
% timeit p_norm_JIT_par (u, p)

2.95841079601
0.0 0.0
10 loops, best of 3: 24.9 ms per loop
1000 loops, best of 3: 1.84 ms per loop
The slowest run took 5.00 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.75 ms per loop


## There is more!
<hr style="border: solid 4px green">

### Features
* automatic function inlining
* writing fast generators (functions that return a sequence, such as `xrange`)
* targeting GPU execution
<br><br>

### Tutorials
* https://github.com/barbagroup/numba_tutorial_scipy2016/blob/master/notebooks/08.Make.generalized.ufuncs.ipynb
* http://numba.pydata.org/numba-doc/dev/user/vectorize.html

## Limitations
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">numba</span> is new
* under development, not yet mature
* some features are missing, *e.g.* no support JIT compiling native Python classes (`jit_class` structure on the way)

### Parallel execution?
* parallel execution is a possibility but
   * the recipe keeps changing, *e.g* the `prange` construct in version 0.11 is no longer available
   * LLVM has had no support for OpenMP until recently
* best approach at the moment for parallel execution is to combine `numba` with other solutions, such as `multiprocessing`

## Summary
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">numba</span>
* an exciting project with promising future
* *extremely* easy to use (essentially the API has only one feature; the decorator `@jit`)
* *extremely* easy to install
* as good as C/Fortran or `Cython` (on a single core!)
* easy threaded parallelism (limited functionality)
<br><br>

### ...but
* don't trust it -- test everything!
* treat it with caution when it comes to long term code development plans

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >
<br>
<br>