High Performance Computing 2018/2019
=======

Lecture 10. High Performance Python. Numba
----------

Heavily based on (forked): 

Scipy2017 tutorial by Gil Forsyth:

https://github.com/gforsyth/numba_tutorial_scipy2017

https://www.youtube.com/watch?v=1AwG0T4gaO0&t=1349s

GTC2018 tutorial by Stan Seibert:

https://github.com/ContinuumIO/gtc2018-numba


High Performance Python
----------------------

* multiprocessing
* mpi4py
* pycuda
* pyopencl
* **numba**

Numba is:

Just-In-Time (JIT) compiler:
* generates optimized machine code using LLVM
* integrates well with Scientific Python stack
* **function compiler**: Numba compiles Python functions (not entire applications and not parts of functions). Numba is a Python module.
* **type-specializing**: Numba speeds up your function by generating a specialized implementation for the specific data types you are using. 
* **just-in-time**: Numba translates functions when they are first called so that the compiler knows the argument types. Works in Jupyter notebook.
* **numerically-focused**: „mostly“ int, float, complex. Works good with numpy arrays.


The first step is always to find the bottlenecks in your code, via _profiling_: analyzing your code by measuring the execution time of its parts.


Tools:
------

2. `cProfile`
4. `snakeviz`
1. [`line_profiler`](https://github.com/rkern/line_profiler)
3. `timeit`



```console
pip install line_profiler
```

In [1]:
import numpy
from time import sleep

def sleepy(time2sleep):
    sleep(time2sleep)
    
def supersleepy(time2sleep):
    sleep(time2sleep)
    
def randmatmul(n=1000):
    a = numpy.random.random((n,n))
    b = a @ a
    return b
    
def useless(a):
    if not isinstance(a, int):
        return
    
    randmatmul(a)
    
    ans = 0
    for i in range(a):
        ans += i
        
    sleepy(1.0)
    supersleepy(2.0)
        
    return ans

## using `cProfile`

[`cProfile`](https://docs.python.org/3.4/library/profile.html#module-cProfile) is the built-in profiler in Python (available since Python 2.5).  It provides a function-by-function report of execution time. First import the module, then usage is simply a call to `cProfile.run()` with your code as argument. It will print out a list of all the functions that were called, with the number of calls and the time spent in each.


In [6]:
import cProfile

cProfile.run('useless(3000)')

         11 function calls in 3.629 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.460    0.460    0.603    0.603 <ipython-input-1-3d4635a3d65c>:10(randmatmul)
        1    0.017    0.017    3.628    3.628 <ipython-input-1-3d4635a3d65c>:15(useless)
        1    0.000    0.000    1.004    1.004 <ipython-input-1-3d4635a3d65c>:4(sleepy)
        1    0.000    0.000    2.005    2.005 <ipython-input-1-3d4635a3d65c>:7(supersleepy)
        1    0.000    0.000    3.629    3.629 <string>:1(<module>)
        1    0.000    0.000    3.629    3.629 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    3.009    1.505    3.009    1.505 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.143    0.143    0.143    0.143 {method 'random_sample' of 'mtrand.RandomState' 

## using `snakeviz`

In [2]:
%load_ext snakeviz

In [4]:
%snakeviz useless(3000)

 
*** Profile stats marshalled to file '/var/folders/mc/0c22tdns1s38xgw4sr17cmx80000gn/T/tmpg53pk483'. 
Embedding SnakeViz in the notebook...


## using `line_profiler`

`line_profiler` offers more granular information than `cProfile`: it will give timing information about each line of code in a profiled function.

### For a pop-up window with results in notebook:

IPython has an `%lprun` magic to profile specific functions within an executed statement. Usage:
`%lprun -f func_to_profile <statement>` (get more help by running `%lprun?` in IPython).

In [11]:
%load_ext line_profiler
%lprun -f sleepy -f supersleepy useless(1000)

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


### Write results to a text file

In [13]:
%lprun -T timings.txt -f sleepy useless(1000)


*** Profile printout saved to text file 'timings.txt'. 


## Profiling on the command line

Open file, add `@profile` decorator to any function you want to profile, then run

```console
kernprof -l script_to_profile.py
```

which will generate `script_to_profile.py.lprof` (pickled result).  To view the results, run

```console
python -m line_profiler script_to_profile.py.lprof
```

In [18]:
from IPython.display import IFrame
IFrame('http://localhost:8888/terminals/1', width=800, height=700)

## `timeit`

```python
python -m timeit "print(42)"
```


In [19]:
# line magic
%timeit x=10

11.6 ns ± 0.0995 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)


In [22]:
%%timeit 
# cell magic

x=10
a='hello'
d=[1,2,3]

58.4 ns ± 1.85 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


JIT
===

### Array sum

The function below is a naive `sum` function that sums all the elements of a given array.

In [23]:
def sum_array(inp):
    J, I = inp.shape
    
    #this is a bad idea
    mysum = 0
    for j in range(J):
        for i in range(I):
            mysum += inp[j, i]
            
    return mysum

In [24]:
import numpy
arr = numpy.random.random((300, 300))

sum_array(arr)

plain = %timeit -o sum_array(arr)

18.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [28]:
from numba import jit

sum_array_numba = jit()(sum_array)

sum_array_numba(arr)

jitted = %timeit -o sum_array_numba(arr)

plain.best / jitted.best


81.1 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


228.1761970125171

## More commonly as a decorator

In [None]:
@jit
def sum_array(inp):
    I, J = inp.shape
    
    mysum = 0
    for i in range(I):
        for j in range(J):
            mysum += inp[i, j]
            
    return mysum

In [26]:
%timeit arr.sum()

39.1 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## [Your turn!](./exercises/02.Intro.to.JIT.exercises.ipynb#JIT-Exercise)