# Running faster your code

## Table of Contents:

1. [Vectorize with NumPy](#numpy).
2. [Use in-place operations](#in-place).
3. [Maximize locality in memory access](#locality).
4. [Delegate in C](#C)

## 1. Vectorize with [NumPy](https://numpy.org/) <a class="anchor" id="numpy"></a>

### Example 1:

The computation of the [discrete signal energy](https://en.wikipedia.org/wiki/Energy_(signal_processing) can be computed as a particular case of the [dot product](https://en.wikipedia.org/wiki/Dot_product) when both signals are the same:
$$ ~\\ E_{s} \ \ = \ \ \langle x(n), x(n)\rangle \ \ = \sum_{n}{|x(n)|^2} = \sum_{n}{x(n)y(n)}$$

In [1]:
import numpy as np

def non_vectorized_dot_product(x, y):
    result = 0
    for i in range(len(x)):
        result += x[i] * y[i]
    return result

signal = np.random.random(100000)

In [2]:
%time
non_vectorized_dot_product(signal, signal)

CPU times: user 16 µs, sys: 23 µs, total: 39 µs
Wall time: 4.29 µs


33337.15045903727

Now, using Numpy's array multiplication and sum:

In [3]:
%timeit
np.sum(signal*signal)

33337.15045903682

### Example 2:

In [4]:
import numpy as np

# https://softwareengineering.stackexchange.com/questions/254475/how-do-i-move-away-from-the-for-loop-school-of-thought
def cleanup(x, missing=-1, value=0):
    result = []
    for i in range(len(x)):
        if x[i] == missing:
            result.append(value)
        else:
            result.append(x[i])
    return np.array(result)

array = np.arange(-10000,10000)

In [5]:
print(array[9995:10005])
%time print(cleanup(array, value=10, missing=0)[9995:10005])

[-5 -4 -3 -2 -1  0  1  2  3  4]
[-5 -4 -3 -2 -1 10  1  2  3  4]
CPU times: user 4.24 ms, sys: 477 µs, total: 4.72 ms
Wall time: 4.69 ms


In [6]:
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
%time print(np.where(array == 0, 10, array)[9995:10005])

[-5 -4 -3 -2 -1 10  1  2  3  4]
CPU times: user 634 µs, sys: 310 µs, total: 944 µs
Wall time: 861 µs


### [Example 3](https://github.com/pyHPC/pyhpc-tutorial):

In [7]:
from math import sin
import numpy as np

arr = np.arange(10000000)
%time x = [sin(i)**2 for i in arr]

CPU times: user 1.59 s, sys: 112 ms, total: 1.7 s
Wall time: 1.7 s


In [8]:
%time x = np.sin(arr)**2

CPU times: user 208 ms, sys: 20.6 ms, total: 229 ms
Wall time: 228 ms


## 2. Use in-place operations <a class="anchor" id="in-place"></a>

In [9]:
import numpy as np
a = np.random.random(5000000)

In [10]:
b = np.copy(a)

In [11]:
%%timeit
global a # Required by %%timeit
a = 10*a



7.7 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
a = np.copy(b)

In [13]:
%%timeit
global a # Required by %%timeit
a *= 10



3.12 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 3. Maximize locality in memory access <a class="anchor" id="locality"></a>

In [14]:
import numpy as np
import numba as nb # Use Numba to compile to machine code

a = np.random.rand(1000, 1000)
b = np.copy(a)

In [21]:
# The inner loop traverses the matrix by columns (expecting that columns are contiguous in RAM).
def mult_by_cols(x, val):
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i][j] /= val
    return x
            
@nb.jit(nopython=True)
def JIT__mult_by_cols(x, val):
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i][j] /= val
    return x

# The inner loop traverses the matrix by rows (expecting that rows are contiguous in RAM).
def mult_by_rows(x, val):
    for j in range(x.shape[1]):
        for i in range(x.shape[0]):
            x[i][j] /= val
    return x
            
@nb.jit(nopython=True)
def JIT__mult_by_rows(x, val):
    for j in range(x.shape[1]):
        for i in range(x.shape[0]):
            x[i][j] /= val
    return x

In [22]:
%%timeit
mult_by_cols(a, 10)

232 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
a[...] = b

In [24]:
%%timeit
mult_by_rows(a, 10)

233 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
a[...] = b

In [26]:
%%timeit
JIT__mult_by_cols(a, 10)

3.48 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [27]:
a[...] = b

In [28]:
%%timeit
JIT__mult_by_rows(a, 10)

The slowest run took 4.71 times longer than the fastest. This could mean that an intermediate result is being cached.
5.11 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 4. Delegate in C <a class="anchor" id="C"></a>
When you want to speed-up your code or simply when you need to reuse C code, it is possible to use it from Python. There are several alternatives:

1. [Cython](http://cython.org/): A superset of Python to allow you call C functions and load Python variables with C ones. 
2. [SWIG (Simplified Wrapper Interface Generator)](http://www.swig.org/): A software development tool to connect C/C++ programs with other languages (included Python).
3. [Ctypes](http://python.net/crew/theller/ctypes/): A Python package that can be used to call shared libraries (`.ddl`/`.so`/`.dylib`) from Python.
4. [Python-C-API](https://docs.python.org/3.6/c-api/index.html): A low-level interface between (compiled) C code and Python.

A function to optimize:

In [None]:
!cat sum_array_lib.py

In [None]:
# Please, restart the kernel to ensure that the module sum_array_lib is re-loaded
!rm -f sum_array_lib.cpython*.so
import sum_array_lib
import array as arr
a = arr.array('d', [i for i in range(100000)])
#a = [1 for i in range(100000)]
%timeit sum_array_lib.sum_array(a, len(a))
sum = sum_array_lib.sum_array(a, len(a))
print(sum)

### 4.1 Cython

[Python with C data types](https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html). Another [interesting link](https://nyu-cds.github.io/python-cython/).

#### Working flow:
```
      Cython compiler        C compiler
.pyx -----------------> .c --------------> .so
```

#### Installation

In [None]:
pip install Cython

#### Compilation of pure Python code:

In [None]:
!cp sum_array_lib.py sum_array_lib.pyx

In [None]:
!cat sum_array_lib.pyx

In [None]:
!cat Cython/basic/setup.py

In [None]:
!rm -f sum_array_lib.cpython*.so
!python Cython/basic/setup.py build_ext --inplace

In [None]:
# Please, restart the kernel to ensure that the module sum_array_lib is re-loaded
import sum_array_lib
import array as arr
a = arr.array('d', [i for i in range(100000)])
#a = [1.1 for i in range(100000)]
%timeit sum_array_lib.sum_array(a, len(a))
sum = sum_array_lib.sum_array(a, len(a))
print(sum)

#### Defining C types:

In [None]:
!cat Cython/cdef/sum_array_lib.pyx

In [None]:
!cat Cython/cdef/setup.py

In [None]:
# Please, restart the kernel to ensure that the module sum_array_lib is re-loaded
!rm sum_array_lib.cpython*.so
!python Cython/cdef/setup.py build_ext --inplace

In [None]:
# Please, restart the kernel to ensure that the module sum_array_lib is re-loaded
import array as arr
import sum_array_lib
#import numpy as np
#a = np.arange(100000)
a = arr.array('d', [i for i in range(100000)])
%timeit sum_array_lib.sum_array(a, len(a))
print(sum)

### 4.2 Python-C

Python-C-API is the most flexible and efficient alternative, but also the hardest to code.

#### The C code to reuse in Python

In [None]:
!cat sum_array_lib.c

In [None]:
!cat sum_array.c

In [None]:
!gcc -O3 sum_array.c -o sum_array
!./sum_array

### The module

In [None]:
!cat sum_array_module.c

### Module compilation

In [None]:
!cat setup.py

In [None]:
!python setup.py build_ext --inplace

In [None]:
import sum_array_module
import numpy as np
a = np.arange(100000)
%timeit sum_array_module.sumArray(a)
print(sum)

However, remember: vectorize when possible!

In [None]:
%timeit np.sum(a)
print(sum)