# Running faster your code

## 1 Vectorize

[Discrete signal energy](https://en.wikipedia.org/wiki/Energy_(signal_processing):
$$ E_{s} \ \ = \ \ \langle x(n), x(n)\rangle \ \  =  \sum_{n=-\infty}^{\infty}{|x(n)|^2}$$
can be computed as a particular case of the [dot product](https://en.wikipedia.org/wiki/Dot_product):
$$ \langle x(n), y(n)\rangle \ \  =  \sum_{n=-\infty}^{\infty}{x(n)y(n)}$$
where both signals are the same.

In [None]:
import numpy as np

def non_vectorized_dot_product(x, y):
    """Return the sum of x[i] * y[j] for all pairs of indices i, j.

    Example:
    
        >>> my_dot_product(np.arange(20), np.arange(20))
    
    """
    result = 0
    for i in range(len(x)):
        result += x[i] * y[i]
    return result

signal = np.random.random(1000)
print(signal)

In [None]:
%timeit non_vectorized_dot_product(signal, signal)

In [None]:
non_vectorized_dot_product(signal, signal)

Now, using Numpy's array multiplication and sum:

In [None]:
%timeit np.sum(signal*signal)

In [None]:
np.sum(signal*signal)

Another example to see that vectorization not only involves pure computation:

In [None]:
# https://softwareengineering.stackexchange.com/questions/254475/how-do-i-move-away-from-the-for-loop-school-of-thought
def cleanup(x, missing=-1, value=0):
    """Return an array that's the same as x, except that where x ==
    missing, it has value instead.

    >>> cleanup(np.arange(-3, 3), value=10)
    ... # doctest: +NORMALIZE_WHITESPACE
    array([-3, -2, 10, 0, 1, 2])

    """
    result = []
    for i in range(len(x)):
        if x[i] == missing:
            result.append(value)
        else:
            result.append(x[i])
    return np.array(result)

array = np.arange(-8,8)
print(array)
print(cleanup(array, value=10, missing=0))

In [None]:
array = np.arange(-1000,1000)
%timeit cleanup(array, value=10, missing=0)
print(array[995:1006])
print(cleanup(array, value=10, missing=0)[995:1006])

In [None]:
# http://www.secnetix.de/olli/Python/list_comprehensions.hawk
# https://docs.python.org/3/library/functions.html#zip
value = [10]*2000
%timeit [xv if c else yv for (c,xv,yv) in zip(array == 0, value, array)]
print([xv if c else yv for (c,xv,yv) in zip(array == 0, value, array)][995:1006])

In [None]:
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
%timeit np.where(array == 0, 10, array)
print(np.where(array == 0, 10, array)[995:1006])

## 2 Use in-place operations

In [None]:
a = np.random.random(500000)
print(a[0:10])
b = np.copy(a)
%timeit global a; a = 10*a
a = 10*a
print(a[0:10])

In [None]:
a = np.copy(b)
print(a[0:10])
%timeit global a ; a *= 10
a *= 10
print(a[0:10])

## 3 Maximize locality in memory acess

In [None]:
a = np.random.rand(100,50)
b = np.copy(a)

In [None]:
def mult(x, val):
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i][j] /= val
%timeit -n 1 -r 1 mult(a, 10)

In [None]:
a = np.copy(b)

def mult2(x, val):
    for j in range(x.shape[1]):
        for i in range(x.shape[0]):
            x[i][j] /= val
            
%timeit -n 1 -r 1 mult2(a, 10)

In [None]:
# http://www.scipy-lectures.org/advanced/optimizing/
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.sum.html

In [None]:
c = np.zeros((1000, 1000), order='C')

In [None]:
%timeit c.sum(axis=0)
c.sum(axis=0).shape

In [None]:
%timeit c.sum(axis=1)
c.sum(axis=1).shape

## 4 Delegate in C
When you want to speed-up your code or simply when you need to reuse C code, it is possible to use it from Python. There are several alternatives:

1. [Cython](http://cython.org/): A superset of Python to allow you call C functions and load Python variables with C ones. 
2. [SWIG (Simplified Wrapper Interface Generator)](http://www.swig.org/): A software development tool to connect C/C++ programs with other languages (included Python).
3. [Ctypes](http://python.net/crew/theller/ctypes/): A Python package that can be used to call shared libraries (`.ddl`/`.so`/`.dylib`) from Python.
4. [Python-C-API](https://docs.python.org/3.6/c-api/index.html): A low-level interface between (compiled) C code and Python.

### 4.1 Python-C

Python-C-API because is the most flexible and efficient alternative. However, it is also the hardest to code.

#### The C code to reuse in Python

In [1]:
!cat sum_array_lib.c

long int sum_array(double* a, int N) {
  int i;
  double sum = 0;
  for(i=0; i<N; i++) {
    sum += *a+i;
  }
  return sum;
}


In [2]:
!cat sum_array.c

#include <stdio.h>
#include <time.h>
#include "sum_array_lib.c"

#define N 100000

int main() {
  double a[N];
  int i;
  clock_t start, end;
  double cpu_time;
  for(i=0; i<N; i++) {
    a[i] = i;
  }
  start = clock();
  double sum = sum_array(a,N);
  end = clock();
  printf("%f ", sum);
  cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
  cpu_time *= 1000000;
  printf("%f usegs\n", cpu_time);
}


In [3]:
!gcc -O3 sum_array.c -o sum_array
!./sum_array

4999950000.000000 176.000000 usegs


### The module

In [4]:
!cat sum_array_module.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Python.h>            /* Compulsory in every module */
#include <numpy/arrayobject.h> /* To interact with numpy arrays */
#include "sum_array_lib.c"

static PyObject* sumArray(PyObject* self, PyObject* args) {
  int N;
  long int sum;
  //int* a;
  PyArrayObject *in_array;
  
  clock_t start, end;
  double cpu_time;

  /*  parse the input */
  //if (!PyArg_ParseTuple(args, "i", &N))
  if (!PyArg_ParseTuple(args, "O!", &PyArray_Type, &in_array))
    return NULL;
  /* if the above function returns -1, an appropriate Python exception will
   * have been set, and the function simply returns NULL
   */

  N = PyArray_DIM(in_array, 0);
  printf("array size %d\n", N);

  npy_double* data  = (npy_double*)PyArray_DATA(in_array);
  //a = (int*)malloc(N*sizeof(int));
  //if (!a) return NULL;
  
  /*for(i=0; i<N; i++) {
    data[i] = i;
    }*/

  start = clock();
  sum = sum_array(data, N);
  en

### Module compilation

In [5]:
!cat setup.py

from distutils.core import setup, Extension
import numpy.distutils.misc_util

# define the extension module
sum_array_module = Extension(
    'sum_array_module',
    sources=['sum_array_module.c'],
    include_dirs=numpy.distutils.misc_util.get_numpy_include_dirs()
)

# run the setup
setup(
    ext_modules=[sum_array_module],
)


In [6]:
!python setup.py build_ext --inplace

[39mrunning build_ext[0m
[39mbuilding 'sum_array_module' extension[0m
[39mC compiler: gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC
[0m
[39mcreating build[0m
[39mcreating build/temp.linux-x86_64-3.6[0m
[39mcompile options: '-I/home/vruiz/.pyenv/versions/3.6.4/lib/python3.6/site-packages/numpy/core/include -I/home/vruiz/.pyenv/versions/3.6.4/include/python3.6m -c'[0m
[39mgcc: sum_array_module.c[0m
In file included from /home/vruiz/.pyenv/versions/3.6.4/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1816,
                 from /home/vruiz/.pyenv/versions/3.6.4/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /home/vruiz/.pyenv/versions/3.6.4/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from sum_array_module.c:5:
  ^~~~~~~
[39mgcc -pthread -shared -L/home/vruiz/.pyenv/versions/3.6.4/lib -L/home/vruiz/.pyenv/ver

In [7]:
import sum_array_module
import numpy as np
a = np.arange(100000)
%timeit sum_array_module.sumArray(a)

189 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


However, remember: vectorize when possible!

In [8]:
%timeit np.sum(a)

83.6 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 4.2 Cython
[Python with C data types.](https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html)