# Compilation and Speeding Up

Python is an interpreted language.  This means that the code you write is basically being interpreted line by line (this is an oversimplification, but not far from the truth).  Each time a line of code is read, it has to be converted into equivalent machine language instructions.  For example, a `for` loop will need a register to be initialized, an instruction for incrementing the counter, an instruction to check the limits, and suitable branching statements.

When a program is *compiled*, it is converted into machine language once and for all, and only that code is then run.  This also means that any change in the code requires a complete recompilation.  Compared to Python, this is less interactive and takes a longer time to do.

So compiled languages pay a cost at compile time, and reap the benefits at run time.  If you expect that your program is going to run multiple times, then it is usually worth checking if this cost is worth it.

## Speed of Python

Python code is typically slow for a number of reasons:

- Data types are not known ahead of time, and the type of a variable can be dynamically changed.  You can store a string in a variable that previously had an `int` for example, and there will be no conflict.  This makes it hard to optimize variables as you do not know how they will change in future.
- Semantics of certain operations are different in Python than they are in other languages or machine code.  For example, *Divide by Zero* will cause an exception to be raised in Python code.  On the other hand, in C code it will result in the program crashing.  It may be possible to catch this exception in languages like C++, but it is optional and not mandatory, so it is possible to crash as well.  Such checks add extra code and slow the program down.
- Accessing an index that is beyond the bounds of a list will cause an Error to be raised.  In C it will not be an error, but may cause the program to crash with a Segmentation Fault.

Similarly, there are other situations where the semantics of the Python code differ from a similar C or machine language representation.  Whenever this happens, there is a chance that the Python will be slower than the raw code.

## Improving Speed

The simplest approach for speeding things up is to try and convert the Python code to a lower level language like C, compile it, and then run the compiled code.  However, due to the above restrictions, this has to be done with care, to avoid changing the meaning of the program.

## Cython

*Cython* is a particular variant of the Python language: it introduces several new syntactic elements into the language to address the issues with types and compilation.  The usual way of running it is to compile the code into a dynamic library, and then import this into Python.  However, in Jupyter notebooks, there is an easier approach that can be used, which makes use of the Cython extensions and *magic annotations*.

# Timing and Optimization

We first measure the time taken for a simple function.  Then we can look at optimizing this using Cython.

In [None]:
def isPrime(n):
    for i in range(2,int(n**0.5)+1):
        if n%i==0:
            return False
        
    return True

In [None]:
%timeit isPrime(999999937)

## Cython

First we just apply cython without any optimizations.  Later we will see the effect of adding the optimizations to it.

In [None]:
%load_ext Cython

In [None]:
%%cython --annotate

def cbasic_isPrime(n):
    for i in range(2,int(n**0.5)+1):
        if n%i==0:
            return False
        
    return True

In [None]:
%timeit cbasic_isPrime(999999937)

### Optimized

Now apply several optimizations.  In the code below, the actual optimizations are commented out.  Try uncommenting them one by one to try and see which has the biggest impact on the result.

In general, you need to look for the following:

- where is the bulk of the time being spent - most likely it is inside loops.  Here it is the `for` loop
    - To handle this, we explicitly declare `i` as an integer: in fact, as a `cdef int`: this means a C type integer.  Try commenting that line out to see how it changes the result.
- what kind of data types are being used?  C prefers to use data types close to what the computer has: for example 32-bit int, 32-bit single precision float etc.  These are highly optimized.  Python on the other hand naturally tries to accommodate larger integers if needed, but the cost of that is additional checks for overflow.  If you force it to use C data types, it will remove some of the checks.
    - Which are all the lines here 
- what kind of operations are being used?  For example, C type division will work only within the precision of the numbers you are using, and will round off or give wrong answers if the numbers are out of range, rather than trying to catch the errors and report them.  This makes it possible to have severe errors, but if you know your values cannot fall in the error regions, you can use this.

In [None]:
%%cython --annotate

import cython

# @cython.cdivision(True)
def c_isPrime(int n):
    # cdef int i
    # cdef float sqrtn = (n**0.5)
    # cdef int lim = int(sqrtn)+1
    # Note: if you uncomment the above two lines, then comment out the one below
    lim = int(n**0.5) + 1
    for i in range(2,lim):
        if n%i==0:
            return False
        
    return True

In [None]:
# %timeit c_isPrime(999999999)
%timeit c_isPrime(999999937)

# Matrix multiplication


In [None]:
import numpy as np
def matrix_multiply(u, v):
    m, n = u.shape
    n, p = v.shape
    res = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            res[i,j] = 0
            for k in range(n):
                res[i,j] += u[i,k] * v[k,j]
    return res


In [None]:
u = np.random.random((100,100))
v = np.random.random((100,100))
# %timeit -n 100 -r 3 matrix_multiply(u,v)
# %timeit matrix_multiply(u, v)
# %timeit np.matmul(u, v)

## Optimized Matrix Multiply

Can we apply the same techniques to speed up matrix multiply in Python?  Consider all the places where changes would make sense.  In addition, there are a couple of decorators that also help to speed things up by avoiding extra checks on the array sizes.

In [None]:
%load_ext Cython

In [None]:
%%cython -a

import numpy as np
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_matmul(float[:,:] u, float[:,:] v, float[:,:] res):
# def cy_matmul(u, v, res):
    cdef int m, n, p
    cdef int i, j, k
    m = u.shape[0]
    n = u.shape[1]
    p = v.shape[1]
    res = np.zeros((m, p), dtype=np.float32)
    for i in range(m):
        for j in range(p):
            res[i,j] = 0
            for k in range(n):
                res[i,j] += u[i,k] * v[k,j]
    return res


In [None]:
# u = np.float32(np.random.random((100,1000)))
# v = np.float32(np.random.random((1000,100)))
# res = np.zeros((100, 100), dtype=np.float32)
# %timeit cy_matmul(u, v, res)

## Performance testing

Try iterating this across different combinations of matrix sizes to see how the time varies.

In [None]:
M, N, P = 100, 100, 100
u = np.float32(np.random.random((M, N)))
v = np.float32(np.random.random((N, P)))
res = np.zeros((M, P), dtype=np.float32)

In [None]:
%timeit matrix_multiply(u, v)

In [None]:
%timeit cy_matmul(u, v, res)

In [None]:
%timeit np.matmul(u, v)