# Computation Efficiency with Numpy, PyTorch, and JIT

This notebooks illustrates the computational efficiency of running linear algebra with the proper tools - such as numpy.

In [None]:
from matplotlib import pyplot as plt

def plot_times(labels, times):
    x = list(range(len(times)))
    fig, ax = plt.subplots()
    ax.grid(alpha=0.5, ls='--', which='both')
    ax.bar(x, times, log=True)
    ax.set_xticks(x, labels)
    ax.set_axisbelow(True)

Let's compute an array dot product in Python:

In [None]:
def array_dot_product(v1, v2):
    result = 0
    for v1_i, v2_i in zip(v1, v2):
        result += v1_i * v2_i
    return result

v1 = list(range(100))
v2 = [1]*100

print("v1 = {}".format(v1))
print("v2 = {}\n".format(v2))

print("v1 dot v2: {}".format(array_dot_product(v1, v2)))
print("1+2+...+99:", 99*100/2)

Okay, it works, but how long does it take?

In [None]:
%timeit array_dot_product(v1, v2)

## Enters numpy

Now let's try with numpy, which uses a C backend optimized for mathematical operations, alleviating the Python overhead.

In [None]:
import numpy as np

In [None]:
v1_np = np.arange(100)
v2_np = np.ones(100)
print("v1 dot v2: {}".format(v1_np.dot(v2_np)))

Nice, aligned with our raw Python version. Now let's check the running time.

In [None]:
%timeit v1_np.dot(v2_np)

We can already se the difference. Numpy was roughly 6x faster than raw PyTorch for a very small array. New let's check with matrices.

In [None]:
def matrix_mul(m1, m2):
    num_rows = len(m1)
    num_columns = len(m2[0])
    internal_dim = len(m1[0])
    result = []
    for i in range(num_rows):
        new_row = []
        for j in range(num_columns):
            total = 0
            for k in range(internal_dim):
                total += m1[i][k] * m2[k][j]
            new_row.append(total)
        result.append(new_row)
    return result

In [None]:
m1_np = np.random.randn(100, 200)
m2_np = np.random.randn(200, 100)
m1_list = m1_np.tolist()
m2_list = m2_np.tolist()

result_raw = matrix_mul(m1_list, m2_list)
result_np = m1_np.dot(m2_np)

Checking the results...

In [None]:
eps = np.abs(result_raw - result_np).sum()
print('{} up to {}'.format(np.allclose(result_raw, result_np), eps))

Okay. Now lets time it again.

In [None]:
time_raw = %timeit -o matrix_mul(m1_list, m2_list) 

In [None]:
time_np = %timeit -o m1_np.dot(m2_np)

In [None]:
time_ratio = time_raw.average / time_np.average
print('Numpy is ~{:.0f}x faster than standard python'.format(time_ratio))
print('Something the runs in 1h in numpy would need to run for {:.0f} days in raw python'.format(time_ratio / 24))

In [None]:
plot_times(['python', 'numpy'], [time_raw.average, time_np.average])

## Enters PyTorch

Now let's try with PyTorch. Note that PyTorch also uses a C-backend to implement linear algebra methods. However, it also has the power to run those operation on GPUs. Let's try both variants and compare them.

In [None]:
import torch

In [None]:
m1_pt = torch.from_numpy(m1_np)
m2_pt = torch.from_numpy(m2_np)

In [None]:
time_pt = %timeit -o m1_pt @ m2_pt

In [None]:
plot_times(['python', 'numpy', 'pytorch'], 
           [time_raw.average, time_np.average, time_pt.average])

Seems about the same... Now let's try to use a GPU:

In [None]:
m1_pt = m1_pt.to('cuda' if torch.cuda.is_available() else 'cpu')
m2_pt = m2_pt.to('cuda' if torch.cuda.is_available() else 'cpu')
time_pt_gpu = %timeit -o m1_pt @ m2_pt

In [None]:
plot_times(['numpy', 'pytorch (cpu)', 'pytorch (gpu)'], 
           [time_np.average, time_pt.average, time_pt_gpu.average])

## Enters JIT

Now suppose we have an even more complicated function that contains control flows (if-else statements). To handle that, we have to rely on the Python interpreter, which is slow. To circumvent that, we can "compile" our function/module into a fixed intermediate-level code representation. 

https://pytorch.org/docs/stable/jit.html

In [None]:
@torch.jit.script
def jit_mm(m1, m2):
    return m1 @ m2

time_pt_jit = %timeit -o jit_mm(m1_pt, m2_pt)

plot_times(['numpy', 'pt (cpu)', 'pt (gpu)', 'pt (gpu+jit)'], 
           [time_np.average, time_pt.average, time_pt_gpu.average, time_pt_jit.average])

For more optimizations, check this blog post by Horace He:
[Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html)