# From Numpy to PyTorch

Numpy provides comprehensive funtionality around mathematical operations on arrays. It is hence a good starting point, but by no means tailored to Machine/Deep Learning. 

There are various big packages (also known as frameworks) that offer functionality specialized in Machine/Deep Learning.

The two most well known ones are **PyTorch** from Meta and **TensorFlow** from Google. In this course, we will work with PyTorch as it provides a good tradeoff between ease of use for academic purposes and scalability to large projects in further reasearch. 

Here, we will examine one of the main features of using a framework such as PyTorch: utilization of the GPU for faster processing, and [Automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation).

The core operation of Deep Learning is matrix multiplication. If we have a way to make this operation as fast as possible that would save us a lot of time. Let's test how much.

We start by initializing two 3D tensors in numpy with shapes `64,768,1024` and `64,2048,512`. Now we want to multiply the last dimensions of these tensors. We will use the command `%timeit` for benchmarking. Note that we will need to switch the order of the dimensions.

**Exercise**: Measure the runtime of the matrix multiplication $a @ b.T$ with the `%timeit` command. 

In [1]:
import numpy as np
import torch

# Initialize the tensors in numpy. To make it fair we are using PyTorchs default precision, float32
a = np.random.rand(10000,1024).astype(np.float32)
b = np.random.rand(5000,1024).astype(np.float32)

### START CODE HERE ### (≈ 1 line of code)
# Perform matrix multiplication and store result to variable c
c = np.matmul(a, b.T)
# alternativly
d = a @ b.T

# check the resulting shape
assert list(c.shape) == [10000,5000]

# repeat the matrix multiplication with the %timeit in fornt of the command
# note works only with ipython notebooks
%timeit c = a @ b.T
### END CODE HERE ###

del a
del b

280 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Expected Output**: `~454 ms ± 19.7 ms per loop`

# Matrix multiplication in pytorch

Now we will execute the same matrix multiplication in pytorch. We will still use the CPU.

**Exercise**: Measure the runtime of the same matrix multiplication in PyTorch with the `%timeit` command. 

In [2]:
# Initialize the tensors in pytorch 
a = torch.rand(10000,1024)
b = torch.rand(5000,1024)

### START CODE HERE ### (≈ 1 line of code)
# Perform matrix multiplication and store result to variable c
c = torch.matmul(a,b.T)
d = a@b.T

assert list(c.shape) == [10000,5000]
print(d.shape, c.shape, torch.sum(c-d))

%timeit c = torch.matmul(a,b.T)
### END CODE HERE ###

torch.Size([10000, 5000]) torch.Size([10000, 5000]) tensor(0.)
670 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Expected Output**: 
`~703 ms ± 27.6 ms per loop `

# Matrix multiplication in the GPU with pytorch

Now we repeat the same process in the GPU. You can move a tensor to the gpu by calling `a = a.cuda()` or `a = a.to("cuda")`. Similarly, you can move a tensor back to the CPU with `a = a.cpu()`.

The resulting tensor needs to be in the CPU, otherwise `%timeit` gets stuck, since nothing is returned to the CPU.

In [3]:
### START CODE HERE ### (≈ 3 line of code)
a = a.cuda()
b = b.cuda()
%timeit c = (a @ b.T).cpu()
### END CODE HERE ###

37.6 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Expected Output**: `~41.3 ms ± 195 µs per loop`

# Final notes

The results may slighlty vary based on the hardware. You should see a significant improvement on the speedup of the matrix multiplication. On google colab, I got a speedup of ~7,4 times.

In practice this means that if I could train a deep learning model on the cpu with numpy in 1 week, with the GPU in pytorch I can train it in a single day.

