In [1]:
# default_exp feedforward

In [50]:
# export
from fastai.datasets import *
import pathlib
import gzip
import pickle
import torch

# Load the data 

In [23]:
MNIST_URL = "http://deeplearning.net/data/mnist/mnist.pkl"

In [39]:
data_path = download_data(MNIST_URL, ext='.gz')

In [40]:
data_path

PosixPath('/home/paperspace/.fastai/data/mnist.pkl.gz')

In [135]:
with gzip.open(data_path, "rb") as f: 
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    (x_train, y_train, x_valid, y_valid) = map(torch.tensor, (x_train, y_train, x_valid, y_valid))

In [136]:
x_train.shape

torch.Size([50000, 784])

In [137]:
y_train.shape

torch.Size([50000])

In [138]:
x_valid.shape

torch.Size([10000, 784])

In [139]:
y_valid.shape

torch.Size([10000])

In [140]:
import math
math.sqrt(784)

28.0

So we have the flattened pixels from 60k 28x28 images, giving us 60k x 784 elements. We break that into a training and a validation set of 50k and 10k examples, respectively. 

# Basic Matrix Multiplication

Here's our first basic stab at matrix multiplication. It's going to be very slow, because we're doing the whole thing in python. It's nice for understanding exactly what we're working with, but we'll see very soon that there are some tricks we can use to speed this up a lot.

In [95]:
# export
def basic_matmul(a, b):
    assert a.shape[1] == b.shape[0]
    output = torch.zeros(a.shape[0], b.shape[1]).float()
    for i in range(a.shape[0]):
        for j in range(b.shape[1]):
            for k in range(a.shape[1]):
                output[i,j] += a[i,k] * b[k,j]
    return output                

In [96]:
mat1 = torch.tensor([[1,2],
                     [3,4]])

mat2 = torch.tensor([[5,6], 
                     [7,8], 
                     [9,10]])
try:
    basic_matmul(mat1, mat2)
    raise ValueError # if it doesn't throw an assertion error, we get here
except AssertionError:
    pass

In [97]:
mat3 = torch.tensor([[5,6], 
                     [7,8]])
expected = torch.tensor([
    [19,22],
    [43,50]]).float()

In [98]:
basic_matmul(mat1, mat3)

tensor([[19., 22.],
        [43., 50.]])

In [102]:
# export
def allclose(a, b, tol=1e-3): return (a - b).max() < tol    

In [101]:
assert allclose(basic_matmul(mat1, mat3), expected)

In [120]:
NUM_HIDDEN = 10

In [121]:
weights = torch.randn(x_train.shape[1], NUM_HIDDEN)
biases = torch.randn(1)

In [122]:
%time t1=basic_matmul(x_train[:5], weights)

CPU times: user 924 ms, sys: 0 ns, total: 924 ms
Wall time: 925 ms


Ok, so this is terribly slow. It takes us 924ms to do the matrix multiplication for 5 rows of x_train with a hidden layer size of 10. This data is super small, and yet it still takes almost a full second. That's definitely not gonna get it done! Alas, we need to speed things along.

# PyTorch Operations

First things first, we can use pytorch's built-in operations to get rid of the innermost loop. Often when numerical operations are implemented in base pytorch, they're implemented in aten, which makes them super fast.

In [141]:
# export
def dot_prod_matmul(a, b):
    assert a.shape[1] == b.shape[0]
    output = torch.zeros(a.shape[0], b.shape[1]).float()
    for i in range(a.shape[0]):
        for j in range(b.shape[1]):
            output[i,j] = (a[i,:] * b[:,j]).sum()
    return output                

In [142]:
assert allclose(dot_prod_matmul(mat1, mat3), expected)

In [143]:
%time t1=dot_prod_matmul(x_train[:5], weights)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.53 ms


Wow! So we're down from 924ms to 2.53ms. That means the basic version is...

In [144]:
924/2.53

365.21739130434787

...365 times slower than the raw pytorch version. Not bad!

# Broadcasting

The reason the code above is so slow is because all of the work is being done in actual python. Python itself is super slow. Any highly performant operations that run in python typically delegate to a faster language. In the case of pytorch, the operation is delegated to a low-level, highly performant language called aten. Vectorized operations are delegated to aten, which speeds them up.


Here are the rules of broadcasting ([source](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html)):
* Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
* Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
* Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Here are some examples.

In [123]:
a = 1; b = torch.tensor([1,2,3])

In [124]:
a + b

tensor([2, 3, 4])

In this case, the scalar is converted to a 0-dim vector, and then an axis of length 1 is prepended onto it making it a 1-dim vector of length 1. Then, it is stretched along that vector to match the size of of the second vector, and the two are added together.

In [125]:
a = torch.tensor(
    [[1],
     [2],
     [3]]
)
b = torch.tensor([[1,2,3]])

In [126]:
a + b

tensor([[2, 3, 4],
        [3, 4, 5],
        [4, 5, 6]])

In this case, broadcasting happens in both directions. First we look at the vertical dimension. `a` has size 3 while `b` has size 1, so `b` is repeated three times to make them match. Then we look at the horizontal dimension. `b` has size 3 but `a` has size 1, so `a` is repeated three times to match. Then we add them together element wise to get our final answer.

We can use broadcasting to remove the innermost loop of our matmul function. 

In [None]:
# TODO