# Goal

Recreate fastai from the foundataions, and much of Pytorch

# Rules of the game

We can use:
* pure Python
* anything in the Python standard library
* any non-data-science modules
* Pytorch for array creation, indexing into arrays, generating random numbers
* fastai.datasets
* matplotlib
* once we've created something ourselves, we can subsequently use the "real" version

# Things we are assumed to already know:
* Affine functions and non-linearities
* Parameters and activations
* Random initiation and transfer learning
* SGD, momentum, and Adam
* Convolutions
* Batch Norm
* Image classification and regression
* Embeddings
* Weight decay
* Res/dense blocks
* Continuous and categorical variables
* Collaborative filtering
* Language models; NLP classification
* Segmentation; U-net; GANs

# First steps:
* matrix multiplication
* ReLU
* Initiation
* FC forward

In [1]:
import operator

Write a unit test to test equality of two things

In [2]:
def test
    
    
def test_eq


# Import allowable things

In [3]:
from pathlib import Path
from IPython.core.debugger import set_trace
from fastai import datasets
import pickle, gzip, math, matplotlib as mpl
import torch # only for array creation!
import matplotlib.pyplot as plt
from torch import tensor

MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'

Use datasets.download_data to grab MNIST and set its local location as the path:

Open the file with gzip and grab the training and validation sets

The tuples we get are numpy arrays, which are not allowed, so we need to convert them into tensors.  (Use the tensor function)

Set n and c to be the number of examples and the number of columns, respectively

Look inside our training and validation sets

Add some tests using our equality test function:
* check that we have 50,000 training examples and that the relevant dimensions for x and y match this
* check that we have 784 columns (28*28)
* check that the values in y make sense

Grab an example from the training set, use the view method to get a 28x28 version of it, and check its type

Plot the example using plt.imshow

In [12]:
mpl.rcParams['image.cmap'] = 'gray'

# Moving on to our linear model

Create a weight matrix and bias vector that will allow us to calculate xW+b for a bunch of examples x.  Use torch.randn and torch.zeros

In [None]:
weights = 
bias = 

Create a function to multiply two matrices.  For now, use three loops.

In [10]:
def matmul


Test whether the multiplication works by multiplying a few training examples by our weight matrix and time the result:

In [12]:
m1 = x_train[:5]
m2 = weights
%time t1 = matmul(m1, m2)

CPU times: user 512 ms, sys: 0 ns, total: 512 ms
Wall time: 510 ms


Check the shape of t1 to see whether it makes sense

# Element-wise operations

We want to speed up our multiplication, so we're going to use element-wise operations whenever possible.  Practice using element-wise operations to calculate the Forbenius norm:

$$\|A\|_F = \sqrt{\Sigma_{i, j=1}^{n} |a_{ij}|^{2}}$$

In [4]:
m1 = tensor([[1., 2, 3], [4, 5, 6], [7, 8, 9]])
m1

tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])

Find the Frobenius norm of m1.

should be 'tensor(16.8819)'

Now use element-wise operations to replace the innermost loop of matmul.

In [72]:
def matmul

Now again test the multiplication of m1 and m2:

In [73]:
%timeit -n 10 t2 = matmul(m1, m2)

824 µs ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We want to check whether the answer is right, but floats change sometimes so we might not get equality even with a correct implementation.

Use torch.allclose to define a function near(a, b) that tests whether a and b are close to each other (i.e. equal with some tolerance).

In [15]:
def near

In [77]:
near(t1, matmul(m1, m2))

True

# Broadcasting

Broadcasting is going to let us get rid of even more loops.  Useful things to investigate broadcasting:

* .expand_as()
* .storage()
* .stride()
* .unsqueeze()
* indexing into a slice with None

Use broadcasting to remove the next-innermost loop of matmul (so there's only one loop left)

In [20]:
def matmul


In [142]:
%timeit -n 10 t3 = matmul(m1, m2)

330 µs ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [143]:
near(t1, matmul(m1, m2))

True

# Now using Einstein summation

Use torch.einsum to get rid of all the loops in matmul

In [21]:
def matmul


In [149]:
%timeit -n 10 t3 = matmul(m1, m2)

92 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [24]:
near(t1, matmul(m1, m2))

True

# Now using pytorch

Since we've written our own fast matmul, now we can use the one built into pytorch

In [25]:
%timeit -n 10 t4 = # fill in with the pytorch way

The slowest run took 5.27 times longer than the fastest. This could mean that an intermediate result is being cached.
23 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [26]:
near(t1, m1.matmul(m2))

True

# Now we need ReLU and random initialization

First create a normalization function that takes a tensor, its mean, and its standard deviation and returns the normalized version

Use .mean() and .std() to get the mean and standard deviation of the training set, check that they're correct (tensor(0.1304), tensor(0.3073)), and then use them to normalize the training set AND the validation set

Use .abs() to define a test to see whether something is close to 0 (within a specified tolerance)

In [23]:
def test_near_zero

Test whether the training set's mean is near 0, and whether it's standard deviation - 1 is near 0

In [24]:
test_near_zero(x_train.mean())
test_near_zero(x_train.std() - 1)

Set the following variables to the correct values:
* n = number of items in the training set
* m = length of each item in the training set
* c = number of possible classes

n, m, c should be (50000, 784, tensor(10))

# Now creating a simple model

It's going to have one hidden layer and one output activation.  Instead of using cross-entropy loss, we are going to use MSE (which doesn't make real-world sense, but it's ok for now)

In [25]:
nh = # size of the hidden layer 

Initialize two weight matrices w1 and w2 using He initialization, and two bias vectors b1 and b2 initialized to zeros.  Relevant functions are torch.randn, math.sqrt, and torch.zeros

Use test_near_zero to see whether the mean of the weight matrices is near zero, and whether their standard deviations are close to $\sqrt{2/m}$

Use .clamp_min to define ReLU

Use @ to define a linear layer -- it should take inputs x, a weight matrix w, and a bias vector b and return x * w + b

(Recall that our first layer actually applies a ReLU first.)

Now get the output when ReLU is applied to the output of this first linear layer, applied to the validation set.

Now that we have recreated He initialization with the ReLU from scratch, we can use Pytorch's version, torch.nn.init.kaiming_normal_().  Recall that we have to do something kind of unintuitive with the mode because of the difference in how we shaped our linear layer versus how Pytorch does it.

In [43]:
from torch.nn import init

Re-initialize w1 using the built-in function

Now use w1 to get the output of the linear layer (followed by ReLU)

Check the mean and std of the output and of w1

# Forward pass

Now create a model (remember it can just be a function) that takes a batch of inputs, applies linear -> ReLU -> linear, and returns the output

In [65]:
%timeit -n 10 _=model(x_valid)

3.2 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Add an assert to make sure the shape of the output matches what we expect it to be.

Now we need a loss function.  We're going to use MSE for the moment (not realistic).

Define a function that takes two vectors (predictions and ground truth) and returns the MSE.  Remember that .squeeze() might be useful here. 

Use .float() with our y vectors so we can use them with MSE

Create a set of predictions using our model and calculate their MSE

tensor(29.2302)

# Now for the backward pass

First we need to define functions to get gradients of the functions we used in the forward pass.

In [117]:
def mse_grad
    # gradient with respect to the outputs of the previous layer


In [131]:
def relu_grad
    # gradient of ReLU wrt input activations


In [134]:
def lin_grad
    # gradient of matmul wrt inputs


Now incorporate everything into a forward and backward pass
* takes a set of x and y inputs
* runs a forward pass
* calculates the loss (not technically needed at this step)
* executes a backward pass

In [135]:
def forward_and_backward



# Refactor the layers as classes
* use `__call__` for the forward
* save the output of each layer
* also return the output of each layer
* define backward like we did earlier

In [37]:
class relu

In [38]:
class linear

In [39]:
class mse


Now refactor `Model` as a class
* initialize with the weights and biases
* use `__call__` to do a forward pass and return the loss
* define `backward` to run a backward pass

In [73]:
class model


Now refactor again to get rid of duplicate code: make a Module class that deals with initialization, and set defaults for forward and backward

In [6]:
class Module

In [160]:
class Lin

In [173]:
class Relu

In [174]:
class Mse

In [182]:
class Model

Now adjust how we get the gradient of the linear layer to be more efficient

In [7]:
class Lin