# Coding up a Simple Neural Network from Scratch

In this demo we will code up a simple multi-layer neural network with single input feature, single output, and single neuron per layer using (i) PyTorch and (ii) from scratch. We will demonstrate that both approaches produce _identical_ outcome.

![title](simple-nn.png)

## A simple DNN in PyTorch
We begin by first implementing a simple neural network with $n$ layers in PyTorch. The model leverages the default **Linear** and **Tanh** layers available under _torch.nn_ namespace.

Note: PyTorch can automatically compute the gradients for backpropagation. So, we only need to define the forward function for our model.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class PyDNN(torch.nn.Module):
    
    def __init__(self, n):
        super(PyDNN, self).__init__()
        self.layers = []
        for i in range(n):                          # define a sequence of linear and non-linear layers
            self.layers.append(nn.Linear(1, 1))
            self.layers.append(nn.Tanh())
        self.model = nn.Sequential(*self.layers)    # wrapping it for easier registration and invocation

    def forward(self, x):
        return self.model(x)

## A simple DNN from scratch
Next, we re-implement the same model _without_ using PyTorch. Note, that now we need to define both the forward and backward methods. The forward method is similar to the forward method in our previously defined PyTorch model.

In [2]:
class OurDNN:

    def __init__(self, n):
        self.layers = []
        for i in range(n):                              # define a sequence of linear and non-linear layers
            self.layers.append(OurLinear())
            self.layers.append(OurTanh())
        
    def forward(self, x):
        for layer in self.layers:                       # run layers sequentially
            x = layer.forward(x)
        return x
    
    def backward(self, grad_out, lr):
        for layer in reversed(self.layers):
            grad_out = layer.backward(grad_out, lr)     # backpropagate gradients through the layers
        return grad_out

Our DNN model makes use of two classes **OurLinear** and **OurTanh**—which we implement next.

### A simple linear layer from scratch
Our linear layer computes the following function in forward pass:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;y=w.x+b"/>

The corresponding gradients for backpropagation are computed as follows:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;\frac{\delta \ell}{\delta w}=\frac{\delta \ell}{\delta y}\times\frac{\delta y}{\delta w} = \frac{\delta \ell}{\delta y}\times x"/>
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\frac{\delta \ell}{\delta b}=\frac{\delta \ell}{\delta y}\times\frac{\delta y}{\delta b} = \frac{\delta \ell}{\delta y}\times 1"/>
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\frac{\delta \ell}{\delta x}=\frac{\delta \ell}{\delta y}\times\frac{\delta y}{\delta x} = \frac{\delta \ell}{\delta y}\times w"/>

In [3]:
import numpy as np
import random

class OurLinear:

    def __init__(self):
        self.weight = random.uniform(-1, 1)     # randomly initialize learnable weight
        self.bias = random.uniform(-1, 1)       # randomly initialize learnable bias
        
    def forward(self, x):
        self.x = x                              # save the input for gradient computation later
        out = x * self.weight + self.bias       # compute output
        return out
    
    def backward(self, grad_out, lr):
        grad_w = grad_out * self.x              # compute gradients w.r.t. weight
        grad_b = grad_out                       # compute gradients w.r.t. bias
        grad_in = grad_out * self.weight        # compute gradients w.r.t. input
        self.weight -= lr * grad_w              # update weight
        self.bias -= lr * grad_b                # update bias
        return grad_in

### A simple tanh layer from scratch
Our tanh layer computes the following function in forward pass:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;y=tanh(x)"/>

The corresponding gradients for backpropagation are computed as follows:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;\frac{\delta \ell}{\delta x}=\frac{\delta \ell}{\delta y}\times\frac{\delta y}{\delta x} = \frac{\delta \ell}{\delta y}\times \big(1 - tanh^2(x)\big)"/>

In [4]:
class OurTanh:
        
    def forward(self, x):
        self.y = np.tanh(x)                     # compute and save output for gradient computation later
        return self.y
    
    def backward(self, grad_out, lr):
        grad_in = grad_out * (1 - self.y**2)    # compute gradients w.r.t. input
        return grad_in

## A simple mean squared loss from scratch
We also need to implement our mean squared error (MSE) loss. PyTorch implements its own MSE loss under torch.nn that we will use for the PyTorch model.

Our MSE loss computes the following function in forward pass:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;\ell=(y-x)^2"/>

The corresponding gradients for backpropagation are computed as follows:

<img src="https://latex.codecogs.com/svg.latex?\Large&space;\frac{\delta \ell}{\delta x}=-2 \times (y - x)"/>

In [5]:
class OurMSELoss:

    def forward(self, x, y):
        self.z = (y - x)        # save the intermediate for gradient computation later
        loss = self.z**2        # compute output
        return loss
    
    def backward(self):
        grad_in = -2 * self.z   # compute gradients w.r.t. input
        return grad_in

## Define the function we want to learn
Next, we define an arbitrary function $f(x)$ that we want to model. This is the ground-truth.

In [6]:
def some_function_we_want_to_learn(x):
    return np.tanh(-0.5 * x + 1)

## A simple dataset
We define a synthetic dataset by randomly sampling $x$ uniformly between $[-1, 1]$ and computing $y=f(x)$.

In [7]:
from torch.utils import data

class SimpleDataset(data.Dataset):

    def __init__(self, func, num_samples=50):
        super(SimpleDataset, self).__init__()
        self.num_samples = num_samples
        xs = [random.uniform(-1, 1) for i in range(self.num_samples)]   # create random values for x
        xys = [([x], [func(x)]) for x in xs]                            # compute y for each x
        xys = [(np.asarray(x, dtype=np.float32),
                np.asarray(y, dtype=np.float32)) for (x, y) in xys]     # convert to numpy array
        self.samples = [(torch.from_numpy(x),            
                         torch.from_numpy(y)) for (x, y) in xys]        # convert to torch tensor

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]

## Methods to get and set DNN parameters
We define some utility methods here for getting, copying, and displaying model parameters.

In [8]:
def get_params(dnn):
    params = {}
    params['ws'] = []
    params['bs'] = []
    for layer in dnn.layers:
        if isinstance(layer, OurLinear):
            params['ws'].append(layer.weight)
            params['bs'].append(layer.bias)
        elif isinstance(layer, nn.Linear):
            params['ws'].append(layer.weight.item())
            params['bs'].append(layer.bias.item())
    return params

def copy_params(py_dnn, our_dnn):
    n = len(py_dnn.layers)
    for i in range(n):
        if isinstance(py_dnn.layers[i], nn.Linear):
            our_dnn.layers[i].weight = py_dnn.layers[i].weight.item()
            our_dnn.layers[i].bias = py_dnn.layers[i].bias.item()

def format_params(params):
    return 'params = {}'.format({k : [round(x, 3) for x in v] for k, v in params.items()})

## Initialize data loader and models
We instantiate the dataloader, the PyTorch DNN model and our DNN model implemented from scratch.

In [9]:
from torch.utils.data import DataLoader

n = 1               # number of layers
lr = 0.01           # learning rate
epoch_size = 50     # number of samples per epoch
num_epochs = 10     # number of epochs

dataset = SimpleDataset(some_function_we_want_to_learn, num_samples = epoch_size * num_epochs)
dataloader = DataLoader(dataset, shuffle=False, batch_size=1)

py_dnn = PyDNN(n)
our_dnn = OurDNN(n)

## Initialize both models to the same random start state
We can comment this out if we want the two model parameters to be randomly initialized independently of each other.

In [10]:
copy_params(py_dnn, our_dnn)

## Train PyDNN model
Finally, we train the DNN model implemented using PyTorch...

In [11]:
py_params = get_params(py_dnn)
py_loss = nn.MSELoss()
py_opt = optim.SGD(py_dnn.parameters(), lr=lr)
py_dnn.train()
batch_idx = 0

print('[before training]\t{}'.format(format_params(py_params)))
for _, (x, y) in enumerate(dataloader):
    py_opt.zero_grad()              # in PyTorch the gradients are accumulated, so we zero them before each epoch
    out = py_dnn(x)                 # forward pass over the network
    loss = py_loss(out, y)          # compute the loss
    loss.backward()                 # compute the gradients
    py_opt.step()                   # update model parameters
    batch_idx += 1
    py_params = get_params(py_dnn)
    if batch_idx % epoch_size == 0:
        print('[after batch {}] loss = {:.3f} {}'.format(batch_idx, loss.item(), format_params(py_params)))

[before training]	params = {'ws': [0.547], 'bs': [-0.399]}
[after batch 50] loss = 0.021 params = {'ws': [0.513], 'bs': [-0.58]}
[after batch 100] loss = 0.019 params = {'ws': [0.501], 'bs': [-0.677]}
[after batch 150] loss = 0.005 params = {'ws': [0.488], 'bs': [-0.736]}
[after batch 200] loss = 0.003 params = {'ws': [0.487], 'bs': [-0.775]}
[after batch 250] loss = 0.001 params = {'ws': [0.487], 'bs': [-0.799]}
[after batch 300] loss = 0.002 params = {'ws': [0.49], 'bs': [-0.816]}
[after batch 350] loss = 0.000 params = {'ws': [0.494], 'bs': [-0.826]}
[after batch 400] loss = 0.000 params = {'ws': [0.498], 'bs': [-0.835]}
[after batch 450] loss = 0.001 params = {'ws': [0.503], 'bs': [-0.841]}
[after batch 500] loss = 0.001 params = {'ws': [0.509], 'bs': [-0.845]}


## Train OurDNN model
...and train the DNN model we implemented from scratch.

In [12]:
our_params = get_params(our_dnn)
our_loss = OurMSELoss()
batch_idx = 0

print('[before training]\t{}'.format(format_params(our_params)))
for _, (x, y) in enumerate(dataloader):
    out = our_dnn.forward(x.item())         # forward pass over the network
    loss = our_loss.forward(out, y.item())  # compute the loss
    grad = our_loss.backward()              # compute the gradients with respect to the model output
    our_dnn.backward(grad, lr)              # compute the gradients and update model parameters
    batch_idx += 1
    our_params = get_params(our_dnn)
    if batch_idx % epoch_size == 0:
        print('[after batch {}] loss = {:.3f} {}'.format(batch_idx, loss, format_params(our_params)))

[before training]	params = {'ws': [0.547], 'bs': [-0.399]}
[after batch 50] loss = 0.021 params = {'ws': [0.513], 'bs': [-0.58]}
[after batch 100] loss = 0.019 params = {'ws': [0.501], 'bs': [-0.677]}
[after batch 150] loss = 0.005 params = {'ws': [0.488], 'bs': [-0.736]}
[after batch 200] loss = 0.003 params = {'ws': [0.487], 'bs': [-0.775]}
[after batch 250] loss = 0.001 params = {'ws': [0.487], 'bs': [-0.799]}
[after batch 300] loss = 0.002 params = {'ws': [0.49], 'bs': [-0.816]}
[after batch 350] loss = 0.000 params = {'ws': [0.494], 'bs': [-0.826]}
[after batch 400] loss = 0.000 params = {'ws': [0.498], 'bs': [-0.835]}
[after batch 450] loss = 0.001 params = {'ws': [0.503], 'bs': [-0.841]}
[after batch 500] loss = 0.001 params = {'ws': [0.509], 'bs': [-0.845]}
