## Implementation of simple MLP using only NumPy

This notebook implements a simple MLP from scratch using only NumPy. This is done for educational purposes and the implementations are based on relevant research papers, no "cheating" by looking at example code or utilizing chatGPT-like tools were involved. This also means that there might be some errors but the classes will be updated in the future as any issues come up. 

The design of the classes is modular to allow for easy modification and building of more complex networks using the same building blocks in the future.

The training example we will start with is still unkonwn :D

In [5]:
import sys
sys.path.append("c:/Users/sebu1/OneDrive/Github Projects/forwardforward/forwardforward")
import CellularAutomata
import numpy as np
import math

Lets start building the neural network by defining a single neuron. We want to use ReLU non-linearity (f(x) = max(0,x)) so we will use the weight initialization method introduced in https://arxiv.org/pdf/1502.01852.pdf. 

The article mentions that using equation 10 or 14 gives the same results so let's opt to using eq.10   $\frac{1}{2}n_{l}Var[w_{l}] = 1,   \forall{l}.$   Which leads to a zero mean gaussian with standard deviation of $\sqrt{2/n_{l}}$. Here w refers to the weights and n to the number of connections.

The bias term is initialized to zero.

The implementation of backpropagation heavily relies on the amazing content of 3blue1brown https://www.3blue1brown.com/lessons/backpropagation-calculus. In essense, backpropagation is one method with which the gradient vector of the cost function of the ANN can be calculated. It is based on the chain-rule. Each neuron will calculate 3 different gradients of the cost function using the chain rule. The gradient with respect to the weights, bias and the activations. It will then pass on the gradient with respect to the activations to the next layer of neurons, and the cycle continues.

In [4]:
class Neuron():
    def __init__(self, input_connections=1, activation_func=lambda x: np.maximum(0,x)):
        self.weights = np.random.normal(loc=0, scale=math.sqrt(2/input_connections), size=(input_connections))
        self.b = 0
        self.activation_f = activation_func
        self.activations = None
        self.z = None
    
    def forward(self, inputs):
        assert inputs.shape == self.weights.shape

        self.activations = inputs
        z = np.dot(inputs, self.weights) + self.bias
        self.z = z
        output = self.activation_f(z)
        return(output)
    
    def backward(self, dC_dout):
        dz_din = self.weights
        dz_dw = self.activations
        dz_db = 1
        dout_dz = 1.0 if self.z > 0 else 0.0
        dC_dz = dC_dout * dout_dz
        weight_grad = dC_dz * dz_dw
        bias_grad = dC_dz * dz_db
        act_grad = dC_dz * dz_din
        return (weight_grad, bias_grad, act_grad)

    # Functions for setting the weigts and biases from outside the class
    def set_weights(self, new_weights):
        self.weights = new_weights

    def set_biases(self, new_b):
        self.b = new_b


0


Let's continue with implementing the Adam optimization algorithm which optimizes Stochastic Gradient Decent, based on the Adam paper https://arxiv.org/pdf/1412.6980.pdf. Adam is an algorithm for first order gradient base optimization of stochastic objective functions. Name comes from adaptive moment estimation. The authors propose a good initialization for the hyperparameters to be:
$$
\begin{aligned}
\alpha &= 0.001\\
\beta_1 &= 0.9\\
\beta_2 &= 0.999\\
\epsilon &= 10^{-8}
\end{aligned}
$$

The update function follows the proposed algorithm 1 in the paper.

In [None]:
class Adam():
    def __init__(self, network, lr = 0.001, b1 = 0.9, b2 = 0.999, epsilon = 10^(-8)):
        self.num_weights = len(network.weights)
        self.net = network
        self.lr = lr
        self.b1 = b1
        self.b2 = b2
        self.e = epsilon
        self.m = 0
        self.v = 0
        self.t = 0
    
    def step(self, weight_g, bias_g):
        weights = self.net.weights
        biases = self.net.biases
        theta = np.append(weights, biases)

        # One step of while loop in Adam algorithm following the paper:
        self.t += 1
        gt = np.append(weight_g,bias_g)
        mt = self.b1*self.m + (1-self.b1)*gt
        vt = self.b2*self.v + (1-self.b2)*gt**2
        mt_hat = mt/(1-self.b1**self.t)
        vt_hat = vt/(1-self.b2**self.t)
        thetat = theta - self.lr*mt_hat/(np.sqrt(vt_hat) + self.e)

        # Update network
        weight_t, bias_t = np.split(thetat, [self.num_weights])
        self.net.set_weights(weight_t)
        self.net.set_biases(bias_t)