In [41]:
import numpy as np

Here we are attempting to create a Neural Network for a Random Number Generator by scartch using only Numpy and Panda

We define these two very popular activation functions which can otherwise be found in the PyTorch library

These are both differentiable in  general ; ReLU is indifferentiable near 0 though

In [42]:
def tanh(x, derivative=False):
    if derivative:
        return 1.0 - tanh(x)**2
    else:
        return np.tanh(x)

def ReLU(x, derivative=False):
    if derivative:
        return 1 * (x > 0)  #returns 1 for any x > 0, and 0 otherwise

    return np.maximum(0, x)


def mse(target, actual, derivative=False):
    try:
        assert(target.shape == actual.shape)
    except AssertionError:
        print(f"Shape of target vector: {target.shape} does not match shape of actual vector: {actual.shape}")
        raise

    if derivative:
        error = (actual - target)

    else:
        error = np.sum(0.5 * np.sum((target-actual)**2, axis=1, keepdims=True))

    return error

Let us try to understand what particular terms mean ; the weight and bias matrices are obvious

Topology refers to the structure of the neural network ; if topology is a vector as [2,6,4,7] ; then we are essentially being told that the neural network has 4 layers : the first one with 2 neurons , the second with 6 neurons , the third with 4 neurons, and the fourth with 7 neuronns

Momentum is a term used for optimization purposes : without Momentum the weights might oscillate and take longer to converge whereas with Momentum the updates are smoother, and the weights converge faster because the momentum term helps to maintain a direction and dampens oscillations

It will be really helpful if you ask ChatGPT to give you an illustraton of how momentum works for a quadratic loss function

To really understand how netIns and netOuts work, let us consider a working example of what is really going on behind the scenes in the neural network

Suppose we have the topology of a neural network as [2,3,1] ; this implies we have three layers: an input layer, a hidden layer, and an output layer ; the weights matrix from the input layer to the hidden layer will be a 2x3 matrix while the weights matrix from the hidden layer to the output layer will be a 3x1 matrix

The bias vector will of course be a 3x1 vector for the first pass and a 1x1 vector for the second pass

Suppose the input is some x ; then netIn will be a 3x1 vector which will be the result of application of the weights and biases from the input layer to the hidden layer

When we apply the appropriate activation function on the netIn , we get the netOut

This netOut then is sent forward to the next layer as input and the process continues


The initialization of weights and biases method is just to randomly generate matrices and vectors of appropriate size required as per the topology of the neural network

As for the Xavier Initialization : The primary goal of Xavier initialization is to keep the scale of the gradients roughly the same in all layers of the network. This helps to mitigate the problem of vanishing or exploding gradients, which can occur when training deep neural networks.

So what we have with Xavier initialization simply is a uniform distribution between negative sqrt of 6/(inputDimensions + outputDimensions)

However that formula can be varied as per requirements

The feedforward step is pretty straightforward ; we take a dot product between the input and the weight matrix for that layer , add it to the bias vector and proceed to fetcht the ouput by applying the activation function on that vector

And we return that output


Moving on to the gradient descent section, we have, the following parameters :
layer_idx is the index of the layer whose weights and biases are being updated ;
gradient_mat is the gradient of the loss with respect to the weights of the current layer ; bias_gradient is the gradient of the loss with respect to the biases of the current layer

The term (self.momentum * self.last_change[layer_idx]) helps smooth the updates by incorporating a fraction of the previous weight change, which helps in reducing oscillations and speeds up convergence

The term -(self.learning_rate * gradient_mat[layer_idx]) is the standard gradient descent step, where the learning rate controls the size of the step in the direction of the negative gradient

The same calculation is applied to the biases using self.last_bias_change[layer_idx] and bias_gradient

In backpropagation , we begin from the output layer, which is the last layer, to work back on updating the weights and biases so as to optimize our neural network to give better results

We calculate the derivative of the activation function, then the derivative of the error function , and multiply those two terms in order to get the delta (change) we wish to make to our parameters

This process is then carried forward in the training function, which takes place for a number of epochs , each epoch further improving (hopefully) our model and seeking to minimize the loss

This is what really takes place behind the scenes of a neural network

In [43]:
class NeuralNetwork(object):
    randomNumberGenerator = np.random.default_rng()

    def __init__(self,
                 topology:list[int] = [],
                 learning_rate = 0.01,
                 momentum      = 0.1,
                 hidden_activation_func=ReLU,
                 output_activation_func=tanh,
                 init_method='random'):

        self.topology    = topology
        self.weight_mats = []
        self.bias_mats   = []

        self.learning_rate = learning_rate
        self.momentum      = momentum

        self.hidden_activation = hidden_activation_func
        self.output_activation = output_activation_func


        self._init_weights_and_biases(init_method)
        self.size             = len(self.weight_mats)
        self.netIns           = [None] * self.size
        self.netOuts          = [None] * self.size

        self.last_change = [np.zeros(mat.shape) for mat in self.weight_mats]
        self.last_bias_change      = [np.zeros(mat.shape) for mat in self.bias_mats]

    def _init_weights_and_biases(self, method='random'):
        if method.lower() == 'random':
            _init_func = lambda num_rows, num_cols: self.randomNumberGenerator.random(size=(num_rows, num_cols))

        elif method.lower() == 'xavier':
            _init_func = self._xavier_weight_initialization

        else:
            print(f"\t-> initialization method {method} not recognized. Defaulting to 'random'")
            _init_func = lambda num_rows, num_cols: self.randomNumberGenerator.random(size=(num_rows, num_cols))

        #-- set up matrices
        if len(self.topology) > 1:
            j = 1
            for i in range(len(self.topology)-1):
                num_rows = self.topology[i]
                num_cols = self.topology[j]

                mat         = _init_func(num_rows, num_cols)  #the +1 accounts for the bias weights
                bias_vector = _init_func(1, num_cols)

                self.weight_mats.append(mat)
                self.bias_mats.append(bias_vector)

                j += 1


    def _xavier_weight_initialization(self, num_rows, num_cols):
        '''A type of weight initialization that seems to be tailored to sigmoidal activation functions.
        Here is a reference: https://machinelearningmastery.com/weight-initialization-for-deep-learning-neural-networks/'''
        num_inputs = self.topology[0]

        lower_bound = -1 / np.sqrt(num_inputs)
        upper_bound = 1 / np.sqrt(num_inputs)

        mat = self.randomNumberGenerator.uniform(lower_bound, upper_bound, (num_rows, num_cols))
        return mat

    @property
    def shape(self):
        return tuple(self.topology)

    @property
    def n_trainable_params(self):
        n_params = 0
        for weight_mat, bias_mat in zip(self.weight_mats, self.bias_mats):
            n_params += weight_mat.size + bias_mat.size

        return n_params



    def feedforward(self, input_vector):

        self.netIns.clear()
        self.netOuts.clear()

        I = input_vector

        for idx, W in enumerate(self.weight_mats):

            bias_vector = self.bias_mats[idx]

            self.netOuts.append(I)
            I = np.dot(I, W) + bias_vector
            self.netIns.append(I)

            #-- apply activation function
            if idx == len(self.weight_mats) - 1:
                out_vector = self.output_activation(I)
            else:
                I          = self.hidden_activation(I)

        return out_vector


    def _gradient_descent(self, layer_idx, gradient_mat, bias_gradient):

        delta_weight = (self.momentum * self.last_change[layer_idx]) - (self.learning_rate * gradient_mat[layer_idx])

        delta_bias_weights =  (self.momentum * self.last_bias_change[layer_idx]) \
                            - (self.learning_rate * bias_gradient)

        self.weight_mats[layer_idx] += delta_weight
        self.bias_mats[layer_idx]   += delta_bias_weights

        self.last_change[layer_idx]      = 1 * delta_weight
        self.last_bias_change[layer_idx] = 1 * delta_bias_weights

    def backprop(self,
                 target,
                 output,
                 error_func,):

        for i in range(self.size):
            back_index =self.size-1 -i

            if i == 0:
                d_activ = self.output_activation(self.netIns[back_index], derivative=True)
                d_error = error_func(target,output,derivative=True)
                delta = d_error * d_activ


                gradient_mat  = np.dot(self.netOuts[back_index].T , delta)
                bias_grad_mat = 1 * delta

                self._gradient_descent(layer_idx=back_index, gradient_mat=gradient_mat, bias_gradient=bias_grad_mat)

            else:
                W_trans = self.weight_mats[back_index+1].T
                d_activ = self.hidden_activation(self.netIns[back_index],derivative=True)
                d_error = np.dot(delta, W_trans)
                delta = d_error * d_activ

                gradient_mat = np.dot(self.netOuts[back_index].T , delta)
                bias_grad_mat = 1 * delta

                self._gradient_descent(layer_idx=back_index, gradient_mat=gradient_mat, bias_gradient=bias_grad_mat)


    def train(self, input_set, target_set, epochs=1000, batch_size=0, error_threshold=1E-10, error_func=mse, verbose=True):

        if batch_size == 0:

            for epoch in range(epochs):
                error = 0

                for i in range(len(input_set)):
                    inputs = input_set[i:i+1]
                    targets = target_set[i:i+1]

                    error += self._train_helper(inputs, targets, error_func)

                if verbose and (epoch % 20 == 0):
                    self._print_training_info(epoch, epochs, error, error_threshold)

                if error <= error_threshold:
                    print(f"\t-> error {error} is lower than threshold {error_threshold}\n\tStopped at epoch {epoch}")
                    break

        elif batch_size == -1:

            for epoch in range(epochs):
                error = 0

                inputs  = input_set
                targets = target_set

                error += self._train_helper(inputs, targets, error_func)


                if verbose and (epoch % 20 == 0):
                    self._print_training_info(epoch, epochs, error, error_threshold)

                if error <= error_threshold:
                        print(f"\t-> error {error} is lower than threshold {error_threshold}\n\tStopped at epoch {epoch}")
                        break

        else:
            print("\t-> PROBLEM: mini-batches not supported yet. Choose batch_size 0 or -1")

        return error

    def _print_training_info(self, curr_epoch, total_epochs, curr_error, error_threshold):
        text = f"""{'-'*45}\n\t-> training step: :{curr_epoch}/{total_epochs}\n\t\t* current error: {curr_error}, threshold: {error_threshold}\n"""
        print(text)

    def _train_helper(self, input_set, target_set, error_func):
        nnet_output = self.feedforward(input_set)
        error       = error_func(target_set, nnet_output)

        self.backprop(target=target_set, output=nnet_output, error_func=error_func,)
        return error

Now we begin designing our logic gates

We have four gates here and we train on the NAND gates target

We can similarly train on the targets of other gates and get all the gates

In [44]:

inputs  = np.array([[0,0],
                    [0,1],
                    [1,0],
                    [1,1]])

#outputs of different logic gates when given the inputs above
training_targets = {
    'XOR'  : np.array([[0],
                       [1],
                       [1],
                       [0]]),

    'OR'   : np.array([[0],
                       [1],
                       [1],
                       [1]]),

    'AND'  : np.array([[0],
                       [0],
                       [0],
                       [1]]),

    'NAND' : np.array([[1],
                       [1],
                       [1],
                       [0]]),
}

In [45]:
# Create a neural network for the XOR logic gate, so two inputs and one output neuron.
nnet = NeuralNetwork([2, 3, 1], hidden_activation_func=ReLU, output_activation_func=tanh, init_method='random')

nnet.train(inputs, training_targets['NAND'], epochs=1000, batch_size=0)

---------------------------------------------
	-> training step: :0/1000
		* current error: 0.5008516419361585, threshold: 1e-10

---------------------------------------------
	-> training step: :20/1000
		* current error: 0.5008190858226067, threshold: 1e-10

---------------------------------------------
	-> training step: :40/1000
		* current error: 0.5007875130369377, threshold: 1e-10

---------------------------------------------
	-> training step: :60/1000
		* current error: 0.5007567438525375, threshold: 1e-10

---------------------------------------------
	-> training step: :80/1000
		* current error: 0.5007267329189224, threshold: 1e-10

---------------------------------------------
	-> training step: :100/1000
		* current error: 0.5006974376065778, threshold: 1e-10

---------------------------------------------
	-> training step: :120/1000
		* current error: 0.5006688177886365, threshold: 1e-10

---------------------------------------------
	-> training step: :140/1000
		* cur

0.4997088470083615