<a href="https://colab.research.google.com/github/rhythmd18/Neural-Nets-From-Scratch/blob/main/NeuralNetsFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Let's build a Neural Network Architecture from absolute scratch!

## 1. Importing the desired libraries
All we need here is the numpy library

In [1]:
import numpy as np

## 2. Formulating the required activation functions.
We'll use the ReLU activation function for the hidden layers and the sigmoid activation function for the output layer.

> The ReLU activation function is given by:
$$g(z) = max(0, z)\tag{1}$$

> The sigmoid activation function is given by:
$$g(z) = \frac{1}{1 + e^{-z}}\tag{2}$$





In [2]:
# The activation functions

def relu(Z):
  """
  Implements the relu activation in numpy

  Arguments:
  Z: a numpy array

  Returns:
  A: the output of the relu activation function
  cache: returns Z as well. useful during backpropagation
  """
  A = np.maximum(0, Z)
  cache = Z
  return A, cache


def sigmoid(Z):
  """
  Implements the sigmoid activation in numpy

  Arguments:
  Z: a numpy array

  Returns:
  A: the output of the sigmoid activation function
  cache: returns Z as well. useful during backpropagation
  """
  A = 1 / (1 + np.exp(-Z))
  cache = Z
  return A, cache


def relu_backward(dA, cache):
  """
  Implements the backward propagation for a single ReLU unit

  Arguments:
  dA: post activation gradient
  cache: Z where we store for computing backward propagation efficiently

  Returns:
  dZ: Gradient of the cost with respect to Z
  """
  Z = cache
  dZ = np.array(dA, copy=True)
  dZ[Z <= 0] = 0 # When Z <= 0, dZ should be set to 0 as well
  return dZ


def sigmoid_backward(dA, cache):
  """
  Implements the backward propagation for a single SIGMOID unit

  Arguments:
  dA: post activation gradient
  cache: Z where we store for computing backward propagation efficiently

  Returns:
  dZ: Gradient of the cost with respect to Z
  """
  Z = cache

  s = 1 / (1 + np.exp(-Z))
  dZ = dA * s * (1 - s)
  return dZ

## 3. Initializing the parameter matrices of adequate shapes

In [3]:
def initialize_parameters(layer_dims):
  """
  Arguments:
  layer_dims: A Python list containing the number of units in each hidden layer

  Returns:
  parameters: A Python dictionary containing all the parameter matrices i.e., W1, b1, ..., WL, bL:
              Wl: Weight matrix of shape (layer_dims[l], layer_dims[l-1])
              bl: Bias matrix of shape (layer_dims[l], 1)
  """
  np.random.seed(3)
  parameters = {}
  L = len(layer_dims) # Specifying the number of layers in the neural network

  for l in range(1, L):
    parameters[f"W{str(l)}"] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
    parameters[f"b{str(l)}"] = np.zeros((layer_dims[l], 1))

  return parameters

## 4. Forward Propagation

### 4.1 Linear Forward
Here we'll implement the linear part of the forward propagation i.e.,
$$Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}\tag{3}$$
Here, $A^{[0]}=X$, where $X$ is the input vector.

In [4]:
def linear_forward(A, W, b):
  """
  Implements the linear part of a layer's forward propagation

  Arguments:
  A: activations from the previous layer of shape (size of previous layer, number of examples)
  W: weights matrix: numpy array of shape (size of current layer, size of previous layer)
  b: bias vector: numpy array of shape (size of current layer, 1)

  Returns:
  Z: the linear part or the input for the activation function
  cache: a python tuple containing "A", "W", and "b"; stored for computing backward pass efficiently
  """
  Z = np.dot(W, A) + b
  cache = (A, W, b)

  return Z, cache

### 4.2 Linear Activation Forward
Now we'll pass the linear part Z as input to the activation function and compute the activations (sigmoid or relu)

In [5]:
def linear_activation_forward(A_prev, W, b, activation):
  """
  Implements the forward propagation for the LINEAR->ACTIVATION layer

  Arguments:
  A_prev: activations from the previous layer of shape (size of the previous layer, number of examples)
  W: weights matrix: numpy array of shape (size of current layer, size of previous layer)
  b: bias vector: numpy array of shape (size of current layer, 1)
  activation: the activation to be used in the current layer, stored as a string ("sigmoid" or "relu")

  Returns:
  A: the output of the activation function, which is then used to compute the linear part Z of the next layer
  cache: a python tuple containing "linear_cache" and "activation_cache"; stored for computing backward pass efficiently
  """
  if activation == "sigmoid":
    Z, linear_cache = linear_forward(A_prev, W, b)
    A, activation_cache = sigmoid(Z)

  elif activation == "relu":
    Z, linear_cache = linear_forward(A_prev, W, b)
    A, activation_cache = relu(Z)

  cache = (linear_cache, activation_cache)

  return A, cache

### 4.3 L-Layer Model
Here we'll build the neural network architecture with L number of layers. L-1 hidden layers will use the <b>ReLU</b> activation and the output layer will use the <b>sigmoid</b> activation.

In [6]:
def l_layer_model(X, parameters):
  """
  Implements the forward propagation for [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

  Arguments:
  X: input vector of shape (input size, number of examples)
  parameters: output of the function initialize_parameters()

  Returns:
  AL: activation value of the output layer
  caches: list of caches containing:
            every cache of the linear_activation_forward()
  """
  caches = []
  A = X
  L = len(parameters) // 2 # number of layers in the neural network

  for l in range(1, L):
    A_prev = A
    A, cache = linear_activation_forward(A_prev, parameters[f'W{str(l)}'], parameters[f'b{str(l)}'], 'relu')
    caches.append(cache)

  AL, cache = linear_activation_forward(A, parameters[f'W{str(L)}'], parameters[f'b{str(L)}'], 'sigmoid')
  caches.append(cache)

  return AL, caches

### 4.4 Cost Function
We use the following formula for computing the cost
$$J = -\frac{1}{m}\sum (y^{(i)}log(a^{[L](i)})+(1-y^{(i)})log(1-a^{[L](i)}))\tag{4}$$

In [7]:
def compute_cost(AL, Y):
  """
  Implements the cost defined by the equation (4)

  Arguments:
  AL: the probability vector corresponding to the label predictions of shape (1, number of examples)
  Y: actual labels vector

  Returns:
  cost: the cross-entropy cost
  """
  m = Y.shape[1]

  cost = -(np.sum(np.dot(Y, np.log(AL).T) + np.dot(1 - Y, np.log(1 - AL).T))) / m
  cost = np.squeeze(cost)

  return cost

## 5. Backward Propagation

### 5.1 Linear Backward
For layer $l$, the linear part is $Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$
Suppose, we have calculated the derivative of $Z$ i.e $dZ^{[l]} = \frac {\partial L}{\partial Z^{[l]}}$, we'll need to get ($dW^{[l]}, db^{[l]}, dA^{[l-1]}$)

The formulas are as follows:

$$dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}\tag{5}$$

$$db^{[l]}=\frac{1}{m}\sum_{i=1}^{m}dZ^{[l](i)}\tag{6}$$

$$dA^{[l-1]}=W^{[l]T}dZ^{[l]}\tag{7}$$

In [8]:
def linear_backward(dZ, cache):
  """
  Implements the linear portion of backward propagation

  Arguments:
  dZ: gradient of the cost with respect to the linear output(of the current layer)
  cache: tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

  Returns:
  dA_prev: gradient of the cost with respect to the activation of the previous layer (l-1)
  dW: gradient of the cost with respect to W of the current layer l
  db: gradient of the cost with respect to b of the current layer l
  """
  A_prev, W, b = cache
  m = A_prev.shape[1]

  dW = np.dot(dZ, A_prev.T) / m
  db = np.sum(dZ, axis=1, keepdims=True) / m
  dA_prev = np.dot(W.T, dZ)

  return dA_prev, dW, db

### 5.2 Linear Activation Backward
If g(.) is the activation then `sigmoid_backward` and `relu_backward` compute the following equation:
$$dZ^{[l]}=dA^{[l]}*g'(Z^{[l]})\tag{8}$$




In [9]:
def linear_activation_backward(dA, cache, activation):
  """
  Implements the backward propagation for a single LINEAR-> ACTIVATION layer

  Arguments:
  dA: post activation gradient of the current layer l
  cache: tuple of values(linear_cache, activation_cache) we store to compute backward pass efficiently
  activation: string that stores the name of the activation function that specifies which one to use.

  Returns:
  dA_prev: gradient of cost with respect to of the previous layer (l - 1)
  dW: gradient of the cost with respect to W(current layer l)
  db: gradient of the cost with respect to b(current layer l)
  """
  linear_cache, activation_cache = cache

  if activation == 'relu':
    dZ = relu_backward(dA, activation_cache)
    dA_prev, dW, db = linear_backward(dZ, linear_cache)

  if activation == 'sigmoid':
    dZ = sigmoid_backward(dA, activation_cache)
    dA_prev, dW, db = linear_backward(dZ, linear_cache)

  return dA_prev, dW, db

### 5.3 L-Model Backward
Here we'll perform the entire backward propagation computing the gradients going backwards for each layer.

In [10]:
def L_model_backward(AL, Y, caches):
  """
  Implements backward propagation for the entire LINEAR->RELU * (L - 1) + LINEAR->SIGMOID group

  Arguments:
  AL: probability vector, the final output of the forward propagation(L_model_forward())
  Y: the true labels vector
  cache: list of caches for all the layers. For each layer it contains the tuple(linear_cache, activation_cache)


  Returns:
  grads: dictionary with the gradients
  """
  grads = {}
  L = len(caches) # number of layers
  m = AL.shape[1]
  Y = Y.reshape(AL.shape) # Y should be of the same shape as AL

  dAL = -np.divide(Y, AL) + np.divide(1 - Y, 1 - AL)

  current_cache = caches[L - 1]
  dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dAL, current_cache, 'sigmoid')
  grads[f"dA{str(L - 1)}"] = dA_prev_temp
  grads[f"dW{str(L)}"] = dW_temp
  grads[f"db{str(L)}"] = db_temp

  for l in reversed(range(L - 1)):
    current_cache = caches[l]
    dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads[f"dA{str(l + 1)}"], current_cache, 'relu')
    grads[f"dA{str(l)}"] = dA_prev_temp
    grads[f"dW{str(l + 1)}"] = dW_temp
    grads[f"db{str(l + 1)}"] = db_temp

  return grads

## 6. Update Parameters
In this section, we'll update all of the parameters as per the gradient descent algorithm
$$W^{[l]}=W^{[l]}-\alpha\text{ }dW^{[l]}\tag{9}$$
$$b^{[l]}=b^{[l]}-\alpha\text{ }db^{[l]}\tag{10}$$

In [11]:
def update_parameters(params, grads, learning_rate):
  """
  Implements the process of gradient descent by updating the parameters

  Arguments:
  params: the initial parameters (Output of initialize_parameters()) dictionary
  grads: the dictionary containing the gradients
  learning_rate: alpha which determines the speed of the updation

  returns:
  parameters: the dictionary of the updated parameters
  """
  parameters = params.copy()
  L = len(parameters) // 2 # number of layers

  for l in range(L):
    parameters[f"W{str(l + 1)}"] = params[f"W{str(l + 1)}"] - learning_rate * grads[f"dW{str(l + 1)}"]
    parameters[f"b{str(l + 1)}"] = params[f"b{str(l + 1)}"] - learning_rate * grads[f"db{str(l + 1)}"]

  return parameters