# Building Neural Networks and Backpropagation from Scratch

In this project, I implemented a **feed-forward neural network** completely from scratch in Python to understand how **backpropagation** works at the mathematical and coding level.

Rather than using frameworks like TensorFlow or PyTorch, I derived and coded every step manually (forward pass, loss computation, and gradient backpropagation) to gain an intuition for how neural networks actually learn.

**Highlights:**
- Built a multi-layer neural network using only NumPy  
- Implemented gradient descent and backpropagation manually  
- Verified correctness by tracking loss reduction over time  
- Wrote modular, well-documented code suitable for reuse  



## Step 1: Setup and Data Preparation

Import libraries and prepare the data that the network will learn from.

In [None]:
# PACKAGE
import numpy as np
import matplotlib.pyplot as plt

## Step 2: Defining the Neural Network Architecture

Here I define the structure of the neural network - the number of layers, neurons, and the activation functions used.

This section sets up the weight matrices and bias vectors, which will later be updated during training.

In [None]:

# Jacobian for the third layer weights. There is no need to edit this function.
def J_W3 (x, y) :
    # First get all the activations and weighted sums at each layer of the network.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    # We'll use the variable J to store parts of our result as we go along, updating it in each line.
    # Firstly, we calculate dC/da3, using the expressions above.
    J = 2 * (a3 - y)
    # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
    J = J * d_sigma(z3)
    # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
    # i.e. dz3/dW3 = a2
    # and divide by the number of training examples, for the average over all training examples.
    J = J @ a2.T / x.size
    # Finally return the result out of the function.
    return J

#  implement the jacobian for the bias.
def J_b3 (x, y) :
    # As last time, we'll first set up the activations.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3-y)
    J = J * d_sigma(z3)
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

In [None]:

def J_W2 (x, y) :
    #The first two lines are identical to in J_W3.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)    
    J = 2 * (a3 - y)
    # the next two lines implement da3/da2, first σ' and then W3.
    J = J * d_sigma(z3)
    J = (J.T @ W3).T
    # then the final lines are the same as in J_W3 but with the layer number bumped down.
    J = J * d_sigma(z2)
    J = J @ a1.T / x.size
    return J

def J_b2 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J*d_sigma(z3)
    J = (J.T @ W3).T
    J = J * d_sigma(z2)
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

In [None]:

# Fill in all incomplete lines.
def J_W1 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    J = (J.T @ W3).T 
    J = J * d_sigma(z2)
    J = (J.T @ W2).T
    J = J * d_sigma(z1)
    J = J @ a0.T / x.size
    return J

def J_b1 (x, y) :
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3 - y)
    J = J * d_sigma(z3)
    J = (J.T @ W3).T 
    J = J * d_sigma(z2)
    J = (J.T @ W2).T
    J = J * d_sigma(z1)
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

##  Step 3: Forward Propagation

The forward pass computes the output of the network given an input by applying linear transformations followed by nonlinear activation functions layer by layer.

This is where predictions are made before any learning happens.

In [None]:
# PACKAGE
# First load the worksheet dependencies.
# Here is the activation function and its derivative.
sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4

# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
    global W1, W2, W3, b1, b2, b3
    W1 = random.randn(n1, 1) / 2
    W2 = random.randn(n2, n1) / 2
    W3 = random.randn(2, n2) / 2
    b1 = random.randn(n1, 1) / 2
    b2 = random.randn(n2, 1) / 2
    b3 = random.randn(2, 1) / 2

# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
    z1 = W1 @ a0 + b1
    a1 = sigma(z1)
    z2 = W2 @ a1 + b2
    a2 = sigma(z2)
    z3 = W3 @ a2 + b3
    a3 = sigma(z3)
    return a0, z1, a1, z2, a2, z3, a3

# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
    return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size

##  Step 4: Backpropagation

Here I implemented **backpropagation**.

I compute the gradient of the loss with respect to each parameter using the chain rule, moving backward through the network. This tells each weight and bias how to adjust to reduce the error.

In [None]:

def J_W3 (x, y) :
    # First get all the activations and weighted sums at each layer of the network.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    # We'll use the variable J to store parts of our result as we go along, updating it in each line.
    # Firstly, we calculate dC/da3, using the expressions above.
    J = 2 * (a3 - y)
    # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
    J = J * d_sigma(z3)
    # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
    # i.e. dz3/dW3 = a2
    # and divide by the number of training examples, for the average over all training examples.
    J = J @ a2.T / x.size
    # Finally return the result out of the function.
    return J

def J_b3 (x, y) :
    # we'll first set up the activations.
    a0, z1, a1, z2, a2, z3, a3 = network_function(x)
    J = 2 * (a3-y)
    J = J * d_sigma(z3)
    # For the final line, we don't need to multiply by dz3/db3, because that is multiplying by 1.
    # We still need to sum over all training examples however.
    J = np.sum(J, axis=1, keepdims=True) / x.size
    return J

## Step 5: Training Loop

Now I bring everything together( forward pass, loss computation, backpropagation, and parameter updates) and run this for multiple epochs to train the network.

The loss should gradually decrease, showing that learning is happening.

## Step 6: Evaluating and Visualizing the Results

After training, I test the model’s accuracy and visualize its learning progress.

## Reflection

Writing backpropagation from scratch was a fantastic exercise in both linear algebra and programming logic.

**Key takeaways:**
- I gained an intuitive understanding of how gradients propagate through a network.  
- I learned how small implementation errors (e.g., in matrix shapes) can dramatically affect learning.  
- I now have a reusable base for experimenting with other architectures (like convolutional or recurrent networks).

