# Vanilla Neural Network from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
***

In [1]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict
from numpy.typing import NDArray
from matplotlib import pyplot as plt
from math import cos, sin, atan

## 1. Introduction
A Vanila Neural Network(VNN), also known as a feedforward neural network, is the most basic form of artificial neural network and serves as athe foundation for more advanced architectures. It is called *vanilla* because it lacks the additional features or complexities found in specialised neural networks, such as Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN). 


### Layers
A vanilla neural network is composed of one input layer, one or more hidden layer(s) and one output layer.

- **Input Layer**: Receives the data input.
- **Hidden Layer(s)**: Layer(s) where data is processed through weighted connections. These layers allow the network to learn complex patterns.
- **Output Layer**: Procudes the final output (e.g., a class label or a predicted value).

### Neurons
Each layer consists of units called **neurons**. Each neuron receives input, processes it, and passes the output to the next layer.

### Weights and Biases
- Weights ($W$) determine the strength of connections between neurons
- Biases ($b$) allow the model to shift the activation function

### Activation Functions
Activation functions are non-linear functions that allow models to learn complex patterns.

## 2. Loading Data
The XOR dataset is a simple dataset based on the exclusive OR (XOR) logical operation. It involves two binary inputs (either 0 or 1) and one binary output. The output is $1$ if exactly one of the inputs is $1$, and $0$ otherwise. 

| Input A | Input B | Output (A XOR B) |
|---------|---------|------------------|
|    0    |    0    |        0         |
|    0    |    1    |        1         |
|    1    |    0    |        1         |
|    1    |    1    |        0         |

The XOR problem is not linearly separable, thus no linear function can divide the classes in the input space. This exercise will demonstrate how to solve this problem by introducing hidden layers and non-linear activation functions.

In [2]:
# XOR dataset (inputs and outputs)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

## 3. Parameter Initialisation
Each neuron in a neural network involves two main steps: a weighted sum (linear combination) of the inputs plus a bias, followed by an activation function that introduces non-linearity. A single neuron is expressed as:

\begin{align*}
    y = \sigma \left( \sum_{i=1}^{n} w_{i}x_{i} + b \right)
\end{align*}

where:
- $x_{i}$: Inputs to the neuron.
- $w_{i}$: Corresponding weights.
- $b$: Bias.
- $\sigma$: Activation function (e.g., sigmoid, ReLU, tanh).


Or, in vector notation:

\begin{align*}
    y = \sigma \left(Z \right)
\end{align*}

where:
- $Z = W^{T}X + b$
- $X$: Input vector.
- $W$: Weight vector.
- $b$: Bias scalar.
- $\sigma$: Activation function.

Firstly, we need to initialise $W$ and $b$ with random values for each neuron. In this example, let's take small values (e.g., Gaussian distribution)for weights to break symmetry, and 0s for biases. Note that W1 & b1 are for hidden layer, and W2 & b2 are for output layer. For example:

\begin{align*}
    \begin{cases}
    z_1 = x_1 w_{11} + x_2 w_{12} + \cdots + x_m w_{1m} + b_1 \\
    z_2 = x_1 w_{21} + x_2 w_{22} + \cdots + x_m w_{2m} + b_2 \\
    \vdots \\
    z_n = x_1 w_{n1} + x_2 w_{n2} + \cdots + x_m w_{nm} + b_n \\
    \end{cases}
\end{align*}

In a vector form:

\begin{align*}
    \begin{bmatrix}
        z_1 \\
        z_2 \\
        \vdots \\
        z_n
        \end{bmatrix}_{n \times 1}
        =
        \begin{bmatrix}
        w_{11} & w_{12} & \cdots & w_{1m} \\
        w_{21} & w_{22} & \cdots & w_{2m} \\
        \vdots & \vdots & \ddots & \vdots \\
        w_{n1} & w_{n2} & \cdots & w_{nm}
        \end{bmatrix}_{n \times m}
        \begin{bmatrix}
        x_1 \\
        x_2 \\
        \vdots \\
        x_m
        \end{bmatrix}_{m \times 1}
        +
        \begin{bmatrix}
        b_1 \\
        b_2 \\
        \vdots \\
        b_n
    \end{bmatrix}_{n \times 1}
\end{align*}



With an input $X = \left[x_1, x_2\right]$, hidden layer is consist of:

\begin{align*}
    h_1 = \sigma\left(x_1 w_{11} + x_2 w_{12} + b_{1}\right) \\
    h_2 = \sigma\left(x_2 w_{21} + x_2 w_{22} + b_{2}\right)
\end{align*}

and output layer will be:

\begin{align*}
    \hat y = \sigma\left(h_1 w_{1} + h_2 w_{2} + b\right)
\end{align*}

In [3]:
input_neurons, hidden_neurons, output_neurons = 2, 2, 1
W1 = np.random.randn(input_neurons, hidden_neurons)
b1 = np.zeros((1, hidden_neurons))
W2 = np.random.randn(hidden_neurons, output_neurons)
b2 = np.zeros((1, output_neurons))

In [4]:
print(f'W1: \n{W1}')
print(f'b1: \n{b1}')
print(f'W2: \n{W2}')
print(f'b2: \n{b2}')

W1: 
[[ 0.92849333  0.16699119]
 [ 1.15186567 -0.26194181]]
b1: 
[[0. 0.]]
W2: 
[[ 1.05437164]
 [-0.47252378]]
b2: 
[[0.]]


## 4. Activation Functions
Activation functions introduce non-linearity to neural networks, which enables them to learn complex patterns. Considering $Z = W^{T}X + b$, some popular activation functions and their derivatives are as follows:

### Sigmoid Function

\begin{align*}
\sigma(Z) = \dfrac{1}{1+e^{-Z}}
\end{align*}

- Output range: (0, 1).
- Smooth gradient (avoiding abrupt jumps).
- Popular for binary classification.

### Sigmoid Derivative

\begin{align*}
\dfrac{d}{dZ}\sigma(Z) = \sigma(Z) \cdot (1 - \sigma(Z))
\end{align*}

The derivative function accepts the output of the sigmoid function, not raw input.

In [5]:
def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))


def sigmoid_derivative(Z):
    return Z * (1 - Z)

### Rectified Linear Unit (ReLU)
ReLU is computationally efficient and it mitigates vanishing gradient, but it may cause 'Dying ReLU' problem where neurons can get stuck at 0.

\begin{align*}
\sigma(Z) = \text{max}(0, Z)
\end{align*}


### ReLU Derivative

\begin{align*}
    \sigma'(Z) =
    \begin{cases}
    1 & \text{if } Z > 0 \\
    0 & \text{otherwise}
    \end{cases}
\end{align*}

In [6]:
def relu(Z):
    return np.maximum(0, Z)


def relu_derivative(Z):
    return (Z > 0).astype(float)

### Leaky ReLU
Leaky ReLU solves 'Dying ReLU' problem by allowing small negative outputs.

\begin{align*}
    \sigma(Z) =
    \begin{cases}
    Z & \text{if } Z > 0 \\
    \alpha Z & \text{otherwise}
    \end{cases}
\end{align*}

where $\alpha$ is typically 0.01.

### Leaky ReLU Derivative

\begin{align*}
    \sigma'(Z) =
    \begin{cases}
    1 & \text{if } Z > 0 \\
    \alpha & \text{otherwise}
    \end{cases}
\end{align*}

In [7]:
def leaky_relu(Z, alpha=0.01):
    return np.where(Z > 0, Z, alpha * Z)


def leaky_relu_derivative(Z, alpha=0.01):
    return np.where(Z > 0, 1, alpha)

### Hyperbolic Tangent (tanh)
By applying hyperbolic tangent, output is centered at 0 (ranging from -1 to +1), giving a stronger gradient than sigmoid.

\begin{align*}
    \text{tanh}(Z) =
    \dfrac{e^{Z}-e^{-Z}}{e^{Z}+e^{-Z}}
\end{align*}

### Tanh derivative

\begin{align*}
    \dfrac{d}{dZ} \text{tanh}(Z) =
    1 - \text{tanh}^2(Z)
\end{align*}

In [8]:
def tanh(x):
    return np.tanh(x)


def tanh_derivative(x):
    return 1 - np.tanh(x)**2

### Softmax
Softmax is used for classification. 

\begin{align*}
    \text{softmax}(Z_{i}) =
    \dfrac{e^{Z_{i}}}{\sum_{j=1}^{K}e^{Z_{j}}}
\end{align*}

### Softmax Derivative

\begin{align*}
    \dfrac{\partial \text{softmax}(Z_{i})}{\partial Z_{j}} = 
    \text{softmax}(Z_{i}) (\delta_{ij} - \text{softmax}(Z_{i}))
\end{align*}

where $\delta_{ij}$ is Kronecker delta.

In [9]:
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

| Function   | Best For                          | Gradient Behavior             |
|------------|-----------------------------------|-------------------------------|
| Sigmoid    | Output layer (binary class)       | Vanishes at extremes          |
| ReLU       | Hidden layers (most cases)        | Simple, fast computation      |
| Leaky ReLU | Deep networks (prevents dead neurons) | Avoids zero gradients    |
| tanh       | Hidden layers (stronger gradient) | Vanishes less than sigmoid    |
| Softmax    | Output layer (multi-class)        | Normalizes probabilities      |

In [10]:
Z1 = X @ W1 + b1            # [1x2] = [1x2] @ [2x2] + [1x2]
h = sigmoid(Z1)             # [1x2]
Z2 = h @ W2 + b2            # [1x1] = [1x2] @ [2x1] + [1x1]
y_hat = sigmoid(Z2)         # [1x1]

## 5. Forward Propagation
Forward propagation is the process where input data flows through the neural network to produce predictions.

We have our hidden layer $Z_{1} = XW_{1} + b_{1}$. Applying the activation function (sigmoid for instance) to it, we get:

\begin{align*}
    h = \sigma(Z_{1})
\end{align*}

Then we compute the output layer:

\begin{align*}
    Z_{2} = hW_{2}+b_{2}
\end{align*}

Finally, we obtain the output:

\begin{align*}
    \hat y = \sigma(Z_{2})
\end{align*}


In [11]:
def forward(X, W1, b1, W2, b2):
    # Hidden layer computation
    Z1 = np.dot(X, W1) + b1    # Matrix multiplication + bias
    h = sigmoid(Z1)             # Apply activation

    # Output layer computation
    Z2 = np.dot(h, W2) + b2     # Matrix multiplication + bias
    y_hat = sigmoid(Z2)         # Apply activation

    return h, y_hat

In [12]:
h, y_hat = forward(X, W1, b1, W2, b2)
print(y_hat)

[[0.57222231]
 [0.64466544]
 [0.62241196]
 [0.67090057]]


## 6. Loss Functions
A loss function (or error function) is a mathematical formula that quantifies how far off a neural network’s predictions are from the actual target values. It provides a single scalar value representing the aggregate difference between predicted and true outputs. The central goal of training a neural network is to minimise this loss, which directly improves the predictive accuracy of the model. 

### Mean Squared Error (MSE)
Mean Squared Error penalises larger errors due to its squared term.
\begin{align*}
    MSE = \dfrac{1}{N} \sum_{i=1}^{N} (y_{i} - \hat y_{i})^2
\end{align*}

where:
- $y_{i}$: True value.
- $\hat y_{i}$: Predicted value.
- $N$: Number of samples.

In [13]:
def mse(y, y_hat):
    return np.mean((y - y_hat) ** 2)

### Mean Absolute Error (MAE)
Mean Absolute Error measures average magnitude of errors. It is less sensitive to outliers than MSE.

\begin{align*}
    MAE = \dfrac{1}{N} \sum_{i=1}^{N} |y_{i} - \hat y_{i}|
\end{align*}

In [14]:
def mae(y, y_hat):
    return np.mean(np.abs(y - y_hat))

### Root Mean Squared Error (RMSE)
Root Mean Squared Error uses the same units as target variable, thus it is easier to interpret in general.

\begin{align*}
    RMSE = \sqrt{\dfrac{1}{N} \sum_{i=1}^{N} (y_{i} - \hat y_{i})^2}
\end{align*}

In [15]:
def rmse(y, y_hat):
    return np.sqrt(np.mean((y - y_hat) ** 2))

... and many more. The choice of loss function depends on our specific task (regression, classification, etc.), and understanding its properties is crucial for building effective neural networks.

## 7. Backward Propagation
Backward Propagation (also called Backpropagation) is the algorithm used to train neural networks by minimising the loss function. It calculates gradients of the loss function with respect to weights and biases by propagating prediction errors backward through the network, then updates them using an optimisation algorithm such as Gradient Descent. The steps are as follows:

1. Forward Pass
- Compute the activations and outputs for each layer.
- Use a loss function to quantify the error.
2. Backward Pass
- For each layer, compute gradients of the loss function with respect to activations, pre-activation values, weights and biases by applying the chain rule.
3. Update Parameters
- Update weights and biases using Gradient Descent.

Using the forward pass notations provided:

- $Z_1 = XW_1 + b_1$ (hidden layer input)
- $h = \sigma(Z_1)$ (hidden layer activation)
- $Z_2 = hW_2 + b_2$ (output layer input)
- $\hat{y} = \sigma(Z_2)$ (predicted output)

We will derive backpropagation for Mean Squared Error loss: $C = \frac{1}{2}(\hat{y} - y)^2$

### 1. Error Propagation
- Output Layer Error ($\delta_2$):
\begin{align*}
    \delta_2 = \dfrac{\partial C}{\partial Z_2} = \dfrac{\partial C}{\partial \hat y} \cdot \dfrac{\partial \hat y}{\partial Z_2} = (\hat y - y) \odot \hat y(1 - \hat y) = \text{(Loss gradient)} \odot \text{(Activation derivative)}
\end{align*}

- Hidden Layer Error ($\delta_1$):
\begin{align*}
    \delta_1 = \dfrac{\partial C}{\partial Z_1} = \dfrac{\partial C}{\partial h} \cdot \dfrac{\partial h}{\partial Z_1} = (\delta_2 W_{2}^{T}) \odot h(1 - h) = \text{(From output layer)} \odot \text{(Activation derivative)}
\end{align*}


### 2. Output Layer Gradients
With respect to $W_{2}$:

\begin{align*}
    \dfrac{\partial C}{\partial W_{2}} = \dfrac{\partial C}{\partial \hat y} \cdot \dfrac{\partial \hat y}{\partial Z_{2}} \cdot \dfrac{\partial Z_{2}}{\partial W_{2}}
\end{align*}

where:

- $\dfrac{\partial C}{\partial \hat y} = \hat y - y \text{ (Loss Derivative)}$
- $\dfrac{\partial \hat y}{\partial Z_{2}} = \sigma'{Z_{2}} = \hat y(1-\hat y) \text{ (Sigmoid Derivative)}$
- $\dfrac{\partial Z_{2}}{\partial W_{2}} = h \text{ (Linear operation derivative)}$

Thus:

\begin{align*}
    \dfrac{\partial C}{\partial W_{2}} = (\hat y - y) \cdot \hat y(1 - \hat y) \cdot h
\end{align*}

Matrix form:
\begin{align*}
    \nabla_{W_2} = h^T \left[ (\hat{y} - y) \odot \hat{y}(1 - \hat{y}) \right]
\end{align*}

With respect to $b_{2}$:

\begin{align*}
    \dfrac{\partial C}{\partial b_{2}} &= \dfrac{\partial C}{\partial \hat y} \cdot \dfrac{\partial \hat y}{\partial Z_{2}} \cdot \dfrac{\partial Z_{2}}{\partial b_{2}} \\
    &= (\hat y - y) \cdot \hat y(1 - \hat y) \cdot 1
\end{align*}

Once $\delta_2$ is calculated, gradients can be written as:
\begin{align*}
    \nabla_{W_{2}} = h^{T} \delta_2 \\
    \nabla_{b_{2}} = \sum \delta_2
\end{align*}

### 3. Hidden Layer Gradients
With respect to $W_{1}$:

\begin{align*}
    \dfrac{\partial C}{\partial W_{1}} = \dfrac{\partial C}{\partial h} \cdot \dfrac{\partial h}{\partial Z_{1}} \cdot \dfrac{\partial Z_{1}}{\partial W_{1}}
\end{align*}

where:

- $\dfrac{\partial C}{\partial h} = \dfrac{\partial C}{\partial Z_{2}} \cdot \dfrac{\partial Z_{2}}{\partial h}$
- $\dfrac{\partial C}{\partial Z_{2}} = (\hat y - y) \cdot \hat y(1 - \hat y) \text{ (From output layer)}$
- $\dfrac{\partial Z_{2}}{\partial h} = W_{2}$
- $\dfrac{\partial h}{\partial Z_{1}} = \sigma'(Z_{1}) = h(1-h)$
- $\dfrac{\partial Z_{1}}{\partial W_{1}} = X$

Thus:
\begin{align*}
    \dfrac{\partial C}{\partial W_{1}} = [(\hat y - y) \cdot \hat y(1 - \hat y) \cdot W^{T}_{2}] \odot h(1-h) \cdot X
\end{align*}

Matrix form:
\begin{align*}
    \nabla_{W_1} = X^T \left[ \left( (\hat{y} - y) \odot \hat{y}(1 - \hat{y}) \right) W_2^T \odot h(1 - h) \right]
\end{align*}

With respect to $b_{1}$:

\begin{align*}
    \dfrac{\partial C}{\partial b_{1}} &= \dfrac{\partial C}{\partial h} \cdot \dfrac{\partial h}{\partial Z_{1}} \cdot \dfrac{\partial Z_{1}}{\partial b_{1}} \\
    &= [(\hat y - y) \cdot \hat y(1 - \hat y) \cdot W^{T}_{2}] \odot h(1-h) \cdot 1
\end{align*}

Once $\delta_1$ is calculated, gradients can be written as:
\begin{align*}
    \nabla_{W_{1}} = h^{T} \delta_1 \\
    \nabla_{b_{1}} = \sum \delta_1
\end{align*}

### 4. Update Rules
For learning rate $\eta$:
\begin{align*}
    W_2 &\leftarrow W_2 - \eta \cdot \nabla_{w_{2}} \\
    b_2 &\leftarrow b_2 - \eta \cdot \nabla_{b_{2}} \\
    W_1 &\leftarrow W_1 - \eta \cdot \nabla_{w_{1}} \\
    b_1 &\leftarrow b_1 - \eta \cdot \nabla_{b_{1}} \\
\end{align*}


## 8. Encapsulation
During epochs, forward and backward propagations are repeatedly executed to minimise the loss and to keep updating weights and biases.

In [16]:
class VanillaNeuralNetwork:
    """
    A two-layer neural network implementation.

    This neural network consists of:
    - Input layer.
    - One hidden layer with sigmoid activation.
    - Output layer with sigmoid activation.
    - Trained using backpropagation with MSE loss.

    Attributes:
        W1 (NDArray[np.float64]): Weight matrix between input and hidden layer.
        b1 (NDArray[np.float64]): Bias vector for hidden layer.
        W2 (NDArray[np.float64]): Weight matrix between hidden and output layer.
        b2 (NDArray[np.float64]): Bias vector for output layer.
    """

    def __init__(self, input_size: int, hidden_size: int, output_size: int) -> None:
        """
        Initialise weights and biases for the neural network.

        Args:
            input_size: Number of input features.
            hidden_size: Number of neurons in hidden layer.
            output_size: Number of neurons in output layer.

        Initialisation:
            Weights: Random values from standard normal distribution.
            Biases: Initialised to zeros.
        """
        self.W1 = np.random.randn(input_size, hidden_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, Z: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Compute sigmoid activation function.

        Args:
            Z: Input tensor (linear transformation).

        Returns:
            Sigmoid activation of input, range [0, 1].
        """
        return 1 / (1 + np.exp(-Z))

    def sigmoid_derivative(self, a: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Compute derivative of sigmoid function.

        Note: This function expects the activation output (a), not the raw input (Z).

        Args:
            a: Output from sigmoid activation (a = σ(Z)).

        Returns:
            Derivative of sigmoid: a * (1 - a).
        """

        return a * (1 - a)

    def forward(self, X: NDArray[np.int64]) -> NDArray[np.float64]:
        """
        Perform forward propagation through the network.

        Computes:
            hidden_input = X·W1 + b1
            hidden_output = σ(hidden_input)
            output_input = hidden_output·W2 + b2
            prediction = σ(output_input)

        Stores intermediate values for use in backpropagation.

        Args:
            X: Input data, shape (n_samples, input_size)

        Returns:
            Final predictions, shape (n_samples, output_size)
        """
        self.hidden_input = np.dot(X, self.W1) + self.b1
        self.hidden_output = self.sigmoid(self.hidden_input)
        self.output_input = np.dot(self.hidden_output, self.W2) + self.b2
        self.prediction = self.sigmoid(self.output_input)
        return self.prediction

    def backward(self, X: NDArray[np.int64], y: NDArray[np.float64], lr: float = 0.01):
        """
        Perform backpropagation and update weights & biases.

        Computes gradients using chain rule and updates parameters:
            1. Calculate output layer error.
            2. Calculate hidden layer error.
            3. Update weights and biases using gradient descent.

        Args:
            X: Input data, shape (n_samples, input_size).
            y: Target values, shape (n_samples, output_size).
            lr: Learning rate for gradient descent.
        """

        # Output layer error
        output_error = y - self.prediction
        output_delta = output_error * self.sigmoid_derivative(self.prediction)

        # Hidden layer error
        hidden_error = output_delta.dot(self.W2.T)
        hidden_delta = hidden_error * \
            self.sigmoid_derivative(self.hidden_output)

        # Update weights and biases
        self.W2 += self.hidden_output.T.dot(output_delta) * lr
        self.b2 += np.sum(output_delta, axis=0, keepdims=True) * lr
        self.W1 += X.T.dot(hidden_delta) * lr
        self.b1 += np.sum(hidden_delta, axis=0, keepdims=True) * lr

    def mse(self, y:NDArray[np.float64], y_hat:NDArray[np.float64]) -> float:
        """
        Compute mean squared error loss.

        Args:
            y_true: Ground truth values
            y_pred: Predicted values

        Returns:
            Mean squared error
        """

        return np.mean((y - y_hat) ** 2)

    def train(self, X, y, epochs, lr):
        for epoch in range(epochs):
            y_pred = self.forward(X)
            self.backward(X, y, lr)
            loss = self.mse(y, y_pred)
            if epoch + 1 == 1 or (epoch + 1) % (epochs//10) == 0 or epoch + 1 == epochs:
                print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss:.4f}")

In [17]:
# Initialise and train network
nn = VanillaNeuralNetwork(input_size=2, hidden_size=2, output_size=1)
nn.train(X, y, epochs=10000, lr=0.1)

# Test predictions
print("\nFinal Predictions:")
for i in range(len(X)):
    probability = round(nn.forward(X[i:i+1]).item(), 4)
    output = 0 if probability <= 0.5 else 1
    print(
        f"Input: {X[i]} -> Output: {output} (Probability: {probability}) | Expected: {y[i][0]}")

Epoch 1/10000, Loss: 0.2652
Epoch 1000/10000, Loss: 0.2265
Epoch 2000/10000, Loss: 0.1825
Epoch 3000/10000, Loss: 0.1335
Epoch 4000/10000, Loss: 0.0398
Epoch 5000/10000, Loss: 0.0149
Epoch 6000/10000, Loss: 0.0083
Epoch 7000/10000, Loss: 0.0056
Epoch 8000/10000, Loss: 0.0041
Epoch 9000/10000, Loss: 0.0033
Epoch 10000/10000, Loss: 0.0027

Final Predictions:
Input: [0 0] -> Output: 0 (Probability: 0.044) | Expected: 0
Input: [0 1] -> Output: 1 (Probability: 0.9507) | Expected: 1
Input: [1 0] -> Output: 1 (Probability: 0.9509) | Expected: 1
Input: [1 1] -> Output: 0 (Probability: 0.0629) | Expected: 0
