# Vanilla Neural Network from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
***

In [115]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict
from numpy.typing import NDArray
from matplotlib import pyplot as plt
from math import cos, sin, atan

## 1. Introduction
A Vanila Neural Network(VNN), also known as a feedforward neural network, is the most basic form of artificial neural network and serves as athe foundation for more advanced architectures. It is called *vanilla* because it lacks the additional features or complexities found in specialised neural networks, such as Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN). 


### Layers
A vanilla neural network is composed of one input layer, one or more hidden layer(s) and one output layer.

- **Input Layer**: Receives the data input.
- **Hidden Layer(s)**: Layer(s) where data is processed through weighted connections. These layers allow the network to learn complex patterns.
- **Output Layer**: Procudes the final output (e.g., a class label or a predicted value).

### Neurons
Each layer consists of units called **neurons**. Each neuron receives input, processes it, and passes the output to the next layer.

### Weights and Biases
- Weights ($W$) determine the strength of connections between neurons
- Biases ($b$) allow the model to shift the activation function

### Activation Functions
Activation functions are non-linear functions that allow models to learn complex patterns.

## 2. Loading Data
The XOR dataset is a simple dataset based on the exclusive OR (XOR) logical operation. It involves two binary inputs (either 0 or 1) and one binary output. The output is $1$ if exactly one of the inputs is $1$, and $0$ otherwise. 

| Input A | Input B | Output (A XOR B) |
|---------|---------|------------------|
|    0    |    0    |        0         |
|    0    |    1    |        1         |
|    1    |    0    |        1         |
|    1    |    1    |        0         |

The XOR problem is not linearly separable, thus no linear function can divide the classes in the input space. This exercise will demonstrate how to solve this problem by introducing hidden layers and non-linear activation functions.

In [116]:
# XOR dataset (inputs and outputs)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

## 3. Parameter Initialisation
Each neuron in a neural network involves two main steps: a weighted sum (linear combination) of the inputs plus a bias, followed by an activation function that introduces non-linearity. A single neuron is expressed as:

\begin{align*}
    y = \sigma \left( \sum_{i=1}^{n} w_{i}x_{i} + b \right)
\end{align*}

where:
- $x_{i}$: Inputs to the neuron.
- $w_{i}$: Corresponding weights.
- $b$: Bias.
- $\sigma$: Activation function (e.g., sigmoid, ReLU, tanh).


Or, in vector notation:

\begin{align*}
    y = \sigma \left(Z \right)
\end{align*}

where:
- $Z = W^{T}X + b$
- $X$: Input vector.
- $W$: Weight vector.
- $b$: Bias scalar.
- $\sigma$: Activation function.

Firstly, we need to initialise $W$ and $b$ with random values for each neuron. In this example, let's take small values (e.g., Gaussian distribution)for weights to break symmetry, and 0s for biases. Note that W1 & b1 are for hidden layer, and W2 & b2 are for output layer. For example:

\begin{align*}
    \begin{cases}
    z_1 = x_1 w_{11} + x_2 w_{12} + \cdots + x_m w_{1m} + b_1 \\
    z_2 = x_1 w_{21} + x_2 w_{22} + \cdots + x_m w_{2m} + b_2 \\
    \vdots \\
    z_n = x_1 w_{n1} + x_2 w_{n2} + \cdots + x_m w_{nm} + b_n \\
    \end{cases}
\end{align*}

In a vector form:

\begin{align*}
    \begin{bmatrix}
        z_1 \\
        z_2 \\
        \vdots \\
        z_n
        \end{bmatrix}_{n \times 1}
        =
        \begin{bmatrix}
        w_{11} & w_{12} & \cdots & w_{1m} \\
        w_{21} & w_{22} & \cdots & w_{2m} \\
        \vdots & \vdots & \ddots & \vdots \\
        w_{n1} & w_{n2} & \cdots & w_{nm}
        \end{bmatrix}_{n \times m}
        \begin{bmatrix}
        x_1 \\
        x_2 \\
        \vdots \\
        x_m
        \end{bmatrix}_{m \times 1}
        +
        \begin{bmatrix}
        b_1 \\
        b_2 \\
        \vdots \\
        b_n
    \end{bmatrix}_{n \times 1}
\end{align*}



With an input $X = \left[x_1, x_2\right]$, hidden layer is consist of:

\begin{align*}
    h_1 = \sigma\left(x_1 w_{11} + x_2 w_{12} + b_{1}\right) \\
    h_2 = \sigma\left(x_2 w_{21} + x_2 w_{22} + b_{2}\right)
\end{align*}

and output layer will be:

\begin{align*}
    \hat y = \sigma\left(h_1 w_{1} + h_2 w_{2} + b\right)
\end{align*}

In [117]:
input_neurons, hidden_neurons, output_neurons = 2, 2, 1
W1 = np.random.randn(input_neurons, hidden_neurons)
b1 = np.zeros((1, hidden_neurons))
W2 = np.random.randn(hidden_neurons, output_neurons)
b2 = np.zeros((1, output_neurons))

In [118]:
print(f'W1: \n{W1}')
print(f'b1: \n{b1}')
print(f'W2: \n{W2}')
print(f'b2: \n{b2}')

W1: 
[[-0.54438272  0.11092259]
 [-1.15099358  0.37569802]]
b1: 
[[0. 0.]]
W2: 
[[-0.60063869]
 [-0.29169375]]
b2: 
[[0.]]


## 4. Activation Functions
Activation functions introduce non-linearity to neural networks, which enables them to learn complex patterns. Considering $Z = W^{T}X + b$, some popular activation functions and their derivatives are as follows:

### Sigmoid Function

\begin{align*}
\sigma(Z) = \dfrac{1}{1+e^{-Z}}
\end{align*}

- Output range: (0, 1).
- Smooth gradient (avoiding abrupt jumps).
- Popular for binary classification.

### Sigmoid Derivative

\begin{align*}
\dfrac{d}{dZ}\sigma(Z) = \sigma(Z) \cdot (1 - \sigma(Z))
\end{align*}

The derivative function accepts the output of the sigmoid function, not raw input.

In [119]:
def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(Z):
        return Z * (1 - Z)

### Rectified Linear Unit (ReLU)
ReLU is computationally efficient and it mitigates vanishing gradient, but it may cause 'Dying ReLU' problem where neurons can get stuck at 0.

\begin{align*}
\sigma(Z) = \text{max}(0, Z)
\end{align*}


### ReLU Derivative

\begin{align*}
    \sigma'(Z) =
    \begin{cases}
    1 & \text{if } Z > 0 \\
    0 & \text{otherwise}
    \end{cases}
\end{align*}

In [120]:
def relu(Z):
    return np.maximum(0, Z)

def relu_derivative(Z):
    return (Z > 0).astype(float)

### Leaky ReLU
Leaky ReLU solves 'Dying ReLU' problem by allowing small negative outputs.

\begin{align*}
    \sigma(Z) =
    \begin{cases}
    Z & \text{if } Z > 0 \\
    \alpha Z & \text{otherwise}
    \end{cases}
\end{align*}

where $\alpha$ is typically 0.01.

### Leaky ReLU Derivative

\begin{align*}
    \sigma'(Z) =
    \begin{cases}
    1 & \text{if } Z > 0 \\
    \alpha & \text{otherwise}
    \end{cases}
\end{align*}

In [121]:
def leaky_relu(Z, alpha=0.01):
    return np.where(Z > 0, Z, alpha * Z)

def leaky_relu_derivative(Z, alpha=0.01):
    return np.where(Z > 0, 1, alpha)

### Hyperbolic Tangent (tanh)
By applying hyperbolic tangent, output is centered at 0 (ranging from -1 to +1), giving a stronger gradient than sigmoid.

\begin{align*}
    \text{tanh}(Z) =
    \dfrac{e^{Z}-e^{-Z}}{e^{Z}+e^{-Z}}
\end{align*}

### Tanh derivative

\begin{align*}
    \dfrac{d}{dZ} \text{tanh}(Z) =
    1 - \text{tanh}^2(Z)
\end{align*}

In [122]:
def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

### Softmax
Softmax is used for classification. 

\begin{align*}
    \text{softmax}(Z_{i}) =
    \dfrac{e^{Z_{i}}}{\sum_{j=1}^{K}e^{Z_{j}}}
\end{align*}

### Softmax Derivative

\begin{align*}
    \dfrac{\partial \text{softmax}(Z_{i})}{\partial Z_{j}} = 
    \text{softmax}(Z_{i}) (\delta_{ij} - \text{softmax}(Z_{i}))
\end{align*}

where $\delta_{ij}$ is Kronecker delta.

In [123]:
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

| Function   | Best For                          | Gradient Behavior             |
|------------|-----------------------------------|-------------------------------|
| Sigmoid    | Output layer (binary class)       | Vanishes at extremes          |
| ReLU       | Hidden layers (most cases)        | Simple, fast computation      |
| Leaky ReLU | Deep networks (prevents dead neurons) | Avoids zero gradients    |
| tanh       | Hidden layers (stronger gradient) | Vanishes less than sigmoid    |
| Softmax    | Output layer (multi-class)        | Normalizes probabilities      |

In [124]:
Z1 = X @ W1 + b1            # [1x2] = [1x2] @ [2x2] + [1x2]
h = sigmoid(Z1)             # [1x2]
Z2 = h @ W2 + b2            # [1x1] = [1x2] @ [2x1] + [1x1]
y_hat = sigmoid(Z2)         # [1x1]

## 5. Forward Propagation
Forward propagation is the process where input data flows through the neural network to produce predictions.

We have our hidden layer $Z_{1} = XW_{1} + b_{1}$. Applying the activation function (sigmoid for instance) to it, we get:

\begin{align*}
    h = \sigma(Z_{1})
\end{align*}

Then we compute the output layer:

\begin{align*}
    Z_{2} = hW_{2}+b_{2}
\end{align*}

Finally, we obtain the output:

\begin{align*}
    \hat y = \sigma(Z_{2})
\end{align*}


In [125]:
def forward(X, W1, b1, W2, b2):
    # Hidden layer computation
    Z1 = np.dot(X, W1) + b1    # Matrix multiplication + bias
    h = sigmoid(Z1)             # Apply activation
    
    # Output layer computation
    Z2 = np.dot(h, W2) + b2     # Matrix multiplication + bias
    y_hat = sigmoid(Z2)         # Apply activation
    
    return h, y_hat

In [126]:
h, y_hat = forward(X, W1, b1, W2, b2)
print(y_hat)

[[0.39027267]
 [0.42134259]
 [0.40746301]
 [0.4319769 ]]
