# Backpropagation

## Lecture 9

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

Backpropagation is an optimization algorithm used in training feedforward artificial neural networks.
It is a form of supervised learning that involves minimizing a loss function with respect to the weights and biases of the network.
The algorithm is based on the chain rule of calculus and calculates the gradient of the loss function with respect to each weight by propagating the error signal backward through the network.
The primary goal of backpropagation is to adjust the weights and biases of the network so that the predicted output converges to the actual output, reducing the error between them.

In the first step of backpropagation, the input is passed through the network in a forward pass to generate the predicted output.
This output is then compared to the actual target output to calculate the error using a predefined loss function.
The error serves as an indicator of how well the network has performed on that particular input.
The gradient of the loss function with respect to the output layer's activation is calculated, which represents the direction in which the loss function's value increases the most.

In the second step, the error signal is propagated backward through the network from the output layer to the input layer.
The chain rule of calculus is applied to compute the gradients of the loss function with respect to the weights and biases in each layer.
These gradients are used to update the weights and biases in the network using an optimization algorithm, such as gradient descent or a variant thereof.
This process is repeated for multiple epochs or iterations, with the weights and biases being adjusted incrementally to minimize the loss function and improve the network's performance on the training data.

## Python example of backpropagation

This example uses a small neural network with one hidden layer to perform binary classification on a synthetic dataset.

We create a synthetic dataset with 500 data points for binary classification. The neural network has one input layer with two neurons (corresponding to the two dimensions of the data), one hidden layer with four neurons, and one output layer with a single neuron.

The forward pass calculates the output of the network using the current weights and biases. Then, the loss is calculated using mean squared error (MSE) as the cost function. The backpropagation algorithm computes the gradients of the loss with respect to the weights and biases of the network. The weights and biases are then updated using gradient descent with a learning rate of 0.1.

The training loop runs for 1,000 epochs, and the loss is printed every 100 epochs. Finally, the trained model is used to make predictions on a test dataset.

## Model architecture
Mathematically, the model can be represented as follows:

Hidden layer calculation:

$$h = \sigma(W1 \cdot x + b1)$$

Output layer calculation:

$$\hat{y} = \sigma(W2 \cdot h + b2)$$

Where $x$ is the input vector, $h$ is the hidden layer output, \hat{y} is the predicted output, and $\sigma$ is the sigmoid activation function given by:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The loss function is given by:

$$L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \text{mean}(y_i - \hat{y}_i)^2$$

Where $y$ is the actual output.

The derivative of the loss function with respect to the output is given by:

$$\frac{\partial L}{\partial \hat{y}} = \frac{2}{n} \sum_{i=1}^n (\hat{y}_i - y_i)$$

Evaluated at a particular data point, the derivative of the loss function with respect to the weights and biases of the output layer is given by:

$$\frac{\partial L}{\partial \hat{y_i}} = (\hat{y}_i - y_i)$$

The derivative of the sigmoid activation function is given by:

$$\frac{\partial \sigma(z)}{\partial z} = \sigma(z) (1 - \sigma(z))$$

Let´s say we have a value $y$ that represents the output of the sigmoid function $y=\sigma(z)$, then

$$\frac{\partial \sigma(z)}{\partial z} = y (1 - y)$$



In [None]:
import numpy as np

# Generate a synthetic dataset
np.random.seed(42)
data = np.random.randn(500, 2)
labels = (data[:, 0] + data[:, 1]) > 0

print(data[0:5])
print(labels[0:5])

In [None]:
# Define the sigmoid activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Define the derivative of the sigmoid activation function
def sigmoid_derivative(z):
    return z * (1 - z)

# Define the mean squared error (MSE) loss function
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Define the derivative of the mean squared error (MSE) loss function
def mse_derivative(y_true, y_pred):
    return y_pred - y_true

# Set the size of the input, hidden, and output layers
input_size = 2
hidden_size = 4
output_size = 1

# Initialize the weights and biases for the first (input) layer
W1 = np.random.randn(input_size, hidden_size)  # Weight matrix connecting input and hidden layers
b1 = np.zeros(hidden_size)  # Bias vector for the hidden layer

# Initialize the weights and biases for the second (hidden) layer
W2 = np.random.randn(hidden_size, output_size)  # Weight matrix connecting hidden and output layers
b2 = np.zeros(output_size)  # Bias vector for the output layer

# Then we do the backpropagation

In [None]:
# Training parameters
epochs = 1000
learning_rate = 0.1

# Training loop
for epoch in range(epochs):
    # Forward pass
    hidden = sigmoid(np.dot(data, W1) + b1)
    output = sigmoid(np.dot(hidden, W2) + b2)

    # Calculate loss
    loss = mse(labels.reshape(-1, 1), output)

    # Backpropagation
    d_output = mse_derivative(labels.reshape(-1, 1), output) * sigmoid_derivative(output)
    d_hidden = np.dot(d_output, W2.T) * sigmoid_derivative(hidden)

    # Update weights and biases
    W2 -= learning_rate * np.dot(hidden.T, d_output)
    b2 -= learning_rate * np.sum(d_output, axis=0)
    W1 -= learning_rate * np.dot(data.T, d_hidden)
    b1 -= learning_rate * np.sum(d_hidden, axis=0)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

**d_output:**
This line calculates the error gradient at the output layer.
The first part, `mse_derivative(labels.reshape(-1, 1), output)`, computes the derivative of the mean squared error (MSE) loss function with respect to the predicted output.
The second part, `sigmoid_derivative(output)`, calculates the derivative of the sigmoid activation function with respect to its input, which is the output of the neural network.
The two gradients are multiplied element-wise to compute the overall gradient at the output layer.

Mathematically, `d_output` represents the following computation:

$$\partial \hat{y} = MSE'(y, \hat{y}) \cdot \sigma'(\hat{y})$$

**d_hidden:**
This line calculates the error gradient at the hidden layer.
The first part, `np.dot(d_output, W2.T)`, calculates the gradient of the loss with respect to the hidden layer's output by backpropagating the output layer's gradient through the weights (W2) connecting the hidden layer and the output layer.
The second part, `sigmoid_derivative(hidden)`, computes the derivative of the sigmoid activation function with respect to its input, which is the output of the hidden layer.
The two gradients are multiplied element-wise to compute the overall gradient at the hidden layer.
Mathematically, `d_hidden` represents the following computation:

$$\partial h  = (\partial \hat{y} \cdot W2) \cdot \sigma'(h)$$

In [None]:
# Testing
test_data = np.array([[0.5, -0.5],
                      [-0.5, 0.5],
                      [0.5, 0.5],
                      [-0.5, -0.5]])

hidden = sigmoid(np.dot(test_data, W1) + b1)
test_output = sigmoid(np.dot(hidden, W2) + b2)
test_predictions = (test_output > 0.5).astype(int).reshape(-1)

print("Test predictions:", test_predictions)