## Lab 06: The Multi-Layer Perceptron (MLP) and Backpropagation.


In this lab we will go from single-neurons to feedforward networks by implementing a simple Multi-Layer Perceptron (MLP) and the famous backpropagation algorithm to train an MLP from labeled data.

The MLP extends Perceptrons to multiple layers with one caveat: We are going to switch to continous activation functions instead of the heavyside 0/1 activation first analyzed by Rosenblatt. In our case, we'll use the sigmoid activation function: 

$$\sigma(x) = \frac{1}{1+e^{x}}$$

The architecture we will implement is simple, from inp layer to hidden layer to output layer:

```3 inputs -> 2 hidden units (sigmoid-activation) -> 1 output unit (sigmoid-activation)```

As we go from neurons to networks, all weight vectors of neurons in one layer are collected in a matrix. For example, the equation for the hidden layer is:

$$ \mathbf{h} = \sigma( W^{(1)} \mathbf{x} + \mathbf{b^{(1)}} ) $$

where $ W^{(1)} \in \mathbb{R}^{m \times n}$, $ \mathbf{x} \in \mathbb{R}^n$ and $\mathbf{h},\mathbf{b} \in \mathbb{R}^m $. The original weight vectors for each of the hidden layer perceptrons can be found in the weight matrix as row vectors:

$$ W^{(1)} = \begin{bmatrix}
- & \mathbf{w_1} & - \\
- & \mathbf{w_2} & - \end{bmatrix} $$

and $\mathbf{w_1}, \mathbf{w_2} \in \mathbb{R}^n$. Correspondingly, the output layer is:

$$ \mathbf{h} = \sigma( W^{(2)} \mathbf{h} + \mathbf{b}^{(2)} ) $$

Notice, that we are now explicityly tracking biases.

To be able to train the network, we will need to be able to quantify its performance using a loss function and minimizing it. We will do so by using the (mean-)squared error (MSE):

$$L(\hat{y},y) = \frac{1}{2} (\hat{y} - y)^2$$
$$L(\hat{y},y) = \frac{1}{2N} \sum (\hat{y_i} - y_i)^2$$

where $\hat{y}$ is the prediction, i.e. the output, of our network, and y is the target variable.



## Learning objectives
1. Practise the mechanics of the forward pass through linear layers + activation
1. Translate analytical gradients into numpy code
1. See how gradient descent gradually reduces the loss

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
np.random.seed(42)

# the sigmoid functions and its derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))

In [2]:
# setup dummy data
n_samples = 200
inputs = np.random.uniform(-1, 1, size=(n_samples, 3))

true_w = np.array([1.5, -2.0, 0.5])
true_b = -0.1
targets = sigmoid(inputs @ true_w + true_b)

# setup initial network parameters (all weights and biases)
n_hidden_units = 2
W1 = 0.1 * np.random.randn(n_hidden_units,3)
b1 = np.zeros(n_hidden_units)

n_output_hidden = 1
W2 = 0.1 * np.random.randn(n_output_hidden,2)
b2 = np.zeros(n_output_hidden)

### Task 1

Your first task is to write a function that implements the forward pass of the network.
Use a scatterplot to visualise the random predictions vs the true outputs and calculate the MSE across the dataset.

In [3]:
# returns the prediction of your network.
def forward_pass(inp, W1, b1, W2, b2):
    pass

# returns the squared loss for one data point
def mse_loss(prediction, target):
    pass


# write a loop over the data that collects all predictions in a list, sums the individual losses. 
# Then, take the mean of the loss and plot predictions against targets.

### Task 2

Now train the network! To do so:

1. Calculate (analytically) the gradients of weights and biases in your network using the chain rule.
2. Implement a forward and backward pass function that calculates the prediction, the loss, and all gradients for the weights and biases using your analytic solution for one data point.
3. Train your network for multiple epochs (iterations of the dataset) by updating the parameters with step-size $\eta$ after every single data-point. (This is the stochastic gradient descent algorithm as immplemented by backpropgataion, or "batch-size = 1")

In [4]:
# returns prediction, loss, dW1, db1, dW2, db2
def forward_backward_pass(inp, target, W1, b1, W2, b2):
    pass


In [5]:
losses = []
eta = 0.5
epochs = 20
for epoch in range(epochs):
    pass

# plot the loss over time (as measured in epochs)

### Task 3 (optional)

Try it out on "real" data: Look up the famous IRIS dataset for more details, download as per the code.

As per good machine learning practice, split the data into training and test and track both training and test error in your model.


In [6]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
raw = urllib.request.urlopen(url).read().decode("utf-8").strip().split("\n")

rows = [r.split(",") for r in raw if r]            # skip empty lines
data = np.array(rows)
features = data[:, :4].astype(float)

inputs = features[:,:3] # sepal length, width, petal length
targets = features[:,3] # petal width

# reset weights
n_hidden_units = 2
W1 = 0.1 * np.random.randn(n_hidden_units,3)
b1 = np.zeros(n_hidden_units)

n_output_hidden = 1
W2 = 0.1 * np.random.randn(n_output_hidden,2)
b2 = np.zeros(n_output_hidden)

In [7]:
# repeat the training steps from above.

In [8]:
np.var(inputs)

np.float64(2.7226293333333333)