# 1. Backpropagation for bias vectors (1 point)

In class, we discussed a multilayer perceptron (neural network) whose layers were all "dense", i.e. the output $a^m \in \mathbb{R}^{N^m}$ of the $m$th layer is computed as 
\begin{align*}
z^m &= W^m a^{m - 1} + b^m \\
a^m &= \sigma^m(z^m)
\end{align*}
where $W^m \in \mathbb{R}^{N^m \times N^{m - 1}}$ is the weight matrix, $b^m \in \mathbb{R}^{N^m}$ is the bias vector, and $\sigma^m$ is the nonlinearity. We showed that 
$$\frac{\partial C}{\partial W^m} = \frac{\partial C}{\partial z^m} a^{m - 1 \top}$$
Show that
$$\frac{\partial C}{\partial b^m} = \frac{\partial C}{\partial z^m}$$
Hint: The derivation is almost the same as for $W$.



*   Since the bias is a constant with respect to the input:



 $${∂z^m \over {∂b^m}} = 1$$


*   Therefore, the gradient of the cost function $C$ with respect to the bias vector $b^m$ is:

$${∂𝐶 \over ∂b^m}={∂𝐶 \over ∂z^m} \cdot {∂z^m \over {∂b^m}} = {∂𝐶 \over ∂z^m}$$





# 2. MLP from scratch (3 points)

Using numpy only, implement backward pass or a sigmoid MLP. Specifically, you will need to implement this functionality in the `train` function in the `SigmoidMLP` class below. You should write numpy code to populate the two lists `weight_gradients` and `bias_gradient`, where each entry in each list corresponds to the gradient for a weight matrix or bias vector for each layer. Then, when you run the code cell at the bottom of this notebook, the trained MLP should output (approximately) 0, 1, 1, 0, having learned the [XOR function](https://en.wikipedia.org/wiki/Exclusive_or). Please us a binary cross-entropy loss, i.e.
$$C(a^L, y) = (y - 1)\log(1 - a^L) - y\log(a^L)$$

*Note 1*: All layers in your model, including the last layer, will use the sigmoid cross-entropy function. Remember that 
$$
\frac{\partial}{\partial x}\mathrm{sigmoid}(x) = \mathrm{sigmoid}(x)(1 - \mathrm{sigmoid}(x))$$

*Note 2*: As we mentioned in class,
$$
\frac{\partial C}{\partial z^L} = a^L - y
$$

In [1]:
import numpy as np

class Layer:
    def __init__(self, inputs, outputs):
        # Initialize weight matrix and bias vector
        # Getting the initialization right can be tricky, but for this problem
        # simply drawing from a standard normal distribution should work.
        self.weights = np.random.randn(outputs, inputs)
        self.biases = np.random.randn(outputs, 1)
    def __call__(self, X):
        # Compute \sigmoid(Wx + b)
        return 1/(1 + np.exp(-(self.weights.dot(X) + self.biases)))

In [2]:
class SigmoidMLP:

    def __init__(self, layer_widths):
        self.layers = []
        for inputs, outputs in zip(layer_widths[:-1], layer_widths[1:]):
            self.layers.append(Layer(inputs, outputs))
    
    def train(self, inputs, targets, learning_rate):
        # Forward pass - compute each layer's output and store it for later use
        layer_outputs = [inputs]
        for layer in self.layers:
            layer_outputs.append(layer(layer_outputs[-1]))
        
        # Backward pass to compute gradients
        weight_gradients = []
        bias_gradients = []
        # Calculate the error delta for the output layer
        delta = layer_outputs[-1] - targets
        # Iterate over the layers in reverse order (from output layer to input layer)
        for i in range(len(self.layers), 0, -1):
          # Compute the weight gradient for the current layer
          weight_gradients.append(np.dot(delta, layer_outputs[i-1].T))
          
          # Compute the bias gradient for the current layer
          bias_gradients.append(np.mean(delta, axis=1).reshape(self.layers[i-1].biases.shape))

          # Calculate the error delta for the current layer
          delta = np.dot(self.layers[i-1].weights.T, delta) * layer_outputs[i-1] * (1 - layer_outputs[i-1])
      
        # Reverse the weight_gradients and bias_gradients lists to match the order of the layers
        weight_gradients.reverse()
        bias_gradients.reverse()

        # Perform gradient descent by applying updates
        for weight_gradient, bias_gradient, layer in zip(weight_gradients, bias_gradients, self.layers):
            layer.weights -= weight_gradient * learning_rate
            layer.biases -= bias_gradient * learning_rate

    def __call__(self, inputs):
        a = inputs
        for layer in self.layers:
            a = layer(a)
        return a

In [3]:
def train_mlp(n_iterations, learning_rate):
    mlp = SigmoidMLP([2, 2, 1])
    inputs = np.array([[0, 1, 0, 1], 
                       [0, 0, 1, 1]])
    targets = np.array([[0, 1, 1, 0]])
    for _ in range(int(1e3)):
        mlp.train(inputs, targets, learning_rate)
    return mlp

In [4]:
# You may need to change the n_iterations and learning_rate values
# but these worked for me
mlp = train_mlp(1000, 1.)
# The following calls should result in (approximately) 0, 1, 1, 0
# If the outputs are somewhat close, your training has succeeded!
print(mlp(np.array([0, 0]).reshape(-1, 1)))
print(mlp(np.array([0, 1]).reshape(-1, 1)))
print(mlp(np.array([1, 0]).reshape(-1, 1)))
print(mlp(np.array([1, 1]).reshape(-1, 1)))

[[0.00278562]]
[[0.99777872]]
[[0.99368235]]
[[0.00169365]]
