# Motivation

In my research, I have been using existing libraries to build neural network models. For example, [Keras](https://keras.io/) provides an extremely modular and clear Python API which sits on top of [Tensorflow](https://www.tensorflow.org/). It is very easy to train and experiment with neural networks, without needing to understand all the details under the hood. This was creating using the content of [Deep Learning Book](http://www.deeplearningbook.org/) chapter 6.

Sometimes it's good to understand all of the details. I would like to better understand the backpropagation algorithm. As such, I am implementing a Multi Layer Perception from scratch using only Python and Numpy.

# Theory

Neural networks can be thought of as sequence of tranformations made to input data.

$$y = f(x)$$

Deep neural networks make very many such transformation, which creates complex representations of the data.

$$y = f(f...f(f(x)))$$


Given examples of known $x$ and $y$, we want to define a mapping between them which reliably takes as input $x$ values and outputs values in $y$. Further, we want to capture some underlying structure that relates $x$ and $y$, meaning if another example $x^*$ are sampled in the same way as $x$, and the model has not seen it during training, we can predict ahead of time what the corresponding $y^*$ would be.

What form does this tranformation $f$ take?

We take:

$$ f = g(0, w^Tx + c) $$

where $w$ and $c$ are vectors. $w^Tx + c$ constitutes what is known as an [affine map](https://en.wikipedia.org/wiki/Affine_transformation). Affine maps are transformations that preserve points, lines and planes. 

g(z) = $max(0,z)$ is a known as the [Relu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks) operator. This simple non-linear operation allows the series of functions $y = f(f...f(f(x)))$ to become non-linear. Note that if each $f$ only comprised the affine map $w^Tx + c$, then the resulting composition $y = f(f...f(f(x)))$ would always be linear.

Note this this is similar to the formulation of logistic regression. In logistic regression we find vectors $w$ and $c$ such that:

$$y = \sigma(w^T x + c)$$

minimises the cross-entropy of the true and predicted distributions. Minimising this cross entropy defines the cost function. Cross entropy is defined as:

$$H(p, q) = - \sum_i p_i(x) \log q_i(x) = \mathop{{}\mathbb{E}}[-\log q] = H(p) + D_{KL}(p\|q) $$ 

where $H(p) = -\sum_i p_i\log p_i$ is the [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory) of $p$ and $D_{KL}(p\|q) = \sum_i p_i\log \frac{p_{i}}{q_{i}}$ is the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) of $p$ given $q$.

In the case of our real values $y$ and our predicted values $y'$. Since we are predicting probabilities, $p \in \{y, 1-y\}$ and $q \in \{y', 1-y'\}$ the cross-entropy loss function is defined by:

$$L(y,y') = - \sum_i p_i(x) \log q_i(x) = \sum_i \bigl(y_i \log y_i' + (1-y_i) \log (1-y_i')\bigr)$$ 

# The Chain Rule

In calculus, the chain rule is used to define the derivate of composite functions. If $y = g(x)$ and $z = f(g(x)) = f(y)$, then the chain rule states that:

$$ \frac{dz}{dx} = \frac{df(g(x))}{dx} = \frac{df(y)}{dx} =  \frac{df}{dy}\frac{dy}{dx} = \frac{dz}{dy}\frac{dy}{dx} $$


This generalizes beyond the scalar case. Suppose that $x \in \mathbb{R}^m$, $y \in \mathbb{R}^n$. Let $g$ be a mapping from $\mathbb{R}^m$ to $\mathbb{R}^n$, and let $f$ map from $\mathbb{R}^n$ to $\mathbb{R}$

Then,

$$ \frac{\partial z}{\partial x_i} = \sum_{j}\frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x_i}$$

In vector notation this is represented by:

$$\nabla_\mathbf{x} z = \Bigl( \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \Bigr)^T \nabla_{\mathbf{x}} z$$

where $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is the $n \times m$ Jacobian matrix of $g$. We read this as the gradient of $z$ with respect to the vector $\mathbf{x}$.


We see from this that we obtain the gradient of a variable $\mathbf{x}$ by multiplying a Jacobian matrix $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ by a gradient $\Delta_{\mathbf{x}} z$

In the context of deep learning we do not just operate on vectors, but also on tensors. We can compute the gradient of $z$ with respect to a tensor $X$ as:

$$\nabla_\mathbf{x} z = \sum_j (\nabla_{\mathbf{X}} Y_j) \frac{\partial z}{\partial Y_j}$$

Normally, $(\nabla_{\mathbf{X}} z)_i$ corresponds to $\frac{\partial z}{\partial X_i}$. For a vector, this enumerates all possible elements in the vector. For tensors, we just flatten across all the axis to achieve one long vector.

# Forward propagation

We can now describe the propagation algorithm:

**Algorithm**

$l$ Network depth. <br>
$\mathbf{W}^{(i)}, i \in \{1,...l\}$ weight parameters <br>
$\mathbf{b}^{(i)}, i \in \{1,...l\}$ bias parameters <br>
$\mathbf{x}$ input
$\mathbf{y}$ target output <br>

* $\mathbf{h}^{(0)}$ = $x$
* **for** $k = 1, ..., l$ **do**
    * $\mathbf{a}^{(k)} = \mathbf{b}^{(k)} + \mathbf{W}^{(k)}h^{(k)}$
    * $\mathbf{h}^{(k)} = f(\mathbf{a}^{(k)})$
* **end for**
* $\mathbf{y'} = \mathbf{h}^{(l)}$
* J = $L(y',y) + \lambda \Omega(\theta)$
    


We will implement a Multi Layer Perception with a single hidden layer. As such, $\mathbf{W}$ and $\mathbf{b}$ have two columns. The number of rows is defined by the number of hidden units, $h$ of the model.

We will use the breast cancer data dataset. This defines a set of 30 predicters that are used to classify malignant vs benign breast cancer tumours.

In [138]:
import scipy
import sklearn.datasets
cancer_data = sklearn.datasets.load_breast_cancer()
y = cancer_data.target
X = cancer_data.data
X = (X - np.mean(X,axis=0)) / np.std(X,axis=0)
print (X.shape, y.shape)

((569, 30), (569,))


We randomly initialise the weight and bias parameters.

In [180]:
import numpy as np

def rand_init(s,h):
    return np.random.random(size=(s,h)) - 0.5
W = {1: rand_init(30,h), 2: rand_init(5,1)}
h = {}
b = {1: rand_init(1,1), 2: rand_init(1,1)}

We compute the forward pass where the network transforms the input $X$ input output $y$. We convert the final output of the neural network to a binary prediction using the $\sigma$ function.

$$ \sigma(x) = \frac{1}{1 + \exp(-x)} $$

In [181]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def forward_pass(X):

    a = {}
    h[0] = X
    for k in [1,2]:
        a[k] = b[k] + np.dot(h[k-1],W[k])
        h[k] = a[k].copy()
        h[k][h[k] < 0] = 0
    y_pred = h[2]
    y_pred = sigmoid(y_pred)
    return y_pred

In [182]:
y_pred = forward_pass(X)

# Backward pass 

For the backward pass, we need to define the cost function, $J$, as the cross-entropy between $y$ and $y'$.

In [144]:
def cost(y_real,y_pred):
    return -np.sum(y_real*np.log(y_pred) + (1-y_real)*np.log(1-y_pred))

The derivate of the $\sigma$ function is:

$$ \frac{d \sigma}{dx} = \sigma(x)(1 - \sigma(x))$$

We need to compute the backward pass

In [257]:
def dsigmoid(x):
    return sigmoid(x)*(1 - sigmoid(x))



def backward_pass(y_pred):
    grad_y_pred = (y_pred - y) * dsigmoid(y_pred) # the last layer's error
    grad_w2 = h[1].T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(W[2].T) # the second laye's error 
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0  # the derivate of ReLU
    grad_w1 = h[0].T.dot(grad_h)


    W[1] -= learning_rate * grad_w1
    W[2] -= learning_rate * grad_w2

In [258]:
backward_pass(y_pred)



In [276]:
learning_rate = 0.005
W = {1: rand_init(30,5), 2: rand_init(5,1)}
h = {}
b = {1: rand_init(1,1), 2: rand_init(1,1)}
def train():
    for i in range(100):
        y_pred = forward_pass(X)
        loss = cost(y,y_pred)
        if i%20 == 0:
            print (loss)
        backward_pass(y_pred)

In [277]:
train()

456.719292981
197.024215512
190.130295586
189.869291004
189.725359574


