# Deep L-Layer Neural Network

Here, we call deep neural network, a network that has at least 2 layers. A "shallow" network such as logistic regression contains a single layer as illustred in the image below:

<img src="images/logistic_regression.svg" width="30%" align="center"/>

Consider the deep neural network with 4 layers illustred in the image below:

<img src="images/deep_neural_network.svg" width="50%" align="center"/>

In this deep network, we have 4 layers (identified as $L=4$) and the number of neurons in each layer is identified as $n^{[l]}$, where $l$ is the layer. We index the input of the network as layer zero ($l=0$), the first hidden layer ($l=1$), the second hidden layer ($l=2$), the third hidden layer ($l=3$), and the output ($l=4$). Thus, we have $n^{[1]}=5$ since we have 5 units in layer 1, $n^{[2]}=5$ since we have 5 units in layer 2, $n^{[3]}=3$ since we have 3 units in layer 3, and $n^{[4]}=n^{[L]}=1$ since we have 1 units in the last layer. For the input, we have that $n^{[0]}=3$ since we have 3 features in the input. 

We also use $a^{[l]}$ to denote the activation in layer $l$. Thus, in forward, for example, we have that $a^{[l]}=g^{[l]}(z^{[l]})$. We use $w^{[l]}$ to denote the weights in layer $l$ and $b^{[l]}$ to denote the bias. Finally, we denote $X=a^{[0]}$ and $\hat{y}=a^{[L]}$.


# Forward Propagation in a Deep Network

Considering the deep network illustred above, we can compute its forward propagation as:

$
Z^{[1]} = W^{[1]}X + b^{[1]} \\
A^{[1]} = g^{[1]}(Z^{[1]}) \\
Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} \\
A^{[2]} = g^{[2]}(Z^{[2]}) \\
Z^{[3]} = W^{[3]}A^{[2]} + b^{[3]} \\
A^{[3]} = g^{[3]}(Z^{[3]}) \\
Z^{[4]} = W^{[4]}A^{[3]} + b^{[4]} \\
A^{[4]} = g^{[4]}(Z^{[4]}) \\
$

Considering that $X=A^{[0]}$, we can generalize the equation to:

$$
Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]} \\
A^{[l]} = g^{[l]}(Z^{[l]}) \\
$$

Where the prediction is computed as:

$$
\hat{Y} = A^{[L]} = g^{[L]}(Z^{[L]})
$$

# Getting your matrix dimensions right

Consider the 5-layer neural network illustred below:

<img src="images/deep_neural_network_5layers.svg" width="60%" align="center"/>

In this network, we have the number of units as $n^{[0]} = n_x = 2$, $n^{[1]} = 3$, $n^{[2]} = 5$, $n^{[3]} = 4$, $n^{[4]} = 2$, and $n^{[5]} = 1$. The dimensions for each layer using a single example are defined as:

$
z^{[1]} = W^{[1]} X + b^{[1]} \\
(3, 1) = (3, 2) (2, 1) + (3, 1) \\
(n^{[1]}, 1) = (n^{[1]}, n^{[0]}) (n^{[0]}, 1) + (n^{[1]}, 1) \\
$

Using this example, we can see that $W^{[1]} : (n^{[1]}, n^{[0]})$ and in more general terms, we have $W^{[l]} : (n^{[l]}, n^{[l-1]})$. Considering the second layer as example, we can see that:

$
z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} \\
(5, 1) = (5, 3) (3, 1) + (5, 1) \\
(n^{[2]}, 1) = (n^{[2]}, n^{[1]}) (n^{[1]}, 1) + (n^{[2]}, 1) \\
$

As we can see for bias, $b^{[1]} = (n^{[1]}, 1)$, $b^{[2]} = (n^{[2]}, 1)$. In the general case, $b^{[l]} = (n^{[l]}, 1)$. When considering a vectorized implementation, our matrices $z$ and $a$ become $Z$ and $A$ with dimensions:

$
Z^{[1]} = W^{[1]} X + b^{[1]} \\
(n^{[1]}, m) = (n^{[1]}, n^{[0]}) (n^{[0]}, m) + (n^{[1]}, 1) \\
$

where $b^{[1]}$ contains a single column but is broadcasted to $m$ examples, becoming $(n^{[1]}, m)$ automatically.

# Why deep representations?

Consider the problem of recognizing or detecting faces using a deep neural network. In this problem, the input of the network could be a picture of a face. From this input, the first layer of the neural network would be a feature detector or an edge detector, as illustred in the image below, where each little visualization represents a hidden unit that's trying to figure out where the edges of that orientation are in the image. The next layer of the neural network will group these edges in more complex forms and maybe identify parts of faces. For example, it might have a neuron trying to find an eye, and a different neuron trying to find a nose. By putting together lots of edges, it can start to detect different parts of faces. Finally, in deeper layers, by putting together different parts of faces, like an eye or a nose or an ear or a chin, it can recognize or detect different types of faces. 

<img src="images/deep_network.svg" width="80%" align="center"/>

From circuit theory, there is also the intuition about why deep networks seem to work well. The intuition says that you can compute a function with a relatively small (*i.e.*, the number of hidden units is relatively small) but deep neural network. On the other hand, if you try to compute the same function with a shallow network, then you might require exponentially more hidden units to compute the same function.

# Building blocks of deep neural networks

Consider a layer $l$ with $W^{[l]}$ and $b^{[l]}$, we compute the forward as:

Input: $\ \ \ a^{[l-1]}$<br>
Output:$\ \ \ a^{[l]}$<br>
Compute:$\ \ \ z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}\ \ \ $#cache$\ \ \ z^{[l]}$<br>
$\hspace{40pt} a^{[l]} = g^{[l]}(z^{[l]})$<br>

And the backward as:

Input:$\ \ \ da^{[l]} \ \ \ \text{and}\ \ \ z^{[l]}$<br>
Output:$\ \ \ da^{[l-1]}, dW^{[l]}, db^{[l]}$<br>

Considering the network presented previously, we select a layer $l$ with some parameters $w^{[l]}$ and $b^{[l]}$ as illustrated in the image below. For the forward propagation, we input the activations $a^{[l-1]}$ from your previous layer and output $a^{[l]}$. To do so, we compute $z^{[l]}$ and then $a^{[l]}$. It turns out that for later use it is useful to also cache the value $z^{[l]}$. For the backward step focusing on computation for this layer $l$, we implement a function that inputs $da^{[l]}$ and outputs $da^{[l-1]}$.

<img src="images/building_blocks.svg" width="30%" align="center"/>

If you want to implement these functions, then the basic computation of the neural network will be as illustred in the image below. First, you have to take the input features $a^{[0]}$ and compute the activations of the first layer ($a^{[1]}$). To do that, you need the $w^{[1]}$ and $b^{[1]}$, and for future use, cache away $z^{[1]}$. Now, you feed that to the second layer and use $w^{[2]}$ and $b^{[2]}$ to compute the activations in the next layer $a^{[2]}$, and so on. Repeat this process until you end up outputting a $l$ which is equal to $\hat{y}$.

<img src="images/building_blocks_all_layers.svg" width="80%" align="center"/>

For the back propagation step, you have to perform a backward sequence of iterations in which you are going backwards and computing gradients like so. In order to do that, you feed $da^{[l]}$ and outputs $da^{[l-1]}$, and so on until we get $da^{[2]}$ and $da^{[1]}$. Along the way, back propagation also ends up outputting the derivatives for weights and bias ($dw^{[l]}$ and $db^{[l]}$ for all layers).

# Forward and Backward Propagation 

In order to compute the forward and backward propagation, we use the following equations:

**Forward Propagation** 

$
Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]} \\
A^{[l]} = g^{[l]}(Z^{[l]})
$

**Backward Propagation:**

$
dZ^{[l]} = A^{[l]} - Y \\
dZ^{[l-1]} = W^{[l]^T}dZ^{[l]}*g'^{[l-1]}(Z^{[l-1]}) \\
dA^{[l]} = W^{[l]^T}dZ^{[l]} \\
dZ^{[l-1]} = dA^{[l]} * g'^{[l-1]}(Z^{[l-1]}) \\
dW^{[l]} = \frac{1}{m} dZ^{[l]}A^{[l-1]^T} \\
db^{[l]} = \frac{1}{m} np.sum(dZ^{[l]}, \text{axis}=1, \text{keepdims=True}) \\
$

where * is a element-wise multiplication. In the last layer of the network, we have in the forward (and for the case of logistic regression):

$$\mathcal{L}(\hat{y}, y) = -y \log a - (1 - y) \log (1 - a)$$

And in the backward propagation:

$$dA^{[l]} = \left [ \left (-\frac{y^{[1]}}{a^{[1]}} + \frac{1-y^{[1]}}{1-a^{[1]}}\right ) \ldots \left (-\frac{y^{[m]}}{a^{[m]}} + \frac{1-y^{[m]}}{1-a^{[m]}} \right) \right ]$$

# Example of Deep Neural Network

Below we illustrate an example of a 3-layer deep neural network. In the image below, $W$ are the weights, $Z$ represents the computation of $W^TA + b$ and $A$ is the activation function. Symbols represented with $d$ are the derivations of each symbol.

<img src="images/example_neural_network.svg" width="50%" align="center"/>

In this example, we first perform the forward propagation by first computing $Z$ and then apply the activation function $A$ as:

**Forward propagation**:

$
Z^{[1]} = W^{[1]}X + b^{[1]} \\
A^{[1]} = g(Z^{[1]}) \ \ \ \rightarrow \ \ \ ReLU(Z^{[1]}) \\
Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} \\
A^{[2]} = g(Z^{[2]}) \ \ \ \rightarrow \ \ \ ReLU(Z^{[2]}) \\
Z^{[3]} = W^{[3]}A^{[2]} + b^{[3]} \\
A^{[3]} = g(Z^{[3]}) \ \ \ \rightarrow \ \ \ Sigmoid(Z^{[3]}) \\
$

Calculating the forward propagation, we obtain $\hat{Y}$ as $A^{[3]}$. Having the prediction $\hat{Y}$ we can calculate the loss (or the error) using:

$
J = \frac{1}{m} (-Y \log(\hat{Y})^T + (1 - Y) \log(1 - \hat{Y})^T)
$

where $m$ is the number of examples we have in our dataset. As we have the true labels in $Y$, we can also calculate the derivative for $Z^{[3]}$, which is the same as calculating the derivative for $\hat{Y}$. Having the derivative of $Z^{[3]}$, we can calulate the derivative of the weights and bias of the last layer. To perform such computation, we use:

$
dZ^{[3]} = -\frac{Y}{\hat{Y}} + \frac{1-Y}{1-\hat{Y}} \\
dW^{[3]} = \frac{1}{m} dZ^{[3]}A^{[2]^T} \\
db^{[3]} = \frac{1}{m} \sum_{cols}dZ^{[3]}
$

As we calculated $dZ^{[3]}$, we can continue the backpropagation to the previous layer using:

$
dA^{[3]} = W^{[3]^T} dZ^{[3]} \\
dZ^{[2]} = dA^{[3]} g'^{[2]}(Z^{[2]}) \\
dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T} \\
db^{[2]} = \frac{1}{m} \sum_{cols}dZ^{[2]}
$

And finally to the first layer using:

$
dA^{[2]} = W^{[2]^T} dZ^{[2]} \\
dZ^{[1]} = dA^{[2]} g'^{[1]}(Z^{[1]}) \\
dW^{[1]} = \frac{1}{m} dZ^{[1]}X^T \\
db^{[1]} = \frac{1}{m} \sum_{cols}dZ^{[1]}
$

Now that we calculated all the derivatives, we can update all the weights and bias using a learning rate $\alpha$ with the following:

$
W^{[1]} = W^{[1]} - \alpha * dW^{[1]} \\
b^{[1]} = b^{[1]} - \alpha * db^{[1]} \\
W^{[2]} = W^{[2]} - \alpha * dW^{[2]} \\
b^{[2]} = b^{[2]} - \alpha * db^{[2]} \\
W^{[3]} = W^{[3]} - \alpha * dW^{[3]} \\
b^{[3]} = b^{[3]} - \alpha * db^{[3]}
$

A Python code performing all these computations for a toy example is presented below.

In [166]:
# Example of deep neural network
import numpy as np

# Activation functions
def relu(z, grad=True):
    a = np.maximum(0, z)
    if grad:
        da = np.where(z <= 0, 0, 1)
        return a, da
    return a


def sigmoid(z, grad=True):
    a = 1./(1. + np.exp(-1.*z))
    if grad:
        da = a*(1 - a)
        return a, da
    return a


class NeuralNetwork(object):
    def __init__(self, X, Y, dims):
        self.nn = {}
        self.dv = {}
        self.cache = {}
        self.X = X
        self.Y = Y
        self.nb_layers = len(dims)
        self.loss = float('inf')
        n_prev = X.shape[0]
        for l, n in enumerate(dims, start=1):
            Wl = np.random.randn(n, n_prev)
            dWl = np.zeros(Wl.shape)
            bl = np.ones((n, 1))
            dbl = np.zeros(bl.shape)
            self.nn[l] = {'W': Wl, 'b': bl}
            self.dv[l] = {'dW': dWl, 'db': dbl}
            n_prev = n
            
    def g(self, Z, func='relu', grad=True):
        # Activation functions
        if func == 'relu':
            return relu(Z, grad)
        return sigmoid(Z, grad)
            
    def forward(self):
        for l in sorted(self.nn):
            Wl = self.nn[l]['W']
            bl = self.nn[l]['b']
            if l == 1:
                #print 'W[1]X + b[1] : ', 
                #print '({},{})({},{}) + ({},{})'.format(Wl.shape[0],Wl.shape[1],self.X.shape[0],self.X.shape[1],bl.shape[0],bl.shape[1])
                Zl = np.dot(Wl, self.X) + bl
            else:
                A_prev = self.cache[l-1]['A']
                #print 'W[{}]A[{}] + b[{}] : '.format(l, l-1, l),
                #print '({},{})({},{}) + ({},{})'.format(Wl.shape[0],Wl.shape[1],A_prev.shape[0],A_prev.shape[1],bl.shape[0],bl.shape[1])
                Zl = np.dot(Wl, A_prev) + bl
            if l == len(self.nn):
                #print 'Sigmoid Z[{}] : '.format(l),
                #print '({}, {})'.format(Zl.shape[0], Zl.shape[1])
                Al, dAl = self.g(Zl, func='sigmoid')
            else:
                #print 'ReLU Z[{}] : '.format(l),
                #print '({}, {})'.format(Zl.shape[0], Zl.shape[1])
                Al, dAl = self.g(Zl, func='relu')
            self.cache[l] = {'Z': Zl, 'A': Al, 'dg': dAl}
        loss = 1./X.shape[1] * (- np.dot(Y, np.log(Al).T) + np.dot((1 - Y), np.log(1 - Al).T))
        self.loss = loss[0][0]
        return Al
        
    def backward(self):
        drv = {}
        for l in range(self.nb_layers, 0, -1):
            #print 'Layer: ', l
            if l == self.nb_layers:
                Yhat = self.cache[self.nb_layers]['A']
                dZl = -Y/Yhat + ((1 - Y)/(1 - Yhat))
                #print 'dZ[{}] : ({}, {})'.format(l, dZl.shape[0], dZl.shape[1])
                #print '({},{})({},{}) + ({},{})'.format(Wl.shape[0],Wl.shape[1],A_prev.shape[0],A_prev.shape[1],bl.shape[0],bl.shape[1])
            else:
                Wlt = self.nn[l+1]['W'].T
                #print 'dA[{}] = W[{}].T dZ[{}] : '.format(l+1, l+1, l+1),
                #print '({}, {})({}, {})'.format(Wlt.shape[0], Wlt.shape[1], dZl.shape[0], dZl.shape[1])
                dAl = np.dot(Wlt, dZl)
                gZl = self.cache[l]['dg']
                #print "dZ[{}] = dA[{}] * g'(dZ[{}]) : ".format(l, l+1, l),
                #print '({}, {})({}, {})'.format(dAl.shape[0], dAl.shape[1], gZl.shape[0], gZl.shape[1])
                dZl = dAl * gZl
            if l == 1:
                dWl = 1./X.shape[1] * np.dot(dZl, self.X.T)
            else:
                dWl = 1./X.shape[1] * np.dot(dZl, self.cache[l-1]['A'].T)
            dbl = 1./X.shape[1] * np.sum(dZl, axis=1, keepdims=True)
            self.dv[l] = {'dW': dWl, 'db': dbl}
            
    def update(self, lr):
     dZ^{[3]} = -\frac{Y}{\hat{Y}} + \frac{1-Y}{1-\hat{Y}} \\dW^{[3]} = \frac{1}{m} dZ^{[3]}A^{[2]}^T \\ 
   for l in sorted(self.nn):
            self.nn[l]['W'] -= lr * self.dv[l]['dW'] 
            self.nn[l]['b'] -= lr * self.dv[l]['db']
        
    def summary(self):
        total = 0
        print 'Layer name\tShape\t\tParam #'
        print '========================================'
        print 'Input:\t\t({}, {})'.format(self.X.shape[0], self.X.shape[1])
        print '----------------------------------------'
        for l in sorted(self.nn):
            W = self.nn[l]['W']
            b = self.nn[l]['b']
            params_W = W.shape[0] * W.shape[1]
            params_b = b.shape[0] * b.shape[1]
            total += params_W + params_b
            print 'W{}\t\t({}, {})\t\t{}'.format(l, W.shape[0], W.shape[1], params_W)
            print 'b{}\t\t({}, {})\t\t{}'.format(l, b.shape[0], b.shape[1], params_b)
            if l != len(self.nn):
                print '----------------------------------------'
        print '========================================'
        print 'Total params: {}'.format(total)

In [177]:
nb_epochs = 10
dims = [3, 4, 1]

# 5 examples, 3 features
X = np.array([
    [0.1, 0.2, 0.1, 0.3, 0.09],
    [0.1, 0.2, 0.1, 0.3, 0.09],
    [0.1, 0.2, 0.1, 0.3, 0.09]
])
Y = np.array([1, 2, 1, 3, 4]).reshape(1,5)

nn = NeuralNetwork(X, Y, dims)
nn.summary()    

Layer name	Shape		Param #
Input:		(3, 5)
----------------------------------------
W1		(3, 3)		9
b1		(3, 1)		3
----------------------------------------
W2		(4, 3)		12
b2		(4, 1)		4
----------------------------------------
W3		(1, 4)		4
b3		(1, 1)		1
Total params: 33


In [178]:
nn.forward()
print 'Epoch: 0 :: Loss: {}'.format(nn.loss)
for i in range(1, nb_epochs+1):
    nn.forward()
    nn.backward()
    nn.update(0.000001)
    print 'Epoch: {} :: Loss: {}'.format(i, nn.loss)

Epoch: 0 :: Loss: 8.19672592762
Epoch: 1 :: Loss: 8.19672592762
Epoch: 2 :: Loss: 8.18898000801
Epoch: 3 :: Loss: 8.18126601948
Epoch: 4 :: Loss: 8.17358370352
Epoch: 5 :: Loss: 8.16593280471
Epoch: 6 :: Loss: 8.15831307073
Epoch: 7 :: Loss: 8.15072425229
Epoch: 8 :: Loss: 8.14316610308
Epoch: 9 :: Loss: 8.1356383797
Epoch: 10 :: Loss: 8.12814084166
