# Neural Networks Overview

From previous lessom, we implemented a logistic regression. In that model, we saw that we compute $z$ with features $x$ and parameters $w$ and $b$. $z$ is then used to computes $a$, which is used to predict $\hat{y}$. Then you can compute the loss function $\mathcal{L}$. 

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/master/Coursera/Deeplearning.ai/Neural%20Networks%20and%20Deep%20Learning/Week%203/images/logistic_regression.svg" width="40%" align="center"/>

A neural network is similar to the logistic regression, but in addition it stacking together a lot of little sigmoid units. In the neural network, the stack of nodes performs the $z$ calculation, as well as, the $a$ calculation. Thus, a node corresponds to a $z$ score and another node to another $z$ score. As logistic regression, we have as inputs the features $x$ and some parameters $w$ and $b$. For each node that calculate the values of $z$ and $a$, we define an index or identifier (a superscript square bracket to refer to quantities associated with this stack of nodes, also called the layer). Thus, a value $W^{[1]}$ corresponds to the scores in nodes of the first layer, $W^{[2]}$ the scores of the second layer and so on. The image below illustrates the architecture of a neural network.

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/master/Coursera/Deeplearning.ai/Neural%20Networks%20and%20Deep%20Learning/Week%203/images/neural_network.svg" width="40%" align="center"/>

# Neural Network Representation

To introduce neural networks, we start by showing the neural network with a single hidden layer, as illustred in the image below. 

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/master/Coursera/Deeplearning.ai/Neural%20Networks%20and%20Deep%20Learning/Week%203/images/neural_network_explanation.svg" width="40%" align="center"/>

This network contains input features ($x_1, x_2, x_3$) stacked up vertically in the so called `input layer`. In our image, the input is an array $X$ containing three rows and one column. 

$$
X = \begin{bmatrix}
x_1 \\
x_2 \\
x_3 \\
\end{bmatrix}
$$

For input, we use the vector $X$ to denote the input features, but an alternative notation for the values of the input features will be $a^{[0]}$. And the term `a` stands for activations and it refers to the values that different layers of the neural network are passing on to the subsequent layers. Thus, the input layer passes on the values of $X$ to the hidden layer, and the activations of the input layer will be called $a^{[0]}$.

The next layer containing four nodes, is called the `hidden layer` of the neural network. It is called hidden layer because in the training set, the true values for these nodes in the middle are not observed. The input values and the output can be seen, but things in this layer are not seen in the training set. This layer will generate some set of activations, which will be called $a^{[1]}$. In particular, the first unit or the first node generates a value $a^{[1]}_1$, the second node generates a value $a^{[1]}_2$, and so on. This layer is a four dimension vector (4 column vector) since we have four nodes, or four units, or four hidden units in this hidden layer, as:

$$
a^{[1]} = \begin{bmatrix}
a^{[1]}_1 \\
a^{[1]}_2 \\
a^{[1]}_3 \\
a^{[1]}_4 \\
\end{bmatrix}
$$

Finally, the last layer is a single-node layer called `output layer`. It is responsible for generating the predicted value $\hat{y}$. In a neural network that you train with supervised learning, the training set contains values of the inputs $X$ as well as the target outputs $Y$. This value generates some value $a^{[2]}$, which is just a real number. Thus, $\hat{y}$ takes on the value of $a^{[2]}$. 

This kind of neural network illustred in the image is called *two-layer neural network* and is called as two-layer because we count all layers in neural networks but the input layer. Hence, the hidden layer is layer one and the output layer is layer two. In our notational convention, the input layer is the layer zero, so technically, there are three layers in this neural network since there is the input layer, the hidden layer, and the output layer. 

Finally, the hidden layer and the output layers will have parameters associated with them. Each layer will have associated with it parameters $w$ and $b$. We write $w^{[1]}$ to indicate that these are parameters associated with layer one (the hidden layer). The weights matrix is a 4 by 3 matrix and b is 4 by 1 vector in this example, as:

$$
w^{[1]} = \begin{bmatrix}
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23} \\
w_{31} & w_{32} & w_{33} \\
w_{41} & w_{42} & w_{43} \\
\end{bmatrix} \ \ \ \ \ \ b^{[1]} = \begin{bmatrix}
b_1 \\
b_2 \\
b_3 \\
b_4 \\ 
\end{bmatrix}
$$

Where the first dimension (4) comes from the fact that we have four nodes or four hidden units, and the second dimension (3) comes from the fact that we have three input features. On the other hand, our matrix of parameters $w^{[2]}$ and $b^{[2]}$ contain a 1 by 4 and 1 by 1 matrices respectively. The 1 by 4 comes from the fact that the hidden layer has four hidden units and the output layer has just one unit.

# Computing a Neural Network's Output

Using the neural network presented above (2 layer neural network), we have to compute for each node the values of `z` and `a` (*e.g.*, $z^{[1]}_1$, $z^{[1]}_2$, $z^{[1]}_3$ and $z^{[1]}_4$). Considering the first node of layer one, we can compute the value of $z^{[1]}_1$ as:

$$
z^{[1]}_1 = w^{[1]T}_1 x + b^{[1]}_1 \\
a^{[1]}_1 = \sigma(z^{[1]}_1)
$$

Extending the computation for all nodes of the hidden layer, we have:

$$
z^{[1]}_1 = w^{[1]T}_1 x + b^{[1]}_1,\ \ \ \ a^{[1]}_1 = \sigma(z^{[1]}_1) \\
z^{[1]}_2 = w^{[1]T}_2 x + b^{[1]}_2,\ \ \ \ a^{[1]}_2 = \sigma(z^{[1]}_2) \\
z^{[1]}_3 = w^{[1]T}_3 x + b^{[1]}_3,\ \ \ \ a^{[1]}_3 = \sigma(z^{[1]}_3) \\
z^{[1]}_4 = w^{[1]T}_4 x + b^{[1]}_4,\ \ \ \ a^{[1]}_4 = \sigma(z^{[1]}_4) \\
$$

As computing them separately would be very inefficient, we can compute them using a vectorized version as:

$$
Z^{[1]} = W^{[1]T} X + b^{[1]} \\
\begin{bmatrix}
z^{[1]}_1 \\
z^{[1]}_2 \\
z^{[1]}_3 \\
z^{[1]}_4 \\
\end{bmatrix} = \begin{bmatrix}
- w^{[1]T}_1 - \\
- w^{[1]T}_2 - \\
- w^{[1]T}_3 - \\
- w^{[1]T}_4 - \\
\end{bmatrix} \begin{bmatrix}
x_1 \\
x_2 \\
x_3 \\
\end{bmatrix} + \begin{bmatrix}
b^{[1]}_1 \\
b^{[1]}_2 \\
b^{[1]}_3 \\
b^{[1]}_4 \\
\end{bmatrix} = \begin{bmatrix}
w^{[1]T}_1 x + b^{[1]}_1 \\
w^{[1]T}_2 x + b^{[1]}_2 \\
w^{[1]T}_3 x + b^{[1]}_3 \\
w^{[1]T}_4 x + b^{[1]}_4 \\
\end{bmatrix}
$$

And then, we compute $a^{[1]}$ as:

$$
a^{[1]} = \sigma(z^{[1]}) \\
\begin{bmatrix}
a^{[1]}_1 \\
a^{[1]}_2 \\
a^{[1]}_3 \\
a^{[1]}_4 \\
\end{bmatrix} = \sigma \left ( \begin{bmatrix}
z^{[1]}_1 \\
z^{[1]}_2 \\
z^{[1]}_3 \\
z^{[1]}_4
\end{bmatrix}
\right )
$$

Expanding the computation for all layers of our network, we have the computation and the dimensions as:

$$
\begin{matrix}
z^{[1]} = W^{[1]}x + b^{[1]} & (4,1) = (4,3)(3,1) + (4,1) \\
a^{[1]} = \sigma(z^{[1]})    & (4,1) = (4,1) \\
z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} & (1,1) = (1,4)(4,1) + (1,1) \\
a^{[2]} = \hat{y} = \sigma(z^{[2]})    & (1,1) = (1,1) \\
\end{matrix}
$$

An example in Python of this example is presented below.

In [6]:
import numpy as np

def sigmoid(z):
    return 1./(1. + np.exp(-1.*z))

# Input (3, 1) matrix
X = np.array([
    [0.1],
    [0.2],
    [0.3]
])

# Initialize matrices
W1 = np.random.randn(3,4)
W2 = np.random.randn(4,1)
b1 = np.ones((4, 1))
b2 = np.ones((1, 1))

# Compute scores
Z1 = np.dot(W1.T, X) + b1
a1 = sigmoid(Z1)
Z2 = np.dot(W2.T, a1) + b2
a2 = sigmoid(Z2)
yhat = a2
print('Prediction: {}'.format(yhat))

Prediction: [[0.32177144]]


In [None]:
&#9744; 