In [1]:
import numpy as np

%matplotlib inline

# Notations

\[1\]: First layer

\[2\]: Second layer

...

\[n\]: n-th layer

From the computation graph below, we can deduce the associated notation of each layer of our neural network. This is what we will dive into for this week's videos. 

![Neural Network Computation Graph](images/nn-graph-small.png)

# Neural Network Representation

We start with giving some names for easier reference of our neural network representation.

1. Input layer: takes in the inputs directly
2. Hidden layer: layers/nodes between input and output layers
3. Output layer: the final activation before output $\hat{y}$

![Neural Network Representation](images/neural-network-representation-small.png)

An alternative notation to represent the inputs to our neural network is to use $a^{[0]}$ - the activations in the zero-th layer.

Subsequently, the hidden layer will output $a^{[1]}$ etc. The $i$ node will generate $a^{[1]}_i$.

Finally, our $\hat{y}$ can be denoted as $a^{[2]}$.

We we see above is known as a **2 layer neural network**. By convention we do not count the input layer when naming neural networks.

The parameters for each layer is denoted as $W^{[1]}$ and $b^{[1]}$ etc. with shape `(4, 3)` and `(4, 1)` respectively.

# Computing a Neural Network's Output

Recall that each node in the neural network computes $w^T x + b$ followed by an activation function $\sigma$. Suppose that this node is now in hidden layer 1. The computation for the **first node** in hidden layer 1 would be

$$z_1^{[1]} = w_1^{[1]T}x + b_1^{[1]}$$

$$a_1^{[1]} = \sigma(z_1^{[1]})$$

Similarly, the second node would compute:

$$z_2^{[1]} = w_2^{[1]T}x + b_2^{[1]}$$

$$a_2^{[1]} = \sigma(z_2^{[1]})$$

The same is done for the 3rd and 4th nodes. We could do this in a neural network by running each neuron in the layer through a for loop. But as we learnt in previous videos, that is highly inefficient. So we try to **vectorize** this operation.

## Vectorization

First we stack our 4 $w^{[1]T}$ vectors together to get a matrix of shape `(4, 3)`. Then we multiply this matrix by our $x$ and add the bias vector $b$ to it. Our end result will be a vector with each individual entry as:

$$w_i^{[1]T}x + b_i^{[1]}$$

![Vectorization](images/vectorize-nn-small.png)

We can call this result as $z^{[1]}$. To find the output values for the layer, we just have to apply an activation function on each of the values in $z^{[1]}$. In this case it will be the sigmoid function. We call this result $a^{[1]}$.

To calculate the output value $\hat{y}$, we perform similar operations on the output layer as we did for the hidden layer by taking in $a^{[1]}$ as input for the last output layer neuron. In order to compute the final result, we perform the following:

$$z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$$

Then we apply the sigmoid function to obtain $a^{[2]}$.

# Vectorizing across multiple examples

The previous example of vectorization is only for one training example. What if we want to extend this to $m$ training examples? i.e. $x^{(1)}, x^{(2)}, ..., x^{(m)}$ to produce outputs $\hat{y}^{(1)}, \hat{y}^{(2)}, ..., \hat{y}^{(m)}$.

We introduce some new notations. $a^{[1](i)}$ will be the output from layer $1$ of training example $i$.

Our algorithm will be as follow:

![Vectorize Across Examples](./images/vectorize-across-examples-small.png)

Recall that our training examples are stacked up **column-wise**, $X \in \mathbb{R}^{n_x \times m}$.

Therefore to compute our results in a vectorized manner,
$$Z^{[1]} = W^{[1]}X + b^{[1]}$$
$$A^{[1]} = \sigma (Z^{[1]})$$
$$Z^{[2]} = W^{[2]}X + b^{[2]}$$
$$A^{[2]} = \sigma (Z^{[2]})$$

We can view $Z^{[1]}$, $Z^{[2]}$, as stacking $Z^{[1](1)}, Z^{[1](2)}, ..., Z^{[1](m)}$ **column-wise**. The same applies for $Z^{[2]}$, $A^{[1]}$ and $A^{[2]}$.

Horizontally, we can view each column as a single training example. Vertically, we can see each row as a single node in the layer.

# Justification of vectorized implementation

Why does the previous two method of vectorization work? To gain some intuition and simplify our example, let's suppose that all our $b$ vectors are zero vectors.

To calculate our $Z^{[1]}$ matrix, what we have is:
$$Z^{[1](1)} = W^{[1]} x^{(1)}$$
$$Z^{[1](2)} = W^{[1]} x^{(2)}$$
$$Z^{[1](3)} = W^{[1]} x^{(3)}$$
... and so on.

We know that $W^{[1]} \in \mathbb{R}^{4 \times n_x} $ (assuming 4 nodes in the hidden layer), and $x^{(i)} \in \mathbb{R}^{n_x \times 1}$. Therefore each $Z^{[1](i)} \in \mathbb{R}^{4 \times 1}$.

By stacking them together **column-wise**, we get a matrix $Z^{[1]} \in \mathbb{R}^{4 \times m}$.

But instead of stacking them together **after** we multiply $W^{[1]}$ by each $x^{(i)}$, we can perform a matrix multiplication between $W^{[1]}$ and $X$, to get our $Z^{[1]}$ matrix in a single matrix multiply operation.

This same logic can be used to justify for the other layers. Our algorithm now can be expressed as:

![Multiple Examples Vectorized](./images/multiple-examples-vectorized-small.png)

# Code samples of the concept above

In order to concretize what I've learnt, I will attempt to code out some examples of the above concepts to vectorize across training examples for forward propagation.

Suppose that we have $m = 10$ inputs, each input has $n_x = 5$ features.

In [2]:
m = 10
n_x = 5

# Our input matrix, 10 training examples and 5 features
X = np.random.rand(n_x, m)
print(X.shape)

(5, 10)


Suppose our first hidden layer in the neural network has $4$ neurons. 

In [3]:
neurons_1 = 4

W_1 = np.random.rand(neurons_1, n_x)
print(W_1.shape)

(4, 5)


Now let's perform the forward propagation computation **one training example at a time** in the explicit for loop.

In [4]:
# Our z matrix should have the same number of rows as the number of neurons
# and the same number of columns as training examples
Z = np.zeros((neurons_1, m))

for i in range(m):
    # i-th training example
    x_i = X[:,i].reshape(n_x, 1)
    
    # Good assertion practice as suggested by Andrew
    assert(x_i.shape == (n_x, 1))
    
    Z[:,i] = np.dot(W_1, x_i).reshape(neurons_1)

print(Z.shape)
print(Z)

(4, 10)
[[ 1.05428913  0.91206421  0.72065878  0.75871072  0.99171047  1.25936009
   0.84978966  1.32790531  0.80486291  1.30610292]
 [ 1.36851992  1.29523428  1.34127692  1.34803191  1.2420436   1.52586708
   1.28892585  1.45701761  1.56523266  2.13654698]
 [ 1.24108459  1.14252411  1.33359829  1.2656931   0.98641853  1.42200879
   1.16020329  1.19209331  1.45478729  2.00669077]
 [ 2.14739187  1.99111566  1.7567464   1.82524098  1.78567228  2.07381764
   1.97287862  2.18875457  2.16366318  2.76867655]]


But vectorize them and getting the exact same result in one single operation.

In [5]:
Z_vectorized = np.dot(W_1, X)
print(Z_vectorized.shape)
print(Z)

(4, 10)
[[ 1.05428913  0.91206421  0.72065878  0.75871072  0.99171047  1.25936009
   0.84978966  1.32790531  0.80486291  1.30610292]
 [ 1.36851992  1.29523428  1.34127692  1.34803191  1.2420436   1.52586708
   1.28892585  1.45701761  1.56523266  2.13654698]
 [ 1.24108459  1.14252411  1.33359829  1.2656931   0.98641853  1.42200879
   1.16020329  1.19209331  1.45478729  2.00669077]
 [ 2.14739187  1.99111566  1.7567464   1.82524098  1.78567228  2.07381764
   1.97287862  2.18875457  2.16366318  2.76867655]]


And then applying the sigmoid activation function to the output.

In [6]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

A = sigmoid(Z)
print(A)

[[ 0.74159768  0.71342238  0.67275207  0.68107375  0.72942564  0.77891593
   0.70052302  0.79049394  0.69101374  0.7868603 ]
 [ 0.79714092  0.78503183  0.79269985  0.79380768  0.77591953  0.82140081
   0.78396532  0.8110761   0.82710292  0.89440493]
 [ 0.77575275  0.75814277  0.79143521  0.78000459  0.72837993  0.80565314
   0.76136965  0.76711524  0.81073411  0.88149778]
 [ 0.89542481  0.87986112  0.8528017   0.86119382  0.85639587  0.88833223
   0.87791997  0.89923511  0.89693866  0.94095951]]


So the above was an example of how these two operations are the same. The detailed justification is done by examining the matrix multiply operation carefully as Andrew explained in the video.