In [2]:
import numpy as np
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

2023-07-06 23:59:48.709340: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Model Representations for a Single Training Example

In a neuron network, the weight matrix $W^{\ell}$ for a layer $\ell$ has the following dimensions:

\begin{align*}
\underbrace{W^{\ell}}_{n^{\ell} \times n^{\ell - 1}}
\end{align*}

where 

* $n^{\ell}$ is the number of neurons or units in the current layer $\ell$ 
* $n^{\ell - 1}$ is the number of neurons or units in the input layer $\ell - 1$ for the current layer $\ell$

The bias vector $\vec{b}^{\ell}$ for a layer $\ell$ has the following dimensions:

\begin{align*}
\underbrace{\vec{b}^{\ell}}_{n^{\ell} \times 1}
\end{align*}

Then, for any input training example $\vec{a}^{0}$ (a vector of the training data with $n^{0}$ features) or input activation vector $\vec{a}^{\ell - 1}$ with $n^{\ell - 1}$ neurons or units, we have the following matrix equation for the vector, $\vec{z}^{\ell}$, outputted by layer $\ell$: 

\begin{align*}
\underbrace{\vec{z^{\ell}}}_{n^{\ell} \times 1} &= \underbrace{W^{\ell}}_{n^{\ell} \times n^{\ell - 1}} \hspace{2mm} \underbrace{\vec{a^{\ell - 1}}}_{n^{\ell - 1} \times 1} + \underbrace{\vec{b}^{\ell}}_{n^{\ell} \times 1} \\
\\
\underbrace{\begin{bmatrix}
z^{\ell}_{1} \\
\\
z^{\ell}_{2} \\
\\
\vdots \\
\\
z^{\ell}_{n^{\ell}} \end{bmatrix}}_{n^{\ell} \times 1}&=\underbrace{\begin{bmatrix} w^{\ell}_{1,1} & w^{\ell}_{1, 2} & w^{\ell}_{1, 3} & \dots & w^{\ell}_{1, n^{\ell - 1}} \\ 
\\
w^{\ell}_{2,1} & w^{\ell}_{2, 2} & w^{\ell}_{2, 3} & \dots & w^{\ell}_{2, n^{\ell - 1}} \\ 
\\
\vdots & \vdots & \vdots & \ddots &  \vdots \\
\\
w^{\ell}_{n^{\ell},1} & w^{\ell}_{n^{\ell}, 2} & w^{\ell}_{n^{\ell}, 3} & \dots & w^{\ell}_{n^{\ell}, n^{\ell - 1}} \end{bmatrix}}_{n^{\ell} \times n^{\ell - 1}} 
\underbrace{\begin{bmatrix}
a^{\ell - 1}_{1} \\
\\
a^{\ell - 1}_{2} \\
\\
\vdots \\
\\
a^{\ell - 1}_{n^{\ell - 1}} \end{bmatrix}}_{n^{\ell - 1} \times 1} + 
\underbrace{\begin{bmatrix}
b^{\ell}_{1} \\
\\
b^{\ell}_{2} \\
\\
\vdots \\
\\
b^{\ell}_{n^{\ell}} \end{bmatrix}}_{n^{\ell} \times 1}
\end{align*}

where 

* The $i \in 1, 2, 3, ..., n^{\ell}$ index of the elements $w^{\ell}_{i,j}$ indexes the $i^{th}$ neuron or unit of layer $\ell$

* The $j \in 1, 2, 3, ..., n^{\ell - 1}$ index of the elements $w^{\ell}_{i,j}$ indexes the $j^{th}$ activation value of the vector $\vec{a}^{\ell - 1}$ outputted by layer $\ell - 1$

Then, an activation function--- linear, relu, sigmoid, tanh--- can be applied to the vector $\vec{z}^{\ell}$ above element-wise to obtain the final activation vector $\vec{a}^{\ell}$.

## Vectorized Model Representations Using Matrices

For a vectorized implementation of the neural network using matrix multiplication, we now have three matrices--- $\underbrace{A^{0}}_{n^{0} \times m}=\underbrace{X}_{n^{0} \times m}$, the weight matrix $W$, and the broadcasted vector $B^{1}$. 

\begin{align*}
\underbrace{Z^{1}}_{n^{1} \times m}=\underbrace{W^{1}}_{n^{1} \times n^{0}} \hspace{2mm} \underbrace{X}_{n^{0} \times m} + \underbrace{B^{1}}_{n^{1} \times m}
\end{align*}

where 

* $n^{1}$ is the number of neurons or units in the first hidden layer

* $Z^{1}$ is the matrix of values outputted by the first layer (after receiving the input training data matrix $A^{0}$ as inputs)

* $A^{0}=X$ is the training data matrix **stacked horizontally** with $n^{0}$ rows (representing the features or variables) and $m$ columns (representing the training examples); in other words, each of the $m$ column vectors is an $n^{0}$-dimensional vector representing a single training example with $n^{0}$ features

    - $A^{0}=X=\begin{bmatrix} a^{0}_{1,1} & a^{0}_{1, 2} & a^{0}_{1, 3} & \dots & a^{0}_{1, m} \\ 
        \\
        a^{0}_{2,1} & a^{0}_{2, 2} & a^{0}_{2, 3} & \dots & a^{0}_{2, m} \\ 
        \\
        \vdots & \vdots & \vdots & \ddots &  \vdots \\
        \\
        a^{0}_{n^{0},1} & a^{0}_{n^{0}, 2} & a^{0}_{n^{0}, 3} & \dots & a^{0}_{n^{0}, m} 
        \end{bmatrix}$

    - The $i \in 0, 1, 2, ..., n^{0}$ index of the elements $a^{0}_{i,j}$ indexes the $i^{th}$ **feature** of the training example among $n^{0}$ features

    - The $j \in 0, 1, 2, ..., m$ index of the elements $a^{0}_{i,j}$ indexes the $j^{th}$ training example along $m$ training examples



* $B^{1}$ has column vectors that are duplicated for each of the $m$ training examples as follows:

    - $\underbrace{B^{1}}_{n^{1} \times m}=
        \begin{bmatrix}
        \vert & \vert &  & \vert \\
        \vec{b}^{1}_{1} & \vec{b}^{1}_{2} & \dots & \vec{b}^{1}_{m} \\
        \vert & \vert &  & \vert 
        \end{bmatrix}$

    - $\vec{b}^{1}_{1}=\vec{b}^{1}_{2}=...=\vec{b}^{1}_{m}$ are $(n^{1} \times 1)$ column vectors broadcasted $m$ times for all $m$ training examples 

In general, for any matrix $Z^{\ell}$ outputted by layer $\ell$, we can express the computations using the following matrix equation:

\begin{align*}
\underbrace{Z^{\ell}}_{n^{\ell} \times m}&=\underbrace{W^{\ell}}_{n^{\ell} \times n^{\ell - 1}} \hspace{2mm} \underbrace{A^{\ell - 1}}_{n^{\ell - 1} \times m} + \underbrace{B^{\ell}}_{n^{\ell} \times m} \\
\\
\underbrace{\begin{bmatrix} z^{\ell}_{1,1} & z^{\ell}_{1, 2} & z^{\ell}_{1, 3} & \dots & z^{\ell}_{1, m} \\ 
\\
z^{\ell}_{2,1} & z^{\ell}_{2, 2} & z^{\ell}_{2, 3} & \dots & z^{\ell}_{2, m} \\ 
\\
\vdots & \vdots & \vdots & \ddots &  \vdots \\
\\
z^{\ell}_{n^{\ell},1} & z^{\ell}_{n^{\ell}, 2} & z^{\ell}_{n^{\ell}, 3} & \dots & z^{\ell}_{n^{\ell}, m} \end{bmatrix}}_{n^{\ell}  \times m}
&=\underbrace{\begin{bmatrix} w^{\ell}_{1,1} & w^{\ell}_{1, 2} & w^{\ell}_{1, 3} & \dots & w^{\ell}_{1, n^{\ell - 1}} \\ 
\\
w^{\ell}_{2,1} & w^{\ell}_{2, 2} & w^{\ell}_{2, 3} & \dots & w^{\ell}_{2, n^{\ell - 1}} \\ 
\\
\vdots & \vdots & \vdots & \ddots &  \vdots \\
\\
w^{\ell}_{n^{\ell},1} & w^{\ell}_{n^{\ell}, 2} & w^{\ell}_{n^{\ell}, 3} & \dots & w^{\ell}_{n^{\ell}, n^{\ell - 1}} \end{bmatrix}}_{n^{\ell} \times n^{\ell - 1}} 
\underbrace{\begin{bmatrix} a^{\ell - 1}_{1,1} & a^{\ell - 1}_{1, 2} & a^{\ell - 1}_{1, 3} & \dots & a^{\ell - 1}_{1, m} \\ 
\\
a^{\ell - 1}_{2,1} & a^{\ell - 1}_{2, 2} & a^{\ell - 1}_{2, 3} & \dots & a^{\ell - 1}_{2, m} \\ 
\\
\vdots & \vdots & \vdots & \ddots &  \vdots \\
\\
a^{\ell - 1}_{n^{\ell - 1},1} & a^{\ell - 1}_{n^{\ell - 1}, 2} & a^{\ell - 1}_{n^{\ell - 1}, 3} & \dots & a^{\ell - 1}_{n^{\ell - 1}, m} \end{bmatrix}}_{n^{\ell - 1}  \times m} + 
\underbrace{\begin{bmatrix} b^{\ell}_{1,1} & b^{\ell}_{1, 2} & b^{\ell}_{1, 3} & \dots & b^{\ell}_{1, m} \\ 
\\
b^{\ell}_{2,1} & b^{\ell}_{2, 2} & b^{\ell}_{2, 3} & \dots & b^{\ell}_{2, m} \\ 
\\
\vdots & \vdots & \vdots & \ddots &  \vdots \\
\\
b^{\ell}_{n^{\ell},1} & b^{\ell}_{n^{\ell}, 2} & b^{\ell}_{n^{\ell}, 3} & \dots & b^{\ell}_{n^{\ell}, m} \end{bmatrix}}_{n^{\ell}  \times m}
\end{align*}

Finally, we can again apply an activation function $g$ to the matrix $Z^{\ell}$ element-wise to obtain the final activation matrix $A^{\ell}$ for all $m$ training examples. The columns of the matrix $A^{\ell}$ are therefore as follows:

* the first column represents the activation vector outputted by layer $\ell$ with $n^{\ell}$ neurons or units for the **first** training example 

\begin{align*}
\underbrace{\begin{bmatrix}
a^{\ell}_{1, 1} \\
\\
a^{\ell}_{2, 1} \\
\\
\vdots \\
\\
a^{\ell}_{n^{\ell}, 1} \end{bmatrix}}_{n^{\ell} \times 1}
\end{align*}

* the second column represents the activation vector outputted by layer $\ell$ with $n^{\ell}$ neurons or units for the **second** training example 

\begin{align*}
\underbrace{\begin{bmatrix}
a^{\ell}_{1, 2} \\
\\
a^{\ell}_{2, 2} \\
\\
\vdots \\
\\
a^{\ell}_{n^{\ell}, 2} \end{bmatrix}}_{n^{\ell} \times 1}
\end{align*}

$\hspace{11cm} \vdots$

* the $\mathbf{m}^{\mathbf{th}}$ column represents the activation vector outputted by layer $\ell$ with $n^{\ell}$ neurons or units for the $\mathbf{m}^{\mathbf{th}}$ training example 

\begin{align*}
\underbrace{\begin{bmatrix}
a^{\ell}_{1, m} \\
\\
a^{\ell}_{2, m} \\
\\
\vdots \\
\\
a^{\ell}_{n^{\ell}, m} \end{bmatrix}}_{n^{\ell} \times 1}
\end{align*}

---

**Note** that this is not the only model representation. Some textbooks and tutorials present the weight matrix as follows:

\begin{align*}
\underbrace{W^{\ell}}_{n^{\ell - 1} \times n^{\ell}}
\end{align*}

This is the transpose of our representation at the very top. But the two model representations are equivalent as long as we multiple the relevant matrices correctly. Had we used this representation of the weight matrix, then the the other matrices $A^{\ell}$, $A^{0}=X$, and $B^{\ell}$ would have had different dimensions as well. One key difference to note is as follows:

>In this representation, each column of the weight matrix $\underbrace{W^{\ell}}_{n^{\ell - 1} \times n^{\ell}}$ represents a neuron unit of the layer; in our representation of the mode, the rows of the matrix $\underbrace{W^{\ell}}_{n^{\ell} \times n^{\ell - 1}}$ represent the neuron units of the layer. 

This second representation of the weight matrix also matches how we implemented the vectorized version of neural network in the `neural_network_vectorized.ipynb` notebook.

## TensorFlow Model Representation

The TensorFlow model representation of the weight matrix $W$ is as follows:

* The weight matrix $W^{\ell}$ we defined above has dimensions $n^{\ell} \times n^{\ell - 1}$

* The weight matrix for layer $\ell$ in TensorFlow has dimensions $n^{\ell - 1} \times n^{\ell}$

The bias vector $\vec{b}^{\ell}$ remains the same with dimensions $n^{\ell} \times 1$, but it is represented as a 1-d array with shape $(n^{\ell}, )$ than a 2-d array with shape $(n^{\ell}, 1)$. Suppose we have the following neural network:

<center> <img src="images/C2_W1_Assign1.PNG" width="450" height="350"> </center>

In [3]:
model = Sequential(
    [               
        Input(shape=(400,)),    # Specify input size (400,) or (400,1) column vector for each of the m training examples
        Dense(units=25, activation="relu", use_bias=True, name='layer_1'), # 25 neurons in layer 1 so n^(1) = 25
        Dense(units=15, activation="relu", use_bias=True, name='layer_2'), # 15 neurons in layer 2 so n^(2) = 15
        Dense(units=1, activation="sigmoid", use_bias=True, name='output_layer') # 1 neuron in output layer so n^(3) = 1
    ], name = "my_model" 
)     

Because we have instantiated an input `keras.engine.keras_tensor.KerasTensor`, we can view the shapes of the weight matrices and bias vectors for each layer in our network. Note that we have not trained the model yet, so the values of the weights are randomly initialized.

In [4]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_1 (Dense)             (None, 25)                10025     
                                                                 
 layer_2 (Dense)             (None, 15)                390       
                                                                 
 output_layer (Dense)        (None, 1)                 16        
                                                                 
Total params: 10,431
Trainable params: 10,431
Non-trainable params: 0
_________________________________________________________________


The weight matrices and bias vectors for these three layers have the following shapes:

In [5]:
[layer1, layer2, layer3] = model.layers

W1, b1 = layer1.get_weights()
W2, b2 = layer2.get_weights()
W3, b3 = layer3.get_weights()
print(f"W^1 shape = {W1.shape}, b^1 shape = {b1.shape}")
print(f"W^2 shape = {W2.shape}, b^2 shape = {b2.shape}")
print(f"W^3 shape = {W3.shape}, b^3 shape = {b3.shape}")

W^1 shape = (400, 25), b^1 shape = (25,)
W^2 shape = (25, 15), b^2 shape = (15,)
W^3 shape = (15, 1), b^3 shape = (1,)


The shapes are as expected:

* The weight matrix $W^{1}$ in TensorFlow has shape $n^{0} \times n^{1} = 400 \times 25$ and $\vec{b}^{1}$ has shape $(n^{1},)=(25,)$

* The weight matrix $W^{2}$ in TensorFlow has shape $n^{1} \times n^{2} = 25 \times 15$ and $\vec{b}^{2}$ has shape $(n^{2},)=(15,)$

* The weight matrix $W^{3}$ in TensorFlow has shape $n^{2} \times n^{3} = 15 \times 1$ and $\vec{b}^{3}$ has shape $(n^{3},)=(1,)$