# **Module 4: Deep Neural Network**

#### 1. Deep neural network representation

##### (1) Deep *L*-layer neural network

![layers](resource%20database%20for%20MD%20notes/Week4/comparing_DL_layers.png)
*<p style="color:grey" align="center">comparing neural networks with different numbers of layers</p>*
- **shallow vs. deep**: determined by the degree of *layer number*


##### (2) Matrix dimensions of deep neural network


![4layer NN](resource%20database%20for%20MD%20notes/Week4/4layer_NN.png)
*<p style="color:grey" align="center">a 4-layer neural network</p>*
- **Number of layers**: $L = 4$
- **Number of nodes**
    - The input ($0^{th}$) layer (input *features*): $n^{[0]} = n_x = 3$
    - Hidden layers:
        - The $1^{st}$ layer: $n^{[1]} = 5$
        - The $2^{nd}$ layer: $n^{[2]} = 5$
        - The $3^{rd}$ layer: $n^{[3]} = 3$
    - The output ($4^{th}$, i.e., $L$) layer: $n^{[4]} = n^{[L]} = 1$
- **Activation function**
    - $\textbf{a}^{[l]} = g^{[l]}(\textbf{z}^{[l]}), \textbf{z}^{[l]} = \textbf{w}^{[l]}\textbf{a}^{[l-1]}+\textbf{b}^{[l]}$
    - **Note**: $\textbf{a}^{[0]} = x, \textbf{a}^{[L]} = \hat{y}$
- **Parameters of activation function**
    - For a *single* sample:
        - $\textbf{z}^{[l]}$.shape =$\textbf{a}^{[l]}$.shape = $(n^{[l]},1)$
        - $\mathrm{d}\textbf{z}^{[l]}$.shape =$\mathrm{d}\textbf{a}^{[l]}$.shape = $(n^{[l]},1)$
        - $\textbf{w}^{[l]}$.shape = $(n^{[l]},n^{[l-1]})$
        - $\textbf{b}^{[l]}$.shape = $(n^{[l]},1)$
    - For *m* samples (vectorized):
        - $\textbf{Z}^{[l]}$.shape =$\textbf{A}^{[l]}$.shape = $(n^{[l]},m)$
        - $\mathrm{d}\textbf{Z}^{[l]}$.shape =$\mathrm{d}\textbf{A}^{[l]}$.shape = $(n^{[l]},m)$
        - $\textbf{w}^{[l]}$.shape = $(n^{[l]},n^{[l-1]})$
        - $\textbf{b}^{[l]}$.shape = $(n^{[l]},m) (\mathrm{broadcasted\ from}\ (n^{[l]},1))$


##### (3) Rationale of using deep neural network

- **Intuition about deep representation**
![Intuition of DL](resource%20database%20for%20MD%20notes/Week4/intuition_DL.png)
*<p style="color:grey" align="center">Consider a neural network for idenfitying figures</p>*
    - Early *hidden layers*: identify *low-level* features (e.g., direction of the edges)
    - Middle *hidden layers*: compose *low-level* features and identify *mid-level* features (e.g., basic units of the face like eyes, chins, etc.)
    - Late *hidden layers*: compose *mid-level* features and identify *high-level features (e.g., different types of faces)
- **Circuit theory**
    - *Informal definition*: there are functions you can compute with a small but deep *l*-layer (<u>fewer hidden units, more layers</u>) neural network that shallower networks (<u>limited layers</u>) require exponentially more hidden units to compute
    - *Elaboration*:
        - For a computation of *n* numbers of x (i.e., $x_1, x_2, ..., x_n$) using *XOR* computation:
            - Using multiple *hidden layers*: the number of layers for computation is $o(\log{(n)})$ with the maximum number of hidden unit (*layer* 1) of each layer to be $n/2$
            - Using one *single hidden layer*: the number of hidden units for computation is $2^{(n-1)}$
        - The complexity using *more layers* is much simpler than *more hidden units* when $n$ is very large

#### 2. Forward and backward propagation of deep neural network

##### (1) Forward and backward functions

- For each layer *l* for a *single* sample:
    - **Given parameters**: $\textbf{w}^{[l]}, \textbf{b}^{[l]}$
    - **Forward propagation**:
        - **Input**: $\textbf{a}^{[l-1]}$
        - **Output**:$\textbf{a}^{[l]}$
        - **Computation**: $\textbf{a}^{[l]} = g^{[l]}(\textbf{z}^{[l]}), \textbf{z}^{[l]} = \textbf{w}^{[l]}\textbf{a}^{[l-1]}+\textbf{b}^{[l]}$
    - **Backward propagation**:
        - **Input**: $\mathrm{d}\textbf{a}^{[l]}$
        - **Output**:$\mathrm{d}\textbf{a}^{[l-1]},\mathrm{d}\textbf{w}^{[l]},\mathrm{d}\textbf{b}^{[l]}$
        - **Computation**: $\mathrm{d}\textbf{z}^{[l]}=\mathrm{d}\textbf{a}^{[l]}\bigodot g^{[l]'}(\textbf{z}^{[l]}),\mathrm{d}\textbf{w}^{[l]} =\mathrm{d}\textbf{z}^{[l]}\textbf{a}^{[l-1]T},\mathrm{d}\textbf{b}^{[l]}=\mathrm{d}\textbf{z}^{[l]},\mathrm{d}\textbf{a}^{[l-1]}=\textbf{w}^{[l]T}\mathrm{d}\textbf{z}^{[l]}$


![4layer NN](resource%20database%20for%20MD%20notes/Week4/4layer_NN.png)
*<p style="color:grey" align="center">using this 4-layer neural network as an example</p>*

##### (2) Forward propagation: Computing a deep neural network's output on a *single* sample

- For *layer 1*:

$$\boxed{\left\{\begin{array}{c}
z_1^{[1]} = \textbf{w}_1^{[1]}\textbf{x} + b_1^{[1]}\\
z_2^{[1]} = \textbf{w}_2^{[1]}\textbf{x} + b_2^{[1]}\\
z_3^{[1]} = \textbf{w}_3^{[1]}\textbf{x} + b_3^{[1]}\\
z_4^{[1]} = \textbf{w}_4^{[1]}\textbf{x} + b_4^{[1]}\\
z_5^{[1]} = \textbf{w}_5^{[1]}\textbf{x} + b_5^{[1]}
\end{array}\right\}\Rightarrow \left\{\begin{array}{c}
a_1^{[1]} = \sigma{z_1^{[1]}}\\
a_2^{[1]} = \sigma{z_2^{[1]}}\\
a_3^{[1]} = \sigma{z_3^{[1]}}\\
a_4^{[1]} = \sigma{z_4^{[1]}}\\
a_5^{[1]} = \sigma{z_5^{[1]}}
\end{array}\right\}}$$

$$\boxed{\textbf{z}^{[1]} = \textbf{w}^{[1]}\textbf{x} + \textbf{b}^{[1]},\textbf{a}^{[1]} = g^{[1]}({\textbf{z}^{[1]}})}$$
where: $\textbf{z}^{[1]}$.shape = (5,1),$\textbf{a}^{[1]}$.shape = (5,1),$\textbf{w}^{[1]}$.shape = (5,3),$\textbf{x}$.shape = (3,1),$\textbf{b}^{[1]}$.shape = (5,1)

- For *layer 2*:
$$\boxed{\textbf{z}^{[2]} = \textbf{w}^{[2]}\textbf{a}^{[1]} + \textbf{b}^{[2]},\textbf{a}^{[2]} = g^{[2]}({\textbf{z}^{[2]}})}$$
where: $\textbf{z}^{[2]}$.shape = (5,1),$\textbf{a}^{[2]}$.shape = (5,1),$\textbf{w}^{[2]}$.shape = (5,5),$\textbf{b}^{[2]}$.shape = (5,1)

- For *layer 3*:
$$\boxed{\textbf{z}^{[3]} = \textbf{w}^{[3]}\textbf{a}^{[2]} + \textbf{b}^{[3]},\textbf{a}^{[3]} = g^{[3]}({\textbf{z}^{[3]}})}$$
where: $\textbf{z}^{[3]}$.shape = (3,1),$\textbf{a}^{[3]}$.shape = (3,1),$\textbf{w}^{[3]}$.shape = (3,5),$\textbf{b}^{[3]}$.shape = (3,1)


- For *layer 4* (*output layer*):
$$\boxed{z^{[L]} = \textbf{w}^{[4]}\textbf{a}^{[3]} + \textbf{b}^{[4]},\hat{y}=a^{[L]} = g^{[4]}({\textbf{z}^{[L]}})}$$
where: $z^{[L]}$.shape = (1,1),$a^{[L]}$.shape = (1,1),$\textbf{w}^{[4]}$.shape = (1,3),$\textbf{b}^{[4]}$.shape = (1,1)


##### (3) Forward propagation: Vectorizing across *m* examples

$$\boxed{
    \!\begin{aligned}
    \textbf{Z}^{[1]} = \textbf{w}^{[1]}\textbf{X} + \textbf{b}^{[1]},\textbf{A}^{[1]} = g^{[1]}({\textbf{Z}^{[1]}})\\
    \textbf{Z}^{[2]} = \textbf{w}^{[2]}\textbf{A}^{[1]} + \textbf{b}^{[2]},\textbf{A}^{[2]} = g^{[2]}({\textbf{Z}^{[2]}})\\
    \textbf{Z}^{[3]} = \textbf{w}^{[3]}\textbf{A}^{[2]} + \textbf{b}^{[3]},\textbf{A}^{[3]} = g^{[3]}({\textbf{Z}^{[3]}})\\
    \textbf{Z}^{[L]} = \textbf{w}^{[4]}\textbf{A}^{[3]} + \textbf{b}^{[4]},\hat{\textbf{Y}}=\textbf{A}^{[L]} = g^{[4]}({\textbf{Z}^{[L]}})
    \end{aligned}
    }
$$
where: $\textbf{Z}^{[l]}$.shape = $(n^{[l]},m)$,$\textbf{A}^{[l]}$.shape = $(n^{[l]},m)$,$\textbf{w}^{[l]}$.shape = $(n^{[l]},n^{[l-1]})$,$\textbf{b}^{[l]}$.shape = $(n^{[l]},m) (\mathrm{broadcasted\ from}\ (n^{[l]},1))$


##### (4) Forward propagation: for *layer l*

- **Input**: $\textbf{A}^{[l-1]}$
- **Output**: $\textbf{A}^{[l]}$ (cache: $\textbf{Z}^{[l]}$)
- **Computation**: $\textbf{A}^{[l]} = g^{[l]}(\textbf{Z}^{[l]}), \textbf{Z}^{[l]} = \textbf{w}^{[l]}\textbf{A}^{[l-1]}+\textbf{b}^{[l]}$

##### (5) Backward propagation: for layer *l*

- **Input**: $\mathrm{d}\textbf{A}^{[l]}$
- **Output**: $\mathrm{d}\textbf{A}^{[l-1]},\mathrm{d}\textbf{w}^{[l]},\mathrm{d}\textbf{b}^{[l]}$
- **Computation**: $\mathrm{d}\textbf{Z}^{[l]} = \mathrm{d}\textbf{A}^{[l]}\bigodot g^{[l]'}(\textbf{Z}^{[l]}), \mathrm{d}\textbf{w}^{[l]} = \frac{1}{m}\mathrm{d}\textbf{Z}^{[l]}\mathrm{d}\textbf{A}^{[l-1]T},\mathrm{d}\textbf{w}^{[l]} = \frac{1}{m}\sum\mathrm{d}\textbf{Z}^{[l]},\mathrm{d}\textbf{A}^{[l-1]}=\textbf{w}^{[l]T}\mathrm{d}\textbf{Z}^{[l]}$

##### (6) General work flow for a neural network

![workflow](resource%20database%20for%20MD%20notes/Week4/FP_BP.JPG)
*<p style="color:grey" align="center">work flow for neural network</p>*

#### 3. Parameters and hyperparameters

- **Parameters of neural network**
    - $\textbf{w}$ and $\textbf{b}$
- **Hyperparameters of neural network**
    - *Learning rate*: $\alpha$
        - Determines how the parameters evolve (fast or slow)
    - *Number of iterations* for *gradient descent*
    - *Number of hidden layers*: *L*
    - *Number of nodes of each layer*: $n^{[l]}$
    - *Choice of activation function*: $g^{[l]}$ (e.g., tanh, sigmoid, ReLU)
- **Function of hyperparameters**
    - Control the *ultimate* and *evolution process* of *parameters*
- **Other hyperparameters (*future course*)**
    - Momentum term
    - Mini batch size
    - Regularization parameters
- **How to tune hyperparameters**
    - Applying DL is highly empirical and relies on "idea-code-experiment" cycle
    - The final solution is based on the performance of the model
        - Rate of converging and performance in avoiding being diverged
        - Lowest cost function *J* with reasonable learning speed
    - The hyperparameters may vary with *tasks* and even change with *progress of a single task*