# **Module 3: Shallow Neural Network**

#### 1. Neural network representation

##### (1) One hidden layer neural network

![Single Hidden Layer](resource%20database%20for%20MD%20notes/Week3/SingleHiddenLayer.jpg)
- **Definition**: A neural network structure that contains a hidden layer prior to its output layer (also called **2-layer network**)
- **Terms**:
    - Input layer (*layer 0*): $x_1, x_2, x_3 \Rightarrow X = a^{[0]}$
    - Hidden layer (*layer 1*): $a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4 \Rightarrow a^{[1]}$
        - *Hidden* indicates that the true values of these nodes are **not observed**.
    - Output layer (*layer 2*): $\hat{y} = a^{[2]}$

##### (2) Computing a neural network's output on a *single* sample

- For each node of *layer 1* $a_i^{[1]}$:
$$\boxed{z_i^{[1]} = \textbf{w}_i^{[1]T}\textbf{x} + b_i^{[1]}, a_i^{[1]} = \sigma{z_i^{[1]}}}$$
where: *superscipt* $^{[1]}$ indicates the No. of *layer*, *subscript* $_i$ indicates the No. of the *node* in this *layer*
- For the full *layer 1*:

$$\boxed{\left\{\begin{array}{c}
z_1^{[1]} = \textbf{w}_1^{[1]T}\textbf{x} + b_1^{[1]}\\
z_2^{[1]} = \textbf{w}_2^{[1]T}\textbf{x} + b_2^{[1]}\\
z_3^{[1]} = \textbf{w}_3^{[1]T}\textbf{x} + b_3^{[1]}\\
z_4^{[1]} = \textbf{w}_4^{[1]T}\textbf{x} + b_4^{[1]}
\end{array}\right\}\Rightarrow \left\{\begin{array}{c}
a_1^{[1]} = \sigma{z_1^{[1]}}\\
a_2^{[1]} = \sigma{z_2^{[1]}}\\
a_3^{[1]} = \sigma{z_3^{[1]}}\\
a_4^{[1]} = \sigma{z_4^{[1]}}
\end{array}\right\}}$$
- **Vectorization of the calculation**:
$$\boxed{\left[\begin{array}{c}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{array}\right] =  
\left[\begin{array}{c}
(w_1^{[1]})_{x_1} & (w_1^{[1]})_{x_2} & (w_1^{[1]})_{x_3}\\
(w_2^{[1]})_{x_1} & (w_2^{[1]})_{x_2} & (w_2^{[1]})_{x_3}\\
(w_3^{[1]})_{x_1} & (w_3^{[1]})_{x_2} & (w_3^{[1]})_{x_3}\\
(w_4^{[1]})_{x_1} & (w_4^{[1]})_{x_2} & (w_4^{[1]})_{x_3}\\
\end{array}\right]
\left[\begin{array}{c}
x_1\\
x_2\\
x_3
\end{array}\right] + \left[\begin{array}{c}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}\\
b_4^{[1]}
\end{array}\right]}$$
$$\boxed{\textbf{z}^{[1]} = \textbf{w}^{[1]T}\textbf{x} + \textbf{b}^{[1]}}$$
where: $\textbf{z}\mathrm{.shape} = (4,1), \textbf{w}^{[1]T}\mathrm{.shape} = (4,3), \textbf{x}\mathrm{.shape} = (3,1), \textbf{b}\mathrm{.shape} = (4,1)$  
$$\boxed{\left[\begin{array}{c}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{array}\right] = \sigma\left[\begin{array}{c}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{array}\right]}$$
$$\boxed{\textbf{a}^{[1]} = \sigma(\textbf{z}^{[1]})}$$
where: $\textbf{a}\mathrm{.shape} = \textbf{z}\mathrm{.shape} = (4,1)$
- For the full *layer 2*:
$$\boxed{z^{[2]} = \textbf{w}^{[2]T}\textbf{a}^{[1]} + b^{[2]}\Rightarrow a^{[2]} = \sigma(z^{[2]})}$$
where: $\textbf{w}^{[2]T}\mathrm{.shape} = (1,4), \textbf{a}^{[1]}\mathrm{.shape} = (4,1)$
- **Vectorization of the calculation**:
$$\boxed{z^{[2]} = \left[\begin{array}{c}
w_1^{[2]} & w_2^{[2]} & w_3^{[2]} & w_4^{[2]}\end{array}\right]
\left[\begin{array}{c}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{array}\right]+b^{[2]}}$$
- For the sample:
$$\textbf{z}^{[1]} = \textbf{w}^{[1]T}\textbf{x} + \textbf{b}^{[1]}$$
$$\textbf{a}^{[1]} = \sigma(\textbf{z}^{[1]})$$
$$z^{[2]} = \textbf{w}^{[2]T}\textbf{a}^{[1]} + b^{[2]}$$
$$a^{[2]} = \sigma(z^{[2]})$$

##### (3) Vectorizing across *multiple* examples

- For *m* samples:
    - Denotation of a *node*: $a_l^{[i](j)}$  
where: *i* denotes the No. of the *layer*, *j* denotes the No. of the *sample*, and *l* denotes the No. of the *node* in this layer
    - *For* loop for all samples:
        - *for i = 1 to m*:
            - $z^{[1](i)} = w^{[1]T}x^{(i)}+b^{[1]}$
            - $a^{[1](i)} = \sigma(z^{[1](i)})$
            - $z^{[2](i)} = w^{[2]T}a^{[1](i)}+b^{[2]}$
            - $a^{[2](i)} = \sigma(z^{[2](i)})$
- **Vectorization of the calculation**:
$$\boxed{\textbf{X} = \left[\begin{array}{c}
x_1^{(1)} & x_1^{(2)} & ... & x_1^{(m)}\\
x_2^{(1)} & x_2^{(2)} & ... & x_2^{(m)}\\
\vdots & \vdots & \vdots & \vdots \\
x_{n_x}^{(1)} & x_{n_x}^{(2)} & ... & x_{n_x}^{(m)}
\end{array}\right], \textbf{X}.\mathrm{shape} = (n_x,m)}$$
$$\boxed{\textbf{w} = \left[\begin{array}{c}
w_1^{(1)} & w_1^{(2)} & ... & w_1^{(n_1)}\\
w_2^{(1)} & w_2^{(2)} & ... & w_2^{(n_1)}\\
\vdots & \vdots & \vdots & \vdots \\
w_{n_x}^{(1)} & w_{n_x}^{(2)} & ... & w_{n_x}^{(n_1)}
\end{array}\right], \textbf{w}.\mathrm{shape} = (n_x,n_1)}$$
By converting $x^{(1)}$ - $x^{(m)}$ to $\textbf{X}$, we can recompute the equations for all *m* samples:
$$\textbf{Z}^{[1]} = \textbf{w}^{[1]T}\textbf{X} + \textbf{b}^{[1]}$$
$$\textbf{A}^{[1]} = \sigma(\textbf{Z}^{[1]})$$
where: $\textbf{A}^{[1]}.\mathrm{shape} = \textbf{Z}^{[1]}.\mathrm{shape} = (n_1,m), \textbf{w}^{[1]T}.\mathrm{shape} = (n_1,n_x),\textbf{X}.\mathrm{shape} = (n_x,m),\textbf{b}^{[1]}.\mathrm{shape} = (n_1,m)( \mathrm{broadcasted}\ \mathrm{from}\ (n_1,1))$, $n_1$ is the total number of *nodes* in *layer* 1 (i.e., 4 in this example).

$$\textbf{Z}^{[2]} = \textbf{w}^{[2]T}\textbf{A}^{[1]} + b^{[2]}$$
$$\textbf{A}^{[2]} = \sigma(\textbf{Z}^{[2]})$$
where: $\textbf{A}^{[2]}.\mathrm{shape} = \textbf{Z}^{[2]}.\mathrm{shape} = (1,m), \textbf{w}^{[2]T}.\mathrm{shape} = (1,n_1),\textbf{X}.\mathrm{shape} = (n_1,m),\textbf{b}^{[2]}.\mathrm{shape} = (1,m)( \mathrm{broadcasted}\ \mathrm{from}\ (1,1))$, assuming *layer* 2 is the *output layer* where $\textbf{A}^{[2]} = \hat{\textbf{Y}}$
- **Further explanation for vectorized implementation**:
$$\textbf{w}^{[1]T}\textbf{X} = \left[\begin{array}{c}
w_1^{[1](1)} & w_2^{[1](1)} & ... & w_{n_x}^{[1](1)}\\
w_1^{[1](2)} & w_2^{[1](2)} & ... & w_{n_x}^{[1](2)}\\
\vdots & \vdots & \vdots & \vdots \\
w_1^{[1](k_1)} & w_2^{[1](k_1)} & ... & w_{n_x}^{[1](k_1)}
\end{array}\right]\left[\begin{array}{c}
x_1^{(1)} & x_1^{(2)} & ... & x_1^{(m)}\\
x_2^{(1)} & x_2^{(2)} & ... & x_2^{(m)}\\
\vdots & \vdots & \vdots & \vdots \\
x_{n_x}^{(1)} & x_{n_x}^{(2)} & ... & x_{n_x}^{(m)}
\end{array}\right] = \left[\begin{array}{c}
\sum\limits_{i=1}^{n_x}w_i^{[1](1)}x_i^{(1)} & \sum\limits_{i=1}^{n_x}w_i^{[1](1)}x_i^{(2)} & ... & \sum\limits_{i=1}^{n_x}w_i^{[1](1)}x_i^{(m)}\\
\sum\limits_{i=1}^{n_x}w_i^{[1](2)}x_i^{(1)} & \sum\limits_{i=1}^{n_x}w_i^{[1](2)}x_i^{(2)} & ... & \sum\limits_{i=1}^{n_x}w_i^{[1](2)}x_i^{(m)}\\
\vdots & \vdots & \vdots & \vdots\\
\sum\limits_{i=1}^{n_x}w_i^{[1](k_1)}x_i^{(1)} & \sum\limits_{i=1}^{n_x}w_i^{[1](k_1)}x_i^{(2)} & ... & \sum\limits_{i=1}^{n_x}w_i^{[1](k_1)}x_i^{(m)}\\
\end{array}\right]$$

#### 2. Activation Function

##### (1) General activation functions

- **Definition**: an *activation function* of a *node* (*g*) calculates the <u>output</u> of the *node* based on a set of given <u>input</u>
    $$\boxed{a^{[i]} = g^{[i]}(z^{[i]}), z^{[i]} = W^{[i]}a^{[i-1]} + b^{[i]}}$$
    - Can be different for different layers

- **General activation functions**
    - *sigmoid* function
        - Worst performance among all *activation functions*
        - Generally only used for the *output layer* of binary classification models (at the $\hat{y}$ output)
        <div style="text-align: center;">
            <img src="resource%20database%20for%20MD%20notes/Week2/1280px-Logistic-curve.svg.png" alt="sigmoid">
        </div>

        $$
        \boxed{a = \sigma(z) = \frac{1}{1+e^{-z}}}
        $$

    - *hyperbolic tangent (tanh)* function
        - <u>Mathematically shifted</u> *sigmoid* function
        - works better than *sigmoid*, especially for 0-centered datasets - the *mean* is closer to 0
        <div style="text-align: center;">
            <img src="resource%20database%20for%20MD%20notes/Week3/TanhReal.gif" alt="hyperbolic tangent">
        </div>
        
        $$
        \boxed{a = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}}
        $$
        - Drawback of *sigmoid* and *tanh* function
            - When *z* is very **large** or very **small**, the slope (gradient of the derivative) of the function is very **small**, thus affecting the performance of *gradient descent*


    - *rectified linear unit (ReLU)* function
        - the derivative is:
            - 1 for $z > 0$ and 0 for $z < 0$
            - 0.000... (simulation: 0 or 1) when $z = 0$
        - increasingly the <u>default</u> choice of *activation function* for middle layers
        <div style="text-align: center;">
            <img src="resource%20database%20for%20MD%20notes/Week1/ezgif-3-e2770704ae.jpg" alt="ReLU">
        </div>

        $$
        \boxed{a = \max(0,z)}
        $$
        - Drawback of *ReLU* function
            - the slope of the function in the negative zone is 0, which results in loss of training results when $x < 0$
    - *Leaky ReLU* function
        <div style="text-align: center;">
            <img src="resource%20database%20for%20MD%20notes/Week3/leakyrelu.png" alt="leaky relu">
        </div>

        $$
        \boxed{a = \max(0.01z,z)}
        $$
        - Advantage of *ReLU* and *Leaky ReLU* function
            - the derivative is much higher than that for *sigmoid* and *tanh* functions when *z* is very **large** or very **small**


##### (2) Rationale of using non-linear activation functions for neural network

- **Why a neural network needs a non-linear activation function?**
- **Identity activation function**:
    $$
    \boxed{a^{[i]} = z^{[i]}, z^{[i]} = W^{[i]}a^{[i-1]} + b^{[i]}}
    $$
    - Finally...
    $$
    \boxed{\hat{y} = W^{[l]}(W^{[l-1]}(...(W^{[1]}x + b^{[1]}) + ...) + + b^{[l-1]}) + b^{[l]} = \prod_{i=1}^{l} W^{[i]}x + \sum\limits_{i=1}^{l}(\prod_{j=i+1}^{l} W^{[j]} b^{[i]}) = Wx + b}
    $$
    - Result: the predicted output $\hat{y}$ is always a linear function of input features no matter how many *layers* we use
    - If *identity activation function* is applied in all *hidden layers* while the output layer uses another *activation function*:
        - The model is equivalent to a model using the last *activation function* without any *hidden layers*
    - *identity activation function* Can only be used at the *output layer* for real number computation (e.g., housing price vs. features)


#### 3. Gradient Descent for Neural Networks

##### (1) Derivatives of activation functions

- For *sigmoid* function:
    $$
    \boxed{
        \!\begin{aligned}
        &\qquad\qquad\qquad g(z) = \frac{1}{1+e^{-z}}\\
        &\frac{\mathrm{d}g(z)}{dz} = \frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) = g(z) (1-g(z))
        \end{aligned}
        }
    $$
    - When $z$ is very **large** (say $z = 10$): $g(z)\rightarrow 1 \Rightarrow \frac{\mathrm{d}g(z)}{dz} \rightarrow 0 $ 
    - When $z$ is very **small** (say $z = -10$): $g(z)\rightarrow 0 \Rightarrow \frac{\mathrm{d}g(z)}{dz} \rightarrow 0 $ 
    - When $z$ is **close to 0** (say $z = 0$): $g(z) = \frac{1}{2} \Rightarrow \frac{\mathrm{d}g(z)}{dz} = \frac{1}{4} $ 

- For *tanh* function:
    $$
    \boxed{
        \!\begin{aligned}
        &\qquad g(z) = \tanh (z) = \frac{e^z-e^{-z}}{e^z+e^{-z}}\\
        &\frac{\mathrm{d}g(z)}{dz} = 1 - (\frac{e^z-e^{-z}}{e^z+e^{-z}})^2 = 1 - {g(z)}^2
        \end{aligned}
    }
    $$
    
    - When $z$ is very **large** (say $z = 10$): $g(z)\rightarrow 1 \Rightarrow \frac{\mathrm{d}g(z)}{dz} \rightarrow 0 $ 
    - When $z$ is very **small** (say $z = -10$): $g(z)\rightarrow -1 \Rightarrow \frac{\mathrm{d}g(z)}{dz} \rightarrow 0 $ 
    - When $z$ is **close to 0** (say $z = 0$): $g(z) = 0 \Rightarrow \frac{\mathrm{d}g(z)}{dz} = 1$ 

- For *ReLU* and *Leaky ReLU* function:
    - For *ReLU*:
        $$
        \boxed{
            \!\begin{aligned}
            &\quad g(z) = \max(0,z) \\
            &\frac{\mathrm{d}g(z)}{dz} = \begin{cases}
            0 (z \le 0)\\
            1 (z > 0)
            \end{cases}
            \end{aligned}
        }
        $$
    - For *leaky ReLU*:
        $$
        \boxed{
            \!\begin{aligned}
            &\quad g(z) = \max(0.01z,z) \\
            &\frac{\mathrm{d}g(z)}{dz} = \begin{cases}
            0.01 (z \le 0)\\
            1 (z > 0)
            \end{cases}
            \end{aligned}
        }
        $$

        - **Note**: taking $\frac{\mathrm{d}g(z)}{dz} = 0$ or $0.01$ at $z = 0$ is a technical application. 
    



##### (2) Gradient descent for neural networks

- **Parameters**
    - $w^{[i]}$, shape = $(n^{[i]}, n^{[i-1]})$ (when $i = 1$: $n^{[i-1]}$ becomes $n_x$, i.e., number of *features*)
    - $b^{[i]}$, shape = $(n^{[i]}, 1)$
- **Cost function**:
    - $J(w^{[1]}, b^{[1]}, ..., w^{[l]}, b^{[l]}) = \frac{1}{m}\sum\limits_{i=1}^{m} L(\hat{y},y)$ (totally *l* layers)
- **Gradient descent**:
    - Repeat {
    - compute predictions $\hat{y}^{(i)}, i = 1-m$ (*m* samples)
    - $\mathrm{d}w^{[i]} = \frac{\partial J}{\partial w^{[i]}}, \mathrm{d}b^{[i]} = \frac{\partial J}{\partial b^{[i]}}$
    - $w^{[i]} := w^{[i]} - \mathrm{d}w^{[i]}, b^{[i]} := b^{[i]} - \mathrm{d}b^{[i]}$
    - until parameters converge
    - }
- **for binary classification with one hidden layer on $n_x$ features and *m* samples**
    - **Number of layers**: 
        - $l = 2$
    - **Parameters**:
        - $w^{[1]}$, shape =$(n^{[1]}, n_x)$
        - $b^{[1]}$, shape = $(n^{[1]}, 1)$
        - $w^{[2]}$, shape =$(1, n^{[1]})$
        - $b^{[2]}$, shape = $(1, 1)$
    - **Nodes of each layer**:
        - *layer 0 (input layer)*: $X = a^{[0]}$, shape = $(n_x, m)$
        - *layer 1 (hidden layer)*: $a^{[1]}$, shape = $(n^{[1]},m)$
        - *layer 2 (output layer)*: $a^{[2]} = \hat{Y}$, shape = $(1,m)$
    - **Cost function**:
        - $J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}) = \frac{1}{m}\sum\limits_{i=1}^{m} L(a^{[2]},y)$
    - **Forward propagation**:
        - $\textbf{Z}^{[1]} = \textbf{w}^{[1]T}\textbf{X} + \textbf{b}^{[1]}$
        - $\textbf{A}^{[1]} = g^{[1]}(\textbf{Z}^{[1]})$
        - $\textbf{Z}^{[2]} = \textbf{w}^{[2]T}\textbf{A}^{[1]} + \textbf{b}^{[2]}$
        - $\textbf{A}^{[2]} = \sigma(\textbf{Z}^{[2]})$

    - **back propagation**:
        - $\mathrm{d}\textbf{Z}^{[2]} = \textbf{A}^{[2]} - \textbf{Y}$
        - $\mathrm{d}\textbf{w}^{[2]} = \frac{1}{m}\mathrm{d}\textbf{Z}^{[2]}\textbf{A}^{[1]T}$
        - $\mathrm{d}\textbf{b}^{[2]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}\textbf{Z}^{[2]})$
        - $\mathrm{d}\textbf{Z}^{[1]} = \textbf{w}^{[2]T}\mathrm{d}\textbf{Z}^{[2]}\bigodot \frac{\mathrm{d}g^{[1]}(\textbf{Z}^{[1]})}{\mathrm{d}\textbf{Z}^{[1]}}$
        - $\mathrm{d}\textbf{w}^{[1]} = \frac{1}{m}\mathrm{d}\textbf{Z}^{[1]}\textbf{X}^{T}$
        - $\mathrm{d}\textbf{b}^{[1]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}\textbf{Z}^{[1]})$

##### (3) Random initialization

- **What happens if you initialize weights $w$ to zero?**
    - It causes a ***symmetry breaking problem***.
    - Say that for the *neural network* model with two *nodes* in one *hidden layer*
        - the initial parameters are:
            - $w^{[1]} = \left[\begin{array}{c}
            0 & 0 \\
            0 & 0
            \end{array}\right]$
          - $b^{[1]} = \left[\begin{array}{c}
            0 \\
             0
            \end{array}\right]$
    - During iteration, the hidden units $a_1^{[1]}$ and $a_1^{[2]}$ will have the same $w$ and $b$ value sets to **X**
        - because they share the <u>same</u> *propagation process* on the <u>same</u> set of *parameters*

- **Solution**:
    - Set *parameters* to random values
        - $w^{[1]} = 0.01* \textrm{np.random.randn((2,2))}$ # generates a small *Gaussian* random variable in a (2,2) matrix
        - $b^{[1]} = \textrm{np.zero((2,1))}$ # doesn't have the *symmetry breaking problem*
        - $w^{[2]} = 0.01* \textrm{np.random.randn((1,2))}$
        - $b^{[2]} = \textrm{np.zero((1,1))}$
    - As long as $w$ is started randomly, the *hidden units* are set differently

- **What happens if $w$ are initiated with very large values?**
    - the activation value $z$ will be very large and are highly likely to be in the large-value zone
    - for *sigmoid* and *tanh* functions, the *slope* in this zone is very low
    - resulting in a slow *learning process*