# Neural Networks and Deep Learning

## Supervised learning

Examples
- Home features -> price: **NN**
- Ad, user info -> click on ad? (0,1): **NN**
- Image -> object$(1\dots1000)$: **CNN**
- Audio -> text transcript: **RNN**
- English -> chinese: **RNN**
- Image, radar info -> position of other cars: **Hybrid**

Data
- Structured data
- Unstructured data

## Binary classification

- $m$ training examples: $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}) \dots (x^{(m)}, y^{(m)})\}$

$X = 
\begin{bmatrix}
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
    x^{(1)} & x^{(2)} \ldots & x^{(m)} \\
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
\end{bmatrix}$

- $X \in {\rm I\!R^{n_{x}, m}}$
- $X$.shape = $(n_{x}, m)$

$Y = 
\begin{bmatrix}
    y^{(1)} & y^{(2)} \ldots & y^{(m)} \\
\end{bmatrix}$

- $Y \in {\rm I\!R^{1, m}}$
- $Y$.shape = $(1, m)$

## Logistic regression

- Given $x$, want $\hat{y} = P(y=1|x)$ (where $x \in {\rm I\!R^{n_{x}}}$ and $0 \le \hat{y} \le 1$)
- Parameters: $w \in {\rm I\!R^{n_{x}}}$, $b \in {\rm I\!R}$
- Output $\hat{y} = \sigma{(w^{T}x + b)}$

$\sigma{(z)} = \dfrac{1}{1+e^{-z}}$
- If $z$ large positive, $\sigma{(z)} \approx 1$
- If $z$ large negative, $\sigma{(z)} \approx 0$

## Loss function

- $L(\hat{y}, y) = -(ylog\hat{y} + (1-y)log(1-\hat{y}))$
- If $y = 1$, $L(\hat{y}, y) = -log\hat{y}$ => want $\hat{y}$ large as possible ($y \approx 1$)
- If $y = 0$, $L(\hat{y}, y) = -log(1-\hat{y})$ => want $\hat{y}$ small as possible ($y \approx 0$)

## Cost function

$J(w,b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$ = $\dfrac{1}{m}\displaystyle\sum_{i=1}^{m}-(y^{(i)}log\hat{y}^{(i)} + (1-y^{(i)})log(1-\hat{y}^{(i)}))$

## Gradient descent

- Want $w,b$ that minimizes $J(w,b)$
- $w := w - \alpha\dfrac{\partial J(w,b)}{\partial w}$
- $b := b - \alpha\dfrac{\partial J(w,b)}{\partial b}$

## Logistic regression gradient descent

- $z = w^{T} + b$
- $y = a = \sigma(z)$
- $L(a,y) = -(ylog(a) + (1-y)log(1-a))$

Example with 2 features with a single training example

- Parameters: $x_{1}, w_{1}, x_{2}, w_{2}, b$
- $z = w_{1}x_{1} + w_{2}x_{2} + b$
- $a = \sigma(z)$
- $L(a,y)$

Derivatives

- $da = \dfrac{\partial L(a,y)}{\partial a} = -\dfrac{y}{a} + \dfrac{1-y}{1-a}$
- $dz = \dfrac{\partial L(a,y)}{\partial z} = \dfrac{\partial L(a,y)}{\partial a}\dfrac{\partial a}{\partial z} = \left(-\dfrac{y}{a} + \dfrac{1-y}{1-a}\right)a(1-a) = a - z$ 
- $dw_{1} = x_{1}dz$
- $dw_{2} = x_{2}dz$
- $db = dz$

Gradient updates

- $w_{1} := w_{1} - \alpha dw_{1}$
- $w_{2} := w_{2} - \alpha dw_{2}$
- $b := b - \alpha db$

## Logistic regression on m examples with 2 features

- $J=0, dw_{1} = 0, dw_{2} = 0, db = 0$
- For $i = 1$ to $m$
    - $z^{(i)} = w^{T}x^{(i)} + b$
    - $a^{(i)} = \sigma(z^{(i)})$
    - $J += -[y^{(i)}loga^{(i)} + (1-y^{(i)})log(1-a^{(i)})]$
    - $dz^{(i)} = a^{(i)} - y^{(i)}$
    - $dw_{1} += x_{1}^{(i)}dz^{(i)}$
    - $dw_{2} += x_{2}^{(i)}dz^{(i)}$
    - $db += dz^{(i)}$
- $J = J/m$
- $dw_{1} = dw_{1}/m$
- $dw_{2} = dw_{2}/m$
- $db = db/m$

## Vectorizing logistic regression

$X = 
\begin{bmatrix}
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
    x^{(1)} & x^{(2)} \ldots & x^{(m)} \\
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
\end{bmatrix}$

$Z = 
\begin{bmatrix}
    z^{(1)} & z^{(2)} \ldots & z^{(m)} \\
\end{bmatrix} = w^{T}X + \begin{bmatrix}
    b & b \ldots & b \\
\end{bmatrix} = \begin{bmatrix}
    w^{T}x^{(1)}+b & w^{T}x^{(2)}+b \ldots & w^{T}x^{(m)}+b \\
\end{bmatrix}$ 

$dZ = 
\begin{bmatrix}
    dz^{(1)} & dz^{(2)} \ldots & dz^{(m)} \\
\end{bmatrix}$

$A = 
\begin{bmatrix}
    a^{(1)} & a^{(2)} \ldots & A^{(m)} \\
\end{bmatrix}$

$Y = 
\begin{bmatrix}
    y^{(1)} & y^{(2)} \ldots & y^{(m)} \\
\end{bmatrix}$

$dZ = A - Y =
\begin{bmatrix}
    a^{(1)}y^{(1)} & a^{(2)}y^{(2)} \ldots & a^{(m)}y^{(m)} \\
\end{bmatrix}$

In summary

- $Z$ = np.dot$(wT,X)+b$
- $A = \sigma(Z)$
- $dZ = A - Y$
- $dw = \dfrac{1}{m}XdZ^{T}$
- $db = \dfrac{1}{m}$np.sum$(dZ)$ 
- $w := w - \alpha dw$
- $b := b - \alpha db$

## Neural network

Example: 2 layers NN 
- For $i = 1$ to $m$
    - $z^{[1](i)} = w^{[1]}x^{(i)} + b^{[1]}$
    - $a^{[1](i)} = \sigma(z^{[1](i)})$
    - $z^{[2](i)} = w^{[2]}a^{[1](i)} + b^{[2]}$
    - $a^{[2](i)} = \sigma(z^{[2](i)})$
    
Vectorizing
- $Z^{[1]} = w^{[1]}X + b^{[1]}$
- $A^{[1]} = \sigma(Z^{[1]})$
- $Z^{[2]} = w^{[2]}A^{[1]} + b^{[2]}$
- $A^{[2]} = \sigma(Z^{[2]})$

Where 
- $X$: $(n_{x}, m)$ matrix
- $Z$: $($number of hidden units, $m)$ matrix
- $A$: $($number of hidden units, $m)$ matrix

## Activation function

- Binary classification? use sigmoid
- All other cases? use RELU (rectified linear unit)

Why use non-linear activation function?
- If use linear activation funciotn, having layers becomes meaningless because combinations of linear function reduce down to a single linear function

## Gradient descent for neural networks

Parameters
- $n_{x} = n^{[0]}$: number of features
- $n^{[1]}$: number of hidden units
- $n^{[2]} = 1$: number of output units
- $w^{[1]}$: $(n^{[1]}, n^{[0]})$ matrix
- $b^{[1]}$: $(n^{[1]}, 1)$ matrix
- $w^{[2]}$: $(n^{[2]}, n^{[1]})$ matrix
- $b^{[2]}$: $(n^{[2]}, 1)$ matrix

Cost function
- $J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{n}L(\hat{y}, y)$ where $L(\hat{y}, y) = a^{[2]}$

Gradient descent
- Repeat
    - Compute prediction $\hat{y}^{(i)}$ for $i = 1 \dots m$
    - $dw^{[1]} = \dfrac{\partial J}{\partial w^{[1]}}$, $db^{[1]} = \dfrac{\partial J}{\partial b^{[1]}}$, $dw^{[2]} = \dfrac{\partial J}{\partial w^{[2]}}$, $db^{[2]} = \dfrac{\partial J}{\partial b^{[2]}}$
    - $w^{[1]} = w^{[1]} - \alpha dw^{[1]}$, $b^{[1]} = b^{[1]} - \alpha db^{[1]}$, $w^{[2]} = w^{[2]} - \alpha dw^{[2]}$, $b^{[2]} = b^{[2]} - \alpha db^{[2]}$
    
Backward propagation
- $dZ^{[2]} = A^{[2]} - Y$
- $dw^{[2]} = \dfrac{1}{m}dZ^{[2]}A^{[1]^{T}}$
- $db^{[2]} = \dfrac{1}{m}$np.sum$(dZ^{[2]}$, axis=1, keepdims=True$)$
- $dZ^{[1]} = w^{[2]^{T}}dZ^{[2]} * g^{[1]^{'}}(Z^{[1]})$
- $dw^{[1]} = \dfrac{1}{m}dZ^{[1]}X^{T}$
- $db^{[1]} = \dfrac{1}{m}$np.sum$(dZ^{[1]}$, axis=1, keepdims=True$)$

## Random initialization

- If weights are initialized to $0$, all hidden units compute the same function due to symmetry
- To break symmetry,
    - $w^{[1]}$ = np.random.rand() * 0.01
    - $b^{[1]}$ = np.zeros()
    - $w^{[2]}$ = np.random.rand() * 0.01
    - $b^{[2]}$ = np.zeros()

## Deep L-layer neural network

Parameters
- $n^{[l]}$: number of units in layer $l$
- $A^{[l]} = g^{[l]}(Z^{[l]})$: activation in layer $l$
- $Z^{[l]} = w^{[l]}A^{[l-1]}+b^{[l]}$ 
- $w^{[l]}, b^{[l]}$: weights for $Z^{[l]}$

Dimensions
- $w^{[l]}$: $(n^{[l]}, n^{[l-1]})$
- $b^{[l]}$: $(n^{[l]}, 1)$
- $dw^{[l]}$: $(n^{[l]}, n^{[l-1]})$
- $db^{[l]}$: $(n^{[l]}, 1)$
- $Z^{[l]}$: $(n^{[l]}, m)$
- $A^{[l]}$: $(n^{[l]}, m)$

## Building blocks of deep neural network

Forward and backward propagation

$a^{[0]} \rightarrow \boxed{w^{[1]},b^{[1]}} \xrightarrow{a^{[1]}} \boxed{w^{[1]},b^{[1]}} \xrightarrow{a^{[2]}} \dots \xrightarrow{a^{[l-1]}} \boxed{w^{[l]},b^{[l]}} \xrightarrow{a^{[l]}} \hat{y} \rightarrow L(\hat{y},y)$

Cache all parameters from each block (layers)

$\boxed{w^{[1]},b^{[1]},dz^{[1]}} \xleftarrow{da^{[1]}} \boxed{w^{[2]},b^{[2]},dz^{[2]}} \xleftarrow{da^{[2]}} \dots \xleftarrow{da^{[l-1]}} \boxed{w^{[l]},b^{[l]},dz^{[l]}} \leftarrow da^{[l]} = -\dfrac{y}{a}+\dfrac{1-y}{1-a}$

Each block (layer) computes $dw, db$

$w^{[l]} = w^{[l]} - \alpha dw^{[l]}$
$b^{[l]} = b^{[l]} - \alpha db^{[l]}$

$dA^{[L]} = -\dfrac{y^{(1)}}{a^{(1)}}+\dfrac{1-y^{(1)}}{1-a^{(1)}} \dots -\dfrac{y^{(m)}}{a^{(m)}}+\dfrac{1-y^{(m)}}{1-a^{(m)}}$

## Hyperparameters

- Learning rate
- Number of iterations
- Number of layers
- Number of hidden units in each layer
- Choice of activation funciton
- Momentum
- Min-batch size
- Regularization