## Deep Neural Networks

- Composed of multiple hidden layers
- Allows the functions to be learned that shallower neural networks cannot

### Notation

- $L$ denotes number of layers (excluding input layer, but including output)
- $n^{[l]}$ denotes the number of units in layer $l$
    - Input layer is layer 0
    
    
### Forward Propagation

- For layer 1:
    - $z^{[1]} = W^{[1]}x + b^{[1]} = W^{[1]}a^{[0]} + b^{[1]}$
    - $a^{[1]} = g^{[1]}(z^{[1]})$
- For layer 2:
    - $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$
    - $a^{[2]} = g^{[2]}(z^{[2]})$
- and so forth, where $W$ is the weight matrix, $b$ is the bias vector.
- Input $a^{[l-1]}$, output $a^{[l]}$ and $z^{[l]}$ (cache)


#### Vectorized Implementation

- In general:
    - $Z^{[l]} = W^{[l]} A{[l-1]} + B{[l]}$
    - $A^{[l]} = g^{[l]} (Z^{[l]})$
    
    
### Backward Propagation

- Input $da^{[l]}$, outputs $da^{[l-1]}, dW^{[l]}, db^{[l]}$
    - $dz^{[l]} = da^{[l]} \cdot g^{[l]'} (z^{[l]})$
    - $dW^{[l]} = dz^{[l]} \cdot a^{[l-1]}$
    - $db^{[l]} = dz^{[l]}$
    - $da^{[l-1]} = W^{[l]^{T}} \cdot dz^{[l]}$

#### Vectorized Implementation

- In general:
    - $dZ^{[l]} = dA^{[l]} \cdot g^{[l]'} (Z^{[l]})$
    - $dW^{[l]} = \dfrac{1}{m} dZ^{[l]} \cdot A^{[l-1]T}$
    - $dB^{[l]} = \dfrac{1}{m} np.sum(dZ^{[l]}, axis=1, keepdims=True)$
    
    - $dA^{[l-1]} = W^{[l]T} \cdot dZ^{[l]}$
    
    
### Hyperparameters

- Parameters are $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}, W^{[3]}, b^{[3]}, \cdots$
- Hyperparameters are values which determine the parameters $W$ and $b$, such as:
    - Learning rate, $\alpha$
    - Number of iterations
    - Number of hidden layers $L$
    - Number of hidden units $n^{[1]}, n^{[2]}$, etc
    - Choice of activation function (ReLU, sigmoid, tanh)
    
- Deep learning is a very empirical process, a lot of trial and error is needed to create a good model