## Neural Networks

- Stacking of multiple sigmoid functions, known as layers
    - $z^{[i]} = W^{[i]}x + b^{[i]}$, where $i$ is the layer
    - Superscript square bracket refers to layers, round brackets refers to training examples
    
### Representation

- Input layer, hidden layer, output layer
- Training set contains inputs $x$ and outputs $y$, hence "hidden" layer as the middle nodes are not in training set
- Each layer will generate a set of activations, denoted by $a^{[i]}$
    - Hidden layer will have its own parameters, $w^{[i]}$ and $b^{[i]}$
    

### Computation

- Stacking transpose of parameter vector $w$ of each node into a matrix, multiplying by the input to obtain vector $z$, to obtain a vector of all the activations $a$
    - $W^{[i]}x^{[i]} + b^{[i]} = z^{[i]}$
    - $W = \begin{bmatrix} w^{[i]T}_{1} \\ w^{[i]T}_{2} \\ w^{[i]T}_{3} \end{bmatrix}$
    - $a^{[i]} = \sigma(z^{[i]})$
- Above applies for one training example
- For multiple training examples, we can apply the same neural network computation, i.e $x^{[1]} \longrightarrow a^{[2](1)} = \hat{y}^{[1]}$, $x^{[2]} \longrightarrow a^{[2](2)} = \hat{y}^{[2]}$, etc

### Activation Functions

- Some of the possible activations that can be used:

    - Sigmoid function ($\sigma$)
        - $\sigma(z) = \dfrac{1}{1+e^{-z}}$
        
    - Hyperbolic tangent function, $tanh$
        - $tanh(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$
        - Almost always works better than the sigmoid function, since the mean is 0 and the data is more centered
        
    - ReLU (Rectified Linear Unit) function
        - $a = max(0, z)$
        - Default activation function
        
    - Leaky ReLU function
        - $a = max(0.01z, z)$

In [2]:
import numpy as np

A = np.random.randn(4,3)
B = np.sum(A, axis = 1, keepdims = True)

print(B.shape)

(4, 1)
