# Neural Networks
Src: http://neuralnetworksanddeeplearning.com/chap1.html

## Why Now?
Scale drives deep learning progress:
- Data
- Computation
- Algorithms

## One-layer neural network
$\hat{y} = \sigma(w^Tx + b)$, where $\sigma(z) = \frac{1}{1 + e^{-z}}$

Given ${(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})}$ we want $\hat{y}^{(i)} \approx y^{(i)}$

### Loss function
"How well we are doing in a single training example?"

First idea: $L(\hat{y}, y) = \frac{1}{2}(\hat{y}-y)^2$. However, we don't usually do this because in the process of learning the parameters, the optimization problem  becomes non-convex.

We use instead: $L(\hat{y},y) = -(ylog(\hat{y}) + (1-y)log(1-\hat{y}))$.

If $y=1$, the loss function becomes $L(\hat{y},y) = -log(\hat{y})$.
- $\hat{y} = 1$: $L(\hat{y},y) = 0$
- $\hat{y} \approx 0$: $L(\hat{y},y) = very large$

### Cost function
"How well we are doing in the whole training set?"

$J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)})$

## Gradient Descent
J is a convex function. Any local minimum of a convex function is also a global minimum. A strictly convex function will have at most one global minimum.

repeat {
    $w: w - \alpha\frac{dJ(w)}{dw}$
}

$L(a,y) = -(ylog(a) + (1-y)log(1-a))$, where $a = \sigma(z)$

$\frac{dL(a,y)}{da} = -\frac{y}{a} + \frac{1-y}{1-a}$

$\frac{dL(a,y)}{dz} = (-\frac{y}{a} + \frac{1-y}{1-a}) * \frac{da}{dz}$, and we have that:

$\frac{da}{dz} = \frac{d}{dz}\frac{1}{1+e^{-z}} = \frac{e^{-z}}{(1+e^{-z})^{2}} = \frac{1 + e^{-z} - 1}{(1+e^{-z})^{2}} = \frac{1}{1 + e^{-z}}(1 - \frac{1}{1 + e^{-z}}) = a(1 - a)$

So: $\frac{dL(a,y)}{dz} = (-\frac{y}{a} + \frac{1-y}{1-a}) * a(1-a) = a - y$


## Perceptron
A perceptron takes several binary inputs, $x_1,x_2,...$, and produces a single binary output. 

To compute the output, we use the concept of **weights**, $w_1,w_2,...$, real numbers expressing the **importance** of the respective inputs to the output.

The neuron's output, 0 or 1, is determined by whether the weighted sum $\sum_jw_jx_j$ is less than or greater than some threshold value:

$output = \left \{ \begin{matrix} 0, & \mbox{if }\sum_jw_jx_j \leq threshold \\ 1, & \mbox{if }\sum_jw_jx_j \gt threshold \end{matrix} \right.$

Now, instead of writing $\sum_jw_jx_j$, we will simply write $w.x$. Also, we will move the threshold to the other side of the equation so that $b = -threshold$.

$output = \left \{ \begin{matrix} 0, & \mbox{if } w.x + b \leq 0 \\ 1, & \mbox{if }w.x + b \gt 0 \end{matrix} \right.$

You can think of the bias as a measure of **how easy it is to get the perceptron to output a 1**.

If the bias is very negative, the threshold is very high, which means it's difficult for the perceptron to output a 1.

### Input Layer

First layer, that encodes the inputs. We can think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values.

## Sigmoid Neurons
### How to train a network?
What we want is that when we make a small change in the weights, we can observe a small change in the output. Then, we can use this fact to modify the weights and bias in order to make the network better.

However, if the network contains perceptrons, a small change in the weights or bias of any perceptron can sometimes cause the output of that perceptor to completely flip, from 0 to 1 or from 1 to 0. 

### What is a Sigmoid neuron?
Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. Now, instead of binary inputs, we will have continuous inputs between 0 and 1. The output is now $\sigma(w.x +b)$, where 

$\sigma(z)=\frac{1}{1 + e^{-z}}$

If $z$ is high, the output will be approximately 1. If $z$ is very negative, the output will be close to 0.

If instead of a sigmoid funcion we had used a step function, that neuron would be a perceptron.

### Calculating $\Delta$ouput

$\Delta output \approx \sum_j\frac{\delta output}{\delta w_j}\Delta w_j + \frac{\delta output}{\delta b}\Delta b$

That means that $\Delta output$ is a linear function on $\Delta w_j$ and $\Delta b$, that is, the relationship between the changes in weight/bias and the impact on output change is easier to understand.

## Activation functions

### Step function
Problems:

- Binary classifier (“yes” or “no”, activate or not activate): a step function could do that for you!

- Multi classifier (class1, class2, class3, etc). What will happen if more than 1 neuron is “activated”?

### Sigmoid function
$\sigma(z)=\frac{1}{1 + e^{-z}}$

Advantages:
- The output of the activation function is always going to be in range (0,1).
- It is nonlinear in nature.
- Combinations of this function are also nonlinear! Great!!

Problems:
- Towards either end of the sigmoid function, the $\sigma(x)$ values tend to respond much less to changes in x. 

- The problem of “vanishing gradients”: Cannot make significant change because of the extremely small value

### Tanh function
$tanh(x) = \frac{2}{1 - e^{-2x}} - 1$

Advantages:
- The output of the activation function is always going to be in range (-1,1).
- It is nonlinear in nature.
- Combinations of this function are also nonlinear! Great!!

Problems:
- Also has the "vanishing gradients" problems

### ReLu function
$ReLU(x) = max(0,x)$

If you don’t know the nature of the function you are trying to learn, start with ReLU.

Advantages:
- It gives an output x if x is positive and 0 otherwise. The range is (0, inf).
- It is nonlinear in nature. Combinations of this function are also nonlinear!
- Sparsity of the activation!

Problems:
- Because of the horizontal line in ReLU( for negative x ), the gradient can go towards 0.
- “Dying ReLU problem”: several neurons can just die and not respond making a substantial part of the network passive.

### Leaky ReLu function
$ReLu(x) = \left \{ \begin{matrix} x, & \mbox{if } x > 0 \\ 0.01x, & \mbox{otherwise }  \end{matrix} \right.$
                    

Advantages:
- It gives an output x if x is positive and 0 otherwise. The range is (0, inf).
- (Leaky) ReLU is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

## Architecture
Layer 1 = Input layer
- $x_1, x_2, x_3$ : inputs
- $x_0$ : bias

Layer 2 = Hidden layer
- $a_1^{(2)}, a_2^{(2)}, a_3^{(2)}$ : activation of unit i in layer 2

Layer 3 = Output layer
- $a_1^{(3)}$

### Weights
$\Theta_j$: matrix of weights from layer j to layer j+1

### Forward Propagating
$a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3)$

$a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3)$

$a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3)$

$h_{\Theta(x)} = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})$

### Selecting the Architecture
- Input units: dimensionality of the problem (features x)
- Output units: Number of classes
- Hidden units (per layer):
    - Usually, the more the better 
    - Good start: a number **close to the number of inputs**
    - Default: 1 hidden layer. If you have >1 hidden layer, then it is interesting that you have the **same number of units in every hidden layer**

## Training a Neural Network
### The training process
1. Initialize the parameters

2. Choose an optimization algorithm

3. Repeat these steps:
    - Forward propagate an input
    - Compute the cost function
    - Compute the gradients of the cost with respect to parameters using backpropagation
    - Update each parameter using the gradients, according to the optimization algorithm


### Step 1 - Random Initialization
Src: https://www.deeplearning.ai/ai-notes/initialization/

The first step of training a network is initializing the weights. 

- Initializing all the weights **with zeros** leads the neurons to **learn the same features** during training.

"Consider a neural network with two hidden units, and assume we initialize all the biases to 0 and the weights with some constant $\alpha$. If we forward propagate an input $(x_1,x_2)$ in this network, the output of both hidden units will be $relu(\alpha x_1 + \alpha x_2)$. Thus, both hidden units will have identical influence on the cost, which will lead to identical gradients. Thus, **both neurons will evolve symmetrically** throughout training, effectively preventing different neurons from learning different things."

- Initializing the weights with values that are too small leads to slow learning, and values that are too big lead to divergence. (read: vanishing/exploding gradients)

#### Rules of thumb
- The mean of the activations should be zero.
- The variance of the activations should stay the same across every layer.

Considering a layer $l$, the forward propagation equations can be written as:
$a^{(l-1)} = g^{[l-1]}(z^{[l-1]})$

$z^{[l]} = W^{[l]}*a^{[l-1]} + b^{[l]}$

$a^{(l)} = g^{[l]}(z^{[l]})$

What we want is: $E[a^{[l-1]}] = E[a^{[l]}]$ and $var[a^{[l-1]}] = var[a^{[l]}]$


#### Xavier initialization
$W^{[l]} \approx N(\mu = 0, \sigma^2 = \frac{1}{n^{[l-1]}})$, where $n^{[l-1]}$ is the number of neurons in layer $l-1$

$b^{[l]} = 0$

### Step 2  - Feed Forward
Now, we need to forward propagate our inputs in order to get the outputs

` def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a`
        
Given one training example $(x,y)$:
Forward propagation:
$a^{(1)} = x$

$z^{(2)} = \Theta^{(1)}a^{(1)}$

$a^{(2)} = g(z^{(2)})$ (add bias $a_0^{(2)}$)

$z^{(3)} = \Theta^{(2)}a^{(2)}$

$a^{(3)} = g(z^{(3)})$ (add bias $a_0^{(3)}$)

$z^{(4)} = \Theta^{(3)}a^{(3)}$

$a^{(4)} = g(z^{(4)}) = h_{\Theta}(x)$ 

### Step 3 - Calculate Loss Function
We compare our outputs obtained in the previous step to the desidered outputs, calculating the loss function.

### Step 4 - Calculate the Derivative of the Error
$\delta_j^{(l)}$ = error of node $j$ in layer $l$.

For each output unit (layer):
$\delta_j^{(4)} = a_j^{(4)} - y_j$ (hypothesis output - real value of y)

$\delta^{(4)} = a^{(4)} - y$

### Step 5 - Backpropagate
For each hidden unit:

$\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)})$

$\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)})$

OBS: $g'(z) = g(z)(1 - g(z))$, and therefore $g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$

We can show that $\frac{\delta}{\delta\Theta_{ij}^{(l)}}J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}$

#### Backpropagation algorithm
Given training set $(x^{(1)}, y^{(1)}) ... (x^{(m)}, y^{(m)})$.

- Set $\Delta^{(l)}_{i,j} = 0$ for all (l,i,j), (hence you end up having a matrix full of zeros)

For training example $t = 1$ to $m$:
1. Set $a^{(1)} := x^{(t)}$

2. Perform forward propagation to compute $a^{(l)}$ for $l=2,3,…,L$

3. Using $y^{(t)}$, compute $\delta^{(L)} = a^{(L)} - y^{(t)}$

4. Compute $\delta^{(L-1)}, \delta^{(L-2)},...,\delta^{(2)}$ using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$

5. $\Delta_{i,j}^{(l)}= \Delta_{i,j}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}$ or with vectorization, $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$

### How many iterations to converge?
1. It depends on the meta-parameters of the network (how many layers, how complex the nonlinear functions are)
2. It depends on the learning rate. 
3. It depends on the optimization method
4. It depends on the random initialization of the network. 
5. It depends on the quality of the training set. 

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

### Softmax classification
TODO

https://www.freecodecamp.org/news/building-a-neural-network-from-scratch/