# Neural Networks

### Backpropagation

Given training set $\{(x^{(1)},y^{(1)})...(x^{(m)},y^{(m)})\}$    

Set $\Delta_{i, j} := 0$ for all (l,i,j), hence you end up having a matrix full of zeros.    

For training example t = 1 to m:

1. Set $a^{(1)} := x^{(t)}$  
<br>
2. Perform forward propagation to compute $a^{(l)}$ for l=2,3,...,L  
<br>
3. Using $y^{(t)}$, compute $\delta^{(L)} = a^{(L)}-y^{(t)}$

    Where L is our total number of layers and $a^{(L)}$ is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left.  
<br>
4. Compute $\delta^{(L-1)},...,\delta^{(2)}$ using $\delta^{(l)}=((\Theta^{(l)})^{T}\delta^{(l+1)}\cdot*a^{(l)}\cdot*(1-a^{(l)})$

    The delta values of layer *l* are calculated by multiplying the delta values in the next layer with the theta matrix of layer *l*. We then element-wise multiply that with a function called *g'*, or g-prime, which is the derivative of the activation function *g* evaluated with the input values given by $z^{(l)}$.

    The g-prime derivate terms can also be written out as: $g'(z^{(l)})=a^{(l)}\cdot*(1-a^{(l)})$  
<br>
5. $\Delta_{i,j}^{(l)} := \Delta_{i,j}^{(l)}+a_{j}^{(l)}\delta_{i}^{(l+1)}$ or with vectorization, $\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^{T}$

    Hence we update our new $\Delta$ matrix.

    - $D_{i,j}^{l} := \frac{1}{m}(\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{l})$, if $j\neq0$.
    - $D_{i,j}^{l} := \frac{1}{m}\Delta_{i,j}^{(l)}$, if $j=0$.

### Gradient Checking

To make sure that the backpropagation algorithm is correct, compare the ``gradApprox`` to the computed gradient vector.

$$gradApprox = \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$$

### Random Initialization (Symmetry Breaking)

Initialize each $\Theta_{i, j}^{(l)}$ to a random value in $[-\epsilon, \epsilon]$ (not related to gradient checking $\epsilon$). That is, $-\epsilon\leq\Theta_{i, j}^{(l)}\leq\epsilon$.

One effective strategy for choosing $\epsilon_{init}$ is to base it on the 
number of units in the network. A good choice of $\epsilon_{init}$ is:

$$\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} + L_{out}}}$$

where $L_{in} = s_{l}$ and $L_{out} = s_{l+1}$ are the number of units in the layers adjacent to $\Theta^{(l)}$.

### Traning a neural network

1. Randomly initialize weights.
2. Implement forward propagation to get $h_{\Theta}(x^{(i)})$ for any $x^{(i)}$.
3. Implement code to compute cost function $J(\Theta)$.
4. Implement backpropagation to compute partial derivatives $\frac{\partial}{\partial\Theta_{j,k}^{(l)}}J(\Theta)$.
5. Use gradient checking to compare $\frac{\partial}{\partial\Theta_{j,k}^{(l)}}J(\Theta)$ computed using backpropagation vs using numerical estimate of gradient of $J(\Theta)$. Then disable gradient checking code.
6. Use gradient descent or other advanced optimization methods with backpropagation to try to minimize $J(\Theta)$ as a function of parameters $\Theta$.
