## Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

$$
\frac{\partial}{\partial\Theta}J(\Theta) \approx 
\frac{J(\Theta + c) - J(\Theta - c)}{2\epsilon}
$$

With multiple theta matrices, we can approximate the derivative **with respect to $\Theta_j$** as follows:

$$
\frac{\partial}{\partial\Theta_j}J(\Theta) \approx
\frac{J(\Theta_1, ..., \Theta_j + \epsilon, ..., \Theta_n) - J(\Theta_1, ..., \Theta_j - \epsilon, ..., \Theta_n)}{2\epsilon}
$$

A small value for $\epsilon$ (epsilon) such as $\epsilon = 10^{-4}$, guarantees that the math works out properly. If the value for $\epsilon$ is too small, we can end up with numerical problems.

Hence, we are only adding our subtracting epsilon to the $\Theta_j$ matrix.

Once you have verified **once** that your backpropagation algrithm is correct, you don;t need to compute it again.

## Random Initialization

Initializing all theta weights to zero doesn;t work with neural networks. When we backpropagate, all nodes will update to the same value repreatedly. Instead we can randomly initialize our weights for our $\Theta$ matrices using the following method:

![image.png](attachment:image.png)

Hence, we initialize each $\Theta^{(l)}_ij$ to a random value between $[-\epsilon, \epsilon]$.

Using the above formula guarantees that we get the desired bound. The same procedure applies to all the $\Theta$s.

## Putting it Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

- Number of input units = dimension of features $x^{(i)}$
- Number of output units = number of classes
- Number of hidden units per layer = usually the more the better (must balance with cost of computation, as it increases with more hidden units)
- Defaults: 1 hidden layer, if you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

##### Training a Neural Network

1. Randomly initialize the weights
2. Implement forward propagation to get $h_{\Theta}(x^{(i)})$ for any $x^{(i)}$
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

```
for i=1:m,
    Perform forward propagation and backpropagation using example (x(i), y(i))
    (Get activations a(l) and deta terms for l=2, ... L)
```

The following image gives us an intuition of what is happening as we are implementing our neural network:

![image.png](attachment:image.png)

Ideally, you want $h_{\Theta}(x^{(i)}) \approx y^{(i)}$. This will minimize our cost function. However, keep in mind that $J(\Theta)$ is not convex and thus we can end up in a local minimum instead.