#Cost Function
Let's first define a few variables that we will need to use:

- L = total number of layers in the network

- $s_l$= number of units (not counting bias unit) in layer l

- K = number of output units/classes

For binary classification, y = 0 or 1, which has 1 output unit;


For multi-class (*K* class) classification, y $\in ℝ^K$ (i.e., $h_\Theta(x)\in ℝ^K$, and $h_\Theta(x)_i = i^{th}$output), which has *K* output units;

---


In **Neural Network**, cost function is

\begin{equation*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{equation*}

Note:
- The double sum adds up the logistic regression costs calculated for each cell in the output layer;
- The triple sum adds up for all individual $\Theta$
- the i in the triple sum does **not** refer to training example i 



#Backpropogation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

Need to compute:
- $\min_\Theta J(\Theta)$
- $\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)$

**Backpropogation Algorithm:**

Given training set {($x^{(1)}, y^{(1)}$),...,($x^{(m)}, y^{(m)}$)}

1. Set $\Delta_{ij}^{(l)} = 0$ (for all $l, i, j$ )

2. For *i* = 1 to m
    set $a^{(1)} = x^{(i)}$
    perform **forward propogation** to compute $a^{(l)}$ for $l$ = 2,3,...,L
    
    ![alt text](https://raw.githubusercontent.com/wongchihaul/Coursera-ML/master/Pics/forward%20propogation.png)
    
3. Using $y^{(i)}$, compute $\delta^{(L)} = a^{(L)} - y^{(i)}$
    
    Where L is our total number of layers and $a^{(L)}$is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. 

4. Compute $\delta^{(L-1)}, \delta^{(L-2)}, ..., \delta^{(2)}$ by using \begin{equation*}\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})\end{equation*}
    
    [Remember there is no $\delta^{(1)}$]

    The g-prime derivative terms can also be written out as:\begin{equation}g^{'}(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})\end{equation}
5.$\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a^{(l)}_j\delta^{(l+1)}_i$ and its vectorization format is \begin{equation}\Delta^{(l)} :=  \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T\end{equation}

    Hence we update our new $\Delta$ matrix

- $D^{(l)}_{ij} := \dfrac{1}{m}\left(\Delta^{(l)}_{ij} + \lambda\Theta^{(l)}_{i,j}\right)$, if $j \not=0$
- $D^{(l)}_{ij} := \dfrac{1}{m}\Delta^{(l)}_{ij}$, if$j =0$

    ![alt text](https://raw.githubusercontent.com/wongchihaul/Coursera-ML/master/Pics/backpropogation.png)

---

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get \begin{equation}\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta) = D^{(l)}_{(ij)}\end{equation}
​	