### Cost Function

* L =  total number of layers in the network   
* $s_l$ = number of units (not counting bias unit) in layer l  
* K = number of output units/classes  

$J(\Theta) = -\frac{1}{m}\sum_{i=0}^{m}\sum_{k=1}^{K}[y^{(i)}_k\log((h_\theta(x^{(i)}))_k) + 
(1 - y^{(i)}_k)\log(1 - (h_\theta(x^{(i)}))_k)] +
\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{j,i}^{(l)})^2$

Note:

* the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
* the triple sum simply adds up the squares of all the individual Θs in the entire network.
* the i in the triple sum does not refer to training example i

### Backpropagation Algorithm
#### Gradient computation
$\min\limits_{\Theta} J(\Theta)$ 

Nedd to compute
* $J(\Theta)$
* $\frac{\partial }{\partial \Theta_{ij}^{(l)}} J(\Theta)$


Let's start an example with one training example $(x, y)$

$a^{(1)} = x$   
$z^{(2)} = \Theta^{(1)}a^{(1)}$    
$a^{(2)} = g(z^{(2)})$ (add $a_0^{(2)}$)        
$z^{(3)} = \Theta^{(2)}a^{(2)}$    
$a^{(3)} = g(z^{(3)})$ (add $a_0^{(3)}$)        
$z^{(4)} = \Theta^{(3)}a^{(3)}$    
$a^{(4)} = h_\Theta(x) = g(z^{(4)})$    


Backpropagation: algorithm to compute the derivative   

Intuition: $\delta_j^{(l)}$ = "error" of node $J$ in layer $l$.  


For each output unit (layer L = 4)  
$\delta_j^{(4)} = a_j^{(4)} - y_j$

$\delta_j^{(3)} = (\Theta^{(3)})^T\delta^{(4)}*g'(z^{(3)})$    
$\delta_j^{(2)} = (\Theta^{(2)})^T\delta^{(3)}*g'(z^{(2)})$    

No $\delta{(1)}$ because is the input layer  

#### Backpropagation algorithm

Training set $\{(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)}) \} $    

Set $\Delta_{ij}^{(l)} = 0 $ ( for all $l, i, j $ ).    

For $i = 1$ to $m$       
&nbsp;&nbsp;&nbsp;&nbsp; 1. Set $a^{(1)} = x^{(1)}$    

&nbsp;&nbsp;&nbsp;&nbsp; 2. Perform foward propagation to compute $a^{(l)}$ for $l = 2, 3, \dots, L$    

&nbsp;&nbsp;&nbsp;&nbsp; 3 .Using $y^{(j)}$ compute $\delta^{(L)} = a^{(L)} - y^{(i)}$   
Where L is our total number of layers and $a^{(L)}$ is the vector of outputs of the activation
units for the last layer. So our "error values" for the last layer are simply the differences 
of our actual results in the last layer and the correct outputs in y. To get the delta values
of the layers before the last layer, we can use an equation that steps us back from right 
 to left:


&nbsp;&nbsp;&nbsp;&nbsp; 4. Compute $\delta^{(L-1)}, \delta^{(L-2)}, \dots, \delta^{(2)}$  
The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by $z^{(l)}$.
The g-prime derivative terms can also be written out as:


$g'(x^{(l)}) = a^{(l)}$ .* $(1- a{(l)})$   


&nbsp;&nbsp;&nbsp;&nbsp; 5. $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a{j}^{(l)}\delta_{i}^{(l + 1)}$  or with vectorization $  \Delta^{(l)} := \Delta^{(l)} + \delta^{(l + 1)}(a^{(l)})^T$ 

$D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} + \lambda\Theta_{ij}^{(l)}$ &nbsp;&nbsp;&nbsp;&nbsp; if $j \neq 0$    
$D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} $ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if $j = 0$    


$\frac{\partial }{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$

### Backpropagation Intuition

$
\begin{bmatrix} +1 \\ x_1^{(i)} \\ x_2^{(i)} \end{bmatrix}
\rightarrow \begin{bmatrix} +1 \\ z_1^{(2)} \rightarrow a_1^{(2)} \\ z_2^{(2)} \rightarrow a_2^{(2)} \end{bmatrix}
\rightarrow \begin{bmatrix} +1 \\ z_1^{(3)} \rightarrow a_1^{(3)} \\ z_2^{(3)} \rightarrow a_2^{(3)} \end{bmatrix}
\rightarrow \begin{bmatrix} z_1^{(4)} \rightarrow a_1^{(4)} \end{bmatrix}
$

e.g.Foward propagation    
$z_1^{(3)} = \Theta_{10}^{(2)}1 + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)}$


#### What is backpropagation doing?

Consider cost function for one training example:   

$J(\Theta) = -\frac{1}{m}\sum_{i=0}^{m}[y^{(i)}\log((h_\theta(x^{(i)}))) + 
(1 - y^{(i)})\log(1 - (h_\theta(x^{(i)}))] +
\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{j,i}^{(l)})^2$


Let's ignore regularization, put $\lambda = 0$, if we consider the training example $(x^{(i)},  y^{(i)})$ 

cost(i) = $y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))$

cost(i) $ \approx (h_\theta(x^{(i)}) - y^{(i)}))^2$

Backpropagation is computing $\delta_j^{(l)}$ = "error" of cost for $a_j^{(l)}$ (unit $j$ in layer $l$)   
formally $\delta_j^{(l)} = \frac{\partial }{\partial z_j^{(l)}} cpst(i) $
 (for $j \geq 0$) where   
 cost(i) = $y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))$
 
 ex       
 $\delta_1^{(4)} = y^{(i)} - a_1^{(4)}$ and we back propagate the error           
 $\delta_2^{(3)} = \Theta_{12}^{(3)} \delta_1^{(4)}$               
 $\delta_2^{(2)} = \Theta_{12}^{(2)} \delta_1^{(3)} + \Theta_{22}^{(2)} \delta_2^{(3)}$          
 etc..