# Neural Networks - Cost Function
Focus on application of NNs for classification problems
- Here's the set up
  - Training set is {$(x_1, y_1), (x_2, y_2), (x_3, y_3) ... (x_n, y_m)$}
  - $L$ = number of layers in the network
    - In our example below $L = 4$
  - $s_l$ = number of units (not counting bias unit) in layer $l$
    - So here: $l = 4, s_1 = 3, s_2 = 5, s_3 = 5, s_4 = 4$
  - $k$ is number of units in output layer
    - So here: $k = 4$
  <img src="images/NN-4-layers.png">
  - <font color="blue">Binary classification</font>
    - 1 output (0 or 1)
    - $k = 1$
    - $s_L = 1$
  <p>
  - <font color="blue">Multi-class classification</font>
    - $k$ distinct classifications
    - Typically $k$ is greater than or equal to 3
      - If only two just go for binary
    - $s_L = k$
    - So $y$ is a $k$-dimensional vector of real numbers
    <img src="images/NN-4-layers-output.png">

## Cost function for neural networks
- The (regularized) logistic regression cost function is as follows:
<img src="images/logistic-regression-cost-function.png">
<p>
- For neural networks our cost function is a generalization of this equation above, 
  - So instead of one output we generate k outputs
<img src="images/neural-network-cost-function.png">
<p>
- Our <font color="blue">cost function now outputs a $k$-dimensional vector</font>
  - <font color="blue">$h_Ɵ(x)$ is a $k$-dimensional vector</font>, 
    so $h_Ɵ(x)_i$ refers to the i<sup>th</sup> value in that vector

<p>
- Cost function $J(Ɵ)$ is
  - $-1/m$ times a sum of a similar term to which we had for logic regression
  - But now this is also a sum from $k = 1$ through to $K$ ($K$ is number of output nodes)
    - Sum over the $k$ output units - i.e. for each of the possible classes
    - E.g., if 4 output units then the sum is $k = 1$ to 4
  - NB: We don't sum over the bias terms (hence starting at 1 for the summation)


This looks really complicated, but it's not so difficult
<font color="red" size="4em">Lets take a second to try and understand this!</font>

- There are basically two halves to the neural network logistic regression cost function
  - **First half**
    <img src="images/logistic-regression-cost-function - first half.png">
    - This is just saying:
      - For <font color="blue">each training data example</font> (i.e. 1 to m - the first summation)
         - Sum <font color="blue">over all output units</font>
    - This is an <font color="blue">average sum of logistic regression</font>

<p>
  - **Second half**
    <img src="images/logistic-regression-cost-function - second half.png">
    - This is a massive regularization summation term, which I'm not going to walk through, 
      <br>but it's a fairly straightforward triple nested summation
      <br>Intuition: It penalize all the $Ɵ$<sub>ij</sub> weights for all layers
    - This is also called a *<font color="blue">weight decay term</font>*
    - As before, the $\lambda$ value determines the important of the two halves
 
 <p>
 <font color="red" size="3em">So, we have a cost function, but how do we minimize this bad boy?!</font>
 ### Next: Back Propogation ...