The cost function:

$Cost\Rightarrow C_o(...)=\left(a^{(L)}-y\right)^2 $

$C_0$ : Represents the cost function for a particular output 0 (with ... parameters inside)

$a^{(L)}$ : This term represents the activation of the last layer L of the neural network for a given input. In simpler terms, it's the output of the neural network for a specific input. The superscript (L) denotes that it's the activation from the last layer, which is the layer responsible for producing the final prediction of the network.

$y$ :  This symbol stands for the actual target or true value that the model is trying to predict. In supervised learning, for each input, there's an associated true value y which the model aims to approximate through its predictions.

$\left(a^{(L)}-y\right)^2$ : This expression is the squared difference between the predicted value $a^{(L)}$ and the actual value y. Squaring the difference serves two main purposes: it ensures that the result is non-negative (since any number squared is non-negative), and it emphasizes larger errors more than smaller ones (since the square of a larger number is much greater than the square of a smaller number).

In summary, this cost function measures the squared error between the neural network's predictions and the actual values. It's often used in regression tasks where the goal is to predict continuous values. The training process of the neural network involves adjusting its weights and biases to minimize this cost function, thereby making the predictions as close as possible to the actual values.



$z^{(L)} = w^{(L)}a^{(L-1)}+b^{(L)}$

$a^{(L)}=\sigma\left(z^{(L)}\right)$

<p style="font-family: Arial;">
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    <span style="vertical-align: middle;">a<sup>(L-2)</sub></span>
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    <span style="vertical-align: middle;">a<sup>(L-1)</sub></span>
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    <span style="vertical-align: middle;">a<sup>(L)</sub></span>
    <span style="display: inline-block; width: 100px; height: 2px; background-color: black; vertical-align: middle;"></span>
    
</p>



We want to find:

$\frac{\partial C_0}{\partial w^{(L)}}=\frac{\partial z^{(L)}}{\partial w^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}}$

That is:

$\frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L-1)}$

$\frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)})$

$\frac{\partial C_0}{\partial a^{(L)}} = 2\left(a^{(L)}-y\right)$

With 

$\frac{\partial C_0}{\partial w^{(L)}}=\frac{\partial z^{(L)}}{\partial w^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}} =2 a^{(L-1)} \sigma'\left(z^{(L)}\right)\left(a^{(L)}-y\right)$

With that in mind we can write:

$\frac{\partial C}{\partial w^{(L)}} = \frac{1}{n}\sum^{n-1}_{k=0}\frac{\partial C_{k}}{\partial w^{(L)}}$


We want to find the gradient C, that is:

$\nabla \mathbf{C} = \begin{bmatrix}\frac{\partial C}{\partial w^{(1)}}\\ \\ \frac{\partial C}{\partial w^{(1)}} \\ \\ ...  \\ \\ \frac{\partial C}{\partial w^{(L)}} \\ \\ \frac{\partial C}{\partial w^{(L)}}\end{bmatrix}$

For the cost with respect to the bias, we simply write:

$$\frac{\partial C_0}{\partial b^{(L)}}=\frac{\partial z^{(L)}}{\partial b^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}} = 2  \sigma'\left(z^{(L)}\right)\left(a^{(L)}-y\right)$$

With back progigation we write:

$\frac{\partial C_0}{\partial a^{(L-1)}}=\frac{\partial z^{(L)}}{\partial a^{(L-1)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}} =2 w^{(L)}\sigma'\left(z^{(L)}\right)\left(a^{(L)}-y\right)$

and

$C_0=\sum^{n_L-1}_{j=0}\left(a_j^{(L)}-y_j\right)^2$

and 

$z_j^{(L)}=w_{j,0}^{(L)}a_0^{(L-1)}+w_{j,i}^{(L)}a_1^{(L-1)}+w_{j,2}^{(L)}a_2^{(L-1)}+b_j^{(L)}$


So for a:

$a_j^{(L)}=\sigma \left(z_j^{(L)}\right)$

and the cost sensetivity

$\frac{\partial C_0}{\partial w^{(L)}_{j,k}}=\frac{\partial z^{(L)}_j}{\partial w^{(L)}_{j,k}}\frac{\partial a^{(L)}_j}{\partial z^{(L)}_j}\frac{\partial C_0}{\partial a^{(L)}_j}$

and only changed "one" term for the back probigation:

$\frac{\partial C_0}{\partial a^{(L-1)}_{k}}=\sum_{j=0}^{n_L-1}\frac{\partial z^{(L)}_j}{\partial a^{(L-1)}_{k}}\frac{\partial a^{(L)}_j}{\partial z^{(L)}_j}\frac{\partial C_0}{\partial a^{(L)}_j}$