# Logistic Regression Cost Function Derivation

> **Cost Function:** measure of how well the paramater predictions from a ML model fits to a set of training data
### Example:
![eg_data](logistic_reg_eg_data.png)

### Review of Linear Regression Cost Function (MSE) for Logistic Regression
![lin_cost](cost_function_dont_use_lin_reg.png)
- The Mean square error cost function for linear regression produces a convex (bottom left) J function. (as the function J is a quadratic function when a linear f is inputted)
- When inputting the sigmoid function as f into J for logistic regression, we do not get a quadratic function, and we have a non-convex curve. (bottom right)
- You get lots of local minima that gradient descent can get stuck in
- We need to change the cost function J so we have a convex curve for logistic regression.
We define the Loss function, L as:
$$L(f_{\bar{w},b}(\bar{x}^{(i)}), y^{(i)})$$

## Logistic Regression Loss Function Explained:
![loss](loss_function.png)
The value of f is the output of the sigmoid function for logistic regression. **This is always between 0 and 1.**

### When the true value is y=1 (in the training data)
![loss_1](loss_function_equal_1.png)
When y = 1:  
$L= -log(f)$

- We see in the pink box the possible values for -log(f), where f is between 0 and 1.

- If the value of f tends to 1, we see the Loss function -log(f) tends to 0. (on the left)
- If for example having a value f=1 refers to a 100% chance of a tumor being malignant, then the loss will be 0 (good in this context as we are completely sure).
- If f = 0.1, we have a 10% chance of the tumour being malignant (when we know it is malignant as y=1), and we can see the loss is very large, and thus penalises the model correctly. 

- As f tends to 1, the loss function pushes the algorithm to make more accurate predictions, as the loss is lowest when we have the f prediction closest to the true value y. (residual is smallest and so the contribution to the cost J is close to 0)

### When the true value is y=0
![loss_0](loss_function_equal_0.png)
When y = 1:  
$L= -log(1-f)$

- We see in the pink box the possible values for -log(1-f), where f is between 0 and 1.
- As f tends to 0, the loss (L) tends to 0 (can see on the left), so the value of f is close to the true value of y=0 and thus reducing the total cost.
- The larger f gets, the bigger the loss, as the prediction is furthest away from the true value of y=0.
- We can see as f tends to 1, the loss tends to infinity. So if the model predicts a value very close to 1, but the true value is y=0, then by using this loss function we penalise the model with a very large loss to be added to the cost function.

For both cases of y, the further the prediction f is from the true target y, the higher the loss and thus the higher the cost function, and so the worse the models predictions.

**With the above defined Loss function (L) for logistic regression, the Cost function is convex, and thus we can reach a global minimum**

# Implementation of the Cost Function

The loss function above can be rewritten to be easier to implement.
    $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$
    
when $ y^{(i)} = 0$, the left-hand term is eliminated:
$$
\begin{align}
loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 0) &= (-(0) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 0\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \\
&= -\log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$
and when $ y^{(i)} = 1$, the right-hand term is eliminated:
$$
\begin{align}
  loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 1) &=  (-(1) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 1\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\\
  &=  -\log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$


![cost](simplified_cost_function.png)
This loss function was produced using the statistical principle called **Maximum Likelihood Estimation**, which is used to efficiently find paramaters for different models