In [1]:
import numpy as np

## Loss Functions

This notebook will derive the forward and backward (derivative) mathematical formulas of various loss functions in machine learning.
Then we will implement the math in python.

### Binary Cross Entropy
Also known as: "Log Loss" or "negative log-likelihood"
$$
- \frac{1}{N} \sum_{i=1}^N \left[ y_i \log{\hat{y_i}} + (1 - y_i) \log{(1 - \hat{y_i})} \right]
$$

Where 
- $N$ is the total number of samples
- $y_i$ is the true label (0 or 1) for the $i$-th sample
- $\hat{y_i}$ is the predicted probability for the $i$-th sample

The formulation of this loss function actually comes from how we model the probability mass function (PMF) of a Bernoulli-distributed random variable $Y$, which represents a binary outcome of 0 or 1.
The likelihood of $Y$ taking on a specific value of $y$ (either 0 or 1) is:

$$
P(Y = y) = p^y (1 - p)^{1-y}
$$

Then if we take the log-likelihood of $Y$:
$$
log P(Y = y) = y \log (p) (1-y) \log (1 - p)
$$

In our machine learning world, we want to maximize this likelihood for our model to correctly model this distribution. But since gradient descent usually is concerned with minimizing the loss function, we will convert this problem from: **maximize log likelihood** to **minimize negative log likelihood**, so we take the negation of that, which gives us exactly the first formula known as the "Binary Cross Entropy" loss:
$$
- \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right]
$$

Then we can divide it by the number of samples $N$ in a iteration. I will omit that here since it's exactly the same.

#### Forward

In [9]:
y_true = np.ones((4,))
y_pred = np.zeros((4,))
y_pred.fill(np.random.rand())
y_pred[0] = 0.99 # make the first prediction close to true label of 1
print(y_true, y_pred)

def binary_cross_entropy(y_true, y_pred):
    return -y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)

binary_cross_entropy(y_true, y_pred) # the first sample's loss should be close to 0, while the others are higher

[1. 1. 1. 1.] [0.99       0.59220362 0.59220362 0.59220362]


array([0.01005034, 0.52390475, 0.52390475, 0.52390475])

#### Backward

Recall
$$
\text{BCE} = - \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right] \\
$$

Compute the derivative separate for the two terms:
$$
\frac{\partial y \log \hat{y}}{\partial \hat{y}} = y \cdot \frac{1}{\hat{y}} \\
\frac{\partial (1 - y) \log{(1 - \hat{y})}}{\partial \hat{y}} = (1 - y) \cdot \frac{-1}{1-\hat{y}} \\
$$

Combine terms
$$
\frac{\partial \text{BCE}}{\partial \hat{y}} = - \left( y \cdot \frac{1}{\hat{y}} + (1 - y) \cdot \frac{-1}{1-\hat{y}} \right) \\

= - \frac{y}{\hat{y}} + \frac{1 - y}{1-\hat{y}} \\

= \frac{(1 - y)(\hat{y}) - y(1 - \hat{y})}{\hat{y} (1 - \hat{y})} \\

= \frac{\hat{y} - y}{\hat{y} (1 - \hat{y})}
$$

Now we can see that:

- when $y = 1$, the gradient simplifies to $-\frac{1}{\hat{y}}$, which is very negative when $\hat{y}$ is small, and pushing $\hat{y}$ upwards closer to $y = 1$
- when $y = 0$, the gradient simplifies to $\frac{1}{1 - \hat{y}}$, which is very positive when $\hat{y}$ is large, and pushing $\hat{y}$ downwards closer to $y = 0$

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y_{new}} = \hat{y_{old}} - \text{learning rate} \cdot \frac{\partial \text{BCE}}{\partial \hat{y}}
$$

- So you can see, when the gradient is very positive, we push $\hat{y}$ downwards closer to $y = 0$.
- when the gradient is very negative, we push $\hat{y}$ upwards closer to $y = 1$.



In [12]:
def bce_backward(y_true, y_pred):
    return (y_pred - y_true) / (y_pred * (1 - y_pred))

print(y_true, y_pred)
bce_backward(y_true, y_pred) # We should see that the gradient of the elements except the first element will be very negative, so it will push y_pred closer to y = 1

[1. 1. 1. 1.] [0.99       0.59220362 0.59220362 0.59220362]


array([-1.01010101, -1.68860838, -1.68860838, -1.68860838])

### Cross Entropy

#### Forward

### Mean Squared Error

### Hinge Loss

### Quantile Loss