In [26]:
import numpy as np

## Loss Functions

This notebook will derive the forward and backward (derivative) mathematical formulas of various loss functions in machine learning.
Then we will implement the math in python.

### Binary Cross Entropy
Also known as: "Log Loss" or "negative log-likelihood"
$$
- \frac{1}{N} \sum_{i=1}^N \left[ y_i \log{\hat{y_i}} + (1 - y_i) \log{(1 - \hat{y_i})} \right]
$$

Where 
- $N$ is the total number of samples
- $y_i$ is the true label (0 or 1) for the $i$-th sample
- $\hat{y_i}$ is the predicted probability for the $i$-th sample

The formulation of this loss function actually comes from how we model the probability mass function (PMF) of a Bernoulli-distributed random variable $Y$, which represents a binary outcome of 0 or 1.
The likelihood of $Y$ taking on a specific value of $y$ (either 0 or 1) is:

$$
P(Y = y) = p^y (1 - p)^{1-y}
$$

Then if we take the log-likelihood of $Y$:
$$
log P(Y = y) = y \log (p) (1-y) \log (1 - p)
$$

In our machine learning world, we want to maximize this likelihood for our model to correctly model this distribution. But since gradient descent usually is concerned with minimizing the loss function, we will convert this problem from: **maximize log likelihood** to **minimize negative log likelihood**, so we take the negation of that, which gives us exactly the first formula known as the "Binary Cross Entropy" loss:
$$
- \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right]
$$

Then we can divide it by the number of samples $N$ in a iteration. I will omit that here since it's exactly the same.

#### Forward

In [27]:
y_true = np.ones((4,))
y_pred = np.zeros((4,))
y_pred.fill(np.random.rand())
y_pred[0] = 0.99 # make the first prediction close to true label of 1
print(y_true, y_pred)

def binary_cross_entropy(y_true, y_pred):
    return -y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)

binary_cross_entropy(y_true, y_pred) # the first sample's loss should be close to 0, while the others are higher

[1. 1. 1. 1.] [0.99       0.68619332 0.68619332 0.68619332]


array([0.01005034, 0.37659589, 0.37659589, 0.37659589])

#### Backward

Recall
$$
\text{BCE} = - \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right] \\
$$

Compute the derivative separate for the two terms:
$$
\frac{\partial y \log \hat{y}}{\partial \hat{y}} = y \cdot \frac{1}{\hat{y}} \\
\frac{\partial (1 - y) \log{(1 - \hat{y})}}{\partial \hat{y}} = (1 - y) \cdot \frac{-1}{1-\hat{y}} \\
$$

Combine terms
$$
\frac{\partial \text{BCE}}{\partial \hat{y}} = - \left( y \cdot \frac{1}{\hat{y}} + (1 - y) \cdot \frac{-1}{1-\hat{y}} \right) \\

= - \frac{y}{\hat{y}} + \frac{1 - y}{1-\hat{y}} \\

= \frac{(1 - y)(\hat{y}) - y(1 - \hat{y})}{\hat{y} (1 - \hat{y})} \\

= \frac{\hat{y} - y}{\hat{y} (1 - \hat{y})}
$$

Now we can see that:

- when $y = 1$, the gradient simplifies to $-\frac{1}{\hat{y}}$, which is very negative when $\hat{y}$ is small, and pushing $\hat{y}$ upwards closer to $y = 1$
- when $y = 0$, the gradient simplifies to $\frac{1}{1 - \hat{y}}$, which is very positive when $\hat{y}$ is large, and pushing $\hat{y}$ downwards closer to $y = 0$

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y_{new}} = \hat{y_{old}} - \text{learning rate} \cdot \frac{\partial \text{BCE}}{\partial \hat{y}}
$$

- So you can see, when the gradient is very positive, we push $\hat{y}$ downwards closer to $y = 0$.
- when the gradient is very negative, we push $\hat{y}$ upwards closer to $y = 1$.



In [28]:
def bce_backward(y_true, y_pred):
    return (y_pred - y_true) / (y_pred * (1 - y_pred))

print(y_true, y_pred)
bce_backward(y_true, y_pred) # We should see that the gradient of the elements except the first element will be very negative, so it will push y_pred closer to y = 1

[1. 1. 1. 1.] [0.99       0.68619332 0.68619332 0.68619332]


array([-1.01010101, -1.45731527, -1.45731527, -1.45731527])

### Cross Entropy

Let's generalize to the cross entropy loss formula, which can be used to compute the difference between two probability distributions $p_{true}$ = $y_{true}$ and $p_{pred}$ = $y_{pred}$ = $\hat{y}$

#### Forward

Let's recall the binary cross entropy formula when $C = 2$:

$$
y_1 = y \; \text{and} \; y_0 = 1-y \\
\hat{y}_1 = \hat{y} \; \text{and} \; \hat{y}_0 = 1-\hat{y} \\
\text{Cross Entropy} = \text{Binary Cross Entropy} = -\Bigl[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\Bigr]
$$

When we have more than 2 classes, let's denote the number of classes as $C$:

$$

\text{Cross Entropy (CE)}
= -\sum_{c} y_{\text{true}, c} \,\log \bigl(y_{\text{pred}, c}\bigr)
$$

There are couple of ways to compute the cross entropy loss depending on the type of inputs we pass in. We will discuss each one separately.

##### 1. probabilities for $y_{pred}$ and one-hot encoding for $y_{true}$

Inputs:
- $y_{true}$: A **one-hot** encoded vector representing the true class (e.g., [0, 1, 0] for class 2).
- $y_{pred}$: A probability distribution over classes (typically obtained by applying softmax to logits).

What It Does:
- The formula  $-\sum_{c} y_{\text{true},c} \log(y_{\text{pred},c})$  is the general definition of cross entropy.
- In a **one-hot** scenario, only the term corresponding to the correct class contributes to the loss, so it simplifies to  $-\log(\hat{y}_{\text{correct}})$.

Loss:
$$
\text{CE} = -\sum_{c} y_{\text{true}, c} \,\log \bigl(y_{\text{pred}, c}\bigr).
$$

In a one-hot scenario, only the term corresponding to the correct class contributes, so it simplifies to

$$
-\log \bigl(y_{\text{pred}, t}\bigr)
$$

where $t$ is the target class index.

In [29]:
def cross_entropy(y_true, y_pred):
    return - np.sum(y_true * np.log(y_pred))

ce = cross_entropy(y_true, y_pred)
bce = binary_cross_entropy(y_true, y_pred)
print(ce, bce)
np.sum(bce) == ce

1.1398379941581172 [0.01005034 0.37659589 0.37659589 0.37659589]


np.True_

##### 2. logits for $y_{pred}$ and target index for $y_{true}$

Inputs:
- $\mathbf{z} = (z_1, \dots, z_C)$ are the logits (raw scores) from the model.
- $t$ (an integer) is the index of the correct class.

What happens internally:
- Numerical Stability: It subtracts the maximum logit from all logits before exponentiating. This is the log-sum-exp trick that helps prevent numerical overflow.
- Internally computes softmax which converts logits into a probability distribution $\hat{y}_j = \frac{e^{z_j}}{\sum{k} e^{z_k}}$



Since the true label is given as an index, the loss is computed as:

$$
\text{Cross Entropy} = -\log(y_{pred, t}) = -\log (\text{softmax}(\text{logits}_{target})) = -\log\left(\frac{e^{z_t}}{\sum_j e^{z_j}}\right)
$$

Where $t$ is the target class' index, and $j$ is the all the classes

This matches the generic cross entropy definition in the one-hot case (only the target class term contributes).

In [30]:
def cross_entropy(logits, target_index):
    max_logit = np.max(logits)
    stable_logits = logits - max_logit
    
    exp_logits = np.exp(stable_logits)
    softmax = exp_logits / sum(exp_logits)
    
    loss = - np.log(softmax[target_index])
    return loss, softmax

loss, softmax = cross_entropy(np.array([1,2,2]), 1)

print(loss, softmax)

0.8619948040582511 [0.1553624 0.4223188 0.4223188]


#### Backward

We first consider the derivative of the cross entropy with respect to the predicted probability y_{\text{pred}} for the target class t:

$$
\frac{\partial \text{CE}}{\partial y_{\text{pred}, t}}
= - \frac{\partial}{\partial y_{\text{pred}, t}} \bigl[\log(y_{\text{pred}, t})\bigr]
= -\frac{1}{y_{\text{pred}, t}}.
$$

For any other class $j \neq t, y_{\text{true}, j} = 0$, so the loss does not directly depend on $y_{\text{pred}, j}$:

$$
\frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}
= 0 \quad \text{(for } j \neq t\text{)}.
$$

**Gradient with Respect to Logits**

Recall that

$$
\text{CE} = -\log\!\Bigl(\frac{e^{z_t}}{\sum_{j} e^{z_j}}\Bigr)
$$

and

$$
y_{\text{pred}, i} = \frac{e^{z_i}}{\sum_{k} e^{z_k}}
$$

By applying the chain rule (and using the known derivative of softmax), we arrive at the well-known result for all classes i:

$$
\frac{\partial \text{CE}}{\partial z_i} = y_{\text{pred}, i} - y_{\text{true}, i}
$$

**In more detail:**

1.	Softmax Derivative:

$$
\frac{\partial y_{\text{pred}, i}}{\partial z_j} = y_{\text{pred}, i}\,\bigl(\delta_{ij} - y_{\text{pred}, j}\bigr)
$$


where $\delta_{ij}$ is the Kronecker delta (1 if $i=j$, 0 otherwise).

2.	Chain Rule:

$$
\frac{\partial \text{CE}}{\partial z_i}
= \sum_{j} \frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}
\cdot \frac{\partial y_{\text{pred}, j}}{\partial z_i}
$$

Only the term for $j = t$ is nonzero in $\frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}$. Substituting and simplifying leads to

$$
\begin{align}
\frac{\partial \text{CE}}{\partial z_i} &= \bigl(y_{\text{pred}, i} - \delta_{it}\bigr)
&= y_{\text{pred}, i} - y_{\text{true}, i}
\end{align}
$$

**Interpretation**

1.	Target Class ($i = t$):
    $$
    \frac{\partial \text{CE}}{\partial z_t} = y_{\text{pred}, t} - 1
    $$

    If $y_{\text{pred}, t} < 1$, this gradient is negative, indicating $z_t$ should increase (raising the probability of the correct class).

2.	Non-Target Classes ($i \neq t$):
    $$
    \frac{\partial \text{CE}}{\partial z_i} = y_{\text{pred}, i}
    $$

    A positive gradient means $z_i$ should decrease to lower the probability for incorrect classes.

In [31]:
def grad_cross_entropy(softmax, target_index):
    grad = softmax.copy()
    grad[target_index] -= 1  # Subtract 1 for the target index
    return grad

grad = grad_cross_entropy(softmax, target_index=1)
print(grad, softmax)

[ 0.1553624 -0.5776812  0.4223188] [0.1553624 0.4223188 0.4223188]


### Mean Squared Error

### Hinge Loss

### Quantile Loss