# Cross-Entropy

### Theoretical Basics

- We previously went through the idea of Shannon Entropy and Kullbeck-Leibler Divergence. The content below assumes full knowledge of that notebook. So if anything doens't make sense, go read the KL divergence notebook and/or the Shannon Entropy notebook for a deeper explanation.

- We know that, for a probability distribution $p(x)$, its Shannon Entropy (expected number of bits needed to encode the distribution) is defined as
$$\begin{aligned}
    H(p) &= - \sum_i p(x_i) \cdot \log_2(p(x_i))
\end{aligned}$$

- We further know that, if we encode $p(x)$ inappropriately with a distribution $q(x)$, the "uncertainty" gained from this encoding (i.e. the increase in the expected number of bits needed to express $p(x)$) may be defined as its KL divergence
$$\begin{aligned}
    D_{KL}(p || q) &= \sum_i p(x_i) \cdot \log_2(p(x_i)) - \sum_i p(x_i) \cdot \log_2(q(x_i)) \\
    &= - H(p) - \sum_i p(x_i) \cdot \log_2(q(x_i))
\end{aligned}$$

- Cross Entropy between distributions $p$ and $q$ is then simply defined as
$$\begin{aligned}
    H(p,q) &= - \sum_i p(x_i) \cdot \log_2(q(x_i))
\end{aligned}$$

- Taken together, the relationship between the KL divergence, Shannon Entropy, and Cross Entropy can be written as

$$\begin{aligned}
    D_{KL}(p || q) &= - H(p) + H(p,q) \\
    \text{KL Divergence} &= - \text{Shannon Entropy} + \text{Cross Entropy}
\end{aligned}$$

### Why is cross entropy important?

- Cross entropy is particularly important in deep learning as a loss function to measure the distance between 2 probability distributions easily
    - In classification problems, it is typically used together with softmax to compute the distance between a network's internal representation with the actual labels. We will see why in the subsection below on `Differentiabiliy of Cross Entropy`
    - This can be used whether the labels are one hot encoded, or if the labels are a probability distribution 

### Differentiability of Cross Entropy

- Loss functions must be differentiable. Let's see how Cross Entropy can be differentiated

- Note that in the case where labels are one-hot encoded, cross entropy becomes the **negative log likelihood**

$$\begin{aligned}
    \mathcal{L} &= - \sum_i y_i \log(p_i) \\
    &= -\log(p_i)
\end{aligned}$$

- Therefore

$$\begin{aligned}
    \frac{\partial \mathcal{L}}{\partial p_i} &= \frac{\partial}{\partial p_i} (-\log(p_i)) \\
    &= -\frac{1}{p_i}
\end{aligned}$$

#### Why Cross Entropy is almost always used alongside Softmax

- Recall from the notebook `softmax.ipynb` that the derivative of softmax is given by
$$\begin{aligned}
    \frac{\partial x_i}{\partial z_j} &= \begin{cases}
        x_i \cdot (1 - x_i) & \forall i==j \\
        -x_i \cdot x_j & \forall i \neq j \\
    \end{cases}
\end{aligned}$$

- Remember our notation here; $x_i$ is the $i$-th softmax value, $z_i$ is the $i$-th input value 

- Now, suppose we have this basic network output: `layer output --> softmax --> cross entropy`. How should we do backpropagation?

- This is effectively asking, how do we compute the change in the loss for a given change in our layer output $z_i$;
$$\begin{aligned}
    \frac{\partial \mathcal{L}}{\partial z_j} &= \frac{\partial \mathcal{L}}{\partial x_i} \cdot \frac{\partial x_i}{\partial z_j} \\
    &= \begin{cases}
        -\frac{1}{x_i} \cdot x_i \cdot (1 - x_i) & \forall i==j \\
        -\frac{1}{x_i} \cdot -x_i \cdot x_j & \forall i \neq j \\
    \end{cases} \\
    &= \begin{cases}
        x_i - 1 & \forall i==j \\
        x_j & \forall i \neq j \\
    \end{cases}
\end{aligned}$$

- The terms cancel out, which means we can compute our gradients using either the softmax output - 1, or the softmax output itself!

### Implementation

In [12]:
import numpy as np
import torch
from torch import nn

def my_cross_entropy(input_arr: np.ndarray, labels: np.ndarray) -> float:
    rescaled_input_arr = input_arr + 1e-6
    rescaled_input_arr /= np.sum(rescaled_input_arr)
    
    return -np.sum(np.log(input_arr) * labels)

logits = torch.tensor([[0.25, 0.35, 0.4]], dtype=torch.float32)  
labels = torch.tensor([2]) #index of positive label

loss_fn = nn.CrossEntropyLoss()
loss = -(labels == torch.arange(3)).float() * torch.log(logits)

print(loss.sum(), my_cross_entropy(logits.numpy(), (labels==torch.arange(3)).numpy()))

tensor(0.9163) 0.9162907
