# Computer Vision (911.908)

## <font color='crimson'>Cross Entropy</font>

**Changelog**:

- *Sep. 2020*: initial version (using PyTorch v1.6) 
- *Sep. 2021*: adaptations to PyTorch v1.9
- *Jan. 2022*: adaptations to PyTorch v1.10.1
- *Dec. 2022*: adaptations to PyTorch v1.13.1

---

In this lecture, we learn about the **cross-entropy** loss, i.e., the prevalent loss for training multi-class classifiers, implemented as neural networks.

---

## Contents

- [Cross-Entropy](#Cross-Entropy-(CE))
- [Practical example](#Practical-example)


## Mean-squared-error (MSE) for classification problems?

Say you have a **classification problem** where the class labels are **one-hot** encoded. Assume the correct class is represented by $1$, i.e., one-hot encoded as

$$\mathbf{y} = [1,0,0]^\top$$

Now, say your model would predict a **score vector** and we take the index (starting at $1$) at the maximum score as our prediction. For a score vector

$$\mathbf{y}' = [0,1.1,1]^\top$$

the MSE to $\mathbf{y}$ would be $\approx 1$. Similarly, if your model predicted

$$\mathbf{y}'' = [2,-1,-1.1]^\top$$

we would also get the **same** MSE. However, $\mathbf{y}''$ is actually correct.

**MSE makes a Gaussian noise assumption** (which is fine for regression) around the targets; this is most likely not satisfied in the context of classification. Below is code for the example above:

In [1]:
import torch.nn.functional as F
import torch

u = torch.tensor([1,0,0], dtype=torch.float32) # a 1 at the position of the class index
y = torch.tensor([0,1.1,1], dtype=torch.float32)
z = torch.tensor([2,-1,-1.1], dtype=torch.float32)

print(F.mse_loss(u,y, reduction='mean'))
print(F.mse_loss(u,z, reduction='mean'))

tensor(1.0700)
tensor(1.0700)


## Cross Entropy (CE)


Let's motivate the **cross-entropy loss** from the viewpoint maximizing the posterior probability of labels given the data.


In the following, assume $y$ identifies the label and $\mathbf{x} \in \mathbb{R}^d$ is some input sample. Also, we assume that tuples of the form $(\mathbf{x},y)$ are iid and labels ($y$) take values in $\{1,\ldots,C\}$ (so, a $C$-class problem). According to Bayes' rule, we have (for the posterior probability of class $i$)

$$p(y=i | \mathbf{x}) = \frac{p(\mathbf{x}|y=i)p(y=i)}{\sum_c p(\mathbf{x}|y=c)p(y=c)}$$

Upon defining

$$ a_i = \log(p(\mathbf{x}|y=i)p(y=i))$$

we can write

$$p(y=i | \mathbf{x}) = \frac{e^{a_i}}{\sum_c e^{a_c}}$$

Now, let's make the assumption that the class-conditional probabilities, $p(\mathbf{x}|y=i)$, are Gaussian with mean $\boldsymbol{\mu}_i$ and
identity covariance $\boldsymbol{\Sigma} = \mathbf{I}$. 
Formally,

$$p(\mathbf{x}|y=i) = \frac{1}{(2\pi)^{d/2}} e^{-0.5 \| \mathbf{x}-\boldsymbol{\mu}_i \|^2}$$

Writing out $a_i$ from above gives

$$a_i = -0.5\mathbf{x}^\top\mathbf{x} + \boldsymbol{\mu}_i^\top\mathbf{x} - 0.5 \boldsymbol{\mu}_i^\top\boldsymbol{\mu}_i + \log(p(y=i)) + \log\left(\frac{1}{(2\pi)^{d/2}} \right)$$

Upon setting $\mathbf{w}_i = \boldsymbol{\mu}_i$, we get 

$$a_i = -0.5\mathbf{x}^\top\mathbf{x} + \mathbf{w}_i^\top\mathbf{x} \underbrace{- 0.5 \mathbf{w}_i^\top\mathbf{w}_i + \log(p(y=i)) + \log\left(\frac{1}{(2\pi)^{d/2}} \right)}_{b_i}$$

or (with terms aggregated) 

$$a_i = -0.5\mathbf{x}^\top\mathbf{x} + \mathbf{w}_i^\top\mathbf{x}  + b_i$$

where we have subsumed most of the terms into $b_i$. Now, 

$$p(y=i | \mathbf{x}) = \frac{e^{a_i}}{\sum_c e^{a_c}} = \frac{e^{-0.5\mathbf{x}^\top\mathbf{x}} e^{\mathbf{w}_i^\top\mathbf{x}}} {\sum_c e^{-0.5\mathbf{x}^\top\mathbf{x}} e^{\mathbf{w}_c^\top\mathbf{x}}} = \frac{e^{\mathbf{w}_i^\top\mathbf{x}}} {\sum_c e^{\mathbf{w}_c^\top\mathbf{x}}}$$

which is the output of a linear transformation of the $\mathbf{x}$ pushed through a so called **softmax** function. In particular, the output of the softmax at the $i$-th coordinate is

$$\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_c e^{z_c}}$$

If we now have $N$ (iid, independent and identically distributed) samples $(\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_N,y_N)$ with $\forall i: y_i \in \{1,\ldots,C\}$, we can follow the goal of **maximizing the likelihood** (or, equivalently, minimizing the **negative log-likelihood**) to obtain the following loss function

$$-\frac{1}{N} \sum_{i=1}^N \log(p(y=y_i|\mathbf{x}_i))$$

In particular, let's say we have a model $f_{\mathbf{W}}$ which linearly maps inputs $\mathbf{x} \in \mathbb{R}^d$ to $\mathbb{R}^C$. This corresponds to one layer of neurons whose output can be written as

$$f_{\mathbf{W}}(\mathbf{x}) = \mathbf{W}\mathbf{x} = [\mathbf{w}_1^\top\mathbf{x},\ldots,\mathbf{w}_C^\top\mathbf{x}]^\top$$

where we have aggregated all weights into a matrix $\mathbf{W}$ which parametrizes this layer of neurons.
For a single training instance $(\mathbf{x},y)$ and assuming that $y=k$, we can write the aforementioned loss function (without the $1/N$ factor and the summation) as

$$l(\mathbf{W},(\mathbf{x},y)) = -\log\left(
\frac{e^{[f_\mathbf{w}(\mathbf{x})]_k}}{\sum_c e^{[f_\mathbf{w}(\mathbf{x})]_c} }\right) = -\log(\text{softmax}([ f_{\mathbf{w}}(\mathbf{x})]_k)
$$

Equivalently, we can write

$$
\begin{split}
l(\mathbf{W},(\mathbf{x},y)) & = - \sum_k \delta_{y}(k) \log\left(
\frac{e^{[f_\mathbf{w}(\mathbf{x})]_k}}{\sum_c e^{[f_\mathbf{w}(\mathbf{x})]_c} }\right) \\ 
& = - \sum_k \delta_{y}(k) \log(p(y=k|\mathbf{x})) \\
& = \mathbb{H}(\delta_y, p(y=\cdot|\mathbf{x})))
\end{split}
$$

where 

$$\delta_y(k) = \begin{cases}
1 & \text{if}~y=k \\
0 & \text{else}
\end{cases}
$$

and

$$\mathbb{H}(p,q) = -\sum_k p(k)\log(q(k))$$

being the **cross-entropy** between distributions $p$ and $q$ (discrete).

In other words, we compute the **cross-entropy** between the true posterior $\delta_y$ and the estimated posterior $p(y=\cdot|\mathbf{x})$, noting that the true posterior is a vector of all zeros, except for a 1 at the $k$-th position (as the true label of $\mathbf{x}$ is $k$).

**Implementation**

From an implementation point of view, the discussed example basically corresponds to a ``nn.Linear`` layer, followed by a `nn.LogSoftmax` layer and the ``torch.nn.NLLLoss``
as a loss function. For convenience, PyTorch implements the same functionality in the ``nn.CrossEntropyLoss`` which directly sits on top of the linear layer (hence, no need for the softmax module). 

## Practical example

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net0(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(5,3) # e.g. for a 3-class problem
        
    def forward(self, x):
        x = self.fc(x)
        x = F.log_softmax(x,dim=1) # log(softmax(.)) -> use NLL (negative log-likelihood)
        return x

class Net1(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(5,3)   # no log(softmax(.)) -> use PyTorch's CE directly
        
    def forward(self, x):
        x = self.fc(x)
        return x  

In [11]:
torch.manual_seed(1234)

x = torch.randn(1,5)
y = torch.tensor([1], dtype=torch.long)

W = torch.rand(3,5) # manually set the weights
b = torch.rand(1,1) # manually set the biases

# create two networks
net0 = Net0() 
net1 = Net1()

# init weights - we do this here manually, just to make sure
# we have the same f_W(.)
net0.fc.weight.data = W
net0.fc.bias.data = b

net1.fc.weight.data = W
net1.fc.bias.data = b

In [25]:
loss_fn0 = nn.NLLLoss()          # use for Net0
loss_fn1 = nn.CrossEntropyLoss() # use for Net1

o0 = net0(x)
o1 = net1(x)




print('Output of Net0 (with LogSoftmax)\n')
print(o0.detach().numpy())
print('\nOutput of Net1 (without LogSoftmax)\n')
print(o1.detach().numpy())

print('\nLoss terms')
print('----------')

l0 = loss_fn0(o0, y)
l1 = loss_fn1(o1, y)
print('using NLLLoss:         {:.3f}'.format(l0.item()))
print('using CrossEntropyLoss {:.3f}'.format(l1.item()))

Output of Net0 (with LogSoftmax)

[[-0.6394741 -1.36645   -1.5259264]]

Output of Net1 (without LogSoftmax)

[[ 0.5333383  -0.19363755 -0.3531139 ]]

Loss terms
----------
using NLLLoss:         1.366
using CrossEntropyLoss 1.366


In [22]:
T=0.001 # temperature
a = torch.tensor([2.1/T,5.5/T,3.3/T])
F.softmax(a,dim=0)

tensor([0., 1., 0.])