# Loss Function
---

## Contents

1. [Cross Entropy Loss with Softmax](#1.-Cross-Entropy-Loss-with-Softmax)

---

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

## 1. Cross Entropy Loss with Softmax
In classification problem, we use cross entropy as loss function, but why do we use **cross entropy loss**?

Before this let's talk about entropy and cross entropy.

### Entropy & Cross Entropy

In information theory, we can think [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory) measures how much amout of **"uncertainty information"** in the data. So, when entropy is low, it means lower "uncertainty information" in the data, vice versa, higher "uncertainty information".

$$H(x) = -\sum_x p(x) \log_2 p(x) $$

For example, when we toss a coin with  known, not necessarily fair, probabilities of coming up heads or tails. When the probability of coming up head equals to 0.5, entropy is 1. This means a lot of "uncertainty information" in this probability distribution(which is maximum in this problem). If this probability goes up, entropy becomes lower and lower. At extreme case, when probability equals to 1, entropy becomes 0, which means no more "uncertainty information", because when we toss this coin, it will always show head side.

In [2]:
def coin_entropy(p):
    q = 1 - p
    if p == 0 or q == 0:
        return 0.0
    return - p*np.log2(p) - q*np.log2(q)

In [3]:
print('when head prob = 0.5, entropy =', coin_entropy(p=0.5).round(3))
print('when head prob = 0.7, entropy =', coin_entropy(p=0.7).round(3))
print('when head prob = 1.0, entropy =', coin_entropy(p=1.0))

when head prob = 0.5, entropy = 1.0
when head prob = 0.7, entropy = 0.881
when head prob = 1.0, entropy = 0.0


Also we define [Cross Entropy](https://en.wikipedia.org/wiki/Cross_entropy) between two probability distributions $P$ and $Q$ to measure unknown distribution $P$ using $Q$.

$$H(P, Q) = -\sum_x p(x) \log q(x)$$

from the definiation of entropy, we can get following equation.

$$\begin{aligned} H(P, Q) &= -\sum_x \sum_y p(x, y) \log_2 p(x, y) \\
&= -\sum_x \sum_y p(x, y) \log_2 p(x \vert y) p(y) \\
&= -\sum_x \sum_y p(x, y) \big( \log_2 p(x \vert y) + \log_2 p(y) \big)\\
&= -\sum_x \sum_y p(x, y) \log_2 p(x \vert y) - \sum_x \sum_y p(x, y) \log_2 p(y) \\
&= -\sum_x \sum_y p(x, y) \log_2 p(x \vert y) - \sum_y p(y) \log_2 p(y) \\
&= H(P \vert Q) + H(Q)
\end{aligned}$$

where $H(P \vert Q)$ is [conditional entropy](https://en.wikipedia.org/wiki/Conditional_entropy)

![fig](./figs/0719fig_entropy.png)

figure from: https://en.wikipedia.org/wiki/Conditional_entropy

As we can see from this figure and equations, two distribution will be similar when the cross entropy gets lower. Which means $H(P, Q) \downarrow = H(P \vert Q) \downarrow$

* A helpful article that you can read(korean): http://sanghyukchun.github.io/62/

### So, why do we use cross entropy loss?

Let's see following example, there are 3 classes in this problem. We assume the situation that already put a single batch input data into a model and got a output vector.

Also, the target class is **1**, we encode it as one hot vector. We can think this one-hot target vector as a distribution of the calss equals to 1 in real world. Let's call it as $p(x)$

In [4]:
target = np.array([1, 0, 0])
output = np.array([2.7, -8.9, 5.5])

and we will use **softmax** function to transform this array to a probability distribution. Let's call this **output distribution ** as  $q(x)$.

In [5]:
def softmax(x):
    x = x - np.max(x)  # to avoid overflow
    return np.exp(x) / np.sum(np.exp(x))

In [6]:
output_prob = softmax(output).round(3)
output_prob

array([0.057, 0.   , 0.943])

Out goal is that make output distribution $q(x)$ as similar as real word distribution $p(x)$ possible. So, when we put an input to our model, it returns a right class distribution. How can we do this?

Here we bring a concept called [**Kullback-Leibler divergence**](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (also called **relative entropy**) . It's a measure of how one probability distribution diverges from a second, expected probability distribution. It looks like distance metric but dont forget it's not same as distance metric, since it's unsymmetric.

$$D_{KL}(P \vert\vert Q) = - \sum_x p(x) \log \dfrac{p(x)}{q(x)} = \sum_x p(x) \log \dfrac{p(x)}{q(x)}$$

**KL-divergence** is equal to following equation.

$$\begin{aligned} D_{KL}(P \vert\vert Q) &= \sum_x p(x) \log \dfrac{p(x)}{q(x)} \\
&= \sum_x p(x) \big( \log p(x) - \log q(x) \big) \\
&= - \sum_x p(x) \log q(x) + \sum_x p(x) \log p(x)  \\ 
&= H(P, Q) - H(P)\\ 
\end{aligned}$$

So we use KL-divergence as our loss function. To make predict distribution $Q$ closer to target distribution $P$, we need to make KL-divergence value become lower, which is equals to make corss entropy $H(P, Q)$ lower.

For this example, the model output predicted class 3 as an answer(distribution=$[0.057, 0.000, 0.943]$), however our target distribution tells us the right answer is class (distribution=$[1, 0, 0]$), so the entropy will be high.

$$\begin{aligned} D_{KL} &=- \big(1 \times \log(0.057) + 0 \times \log(0.0) + 0 \times \log(0.943) \big) + \big(1 \times \log(1) + 0 \times \log(0) + 0 \times \log(0) \big) \\
&= -\log(0.057)\end{aligned}$$

As we can see, if targer vector is one hot encoded, $H(P)=0$, then cross entropy is equal to our loss function.

In [7]:
def cross_entropy_error_with_softmax(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    if t.size == y.size:  # if target is one hot encoded it is same as get max index of batch vectors
        t = t.argmax(axis=1)

    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-99)) / batch_size

In [8]:
cross_entropy_error_with_softmax(output_prob, target)  

2.864704011147587

In [9]:
cross_entropy_error_with_softmax(output_prob, np.array([0]))  # same as above target index is 0, target vector = [1, 0, 0]

2.864704011147587

### Difference between KL-divergence & Cross Entropy

Note that the reason why we use can Cross entropy loss as our loss function, is that we encode target to one-hot vectors that makes $H(P)=0$ in **KL-divergence**. 

The difference: $D_{KL}(P \vert \vert Q)$ measures the average number of **extra** bits per message, whereas $H(P,Q)$ measures the average number of **total** bits per message.

* Reference: https://stats.stackexchange.com/questions/265966/why-do-we-use-kullback-leibler-divergence-rather-than-cross-entropy-in-the-t-sne

### Pytorch Cross entropy Loss

In Pytorch we don't need to calcualate probability with softmax, `CrossEntropyLoss` contains both softmax and cross entropy. Also, don't need to encode target to one hot vectors. Use LongTensor instead.

In [10]:
output = torch.FloatTensor(output).view(1, -1)
target = torch.LongTensor([0])
loss_function = nn.CrossEntropyLoss()
loss_function(output, target)

tensor(2.8590)

### Backpropagation in cross entropy loss with softmax

$$\begin{cases} \hat{y}=\dfrac{\exp(z_i)}{\sum_k \exp(z_k)} \\ L=H(y, \hat{y})=-\sum_j y_j \log \hat{y}_j \end{cases}$$

where $z_i$ is linear output before softmax layer. $y$ is onehot encoded

$$\dfrac{\partial L}{\partial \hat{y}_i} = -y_i \dfrac{1}{\hat{y}_i} \quad\quad since\ y_j = 0 ,i \neq j \\
\begin{aligned} \dfrac{\partial \hat{y}_i}{\partial z_i} &= \dfrac{e^{z_i} \cdot \sum_k e^{z_k} - e^{z_i} \cdot \dfrac{\partial}{\partial z_i}\sum_k e^{z_k}}{(\sum_k e^{z_k})^2} \\
&= \dfrac{e^{z_i}}{\sum_k e^{z_k}} - \dfrac{e^{z_i} \cdot e^{z_i}}{(\sum_k e^{z_k})^2} \\
&= \dfrac{e^{z_i}}{\sum_k e^{z_k}} \big(1- \dfrac{e^{z_i}}{\sum_k e^{z_k}} \big) \\
&= \hat{y}_i ( 1- \hat{y}_i)
\end{aligned}$$

so, 

$$\begin{aligned}\dfrac{\partial L}{\partial z_i} &= \dfrac{\partial L}{\partial \hat{y}_i}\dfrac{\partial \hat{y}_i}{\partial z_i} \\
&= -y_i \dfrac{1}{\hat{y}_i} \times \hat{y}_i ( 1- \hat{y}_i) \\
&= \hat{y}_i - 1
\end{aligned}$$