<a href="https://colab.research.google.com/github/tangYang7/GAI/blob/main/exercise/week3_topic2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Explain Cross Entropy and Kullback-Leibler Divergence
### Cross Entropy
- Measures the similarity between two probability distributions, commonly used to **evaluate how well the predicted probability distribution matches the ground truth labels.**

### KL Divergence
- Measures the difference between two probability distributions, typically used to **quantify the relative entropy loss between a given distribution and a target distribution.**

### 1. Problem Definition
- If we have the label for N classess, and in a random sample, the true distribution (ground truth) annotes $$ P = [P(1), P(2), ... , P(i),  ..., P(N)] $$
- We have a model θ, and we got the prediction $$ Q = [Q(1), Q(2), ..., Q(i), ..., Q(N)] $$
- H(P) is the entropy of the true distribution (a constant value).

### 2. Cross Entropy
- The formula for cross entropy is:

$$
H (P, Q) = \sum_i P(i) logQ(i)
$$

In [24]:
import torch
p = torch.tensor([0.2, 0.3, 0.5])
q1 = torch.tensor([0.1, 0.4, 0.5])
q2 = torch.tensor([0.1, 0.1, 0.8])

In [25]:
CE1 = torch.sum(p * torch.log(q1))
CE2 = torch.sum(p * torch.log(q2))
print(CE1)
print(CE2)

tensor(-1.0820)
tensor(-1.2629)


### 3. KL Divergence
- KL divergence measures the relative entropy between two probability distributions:
$$
D_{KL}(P||Q)=\sum_i P(i)log \frac{P(i)}{Q(i)}
$$

- What's more, we can expand the formula:
$$
D_{KL}(P||Q)=\sum_i P(i)log \frac{P(i)}{Q(i)} = \sum_i P(i)logP(i) - \sum_i P(i)logQ(i) \\ = H(P) - H(P, Q)
$$
- Since H(P) is constant, minimizing cross entropy is equivalent to minimizing KL divergence.

In [26]:
KLD1 = torch.sum(p * torch.log(p / q1))
KLD2 = torch.sum(p * torch.log(p / q2))
print(KLD1)
print(KLD2)

tensor(0.0523)
tensor(0.2332)


In [27]:
# calculate H(P)
H_P = torch.sum(p * torch.log(p))
print(H_P)

tensor(-1.0297)


- Calcuate H(P) - H(P, Q), and we get the same value as the one  in KL Divergence

In [28]:
print(H_P - CE1)
print(H_P - CE2)

tensor(0.0523)
tensor(0.2332)


### 4. What situations do we use Cross Entropy or KL Divergence?
- As mentioned in the beginning, Cross Entropy focuses on the **similarity between P and Q** while KL Divergence measures how **difference between P and Q**.

- In deep learning, softmax is used to output probability distributions, and cross entropy is used to compute loss.
- Take Knowledge Distillation for examples, KL Divergence can be used to train a smaller model (student) from a larger model (teacher), which means they approximate a distribution of the prediction of the teacher.
- In most machine learning applications, we minimize cross entropy, while KL divergence is often used for probability distribution learning.