# Cross-Entropy Loss

The Cross-Entropy Loss function is typically used as a loss function in multi-class classification problems.

The output for a neural network doing classification is a set of probabilities (a so-called probability distribution where every class is associated with a probability).  We try to adapt weights to optimize the resulting probabilities to match as close as possible the ground truth.  To iteratively adapt the weights and improve the prediction , a loss function is needed.  For multi-class classification, Cross-Entropy Loss is used.

## Shannon Information

An occurrence of an unlikely event gives more information than the occurrence of a very likely event.  Shannon came up with a way to quantify how unpredictable a series of events is; measuring the "disorder" of system and quantifying uncertainty of a probability distribution.

Let's do a thought experiment with two people: person A and person B.  They can agree upfront on the meaning of a series of bits thrown back and forth over a wall (like some mapping function saying "0101" means event "abc" happened.  Beyond the bits used for communicating they cannot exchange other information.  Let's now assume a number of different scenarios.

### A fair coin flip

Assume person A does a fair coin flip where the probability distribution of having heads is the same as having tails, each begin 50 percent:  $P(H)=0.5$ and $P(T)=0.5$

Both A and B can agree to exchange the outcome of the coin flip using a single bit of information where 0 means heads and 1 means tails.  When B receives the bit, he will know exactly what the outcome of the coin flip was.  We can say the entropy of this probability distribution is 1 bit.

### Winning team out of 8

Assume on the one side of the wall person A observes one team out of 8 winning a tournament.  Each team has a probability of 1/8 or 0.125 of winning, so this is again an equal probability distribution: $P(A)=0.25$, $P(B)=0.25$, $P(C)=0.25$, ..., $P(H)=0.25$

A and B can agree to communicate the winning team using 3 bits of information. 3 bits give them $2^3=8$ classes, one for each team.  Let's say "000" means team A; "001" team B; "010" team C and so on.  We can say the entropy for this probability distribution is 3 bits.

If we generalize this: for a uniform distribution of M equally possible outcomes, the entropy is: $log_2\,M$

This also holds for distributions where the number of outcomes is not exactly a power of 2, like was the case in the examples before.  Let's have a look at this in the next example.

### 10 outcomes

If A observes an outcome out of equal distributation of 10 possible outcomes (each with a probability of 0.1), then these can all be encoded using 4 bits.  4 bits allows for representing $2^4=16$ states which is more than needed for our 10 possible outcomes.  There are 6 "unused" states.

We can group outcomes in groups of 3.  There are 1000 such unique triplets possible.  If we encode our data per 3 bits, then every triplet can be encoded using 10 bits, giving us a total of $2^10=1024$ states.  That's still too much but we're already much more efficient in encoding out information as we can represent on average 1 outcome = 1/3 triplet using $\frac{10}{3}=3.333...$ bits.  This is better but not perfect yet.  

We grouped our information by 3 outcomes at a time; which gave us $10^3$ outcomes.  Let's call the number of items by which we group is G instead of 3. The number of states we can present with B bits is $2^B$.  The most efficient encoding is one where $2^B = 10^G$ where G is the number of grouped observations and B is the number of bits.

$$2^B = 10^G$$

Let's take the $log_2$ of both sides:

$$B = log_2 (10^G)$$
$$B = G\, log_2 10$$
$$\frac{B}{G} = log_2 10$$

$\frac{B}{G}$ is our entropy and $log_2 10$ is approximately 3.322...

So for a uniform distribution of M possible outcomes ($U(M)$) in which every probability of an outcome is $p_{1..M} = \frac{1}{M}$, the entropy is: $$H(U(M))=log_2\,M$$

### Non-uniform distributions

As we've seen before, each outcome with probability $p$ needs $log_2\,M$ bits to encode or $log_2\,\frac{1}{p}$ which is $-log_2\,p$.  Summing this over an entire distribution and multiplying each possible outcome with its probability gives us the entropy for a non-uniform distribution: $-\sum_{i=1}^{M}\,p{i}\,log_2\,p_{i}$

This describes how much information, on average, is needed to describe the outcome for a distribution.

### Shannon Entropy Formula

Shannon Entropy is defined as: $$H=-\sum_{i=1}^{M}P(x_{i}) \, log_2 \, P(x_{i})$$

## KL Divergence

TODO

## Cross Entropy Loss

TODO

## References

### Articles

- https://machinelearningmastery.com/cross-entropy-for-machine-learning/
- https://towardsdatascience.com/cross-entropy-for-dummies-5189303c7735

### Videos

- [Intuitively understanding Shannon Entropy](https://www.youtube.com/watch?v=0GCGaw0QOhA)