## Shannon Entropy

- Shannon entropy is the main building block for many distance measures, and so deserves its own notebook. Many other measures (KL Divergence, Jensen Shannon, etc) will be derived based on the definition of Shannon Entropy here, so this is a super important segment

### Deriving Shannon Entropy from Information Theory

- In the digital domain, we often use `bits` as our building blocks of information

- A `bit` is just a single on-off switch. It has the value of 1 when on, and 0 when off. By putting bits together, we can express more complex pieces of information.
    - The most canonical example of using bits to represent information is using them to represent numbers (i.e. binary)
    - 

- Bit strings of longer lengths can encode more information
    - length 1 --> 2 states, 0 or 1
    - length 2 --> 4 states, 00 or 10 or 01 or 11
    - ...

#### Representing information with bits

- Let's consider a coin flip
    - If we wish to communicate the result of a coin flip, the simplest idea is to use 1 bit of information. That is, let 1 represent heads, and 0 represent tails.
    - Then for $N$ flips, we require $N$ bits of information

- Let's now suppose the same scenario, but with a biased coin that produces heads 99% of the time, and tails 1% of the time
    - We can use the same encoding scheme of 1 for heads and 0 for tails to encode the result of $N$ tosses, which will again give us $N$ bits of information
    - **We can do better!**
        - Let's suppose $N = 1,000,000$ 
        - In a million tosses, we expect 10,000 tails, and 990,000 heads
        - Since there are only 2 outcomes, and most of them are heads, we can choose to transmit only the information about the 10,000 tails!
        - Since we need to encode **where** the tails occur, we need approximately 20bits of information for each tail, which will encode the index (between 0 and 1 million) where the tail occurs
        - This lets us express 10,000 tails in 20 * 10,000 = 200,000 bytes!
    - **We can do even better!**
        - Instead of transmitting the raw indices, we can choose to transmit the distance between tail!
        - Since there are 10,000 tails, on average there must be around 100 flips between each tail
        - So if we only encode the distance between each tail, we require around 7 bits of information per tail on average. Let's call it 10 bits, to account for long stretches of heads
        - Therefore, information about the 10,000 tails can be expressed using 10,000 * 10 = 100,000 bits!
        - Therefore, 100,000 / 1,000,000 = 0.1 bits needed per flip on average
    - Since we used a fixed number of bits to represent information of each flip, we call this **fixed length encoding**

- TLDR;
    1. We can represent "information" (e.g. the results from a flip of a coin) using bits
    2. We don't always need a full 1 bit to represent 1 outcome; if we do things smartly!
        - For instance, in this case, we represent the same information (the count and position of "tails") in a different way (representing the distance between tails instead of their positions directly)
        - This lets us encode what was originally $N$ bits, into something much less
        - Since each representation of distance takes up the same number of bits, we call it **fixed length encoding**

#### Variable Length Encoding

- Let's expand our scenario to something with more than 2 states. Let's imagine with have an 8 sided die, instead of a coin. 
    - Now, there are 8 outcomes instead of 2.

- Since there are 8 outcomes, we can represent each outcome with 3 bits of information $\log_2(8) = 3$. 
    - Imagine we toss this die 1000 times. Since each outcome is represented with a binary string of length 3, this requires 3*1000 = 3000 bits
    - We cannot apply our earlier trick of representing only the distances between some selected outcome, because there are now 8 different scenarios to encode!!

- Nonetheless, **We can do better!** assuming the dice is not fair!!
    - Let's suppose the probabilities of the 8 outcomes are as follows
        - $\frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{32}, \frac{1}{64}, \frac{1}{128}, \frac{1}{128}$
    - Up to this point, we have used 3 bits to encode each outcome (000 for outcome 1, 001 for outcome 2, etc.,), giving us 3000 bits needed for 1000 rolls of the dice
        - Outcome 1 to Outcome 8: 000, 001, 010, 100, 011, 101, 110, 111
    - What if we exploit the difference in the outcome probabilities? 
        - Idea: Encode outcomes with larger probabilities with shorter strings, so we use fewer bits overall

    - Let's try this instead: 
        - Outcome 1 to Outcome 8: 0, 10, 110, 1110, 11110, 111110, 1111110, 11111110
    
    - How many bits does it take to encode 3000 rolls now?
        - $\frac{1}{2} + \frac{2}{4} + \frac{3}{8} ... + \frac{7}{128} \approx 1.98$ bits per roll, or 1980 bits for 1000 rolls! 
        - This cuts the number of bits needed by more than one third!

#### Generalising Variable Length Encoding

- We've seen how representing more frequent common outcomes with fewer bits can let us encode information more efficiently

- Intuitively, for a given number of outcomes $N$, we will always need $\log_2(N)$ bits to encode each outcome, assuming each outcome is equally likely. 
    - This has been formalised as jargon; so now $\log_2(N)$ is known as the **entropy** of a uniform distribution with $N$ outcomes

- When the outcomes are NOT equally likely, we can adjust the expected number of bits needed by the weighted sum of the information content, as we did in the example above!
    - Recall that we looked at $\log_2(N)$ as the number of bits needed
    - This can be rewritten as $-\log_2(1/N) = -\log_2(P(X))$, where $P(X)$ is the probability of observing some outcome $X$
    
- Then, the number of bits needed is simply the weighted average of the number of bits needed for each outcome 
$$\begin{aligned}
    \text{Bits needed to encode each roll} &= - \sum_X P(X) \log_2(P(X)) \\
    &= \text{Uncertainty} \\
    &= \text{Shannon Entropy}
\end{aligned}$$

- This derivation gives us the exact definition of **Shannon Entropy**
    - The more bits are needed to encode the distribution, the more uncertain the outcome is 

#### Shannon Entropy maximisation

- To build intuition, we study when Shannon Entropy is maximised and when it is minimised using the same die roll example from the implementation

- We see that we get progressively less uncertain (lower entropy) as the outcomes become more skewed!

In [None]:
import numpy as np

unbiased_die = {x+1: 1/6 for x in range(6)}
biased_die = {x+1: ((x+1)/6)**2 / np.sum([((x+1)/6)**2 for x in range(6)]) for x in range(6)}
biased_die2 = {x+1: ((x+1)/6)**3 / np.sum([((x+1)/6)**3 for x in range(6)]) for x in range(6)}
biased_die3 = {x+1: ((x+1)/6)**4 / np.sum([((x+1)/6)**4 for x in range(6)]) for x in range(6)}

unbiased_die_shannon_entropy = 0
for key in unbiased_die:
    unbiased_die_shannon_entropy += -unbiased_die[key] * np.log2(unbiased_die[key])

biased_die_shannon_entropy = 0
for key in biased_die:
    biased_die_shannon_entropy += -biased_die[key] * np.log2(biased_die[key])

biased_die2_shannon_entropy = 0
for key in biased_die2:
    biased_die2_shannon_entropy += -biased_die2[key] * np.log2(biased_die2[key])

biased_die3_shannon_entropy = 0
for key in biased_die3:
    biased_die3_shannon_entropy += -biased_die3[key] * np.log2(biased_die3[key])


print(unbiased_die_shannon_entropy, biased_die_shannon_entropy, biased_die2_shannon_entropy, biased_die3_shannon_entropy)

2.584962500721156 2.082047059877024 1.7956083181006413 1.5556975591204392


### Proof that Shannon Entropy is the lower bound on the expected number of bits

- We've covered a few ways that we can reduce the number of bits used when representing; e.g. fixed/variable length encoding. However, we have concluded above that Shannon Entropy represents the best we can do; that is, the lower bound of bits we need to encode a set of information

$$\begin{aligned}
    \text{Shannon Entropy} &= - \sum_X P(X) \log_2(P(X))
\end{aligned}$$

- This is a pretty foundational idea in modern AI/machine learning development, so it pays to dig a little deeper to understand this. How do we know that this is the best we can do? Restating with jargon, how do we know that Shannon Entropy represents the **information-theoretic lower bound** of the expected number of bits per symbol for any uniquely decodable length of code?

- A non-formal proof is provided in `kl_divergence.ipynb`