# Entropy

##### Keywords: entropy,  maxent, binomial, normal distribution, statistical mechanics

## Contents
{:.no_toc}
* 
{: toc}

## Information Theory
Entropy, and the information theory underlying it, is a powerful reframing of topics in statistics and machine learning. Entropy allows qunatification of how surprising a result is, how difficult something is to guess, how much information a fact contains, and how wrong a given model is. And it turns out that most of these things are the same.

In particular, information theory sets up an equivalence between how hard something is to guess, how much information it reveals, and how surprising it would be if that event occured. We'll start with a measure of "difficulty of guessing" and reveal the other connections as we go along.

For nomenclature: "Entropy" will be reserved for an entire process or distribution (rolling a die). "Surprise" and "Difficulty of guessing" will be applied to both events and to processes/distributions, and will coincide with Entropy when discussing a process or distribution.

## Measuring difficulty of guessing
We'd like to measure how hard it is to guess something. We have a clear idea that the smaller the probability of an event, the harder it is to guess that the event would occur. Equivalently, the more surprised we are if that event comes to pass.

But what should be the units of "difficult to guess"? It's definitely not newtons or meters. Let's actually just side-step that issue. Like defining a kilogram as "the mass of this hunk of platinumâ€“iridium alloy" and other masses as "3 hunks of platinum-iridium", we'll use one coin flip (termed one 'bit') as the baseline and "as hard to guess as N coin flips" as our measure.

#### Log of a probability measures difficulty of guessing
It turns out that simply taking the (negative) log base 2 of an event's probability can re-express that event in terms of coin flips (a.k.a. bits). For example: how hard is it to guess a 50-50 event? $-log_2(1/2)=1$. It's as hard as guessing one coin flip. What about a 1-in-8 event, like correctly predicting that a coin will land heads, heads, tails? $-log_2(1/8)=3$ three coin flips. Guessing a die will come up 5? $-log_2(1/6)=2.58$, or between two and three coin flips.

In fact, we aren't stuck with coin flips as the yardstick- we can choose log base 6 to measure in terms of die rolls, and so on.

What about the difficulty of guessing a 3-in-8 event? Well, we could just do the log calculation (it'll work), but check this out:

$$log_2(3/8)=log_2(\frac{1/8}{1/3})=log_2(1/8) - log_2(1/3)$$
(a factor of -1 omitted throughout for clarity)

The final expression says that the difficulty of guessing a 3-in-8 is the difficulty of guessing a 1-in-8, but take away the difficulty of guessing a 1-in-3. Litterally: you have to guess a 1-in-8, but not which of the three 1-in-8s.

The appearance of logarithms here shoulnd't be too surprising. If we consider the difficulty of guessing that a coin is heads and that an independent 1-in-3 spinner is green the probability multiplies, but the dificulty of guessing should add: the whole guess should be as hard as guessing one and then the other, separately. Logs are simply the classic tool for turning multiplications into additions.

#### Conection to maximum likelihood
When we write a likelihood, often the very next step is to take the log and make noises about how taking logs doesn't change where the maximum is. If we follow the logic through, though, maximizing the likelihood is the same as minimizing the (negative) log-likelihood. Maximum likelihood (i.e. "Pick the paramerters that make the data as likely as possible") is equivalent to minimum surprise "Pick the parameters that make the observed data least surprising".

Maybe taking the log isn't just about mathematical convenience after all!

##  Entropy: Difficulty of guessing a process/distribution/random variable
Up till now we've been talking about the difficulty of guessing or surprise at an event. What about the difficulty of guessing a process, like guessing the result of a die roll, or a spinner where each color has a different probability?

We simply average surprise or difficulty over all outcomes. We define **Entropy** for a probability distribution/random variable P as:

$$H(P) = - E_P[log(p)] = - \int p(x) log(p(x))dx \,\,\,OR\, - \sum_i p_i log(p_i) $$

Where $H(P)$ can be read as "hardness of guessing $P$'s outcome". Notice the distinction between $P$ and $p$. Capital $P$ is a random variable/probability dittribution, $p$ is a particular outcome, like $P=7$ or $P=red$. The formula just says: visit every outcome and measure how surprising that outcome is. Entropy is the (weighted) average surprise over all possible outcomes.

Let's do an example: what's the entropy (difficulty of guessing) a spinner that's red half the time, green 1/3 of the time, and blue 1/6 of the time

$$H(spinner)= -[1/2\cdot log_2(1/2)+1/3\cdot log_2(1/3)+1/6\cdot log_2(1/6)] = 1.46$$

So guessing the spinner's outcome is (on average) as hard as guessing about 1 and a half coin flips. This should make some sense: 1/2 the time it's as hard as guessing 1 coin flip, 1/3 of the time it's as hard as 1.58 coin flips, and so on.


## Entropy as information
We can measure the information content of some fact by measuring how much it reduces our uncertainty. For instance, how much information is present in "the spinner is not blue"

$$H(spinner|not\ blue)=-[6/10\cdot log_2(6/10)+4/10\cdot log_2(4/10)+0\cdot log_2(0)] = .97$$

(Notise that the probabilities of red and green have changed since we've ruled out blue and they still need to sum to 1)

So, guessing is now a little bit easier than guessing a coin flip: the spinner has above 50% chance of being red, and if it's not red it's green. The spinner has been reduced to a biased coin. Moreover, we lost about .49 of a coin flip of difficulty: that's the informational value of knowing that the spinner isn't blue.

Moreover, if we know the spinner IS blue, there's no entropy left: $ 0\cdot log(0)+0\cdot log(0)+1\cdot log(1)=0$ so the informational value of knowing the spinner is blue is 1.46 coin flips.

Finally, if someone says "Hey, I'll tell you whether or not the spinner is blue" they're offering us $1/6*1.46 + 5/6*.49 = .65$ bits of information (sometimes they tell us the spinner landed on blue and we get a lot of uncertainty reduction, more often we only get a little reduction). Moreover, .65 is the suprise at a 1-in-6 event, or the a entropy of a coin with bias 5/6. Thus, the information value of learning whether or not the spinner is blue is the same the surprise at learning it landed on blue, and the same the entropy of a process that yeilds "blue" 1/6th of the time and "not blue" 5/6ths of the time.

The entropy-surprise-information metaphor runs really, really deep. Other entropy measures like joint entropy and mutual information can be defined to measure the difficulty of guessing two [possibly dependent] outcomes simultaneously, and how much learning one variable's value tells us about another variable's value.