In [1]:
%%html

<style>
.boxed {
    margin: 10px 30px;
    padding: 10px;
    border: 1px solid black;
}
</style>

# Information Theory

Much like mass and energy, *information* is a fundamental quantity of the universe.  It may seem difficult at first to quantify information; we know that textbooks are dense with it and political speeches nearly devoid of it, but how can we formalize our intuition into something more useful?

Claude Shannon (a Michigan undergrad!) set out to tackle this challenge, laying out the foundations of what we know today as **Information Theory** in his 1948 master's thesis, "*A Mathematical Theory of Communication*".

## Signals and Communication

In machine learning, we will mainly use information theory as a tool to manipulate probability distributions.  However, much of the intuition behind the theory comes from signal processing--indeed, this was Shannon's original purpose--so it is worthwhile to establish some basic terminology.

<div class="boxed">
*"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."*  

<div style="text-align:right">&ndash; Claude E. Shannon</div>
</div>

### Communication Systems

Shannon outlines five key components of a communication system **[Shannon 1951]**:
1. The **information source**, responsible for producing messages.  
2. An **encoder**, which operates on the message in some way to produce a signal suitable for transmission.  
3. A **channel**, along which messages are transmitted.  The signal may be partially or fully corrupted by *noise* at this stage.
4. A **decoder**, which attempts to reconstruct the original message from the transmitted signal.
5. The **destination** is the intended recipient of the original message.

### Discrete Noiseless Systems

In these notes, we will focus on **discrete noisless systems**, in which both the message and signal are discrete, and where the channel introduces no noise.

## Information

### Information & Surprise

We see that the information we gain from receiving a message depends only on the probability $p$ of the message.  In other words, the *meaning* of the message does not matter to us, only the fact that the message was *selected from a set* of possible messages **[Shannon 1951]**.

### Axiomatic Derivation

For information to be a useful quantity, it should satisfy the following axioms:

1. Information is nonnegative, $I(p) \geq 0$.  We can never *lose* information.
2. A sure event provides no information, $I(1) = 0$.
3. The information gained from observing two independent events is the sum of the information gained from observing each individually, $$I(p_1 * p_2) = I(p_1) + I(p_2)
4. Information should be **continuous** and **monotonic**.  

### Mutual Information

### Relative Entropy

- Distance between distributions
- Signal interpretation
- Example:  Morse code tailored to German text

# References

- **[Shannon 1951]**
- **[Pierce 1980]**