In [3]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 2200
# Introduction to Algorithms

## Data Compression


Today's agenda:

- Huffman Coding

Suppose we are given a document $D$ in which we use the alphabet $\Sigma$. Our goal is to create $\sigma$ to a binary encoding to represent $D$ with as few bits as possible. Of course, the encoding must be distinctly represent $\Sigma$.

Example: Suppose $\sigma=\{A, B, C, D\}$, and document $D = \langle A, A, A, A, A, A, A, A, A, B, C, D\rangle$. 

The naive encoding could be $e(A)=00, e(B)=01, e(C)=10, e(D)=11$. This is a *fixed-length* encoding of $\Sigma$. The length of the document with this encoding is $2\cdot 12 = 24$. The encoding is:

$e(D) = "000000000000000000011011"$

But this doesn't account for redundancy in the document. Let $f: \sigma \rightarrow [0,1]$ be the frequency of the characters in $D$; this is easily computed in $O(|D|)$ work. What about the span?

Intuitively, we should encode the document by the frequency of the characters in the alphabet.

Suppose we used $e'(A) = 0$, what does this mean for the encodings of the other characters? 

No other character can be encoded with a leading $0$. So we could use $e'(B) = 10, e'(C) = 110, e'(D) = 111.$ This leads to an encoding of:

$e'(D) = "0000000001011100"$

This has length $1\cdot 9 + 2\cdot 1 + 3\cdot 1 + 3\cdot 1 = 17$. So this is a bit better. 

In general, the cost of a given encoding $e$ is 

$$C(e) = \sum_{i=0}^{|D|} |e(D[i])| = |D| \sum_{\sigma\in\Sigma} f(\sigma)\cdot e(\sigma).$$

Over all possible valid encodings $e: \Sigma \rightarrow \{0,1\}^*$, we want to find a variable-length encoding $e_*$ so that $C(e_*)$ is minimized.



### Encodings as Trees

How do we ensure that a variable-length encoding is valid? In other words, how do we only consider variable-length encodings that are *prefix-free*?

We can think of an encoding as representing a tree, with characters from $\sigma$ as leaves. Note that a fixed-length encoding has all leaves at the same level. 

For the two encodings we gave, we'd have:

<img src = "encoding_trees.jpg" width="60%">

Every prefix-free encoding $e$ can be represented by a tree $T_e$, so the optimal compression of $D$ can be achieved by identifying the encoding tree $T$ that minimizes:

$$C(T) = \sum_{\sigma\in\Sigma} f(\sigma)\cdot d_T(\sigma)$$

We will come up with a greedy algorithm for constructing $T$ and show that it is optimal.


### Huffman Coding


Intuitively we know we should ensure that when constructing an encoding tree, the higher the frequency, the shorter the path length.

How about if we sort the frequencies in descending order and then assign tree positions in this order? But how do we guarantee the highest frequency characters have a short depth? 

We could group the characters into two sets of equal total frequency, this way the more frequent characters will have lower depth. This divide-and-conquer approach was developed by Shannon-Fano... but is not optimal.

Unfortunately Shannon-Fano coding is not provably optimal. David Huffman (as a graduate student in Robert Fano's class) came up with a *bottom-up* greedy algorithm as a class project and was able to prove that it was optimal.

The main idea of this algorithm is to choose the two **least** frequent characters $x$ and $y$ and create a subtree with $x$ and $y$ as sibling leaves for the final encoding. We then remove $x$ and $y$ from $\Sigma$ and add a *new* character $z$ with frequency $f(x)+f(y)$, and recurse to compute a tree $T'$. The final tree $T$ is just $T'$ with $z$ replaced by the subtree with $x, y$ as siblings. 

<img src="huffman_example.jpg" width="60%">

We can use a priority queue to construct $T$, so the work required is $O(n\log n)$ where $n=|\Sigma |.$ 

### Proof of Optimality

**Greedy Choice**: Let $x, y\in\Sigma$ have the two smallest frequencies. Then there is an optimal encoding $T$ for $\Sigma$ with $x, y$ as sibling leaves at maximum depth. 

**Proof**: If not, we could exchange our way to a better optimal solution.

**Optimal Substructure**: Let $\Sigma' = \Sigma - \{x, y\} \cup \{z\}$, where $z\not\in \Sigma$ is a character with frequency $f(x) + f(y)$. If $T'$ is an optimal encoding for $\Sigma'$, then an optimal encoding $T$ for $\Sigma$ can be constructed from $T'$ by replacing the leaf representing $z$ with an internal node that has $x, y$ as children. 

**Proof** Suppose that there was some other alternative tree $Z$ that have $C(Z) < C(T)$. We can assume that $Z$ must have the two smallest frequency characters as siblings at maximum depth (by the same exchange argument). 

From $Z$ we can then construct $Z'$ (with $x, y$ removed and replaced with $z$ having frequency $f(x)+f(y)$). Then we show that $C(Z') < C(T')$ which is a contradiction.
