# Huffman codes

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Task-1" data-toc-modified-id="Task-1-1">Task 1</a></span></li><li><span><a href="#Task-2" data-toc-modified-id="Task-2-2">Task 2</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-3">Data</a></span><ul class="toc-item"><li><span><a href="#Hamlet" data-toc-modified-id="Hamlet-3.1">Hamlet</a></span></li><li><span><a href="#Goethe" data-toc-modified-id="Goethe-3.2">Goethe</a></span></li></ul></li></ul></div>

Given an alphabet $A$, a *code* replaces each letter $x$ of $A$ by a variable-length binary string $c(x)$.
A code is a *prefix code* if for distinct letters $x$ and $y$ in $A$, the string $c(x)$ is not a prefix of $c(y)$.
A prefix code can be decoded unambiguously scanning the encoded message from left to right.
Given a text $T$, let $f_x$ be the frequency of letter $x$ in $T$. The average number of bits required
per letter is the quantity

$C = \sum _{ x \in A } f _ { x } | c ( x ) |$

where $| c ( x ) |$ is the length of the string $c(x)$. A prefix code is optimal if $C$ is minimal among all
prefix codes.

**Tasks:** 

1. Design an algorithm that, given an input text $T$, constructs an optimal prefix code for $T$.
The *size* of the input is the number of characters in $T$.

2. Design an algorithm that, given a prefix code for a text $T$, outputs T.

## Task 1 
- Design an algorithm that, given an input text $T$, constructs an optimal prefix code for $T$.
The *size* of the input is the number of characters in $T$.

In [1]:
# import required package deque from collections and time

from collections import deque
import time

# Define a class `Tree`
class Tree:
    def __init__(self, freq, char = None, left=None, right=None):
        self.char = char
        self.freq = freq
        self.left  = left
        self.right = right
        
# The alpha_count function is a fucntion that counts the number of letters in text and
# returns a list of leaf nodes corresponding to each letter in the text that has to be coded.
def alpha_count(text):
    alpha = set(text)
    count = [(char, text.count(char)) for char in alpha]
    count = sorted(count, key=lambda x:x[1])
    return [Tree(f, char) for char,f in count]

# The huffman_tree function creates a Huffman tree from a given text.
# Specifically, the function Huffman_Tree is a greedy algorithm that creates a parent node 
# for the two least likely nodes in the sorted list of nodes
def huffman_tree(T):
    while len(T)>1:
        left, right = T[:2]
        frequency = left.freq + right.freq
        T = T[2:]
        T.append(Tree(frequency, left = left, right = right))
        T.sort(key=lambda x: x.freq)
    return T[0]

# The huffman_code generates the 'encoding' for a text. It uses breadth first search (BFS) 
# to assign a code to each node of the Huffman Tree
def huffman_code(text):
    T = alpha_count(text)
    T = [(huffman_tree(T),'')]
    for node, enc in T:
        if node.left:
            T.append((node.left, enc + '0'))
        if node.right:
            T.append((node.right, enc + '1'))
    return [(enc, node.char) for node, enc in T if node.char]

# Next we define huffman_encode to encode the text in binary. This takes a given encoding 
# and the text and subsequently looks up the encoding for each letter, and returns the 
# resulting encoded text
def huffman_encode(encoding, text):
    code = {letter:coding for coding,letter in encoding}
    return ''.join(code[letter] for letter in text)

In [2]:
huffman_code("reid, ari and evelyn")

[('000', 'a'),
 ('001', 'i'),
 ('010', 'd'),
 ('101', 'e'),
 ('110', ' '),
 ('0110', 'y'),
 ('0111', 'v'),
 ('1000', ','),
 ('1001', 'l'),
 ('1110', 'n'),
 ('1111', 'r')]

In [3]:
huffman_encode(huffman_code("reid, ari and evelyn"),"reid, ari and evelyn")

'11111010010101000110000111100111000011100101101010111101100101101110'

## Task 2
- Design an algorithm that, given a prefix code for a text $T$, outputs T.

In [6]:
# The huffman_decode function decodes an encoded string with a given encoding.
def huffman_decode(encoding, text):
    code = {coding:letter for coding, letter in encoding}
    codewords = dict.keys(code)
    length = len(min(codewords, key=len))
    T =''
    i = length
    while len(text) > 0:
        try:
            T = T + code[text[0:i]]
            text = text[i:]
            i = length
        except KeyError:
            i = i+1
    return T

In [7]:
huffman_decode(huffman_code("reid, ari and evelyn"),huffman_encode(huffman_code("reid, ari and evelyn"),"reid, ari and evelyn"))

'reid, ari and evelyn'

## Data

We now run our algorithms on the following text to produce an optimal prefix code. Blanks, dots, questions marks, etc. are part of the alphabet. Upper and lower cases are considered the same letter. Write explicitly as a table the encoding function $c(x)$.

### Hamlet

In [8]:
# use `lower()` to convert all the text to lower case
Hamlet = u'O all you host of heaven! O earth! What else? And shall I couple hell? Oh, fie! Hold, hold, my heart, And you, my sinews, grow not instant old, But bear me stiffly up. Remember thee! Ay, thou poor ghost, whiles memory holds a seat In this distracted globe. Remember thee! Yea, from the table of my memory I’ll wipe away all trivial fond records, All saws of books, all forms, all pressures past That youth and observation copied there, And thy commandment all alone shall live Within the book and volume of my brain, Unmixed with baser matter. Yes, by heaven! O most pernicious woman! O villain, villain, smiling, damned villain! My tables! Meet it is I set it down That one may smile, and smile, and be a  villain. At least I’m sure it may be so in Denmark. So, uncle, there you are. Now to my word.'.lower()

In [9]:
%%time
huffman_code(Hamlet)

CPU times: user 237 µs, sys: 0 ns, total: 237 µs
Wall time: 259 µs


[('00', ' '),
 ('0101', 'i'),
 ('0110', 't'),
 ('1000', 'l'),
 ('1001', 'o'),
 ('1011', 'a'),
 ('1110', 'e'),
 ('01001', 'y'),
 ('01110', ','),
 ('01111', 'd'),
 ('11000', 'h'),
 ('11001', 'r'),
 ('11011', 'm'),
 ('11110', 'n'),
 ('11111', 's'),
 ('010000', 'v'),
 ('010001', 'f'),
 ('101001', 'w'),
 ('101010', 'u'),
 ('110100', 'b'),
 ('1010001', '.'),
 ('1010110', 'c'),
 ('1101010', 'p'),
 ('1101011', '!'),
 ('10100000', '?'),
 ('10100001', 'k'),
 ('10101111', 'g'),
 ('101011100', 'x'),
 ('101011101', '’')]

In [10]:
%%time
huffman_encode(huffman_code(Hamlet),Hamlet)

CPU times: user 647 µs, sys: 1e+03 ns, total: 648 µs
Wall time: 658 µs


'100100101110001000000100110011010100011000100111111011000100101000100110001110101101000011101111011010110010010011101011110010110110001101011001010011100010110110001110100011111111010100000001011111100111100111111100010111000100000010100101011010011010101101010100011100011000111010001000101000000010011100001110000100010101111011010110011000100110000111101110001100010011000011110111000110110100100110001110101111001011001110001011111100111100010011001101010011100011011010010011111010111110111010100111111011100010101111110011001101001001111010010110000101111101111101101011111100110001001100001111011100011010010101001100011010011101011110010011011111000111110110010101000101000110000100100101010110101010100010011001111011011111011011110100111011001000110110001110111011010110010110100101110000110110001001101010001101010100110011100100101011111100010011111101100111000101001110000101100011101111100110111110110111001110010100100110001001100001111111110010110011111111010110110000101111100001101

In [11]:
%%time
huffman_decode(huffman_code(Hamlet),huffman_encode(huffman_code(Hamlet),Hamlet))

CPU times: user 2.69 ms, sys: 6 µs, total: 2.7 ms
Wall time: 2.7 ms


'o all you host of heaven! o earth! what else? and shall i couple hell? oh, fie! hold, hold, my heart, and you, my sinews, grow not instant old, but bear me stiffly up. remember thee! ay, thou poor ghost, whiles memory holds a seat in this distracted globe. remember thee! yea, from the table of my memory i’ll wipe away all trivial fond records, all saws of books, all forms, all pressures past that youth and observation copied there, and thy commandment all alone shall live within the book and volume of my brain, unmixed with baser matter. yes, by heaven! o most pernicious woman! o villain, villain, smiling, damned villain! my tables! meet it is i set it down that one may smile, and smile, and be a  villain. at least i’m sure it may be so in denmark. so, uncle, there you are. now to my word.'

### Goethe 

In [12]:
# use `lower()` to convert all the text to lower case
Goethe = u'Habe nun, ach! Philosophie, Juristerei und Medizin, Und leider auch Theologie Durchaus studiert, mit heissem Bemühn. Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor; Heisse Magister, heisse Doktor gar Und ziehe schon an die zehen Jahr Herauf, herab und quer und krumm Meine Schüler an der Nase herum Und sehe, dass wir nichts wissen können! Das will mir schier das Herz verbrennen. Zwar bin ich gescheiter als all die Laffen, Doktoren, Magister, Schreiber und Pfaffen; Mich plagen keine Skrupel noch Zweifel, Fürchte mich weder vor Hölle noch Teufel Dafür ist mir auch alle Freud entrissen, Bilde mir nicht ein, was Rechts zu wissen, Bilde mir nicht ein, ich könnte was lehren, Die Menschen zu bessern und zu bekehren. Auch hab ich weder Gut noch Geld, Noch Ehr und Herrlichkeit der Welt; Es möchte kein Hund so länger leben! Drum hab ich mich der Magie ergeben, Ob mir durch Geistes Kraft und Mund Nicht manch Geheimnis würde kund; Dass ich nicht mehr mit saurem Schweiss Zu sagen brauche, was ich nicht weiss; Dass ich erkenne, was die Welt Im Innersten zusammenhält, Schau alle Wirkenskraft und Samen, Und tu nicht mehr in Worten kramen.'.lower()

In [13]:
%%time
huffman_code(Goethe)

CPU times: user 360 µs, sys: 1e+03 ns, total: 361 µs
Wall time: 367 µs


[('010', 'e'),
 ('111', ' '),
 ('0000', 'u'),
 ('0001', 'a'),
 ('0011', 's'),
 ('0111', 'r'),
 ('1000', 'h'),
 ('1010', 'i'),
 ('1011', 'n'),
 ('01100', 'l'),
 ('10010', 'm'),
 ('10011', 't'),
 ('11001', 'c'),
 ('11011', 'd'),
 ('001000', 'z'),
 ('001001', 'f'),
 ('001010', 'g'),
 ('011010', 'k'),
 ('011011', 'b'),
 ('110000', 'o'),
 ('110001', 'w'),
 ('110101', ','),
 ('00101101', '!'),
 ('00101110', '.'),
 ('00101111', 'ö'),
 ('11010000', ';'),
 ('11010001', 'ü'),
 ('11010010', 'p'),
 ('001011000', 'q'),
 ('001011001', 'ä'),
 ('110100110', 'j'),
 ('110100111', 'v')]

In [14]:
%%time
huffman_encode(huffman_code(Goethe),Goethe)

CPU times: user 675 µs, sys: 1e+03 ns, total: 676 µs
Wall time: 684 µs


'100000010110110101111011000010111101011110001110011000001011011111101001010001010011001100000011110000110100101000101001011010111111010011000000111101000111001101001110101010111000010111101111110010010110111010001000101010111101011110000101111011111011000101010110110100111111000100001100110001111001110000101100000110011000000101010100101111101100000111110011000000100000011111001110011000011011101001001111001111010111110010101010011111100001010100011001101010010111011011010100101101000110001011001011101111101100011110011100110101000111101011001100011110110000101111010111110101100110001110001011110010010011111110011110000011100101101111000010111101111101101110101011111001111000011101101001100000000101011100010110000111111100011010010111001000000011010011111000001111101000011110000101010001100110101111001000010010101010001110011010011111010111110000101010001100110101111101111000001101010011110000011111100101000010111111000010111101111100100010100101000010111001111001100011000010111110001

In [15]:
%%time
huffman_decode(huffman_code(Goethe),huffman_encode(huffman_code(Goethe),Goethe))

CPU times: user 4.91 ms, sys: 550 µs, total: 5.46 ms
Wall time: 5.07 ms


'habe nun, ach! philosophie, juristerei und medizin, und leider auch theologie durchaus studiert, mit heissem bemühn. da steh ich nun, ich armer tor! und bin so klug als wie zuvor; heisse magister, heisse doktor gar und ziehe schon an die zehen jahr herauf, herab und quer und krumm meine schüler an der nase herum und sehe, dass wir nichts wissen können! das will mir schier das herz verbrennen. zwar bin ich gescheiter als all die laffen, doktoren, magister, schreiber und pfaffen; mich plagen keine skrupel noch zweifel, fürchte mich weder vor hölle noch teufel dafür ist mir auch alle freud entrissen, bilde mir nicht ein, was rechts zu wissen, bilde mir nicht ein, ich könnte was lehren, die menschen zu bessern und zu bekehren. auch hab ich weder gut noch geld, noch ehr und herrlichkeit der welt; es möchte kein hund so länger leben! drum hab ich mich der magie ergeben, ob mir durch geistes kraft und mund nicht manch geheimnis würde kund; dass ich nicht mehr mit saurem schweiss zu sagen bra

**END**