# CS460 Algorithms and Their Analysis
## Programming Assignment 5: Huffman codes -- An example of greedy algorithms

**Author:** Yang Xu, Assistant Professor of Computer Science, San Diego State University

**Total points: 15**

In [3]:
import heapq
from collections import Counter

## Task 1. Define the TreeNode class

**Points:2**

The `TreeNode` class implements the node in a binary tree, which stores the target **charactor** to be encoded, the **frequency** of charactor, and the **left** and **right** children nodes.

You need to implement the following:
- In `__init__()`, initialize the attributes using the parameters. The left and right children should be `None`.
- In `__lt__()`, return if `self.freq` is less than than `other.freq`.
- In `__eq__()`, return if `self.freq` equals `other.freq`.

In [6]:
class TreeNode:
    def __init__(self, char, freq):
        ### START YOUR CODE ###
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None
        ### END YOUR CODE ###

    # defining comparators less than and equal to
    def __lt__(self, other):
        ### START YOUR CODE ###
        return self.freq < other.freq
        ### END YOUR CODE ###

    def __eq__(self, other):
        if(other == None):
            return False
        if(not isinstance(other, TreeNode)):
            return False
        ### YOUR YOUR CODE ###
        return (self.freq == other.freq)
        ### END YOUR CODE ###

    def __repr__(self):
        return f'Node({self.char}, {self.freq})'

In [7]:
# Do not change the test code here
node1 = TreeNode('a', 45)
node2 = TreeNode('b', 13)

print(node1 == node2)
print(node1 > node2)

False
True


## Task 2. Count charactor frequencies in string
**Points:2**

Implement a function that returns the frequencies for all charactors in the input string, including whitespace and punctuations.

The return type should be a `dict` or a `collections.Counter` object.

In [12]:
def create_frequency_dict(text):
    ### START YOUR CODE ###
    frequency = {}
    for character in text:
        if character in frequency: # Update the frequency (You can use if statement)
            frequency[character] += 1 # if in dict already, increase by 1
        else:
            frequency[character] = 1 # if not in dict, set to 1
    ### START YOUR CODE ###
    return frequency

In [13]:
# Do not change the test code here
sample_text = 'No, it is a word. What matters is the connection the word implies.'
freq = create_frequency_dict(sample_text)

print(freq['a'], freq['e'], freq['i'], freq['o'])

3 5 6 5


**Expected output**:

3 5 6 5

---

## Task 3. Create tree from frequency dict
**Points: 2**

The tree is represented by a list, with the use of `heapq` moddule.
- First, create a node use each (key, value) pair of the frequency dict.
- Then, use `heapq.heappush()` to insert all the nodes to the `tree` list, while maintaining the minimum heap invariant, that is, `heap[k] <= heap[2*k+1]` and `heap[k] <= heap[2*k+2]`.

See the document for more information: <https://docs.python.org/3/library/heapq.html>


In [20]:
def create_tree(frequency):
    tree = []
    for key, val in frequency.items():
        ### START YOUR CODE ###
        node = TreeNode(key,val) # Create a node
        heapq.heappush(tree, node) # Insert the node to tree
        ### END YOUR CODE ###
    return tree

In [21]:
# Do not change the test code here
sample_text = 'No, it is a word. What matters is the connection the word implies.'
freq = create_frequency_dict(sample_text)
tree = create_tree(freq)

print(sorted(tree, key=lambda x: x.freq)[:5])

[Node(N, 1), Node(p, 1), Node(,, 1), Node(l, 1), Node(W, 1)]


**Expected output**:

[Node(N, 1), Node(p, 1), Node(,, 1), Node(l, 1), Node(W, 1)]

---

## Task 4. Merge nodes in tree

**Points: 4**

Implement the following function, in which a `while` loop is used to keep taking out two minimum elements from the tree, by calling `heapq.heappop()`, and then merging them to a new node, and lastly inserting the new node back to tree.

Note that:
- The new node does not need to have a `self.char` attribute, so you can let it be `None`.
- Remember to specify the left and right children nodes correctly for the new node.
- The function changes `tree` in place. So, you don't need to return `tree`.

In [30]:
def merge_nodes(tree):
    ### START YOUR CODE ###
    while(len(tree)>1): # Specify the loop condition (you can use while or for)
        node1 = heapq.heappop(tree)
        node2 = heapq.heappop(tree)

        merged = TreeNode(None, node1.freq + node2.freq) # Create a new node by merging the two popped nodes
        # Remember to specify the left and right children nodes for merged
        merged.left = node1
        merged.right = node2

        heapq.heappush(tree, merged) # Insert the new node to the tree
    ### END YOUR CODE ###

In [31]:
# Do not change the test code here
sample_text = 'No, it is a word. What matters is the connection the word implies.'
freq = create_frequency_dict(sample_text)
tree = create_tree(freq)
print('Before merge, len(tree) = ', len(tree))

print()
merge_nodes(tree)
print('After merge, len(tree) = ', len(tree))

Before merge, len(tree) =  20

After merge, len(tree) =  1


**Expected output**:

Before merge, len(tree) =  20\
After merge, len(tree) =  1

---

## Task 5. Get codes from the tree
**Points: 5**

Obtain the `Huffman` codes for each character in the leaf nodes of the merged tree. The returned codes are stored in a dict object `codes`, whose key (`str`) and value (`str`) are the character and code, respectively.

`make_codes_helper()` is a recursive function that takes a tree node, `codes`, and `current_code` as inputs. `current_code` is a `str` object that records the code for the current node (which can be an internal node). The function needs be called on the left child and right child nodes recursively. For the left child call, `current_code` needs increment by appending a "0", because this is what the left branch means; and append an "1" for the right child call.

In [36]:
def make_codes(tree):
    codes = {} # key (str) and value (str) are the character and code
    ### START YOUR CODE ###
    root = heapq.heappop(tree) # Get the root node
    # current_code is a string, so initially should be empty
    current_code = "" # Initialize the current code
    make_codes_helper(root, codes, current_code) # initial call on the root node
    ### END YOUR CODE ###
    return codes

def make_codes_helper(node, codes, current_code):
    if(node == None):
        ### START YOUR CODE ###
        return # What should you return if the node is empty?
        ### END YOUR CODE ###
    if(node.char != None):
        ### START YOUR CODE ###
        # For leaf node, copy the current code to the correct position in codes
        codes[node.char] = current_code
        ### END YOUR CODE ###

    ### START YOUR CODE ###
    # Make a recursive call to the left child node, with the updated current code
    make_codes_helper(node.left, codes, current_code + '0')
    # Make a recursive call to the right child node, with the updated current code
    make_codes_helper(node.right, codes, current_code + '1')
    ### END YOUR CODE ###

def print_codes(codes):
    codes_sorted = sorted([(k, v) for k, v in codes.items()], key = lambda x: len(x[1]))
    for k, v in codes_sorted:
        print(f'"{k}" -> {v}')

In [37]:
# Do not change the test code here
sample_text = 'No, it is a word. What matters is the connection the word implies.'
freq = create_frequency_dict(sample_text)
tree = create_tree(freq)
merge_nodes(tree)
codes = make_codes(tree)
print('Example 1:')
print_codes(codes)

print()
freq2 = {'a': 45, 'b': 13, 'c': 12, 'd': 16, 'e': 9, 'f': 5}
tree2 = create_tree(freq2)
merge_nodes(tree2)
code2 = make_codes(tree2)
print('Example 2:')
print_codes(code2)

Example 1:
"i" -> 001
"t" -> 010
" " -> 111
"h" -> 0000
"n" -> 0001
"s" -> 0111
"e" -> 1011
"o" -> 1100
"l" -> 01100
"m" -> 01101
"w" -> 10000
"c" -> 10001
"d" -> 10010
"." -> 10100
"r" -> 11010
"a" -> 11011
"N" -> 100110
"," -> 100111
"W" -> 101010
"p" -> 101011

Example 2:
"a" -> 0
"c" -> 100
"b" -> 101
"d" -> 111
"f" -> 1100
"e" -> 1101


**Expected output**

Example 1:\
"i" -> 001\
"t" -> 010\
" " -> 111\
"h" -> 0000\
"n" -> 0001\
"s" -> 0111\
"e" -> 1011\
"o" -> 1100\
"l" -> 01100\
"m" -> 01101\
"w" -> 10000\
"c" -> 10001\
"d" -> 10010\
"." -> 10100\
"r" -> 11010\
"a" -> 11011\
"N" -> 100110\
"," -> 100111\
"W" -> 101010\
"p" -> 101011

Example 2:\
"a" -> 0\
"c" -> 100\
"b" -> 101\
"d" -> 111\
"f" -> 1100\
"e" -> 1101