## Task 3 - Overview - Data Compression

In general, a data compression algorithm reduces the amount of memory (bits) required to represent a message (data). The compressed data, in turn, helps to reduce the transmission time from a sender to receiver. The sender encodes the data, and the receiver decodes the encoded data. As part of this problem, you have to implement the logic for both encoding and decoding.

A data compression algorithm could be either *lossy* or *lossless*, meaning that when compressing the data, there is a loss (lossy) or no loss (lossless) of information. The **Huffman Coding** is a lossless data compression algorithm. Let us understand the two phases - encoding and decoding with the help of an example.
A. Huffman Encoding

Assume that we have a string message AAAAAAABBBCCCCCCCDDEEEEEE comprising of 25 characters to be encoded. The string message can be an unsorted one as well. We will have two phases in encoding - building the Huffman tree (a binary tree), and generating the encoded data. The following steps illustrate the Huffman encoding:
Phase I - Build the Huffman Tree

### A Huffman tree is built in a bottom-up approach.

First, determine the frequency of each character in the message. In our example, the following table presents the frequency of each character.

(Unique) Character 	Frequency
A 	7
B 	3
C 	7
D 	2
E 	6

Each row in the table above can be represented as a node having a character, frequency, left child, and right child. In the next step, we will repeatedly require to pop-out the node having the lowest frequency. Therefore, build and sort a list of nodes in the order lowest to highest frequencies. Remember that a list preserves the order of elements in which they are appended.

We would need our list to work as a priority queue, where a node that has lower frequency should have a higher priority to be popped-out. The following snapshot will help you visualize the example considered above:

Can you come up with other data structures to create a priority queue? How about using a min-heap instead of a list? You are free to choose from anyone.

Pop-out two nodes with the minimum frequency from the priority queue created in the above step.

Create a new node with a frequency equal to the sum of the two nodes picked in the above step. This new node would become an internal node in the Huffman tree, and the two nodes would become the children. The lower frequency node becomes a left child, and the higher frequency node becomes the right child. Reinsert the newly created node back into the priority queue.

Do you think that this reinsertion requires the sorting of priority queue again? If yes, then a min-heap could be a better choice due to the lower complexity of sorting the elements, every time there is an insertion.

Repeat steps #3 and #4 until there is a single element left in the priority queue. The snapshots below present the building of a Huffman tree.

For each node, in the Huffman tree, assign a bit 0 for left child and a 1 for right child. See the final Huffman tree for our example:

Phase II - Generate the Encoded Data

Based on the Huffman tree, generate unique binary code for each character of our string message. For this purpose, you'd have to traverse the path from root to the leaf node.

(Unique) Character 	Frequency 	Huffman Code
D 	2 	000
B 	3 	001
E 	6 	01
A 	7 	10
C 	7 	11

    Points to Notice

        Notice that the whole code for any character is not a prefix of any other code. Hence, the Huffman code is called a Prefix code.
        Notice that the binary code is shorter for the more frequent character, and vice-versa.
        The Huffman code is generated in such a way that the entire string message would now require a much lesser amount of memory in binary form.
        Notice that each node present in the original priority queue has become a leaf node in the final Huffman tree.

This way, our encoded data would be 1010101010101000100100111111111111111000000010101010101

### B. Huffman Decoding

Once we have the encoded data, and the (pointer to the root of) Huffman tree, we can easily decode the encoded data using the following steps:

    Declare a blank decoded string
    Pick a bit from the encoded data, traversing from left to right.
    Start traversing the Huffman tree from the root.
        If the current bit of encoded data is 0, move to the left child, else move to the right child of the tree if the current bit is 1.
        If a leaf node is encountered, append the (alphabetical) character of the leaf node to the decoded string.
    Repeat steps #2 and #3 until the encoded data is completely traversed.

You will have to implement the logic for both encoding and decoding in the following template. Also, 

In [1]:
import sys

# get char and count frequency

# node char and count frequency sorted lowest and highest fequency

# loop - pop 2 least frquency 
#    -> merge to internal Node (sum of frequencies) child nodes 

def huffman_encoding(data):
    pass

def huffman_decoding(data,tree):
    pass

if __name__ == "__main__":
    codes = {}

    a_great_sentence = "The bird is the word"

    print ("The size of the data is: {}\n".format(sys.getsizeof(a_great_sentence)))
    print ("The content of the data is: {}\n".format(a_great_sentence))

    encoded_data, tree = huffman_encoding(a_great_sentence)

    print ("The size of the encoded data is: {}\n".format(sys.getsizeof(int(encoded_data, base=2))))
    print ("The content of the encoded data is: {}\n".format(encoded_data))

    decoded_data = huffman_decoding(encoded_data, tree)

    print ("The size of the decoded data is: {}\n".format(sys.getsizeof(decoded_data)))
    print ("The content of the encoded data is: {}\n".format(decoded_data))


The size of the data is: 69

The content of the data is: The bird is the word



TypeError: cannot unpack non-iterable NoneType object

In [149]:
# this code makes the tree that we'll traverse

class Node(object):
        
    def __init__(self,char = None, freq = None, left = None, right = None, code = None):
        self.char = char
        self.freq = freq
        self.left = left
        self.right = right
        self.code = code
        
    def set_char(self,char):
        self.char = char
        
    def get_char(char):
        return self.char
        
    def set_left_child(self,left):
        self.left = left
        
    def set_right_child(self, right):
        self.right = right
        
    def get_left_child(self):
        return self.left
    
    def get_right_child(self):
        return self.right

    def has_left_child(self):
        return self.left != None
    
    def has_right_child(self):
        return self.right != None
    
    # define __repr_ to decide what a print statement displays for a Node object
    def __repr__(self):
        return f"Node(char: {self.char} freq: {self.freq} code: {self.code} \n left: {self.left} \n right: {self.right})"
    
    def __str__(self):
        return f"Node(char: {self.char} freq: {self.freq} code: {self.code} \n left: {self.left} \n right: {self.right})"
    
    
class Tree():
    def __init__(self, value=None):
        self.root = Node(value)
        
    def get_root(self):
        return self.root

In [150]:
tree = Tree("5")

In [151]:
tree.get_root()

Node(char: 5 freq: None code: None 
 left: None 
 right: None)

### count frequencies

In [152]:
string = "AAAAAAABBBCCCCCCCDDEEEEEE"

In [153]:
def count_char(string):
    
    char_freq = {}
    
    for char in string: 
        if char in char_freq: 
            char_freq[char] += 1
        else: 
            char_freq[char] = 1
            
    return char_freq

char_freq = count_char(string)
# Show Output
print ("Per char frequency in '{}' is :\n {}".format(string, str(char_freq)))

Per char frequency in 'AAAAAAABBBCCCCCCCDDEEEEEE' is :
 {'A': 7, 'B': 3, 'C': 7, 'D': 2, 'E': 6}


### priority queue

In [6]:
student = []
student.append((5, 'Nick'))
student.append((1, 'Rohan'))
student.append((3, 'Jack'))
student.sort()

In [7]:
student

[(1, 'Rohan'), (3, 'Jack'), (5, 'Nick')]

In [8]:
# from https://www.pythonpool.com/python-priority-queue/
import heapq

pqueue = []
 
heapq.heappush(pqueue, (3, 'A')) #heappush is a method to add an 
heapq.heappush(pqueue, (1, 'B')) #element
heapq.heappush(pqueue, (2, 'C'))
 
while pqueue:
    next_item = heapq.heappop(pqueue) #heappop is a method to
    print(next_item) #remove an element

(1, 'B')
(2, 'C')
(3, 'A')


In [9]:
pqueue = []

for key, value in char_freq.items():
    node = Node(char=key, freq=value)
    print(node)
    heapq.heappush(pqueue, (node.freq, node))

Node(char: A 
 freq: 7 
 left: None 
 right: None)
Node(char: B 
 freq: 3 
 left: None 
 right: None)
Node(char: C 
 freq: 7 
 left: None 
 right: None)
Node(char: D 
 freq: 2 
 left: None 
 right: None)
Node(char: E 
 freq: 6 
 left: None 
 right: None)


In [None]:
# -------------------------------------- #

In [34]:
freq, char = pqueue.pop() #heappop is a method to
print(freq, char)

7 Node(char: C 
 freq: 7 
 left: None 
 right: None)


In [30]:
char

'D'

In [31]:
heapq.heappush(pqueue, (4, 'Z'))

In [34]:
##

In [10]:
# pop 2 min values from pqueue
freq_1, node_1 = heapq.heappop(pqueue) #heappop is a method to
freq_2, node_2 = heapq.heappop(pqueue) #heappop is a method to

In [11]:
node_1

Node(char: D 
 freq: 2 
 left: None 
 right: None)

In [12]:
merged_node = Node(freq = freq_1 + freq_2, left = node_1, right = node_2)

In [13]:
merged_node

Node(char: None 
 freq: 5 
 left: Node(char: D 
 freq: 2 
 left: None 
 right: None) 
 right: Node(char: B 
 freq: 3 
 left: None 
 right: None))

In [16]:
merged_node.freq

5

In [17]:
# add merged node back to pqueue
heapq.heappush(pqueue, (merged_node.freq, merged_node))

In [18]:
# pop 2 min values from pqueue
x1 = heapq.heappop(pqueue) #heappop is a method to
x2 = heapq.heappop(pqueue) #heappop is a method to

In [19]:
x1

(5,
 Node(char: None 
  freq: 5 
  left: Node(char: D 
  freq: 2 
  left: None 
  right: None) 
  right: Node(char: B 
  freq: 3 
  left: None 
  right: None)))

In [20]:
x2

(5,
 Node(char: None 
  freq: 5 
  left: Node(char: D 
  freq: 2 
  left: None 
  right: None) 
  right: Node(char: B 
  freq: 3 
  left: None 
  right: None)))

In [11]:
def merge_nodes(pqueue):
    # pop 2 nodes from priority queue
    freq_1, node_1 = heapq.heappop(pqueue) #heappop is a method to
    freq_2, node_2 = heapq.heappop(pqueue) #heappop is a method to
    # create internal merged node with the sum of frequencies
    merged_node = Node(freq = freq_1 + freq_2, left = node_1, right = node_2)
    # add merged node back to pqueue
    heapq.heappush(pqueue, (merged_node.freq, merged_node))
    return pqueue

In [15]:
pqueue = merge_nodes(pqueue)
pqueue

IndexError: index out of range

In [None]:
# ------------------------------------------------- #

### create tree

In [191]:
pqueue = []

for key, value in char_freq.items():
    node = Node(char=key, freq=value)
    pqueue.append((node.freq, node))
pqueue = sorted(pqueue, key=lambda x: x[0], reverse=True)

In [192]:
def merge_nodes(pqueue):
    # pop 2 nodes from priority queue
    freq_1, node_1 = pqueue.pop() #heappop is a method to
    freq_2, node_2 = pqueue.pop() #heappop is a method to
    # create internal merged node with the sum of frequencies
    merged_node = Node(freq = freq_1 + freq_2, left = node_1, right = node_2)
    # add merged node back to pqueue
    pqueue.append((merged_node.freq, merged_node))
    pqueue = sorted(pqueue, key=lambda x: x[0], reverse=True)
    return pqueue

In [193]:
while len(pqueue) > 1:
    pqueue = merge_nodes(pqueue)

In [194]:
pqueue

[(25,
  Node(char: None freq: 25 code: None 
   left: Node(char: None freq: 11 code: None 
   left: Node(char: None freq: 5 code: None 
   left: Node(char: D freq: 2 code: None 
   left: None 
   right: None) 
   right: Node(char: B freq: 3 code: None 
   left: None 
   right: None)) 
   right: Node(char: E freq: 6 code: None 
   left: None 
   right: None)) 
   right: Node(char: None freq: 14 code: None 
   left: Node(char: C freq: 7 code: None 
   left: None 
   right: None) 
   right: Node(char: A freq: 7 code: None 
   left: None 
   right: None))))]

In [195]:
# create 0|1 code

In [196]:
def add_code(node):
    if node.left == None and node.right == None:
        return node
    if node.left is not None:
        if node.code is None:
            node.left.code = '0'
        else:
            node.left.code = node.code + '0'
        node.left.freq = 0
        node.left = add_code(node.left)
    if node.right is not None:
        if node.code is None:
            node.right.code = '1'
        else:
            node.right.code = node.code + '1'
        node.right.freq = 1
        node.right = add_code(node.right)
    return node

In [197]:
root = pqueue[0][1]

In [198]:
add_code(root)

Node(char: None freq: 25 code: None 
 left: Node(char: None freq: 0 code: 0 
 left: Node(char: None freq: 0 code: 00 
 left: Node(char: D freq: 0 code: 000 
 left: None 
 right: None) 
 right: Node(char: B freq: 1 code: 001 
 left: None 
 right: None)) 
 right: Node(char: E freq: 1 code: 01 
 left: None 
 right: None)) 
 right: Node(char: None freq: 1 code: 1 
 left: Node(char: C freq: 0 code: 10 
 left: None 
 right: None) 
 right: Node(char: A freq: 1 code: 11 
 left: None 
 right: None)))

In [199]:
root

Node(char: None freq: 25 code: None 
 left: Node(char: None freq: 0 code: 0 
 left: Node(char: None freq: 0 code: 00 
 left: Node(char: D freq: 0 code: 000 
 left: None 
 right: None) 
 right: Node(char: B freq: 1 code: 001 
 left: None 
 right: None)) 
 right: Node(char: E freq: 1 code: 01 
 left: None 
 right: None)) 
 right: Node(char: None freq: 1 code: 1 
 left: Node(char: C freq: 0 code: 10 
 left: None 
 right: None) 
 right: Node(char: A freq: 1 code: 11 
 left: None 
 right: None)))

In [200]:
def create_dict(root):
    code_dict = dict()
    

In [201]:
def retrieve_code(node, dic):
    if node.left == None and node.right == None:
        dic[node.char] = node.code
        return node
    if node.left is not None:
        retrieve_code(node.left, dic)
    if node.right is not None:
        retrieve_code(node.right, dic)
    return dic

In [202]:
code_dict = dict()
code_dict = retrieve_code(root, code_dict)

In [203]:
code_dict

{'D': '000', 'B': '001', 'E': '01', 'C': '10', 'A': '11'}

In [204]:
string

'AAAAAAABBBCCCCCCCDDEEEEEE'

In [205]:
encoded = ""
for c in string:
    encoded += code_dict[c]

In [206]:
encoded

'1111111111111100100100110101010101010000000010101010101'

In [207]:
# decoding

In [None]:
def decode(string):
    for 

In [210]:
len(encoded)

55

In [227]:
decoded = ""
i = 0
current_node = root

# loop until the the encoded message ends
for i in range(len(encoded)):
    
    item = encoded[i]
    # if item is 0 set current node to LEFT child node
    if item == str(current_node.left.freq):
        current_node = current_node.left
    # if item is 1 set current node to RIGHT child node
    elif item == str(current_node.right.freq):
        current_node = current_node.right
    
    # if current node is a leaf node convert binary code to character
    if current_node.char is not None:
        decoded += current_node.char
        current_node = root

In [228]:
decoded

'AAAAAAABBBCCCCCCCDDEEEEEE'