## Exercise 20.10 2021/ACJC/P2/Q2 H2 Computing (Modified)
A file compression algorithm reduces file sizes so that files can be sent more quickly. One such algorithm is the Huffman algorithm for text files, which will be implemented in this task.

Unlike ASCII, which assigns a fixed size of 8 bits for each character, the Huffman algorithm assigns fewer bits to more common characters and more bits to less common characters. For example, in a long text written in English, characters such as `e` and `t` will have fewer bits assigned to them than characters such as `q` and `z`. If the text is long enough, this will use fewer bits in total to encode the text compared to ASCII.

To know which sequence of bits to encode for each character, the **frequency** of each character, which is the number of times each character appears in the text file, is tabulated.

The characters are put into a tree. A node is created for each character. The following steps are then repeated until there is only one node without a parent:

1. Identify the two nodes, without parents, which have the lowest frequency.
2. Create a new node whose left and right children are the two nodes identified in Step 1. The frequency of the new node is the total of the frequency of its children.
3. Add the new node to the queue.

The diagram on the following page shows the process of creation of a tree for a file with only five distinct characters (`A`, `E`, `I`, `O` and `U`), in five stages.

The bit sequence assigned to a character will be the path from the root to the node corresponding to that character, where going left corresponds to `0` and going right corresponds to `1`. For example, `A` is encoded as `10` and `O` is encoded as `011`.

### Task 59
Create a `Node` class that has the following attributes:
- `data`, which is determined when the node is initialized
- `left`, a pointer to another node,
- `right`, a pointer to another node
When the node is initialised, `left` and `right` do not point to anything.
The class also has setter methods for `left` and `right`, and getter methods for all three attributes.<div style="text-align: right">[3]</div>

In [None]:
# Task 59 : Jia Le

class Node:
    def __init__(self,data) -> None:
        self.__data = data
        self.__left = None
        self.__right = None
    def setleft(self,left):
        self.__left = left
    def getleft(self):
        return self.__left
    def setright(self,right):
        self.__right = right
    def getright(self):
        return self.__right
    def getdata(self):
        return self.__data
    def __repr__(self) -> str:
        return f"{self.__data}"

### Task 60

Write code that takes an `input.txt` file and creates a dictionary whose keys are the characters in the file, including spaces, punctuation and line breaks (`\n`), and the value of a key is its frequency in the file. Uppercase and lowercase letters should be considered as different characters.

Create a node for each character in the file, and put the nodes into a list in ascending order of frequency.	<div style="text-align: right">[11]</div>

#
Diagram showing how the tree is created basd on the frequency of each character:

<center>

|   Character | Frequency  |
|-|-|
|`A`|15|
|`E`|16|
|`I`|12|
|`O`|9 |
|`U`|5 |

</center>

<br>

<center>
<img src="exercise20-pic1.png" width="600" align="center"/>
</center>

In [None]:
## Task 60 : Zi Zhuo
def frequency_finder(file_name):
    char_dict = {}
    try:
        with open(file_name,"r") as f :
            file = f.readlines()
            for line in file:
                
                for word in line:
                    if word in char_dict:
                        char_dict[word] += 1
                    else:
                        char_dict[word] = 1

            return char_dict
    except Exception as e:
        return(e)
        
characters = frequency_finder("HAMLET.txt")
char_list = []
for key,value in characters.items():
    new_node = Node((key,value))
    char_list.append(new_node)
    

In [None]:
characters

### Task 61
Create a tree using the algorithm described above.	<div style="text-align: right">[5]</div>

In [None]:
# Task 61 :create node_list of Nodes
node_list = []
for k in characters.keys():
    node_list.append( Node((k,characters[k])))
    # Node.getdata() contains a tuple, tuple[0] contains letters, tuple[1] frequency count



node_list.sort(key=lambda x: x.getdata()[1])
#print(node_list)

def insert_sort(node_list, node):
    for i in range(len(node_list)):
        if node.getdata()[1] <= node_list[i].getdata()[1]:
            node_list.insert(i, node)
            break
    else:
        node_list.append(node)

while len(node_list) >= 2:
    child_1 = node_list.pop(0)
    child_2 = node_list.pop(0)
    parent = Node((child_1.getdata()[0] + child_2.getdata()[0], child_1.getdata()[1]+child_2.getdata()[1]))
    parent.setleft(child_1)
    parent.setright(child_2)
    insert_sort(node_list,parent)
 #node_list[0] is the root of the binary tree

### Task 62
Create a dictionary whose keys are the characters, and the value of a key is the bit sequence of that character, expressed as a string of `0`s and `1`s.

Carry out Tasks 60 to 62 on the file `HAMLET.txt`. Compress the file by replacing each character with its bit sequence and writing the output to a new file, `HAMLET_compressed.txt`. <div style="text-align: right">[8]</div>

In [None]:
#Task 62: Create the code cipher dictionary by traversing tree to reach a leaf note
code_cipher={}
for letter in characters.keys():
    cur = node_list[0]
    letter_code=""
    while cur.getdata()[0] != letter: # Traverse tree to get the hoffman code for the letter
        if cur.getleft() and letter in cur.getleft().getdata()[0]:
            #left
            letter_code += "0"
            cur = cur.getleft()
        elif cur.getright() and letter in cur.getright().getdata()[0]:
            # right
            letter_code +="1"
            cur = cur.getright()
        else:
            ## run time error when tree is not created correctly
            raise Exception("Tree is incorrect")

    code_cipher[letter] = letter_code


In [None]:
code_cipher

In [None]:
# Task 62: encoding each 1 and 0 as a 8-bit unicode string : this will actually expand the file size
f = open("hamlet.txt", "r") 
letters_str = f.read()
f.close
f = open("HAMLET_compressed.txt","w", encoding="UTF-8")
for letter in letters_str:
    f.write(code_cipher[letter])
f.close()

___
Question 2: Why is the compressed file bigger than the compressed file ?

In [17]:
!dir hamlet*

 Volume in drive G is Google Drive
 Volume Serial Number is 1983-1116

 Directory of g:\My Drive\Classroom\2022_cz2a\Classwork\11_DS2

01/03/2023  05:15 PM           182,335 HAMLET.txt
03/03/2023  12:57 PM           852,667 HAMLET_compressed.txt
03/03/2023  01:09 PM                 0 HAMLET_decompressed.txt
03/03/2023  01:12 PM           106,584 HAMLET.bin
               4 File(s)      1,141,586 bytes
               0 Dir(s)  715,054,075,904 bytes free


___
Question 3 : Decode the compressed file

In [None]:
f = open("HAMLET_compressed.txt","r")
reverse_dict =  dict(zip(code_cipher.values(), code_cipher.keys()))
code =""
for c in f.read():
    code += c
    if code in reverse_dict:
        print( reverse_dict[code], end="")
        code = ""
f.close()

In [None]:
reverse_dict

___
Encode/Decode into binary file

In [16]:
## cipher is alrealy built
cipher_dict = code_cipher

orig_file = "HAMLET.txt"
comp_file = "HAMLET.bin"
decp_file = "HAMLET_decompressed.txt"

import binary_encoding
f1 = open(orig_file,"r")
f2 = open(comp_file, "wb") 
f2.write(binary_encoding.bin_encode(f1.read(),cipher_dict))
f1.close()
f2.close()


In [None]:
## decoding back into bit string
import binary_encoding
f = open(comp_file,"rb")
bits_str = binary_encoding.bin_decode(f.read())
f.close()

In [None]:
## decoding bytes into bits string
## updated to deal with the last byte issue

f = open(comp_file,"rb")
bytes_arr = [ b for b in f.read() ]
f.close()
bits_str = "".join ( f"{b:08b}" for b in bytes_arr[:-2] ) 


## last byte issue
pad_bits = bytes_arr[-1] #int
remaining_bits = f"{bytes_arr[-2]:08b}"

bits_str += remaining_bits[pad_bits:]


In [None]:
f = open(decp_file,"w")
reverse_dict = dict(zip(cipher_dict.values(),cipher_dict.keys()))
code = ""
for char in bits_str:
    code+=char
    if code in reverse_dict:
        f.write(reverse_dict[code])
        code = ""
f.close()