----
### Exercise 2: 2024 A-Level Paper 1 Q5

[Question Paper](2024_A_P1Q5.pdf) 

## Exercise 1 2021/ACJC/P2/Q2 H2 Computing (Modified)
A file compression algorithm reduces file sizes so that files can be sent more quickly. One such algorithm is the Huffman algorithm for text files, which will be implemented in this task.

Unlike ASCII, which assigns a fixed size of 8 bits for each character, the Huffman algorithm assigns fewer bits to more common characters and more bits to less common characters. For example, in a long text written in English, characters such as `e` and `t` will have fewer bits assigned to them than characters such as `q` and `z`. If the text is long enough, this will use fewer bits in total to encode the text compared to ASCII.

To know which sequence of bits to encode for each character, the **frequency** of each character, which is the number of times each character appears in the text file, is tabulated.

The characters are put into a tree. A node is created for each character. The following steps are then repeated until there is only one node without a parent:

1. Identify the two nodes, without parents, which have the lowest frequency.
2. Create a new node whose left and right children are the two nodes identified in Step 1. The frequency of the new node is the total of the frequency of its children.
3. Add the new node to the queue.

The diagram below shows the process of creation of a tree for a file with only five distinct characters (`A`, `E`, `I`, `O` and `U`), in five stages.

The bit sequence assigned to a character will be the path from the root to the node corresponding to that character, where going left corresponds to `0` and going right corresponds to `1`. For example, `A` is encoded as `10` and `O` is encoded as `011`.


In [None]:
sentence = "bobby blames blue ballons"

print(f"size = {len(sentence)*8} bits")
count = {}
for c in sentence:
    if c in count:
        count[c]+=1
    else:
        count[c] = 1
print(sorted(count.items(), key = lambda x: x[1],reverse=True))

#
Diagram showing how the tree is created basd on the frequency of each character:

<center>

|   Character | Frequency  |
|-|-|
|`A`|15|
|`E`|16|
|`I`|12|
|`O`|9 |
|`U`|5 |

</center>

<br>

<center>
<img src="exercise20-pic1.png" width="600" align="center"/>
</center>

### Task 1.1
Create a `Node` class that has the following attributes:
- `data`, which is determined when the node is initialized
- `left`, a pointer to another node,
- `right`, a pointer to another node
When the node is initialised, `left` and `right` do not point to anything.
The class also has setter methods for `left` and `right`, and getter methods for all three attributes.<div style="text-align: right">[3]</div>

In [None]:
# Task 1.1
class Node:
    def __init__(self, data=None):
        self.data = data
        self.left = None
        self.right = None

    def get_data(self):
        return self.data
    def get_left(self):
        return self.left
    def get_right(self):
        return self.right

    def set_left(self, left):
        self.left = left
    def set_right(self, right):
        self.right = right
        
    def __repr__(self):
        return f"<{self.data}>"


### Task 1.2

Write code that takes an input .txt file and creates a dictionary whose keys are the characters in the file, including spaces, punctuation and line breaks (`\n`), and the value of a key is its frequency in the file. Uppercase and lowercase letters should be considered as different characters.

Create a node for each character in the file, and put the nodes into a list in ascending order of frequency.	<div style="text-align: right">[11]</div>

##### Planning

dict -> list of Nodes -> sorted list of Nodes

- Node data must contain character and frequency
- Sorting the list in ascending order of frequency, add everything to list then sort or sort while inserting ?
- Use a small test case before using `hamlet.txt`


In [None]:
## Task 1.2 : 
def create_dict(input_file):
    char_count = {}
    for char in open(input_file).read():
        count = char_count.setdefault(char, 0)
        char_count[char] = count + 1
    return char_count

# bin = {}
# for char in open("HAMLET.TXT").read():
#     if char in bin:
#         bin[char] +=1 
#     else:
#         bin[char] = 1


def insert_in_order(node_list, node):
    for i in range(len(node_list)): # insertion sort
        if node.data[1] <= node_list[i].data[1]:
            node_list.insert(i, node)
            break
    else:
        node_list.append(node)


node_list = [] ## list of Node, what data is in the Node?->(chars, freq), sorted in ascending order of freq
char_dict = create_dict("HAMLET.txt")
## Use this for testing first
# char_dict = {
#     "A": 15,
#     "E": 16,
#     "I": 12,
#     "O": 9,
#     "U": 5
# }
for key in char_dict:
    # node = Node((key,char_dict[key])) # (char, freq)
    node = Node( (key, char_dict[key]))
    insert_in_order(node_list,node)


### Task 1.3
Create a tree using the algorithm described above.	<div style="text-align: right">[5]</div>

##### Planning
- Initial condition : ?
- Terminating condition: ?
- what must the new node have ?
- where to insert the new node ?

In [None]:
# 1.3
while len(node_list ) > 1:
    node_1 = node_list.pop(0)
    node_2 = node_list.pop(0)
    root = Node(
        (
            node_1.data[0] + node_2.data[0],
            node_1.data[1] + node_2.data[1]
        )
    )
    root.set_left(node_1)
    root.set_right(node_2)
    insert_in_order(node_list, root)
tree = node_list[0]

In [None]:
# from TreeUtils2 import print_tree
# print_tree( tree)

### Task 1.4
Create a dictionary whose keys are the characters, and the value of a key is the bit sequence of that character, expressed as a string of `0`s and `1`s.

Carry out Tasks 1.1 to 1.4 on the file `HAMLET.txt`. Compress the file by replacing each character with its bit sequence and writing the output to a new file, `HAMLET_compressed.txt`. <div style="text-align: right">[8]</div>

##### Planing
- Part 1 :
    - create the encoding dictionary
- Part 2 :
    - use the dictionary to create the encoding file

In [None]:
#Task 1.4: Create the code cipher dictionary by traversing tree to reach a leaf note
char_encoder = {}
for char in char_dict:
    key = char
    value = ""
    cur = tree
    while char != cur.get_data()[0]:
        if char in cur.get_left().data[0]:
            value += "0"
            cur = cur.get_left()
        else:
            value += "1"
            cur = cur.get_right()
    char_encoder[key] = value



In [None]:
char_encoder

In [None]:
f = open("HAMLET_compressed.txt","w")
for char in open("HAMLET.txt").read():
    encoding = char_encoder[char]
    f.write(encoding)
f.close()


In [None]:
## bit level encoding
## file output must be in units of byte

from bitstring import BitArray, BitStream
f = open("hamlet.txt", "r")
raw = f.read()
f.close

bits_list=[]
for letter in raw:
    bits_list.append(char_encoder[letter])
bits_str = "".join(bits_list) 

leading = 8 - (len(bits_str)%8) # leading 0s to pad
enc_int = leading.to_bytes(1, byteorder="little") 

bits_str = "0"*leading + bits_str ## pad leading 0s to form bytes boundary
raw_bits = BitArray(bin=bits_str)
bytes_output=raw_bits.tobytes()
f=open("encoded.dat","wb")
f.write(enc_int) ## add a 8 bit integer value as a peamble to indicate  leading bits to strip off
f.write(bytes_output)
f.close()


### 1.5 Decoding

In [None]:
## Using a reverse dictionay
char_decoder = {}
for key, value in char_encoder.items():
    char_decoder[value] = key

bit_str = ""
for bit in open("HAMLET_compressed.txt").read():
    bit_str += bit
    if bit_str in char_decoder:
        print(char_decoder[bit_str], end="")
        bit_str=""


In [None]:
## By tree traversal
cur = tree
bits = list(open("HAMLET_compressed.txt").read())
f = open("decoded.txt", "w")
while bits:    
    while cur.get_left() and cur.get_right():
        bit = bits.pop(0)
        if bit == "0":
            cur = cur.get_left()
        else:
            cur = cur.get_right()
    else:
        print(cur.get_data()[0], end="")
        # f.write(cur.get_data()[0])
        cur = tree
f.close()