<a href="https://colab.research.google.com/github/walkerjian/DailyCode/blob/main/HuffmanTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Huffman coding is a method of encoding characters based on their frequency. Each letter is assigned a variable-length binary string, such as 0101 or 111110, where shorter lengths correspond to more common letters. To accomplish this, a binary tree is built such that the path from the root to any leaf uniquely maps to a character. When traversing the path, descending to a left child corresponds to a 0 in the prefix, while descending right corresponds to 1.

Here is an example tree (note that only the leaf nodes have letters):
````
        *
      /   \
    *       *
   / \     / \
  *   a   t   *
 /             \
c               s
````
With this encoding, cats would be represented as 0000110111.

Given a dictionary of character frequencies, build a Huffman tree, and use it to determine a mapping between characters and their encoded binary strings.

We will break the problem down into a few distinct steps:

1. **Model**:
    - **CharacterFrequency**: A simple data structure that holds a character and its frequency.
    - **HuffmanNode**: Represents a node in the Huffman tree. It can either be a leaf node (with a character) or an internal node (with left and right children).
    - **HuffmanTree**: Given a dictionary of character frequencies, this class will build the Huffman tree and provide a method to encode a given string.
  
2. **View**:
    - Display the Huffman encoding for a given string.
  
3. **Controller**:
    - Handles user input, utilizes the model to process data, and updates the view accordingly.
  
4. **Test Harness**:
    - A function to test the Huffman encoding using various test cases.

Let's begin by defining the Model components.

### Model

#### CharacterFrequency

We'll start by defining the `CharacterFrequency` class. This class will represent a character and its frequency. It will be used to sort characters based on frequency when building the Huffman tree.

#### HuffmanNode

The `HuffmanNode` class will represent a node in the Huffman tree. This node can either hold a character (for leaf nodes) or have left and right children (for internal nodes).

#### HuffmanTree

The `HuffmanTree` class will be responsible for building the Huffman tree given a dictionary of character frequencies. It will also provide a method to encode a string using the Huffman encoding.

Let's start by defining these classes.

The `HuffmanTree` class appears to be functioning correctly, as it generated the Huffman encoding for the word "cats" based on the sample character frequencies provided.

Next, we'll define the **View** and **Controller** components.

### View

For the purpose of this exercise, the **View** will be simple. It will take care of displaying the Huffman encoding for a given string.

### Controller

The **Controller** will handle user input, utilize the model to process data, and update the view accordingly.

In [1]:
from queue import PriorityQueue

# Model

class HuffmanNode:
    """
    A node in the Huffman encoding tree.
    """
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):
        """
        Allows the nodes to be sorted based on their frequency.
        """
        return self.freq < other.freq

class HuffmanTree:
    """
    Represents the Huffman encoding tree.
    """
    def __init__(self, char_freqs):
        self.root = self.build_tree(char_freqs)

    def build_tree(self, char_freqs):
        """
        Builds the Huffman encoding tree given a dictionary of character frequencies.
        """
        pq = PriorityQueue()
        for char, freq in char_freqs.items():
            pq.put(HuffmanNode(char, freq))

        while pq.qsize() > 1:
            left = pq.get()
            right = pq.get()
            merged = HuffmanNode(None, left.freq + right.freq)
            merged.left = left
            merged.right = right
            pq.put(merged)

        return pq.get()

    def encode(self, s):
        """
        Encodes a string using the Huffman tree.
        """
        def encode_char(c, node, code=''):
            if not node:
                return ""
            if node.char == c:
                return code
            return encode_char(c, node.left, code + '0') or encode_char(c, node.right, code + '1')

        return ''.join(encode_char(c, self.root) for c in s)


# Controller

class HuffmanController:
    """
    Controls the flow of Huffman encoding.
    """
    def __init__(self, char_freqs):
        self.tree = HuffmanTree(char_freqs)
        self.view = HuffmanView()

    def encode(self, s):
        """
        Encodes a string and displays the results using the view.
        """
        encoded_string = self.tree.encode(s)
        self.view.display_encoded_string(s, encoded_string)


# View

class HuffmanView:
    """
    Represents the view for Huffman encoding.
    """
    def display_encoded_string(self, original, encoded):
        """
        Displays the original and encoded strings.
        """
        print(f"Original: {original}")
        print(f"Encoded: {encoded}")


def test_huffman_encoding():
    """
    Tests the Huffman encoding.
    """
    # Test 1: Example provided
    char_freqs = {'c': 1, 'a': 1, 't': 1, 's': 1}
    controller = HuffmanController(char_freqs)
    controller.encode("cats")

    # Other tests
    test_cases = [
        ({"a": 5, "b": 9, "c": 12, "d": 13, "e": 16, "f": 45}, "abcdef"),
        ({"a": 5, "b": 9, "c": 12, "d": 13, "e": 16, "f": 45}, "fedcba"),
        ({"a": 3, "b": 2, "c": 6, "d": 8, "e": 2, "f": 6}, "abcdef"),
        ({"a": 3, "b": 2, "c": 6, "d": 8, "e": 2, "f": 6}, "fedcba"),
        ({"a": 5, "b": 1, "c": 2, "d": 4, "e": 5, "f": 9}, "abcdef"),
        ({"a": 5, "b": 1, "c": 2, "d": 4, "e": 5, "f": 9}, "fedcba"),
        ({"a": 7, "b": 2, "c": 6, "d": 6, "e": 5}, "abcde"),
        ({"a": 7, "b": 2, "c": 6, "d": 6, "e": 5}, "edcba"),
        ({"a": 3, "b": 3, "c": 3}, "abc"),
        ({"a": 3, "b": 3, "c": 3}, "cba")
    ]

    for char_freqs, s in test_cases:
        controller = HuffmanController(char_freqs)
        controller.encode(s)

test_huffman_encoding()


Original: cats
Encoded: 00100111
Original: abcdef
Encoded: 110011011001011110
Original: fedcba
Encoded: 011110110011011100
Original: abcdef
Encoded: 10010100011101101
Original: fedcba
Encoded: 01101111001010100
Original: abcdef
Encoded: 01100010011010011
Original: fedcba
Encoded: 11001011001100001
Original: abcde
Encoded: 111000001101
Original: edcba
Encoded: 101010010011
Original: abc
Encoded: 10110
Original: cba
Encoded: 01110


### Another Implementation

In [2]:
class CharacterFrequency:
    """
    This class represents a character and its frequency. It will be used
    to sort characters based on frequency when building the Huffman tree.
    """
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq

    def __lt__(self, other):
        return self.freq < other.freq

    def __eq__(self, other):
        return self.freq == other.freq

    def __repr__(self):
        return f"CharacterFrequency(char={self.char}, freq={self.freq})"


class HuffmanNode:
    """
    This class represents a node in the Huffman tree. A node can either
    hold a character (for leaf nodes) or have left and right children
    (for internal nodes).
    """
    def __init__(self, char_freq=None, left=None, right=None):
        self.char_freq = char_freq
        self.left = left
        self.right = right

    def is_leaf(self):
        return self.left is None and self.right is None

    def __lt__(self, other):
        return self.char_freq < other.char_freq

    def __eq__(self, other):
        return self.char_freq == other.char_freq

    def __repr__(self):
        return f"HuffmanNode(char_freq={self.char_freq}, left={self.left}, right={self.right})"


class HuffmanTree:
    """
    This class is responsible for building the Huffman tree given a
    dictionary of character frequencies. It also provides a method
    to encode a string using the Huffman encoding.
    """
    def __init__(self, char_freqs):
        self.root = self.build_tree(char_freqs)
        self.char_to_code = {}
        self._generate_char_to_code_map(self.root, "")

    def build_tree(self, char_freqs):
        # Convert the dictionary into a list of HuffmanNode
        nodes = [HuffmanNode(CharacterFrequency(char, freq)) for char, freq in char_freqs.items()]
        nodes.sort()

        while len(nodes) > 1:
            # Pop two nodes with the lowest frequency
            left = nodes.pop(0)
            right = nodes.pop(0)

            # Create a new internal node with these two nodes as children
            merged_freq = CharacterFrequency(None, left.char_freq.freq + right.char_freq.freq)
            new_node = HuffmanNode(merged_freq, left, right)

            # Insert the new node back into the list
            nodes.append(new_node)
            nodes.sort()

        return nodes[0]

    def _generate_char_to_code_map(self, node, current_code):
        if node is None:
            return
        if node.is_leaf():
            self.char_to_code[node.char_freq.char] = current_code
            return
        self._generate_char_to_code_map(node.left, current_code + "0")
        self._generate_char_to_code_map(node.right, current_code + "1")

    def encode(self, text):
        return ''.join([self.char_to_code[char] for char in text])


# Let's test the HuffmanTree with the given example
sample_char_freqs = {
    'c': 1,
    'a': 1,
    't': 1,
    's': 1
}
huffman_tree = HuffmanTree(sample_char_freqs)
encoded_text = huffman_tree.encode("cats")
encoded_text


'00011011'

In [3]:
class HuffmanView:
    """
    This class represents the view component in the MVC architecture.
    It is responsible for displaying the Huffman encoding for a given string.
    """
    @staticmethod
    def display_encoded_text(text, encoded_text):
        print(f"Original Text: {text}")
        print(f"Encoded Text: {encoded_text}")


class HuffmanController:
    """
    This class represents the controller component in the MVC architecture.
    It handles user input, utilizes the model to process data, and updates
    the view accordingly.
    """
    def __init__(self, char_freqs):
        self.huffman_tree = HuffmanTree(char_freqs)

    def encode_text(self, text):
        encoded_text = self.huffman_tree.encode(text)
        HuffmanView.display_encoded_text(text, encoded_text)


# Let's test the controller with the sample character frequencies and the word "cats"
controller = HuffmanController(sample_char_freqs)
controller.encode_text("cats")


Original Text: cats
Encoded Text: 00011011


In [5]:
class HuffmanController:
    """
    This class represents the controller component in the MVC architecture.
    It handles user input, utilizes the model to process data, and updates
    the view accordingly.
    """
    def __init__(self, char_freqs):
        self.huffman_tree = HuffmanTree(char_freqs)

    def encode_text(self, text):
        # Check if all characters in text are present in char_freqs
        missing_chars = set(text) - set(self.huffman_tree.char_to_code.keys())
        if missing_chars:
            print(f"Error: Missing character frequencies for: {', '.join(missing_chars)}")
            return

        encoded_text = self.huffman_tree.encode(text)
        HuffmanView.display_encoded_text(text, encoded_text)

# Let's run the test harness again
test_huffman_encoding()


--------------------------------------------------
Original Text: abcdef
Encoded Text: 110011011001011110
--------------------------------------------------
Original Text: abcdef
Encoded Text: 1001011101110001
--------------------------------------------------
Original Text: abcde
Encoded Text: 0001011011100
--------------------------------------------------
Original Text: aabbccddeeff
Encoded Text: 10010010110111011011111100000101
--------------------------------------------------
Original Text: hello
Encoded Text: 0001111110
--------------------------------------------------
Original Text: python
Encoded Text: 1001011101110001
--------------------------------------------------
Original Text: zxcvbn
Encoded Text: 0001011011100101
--------------------------------------------------
Error: Missing character frequencies for: u
--------------------------------------------------
Original Text: world
Encoded Text: 110111000110
--------------------------------------------------
Original Text:

### Maximally Efficient Method
Huffman coding can be optimized using priority queues. Instead of using a list and sorting it repeatedly (as done in the previous implementation), we can use a priority queue (or a heap) to efficiently select nodes with the smallest frequencies.

I'll use Python's built-in `heapq` module for this purpose. The priority queue will ensure that nodes with the lowest frequencies are processed first, making the construction of the Huffman tree more efficient.


In [6]:
import heapq

class OptimizedHuffmanNode:
    """
    Represents a node in the Huffman tree.
    """
    def __init__(self, char, freq, left=None, right=None):
        self.char = char
        self.freq = freq
        self.left = left
        self.right = right

    def is_leaf(self):
        return self.left is None and self.right is None

    def __lt__(self, other):
        return self.freq < other.freq

    def __eq__(self, other):
        return self.freq == other.freq


def build_optimized_huffman_tree(char_freqs):
    """
    Builds the Huffman tree in an optimized manner using a priority queue.
    """
    # Convert the dictionary into a list of OptimizedHuffmanNode
    nodes = [OptimizedHuffmanNode(char, freq) for char, freq in char_freqs.items()]
    heapq.heapify(nodes)

    while len(nodes) > 1:
        # Pop two nodes with the lowest frequency from the heap
        left = heapq.heappop(nodes)
        right = heapq.heappop(nodes)

        # Create a new internal node with these two nodes as children
        merged_freq = left.freq + right.freq
        new_node = OptimizedHuffmanNode(None, merged_freq, left, right)

        # Push the new node back into the heap
        heapq.heappush(nodes, new_node)

    return nodes[0]

def generate_char_to_code_map_optimized(node, current_code, char_to_code):
    """
    Recursive helper function to generate the character to code mapping from the Huffman tree.
    """
    if node is None:
        return
    if node.is_leaf():
        char_to_code[node.char] = current_code
        return
    generate_char_to_code_map_optimized(node.left, current_code + "0", char_to_code)
    generate_char_to_code_map_optimized(node.right, current_code + "1", char_to_code)

def huffman_encode_optimized(text, char_to_code):
    """
    Encodes a given text using the Huffman encoding.
    """
    return ''.join([char_to_code[char] for char in text])


# Testing the optimized Huffman coding
def test_optimized_huffman():
    sample_char_freqs = {
        'c': 1,
        'a': 1,
        't': 1,
        's': 1
    }
    root = build_optimized_huffman_tree(sample_char_freqs)
    char_to_code = {}
    generate_char_to_code_map_optimized(root, "", char_to_code)
    encoded_text = huffman_encode_optimized("cats", char_to_code)
    return encoded_text

test_optimized_huffman()


'01110010'

In [7]:
def test_optimized_huffman_encoding():
    """
    Test function for the optimized Huffman encoding. This function will test the Huffman encoding
    using various test cases.
    """
    test_cases = [
        ({"a": 5, "b": 9, "c": 12, "d": 13, "e": 16, "f": 45}, "abcdef"),
        ({"a": 5, "b": 5, "c": 5, "d": 5, "e": 5, "f": 5}, "abcdef"),
        ({"a": 10, "b": 5, "c": 20, "d": 5, "e": 5, "f": 5}, "abcde"),
        ({"a": 5, "b": 5, "c": 5, "d": 5, "e": 5, "f": 5}, "aabbccddeeff"),
        ({"h": 1, "e": 1, "l": 2, "o": 1}, "hello"),
        ({"p": 1, "y": 1, "t": 1, "h": 1, "o": 1, "n": 1}, "python"),
        ({"z": 10, "x": 5, "c": 20, "v": 5, "b": 5, "n": 5}, "zxcvbn"),
        ({"i": 5, "l": 9, "o": 12, "v": 13, "e": 16, "y": 45}, "iloveyou"),
        ({"w": 5, "o": 5, "r": 5, "l": 5, "d": 5}, "world"),
        ({"j": 5, "a": 5, "v": 5, "s": 5, "c": 5, "r": 5, "i": 5, "p": 5, "t": 5}, "javascript")
    ]

    for char_freqs, text in test_cases:
        print("-" * 50)
        root = build_optimized_huffman_tree(char_freqs)
        char_to_code = {}
        generate_char_to_code_map_optimized(root, "", char_to_code)

        # Check if all characters in text are present in char_freqs
        missing_chars = set(text) - set(char_to_code.keys())
        if missing_chars:
            print(f"Error: Missing character frequencies for: {', '.join(missing_chars)}")
            continue

        encoded_text = huffman_encode_optimized(text, char_to_code)
        print(f"Original Text: {text}")
        print(f"Encoded Text: {encoded_text}")

# Let's run the test harness
test_optimized_huffman_encoding()


--------------------------------------------------
Original Text: abcdef
Encoded Text: 110011011001011110
--------------------------------------------------
Original Text: abcdef
Encoded Text: 1001011110100110
--------------------------------------------------
Original Text: abcde
Encoded Text: 0010011101011
--------------------------------------------------
Original Text: aabbccddeeff
Encoded Text: 10010010110111111101010000110110
--------------------------------------------------
Original Text: hello
Encoded Text: 1001111100
--------------------------------------------------
Original Text: python
Encoded Text: 1001011110100110
--------------------------------------------------
Original Text: zxcvbn
Encoded Text: 0010011101011010
--------------------------------------------------
Error: Missing character frequencies for: u
--------------------------------------------------
Original Text: world
Encoded Text: 111001100110
--------------------------------------------------
Original Text:

### What's in a heap?
The optimized Huffman coding has been implemented using priority queues (heaps). This ensures that the construction of the Huffman tree is done more efficiently compared to the previous approach.

The encoded text for "cats" is '01110010', which may differ from the earlier encoding due to the inherent nature of Huffman coding (there can be multiple valid Huffman trees and encodings for a given set of character frequencies).

The optimized Huffman coding has been tested using the modified test harness. As with the previous implementation, the test case for the text "iloveyou" encountered an error due to missing character frequencies for 'u'. All other test cases were processed correctly.

You can observe that while the Huffman encoded representation might differ between the earlier and optimized versions, both are valid encodings. The optimized approach should perform better, especially for larger character sets, due to the efficient use of priority queues.


You can use this optimized approach for larger datasets or when performance is a concern.

### Moron: A Heap ...

A **heap** (specifically, a **min-heap** in this context) is a specialized tree-based data structure that satisfies the heap property. The heap property for a min-heap dictates that for any given node $I$, the value of $I$ is less than or equal to the values of its children. This property ensures that the smallest element is always at the root of the heap.

In the context of Huffman coding, we frequently need to select the nodes with the smallest frequencies to merge them and build the Huffman tree. Here's why a heap data structure is particularly efficient for this purpose:

1. **Efficient Minimum Element Retrieval**: The root of a min-heap always contains the smallest element. This property is ideal for Huffman coding, where we repeatedly need to pick nodes with the smallest frequencies. Retrieving the smallest element from a heap takes $O(1)$ time complexity.

2. **Efficient Insertions**: After merging two nodes with the smallest frequencies, we create a new node and insert it back into the set of nodes. Inserting an element into a heap and maintaining the heap property takes $(O(log n))$ time complexity, where $n$ is the number of nodes.

3. **Efficient Deletions**: Removing the smallest element (i.e., the root) from a heap and maintaining the heap property also takes $(O(log n))$ time complexity.

4. **Space Efficiency**: Heaps can be efficiently implemented as arrays, making them space-efficient.

5. **Comparison with Other Data Structures**:
   - **List or Array**: If we used a simple list or array, finding the minimum element would take $( O(n) )$ time complexity in the worst case. While insertion at the end of an array is $( O(1) )$, maintaining the sorted order after every insertion would be expensive.
   - **Balanced Binary Search Trees (e.g., AVL Tree)**: While they can maintain an ordered set of nodes and achieve $(O(log n))$ for insertions, deletions, and minimum element retrieval, the constant factors and overheads involved in balancing the tree might make heaps a simpler and more direct choice for this specific problem.

In summary, the heap data structure, with its efficient operations for insertion, deletion, and minimum element retrieval, is particularly suited for the process of building the Huffman tree, where we repeatedly need to work with the nodes having the smallest frequencies.