# Abstract

**Huffman Encoding** is one of the most basic and elegant applications of the **Greedy Algorithm Design Paradigm**. It provides an optimal method of **lossless data compression** by assigning shorter binary codes to frequently occurring symbols and longer codes to rarely occurring ones.
This project implements Huffman Encoding and Decoding in **Python 3.14.0**, constructs frequency tables and Huffman Trees, and evaluates performance through compression ratio analysis.  
The project presents both the **proof of correctness** and **complexity analysis** of the algorithm and compares the bit cost with that of **fixed-length encoding**.
Experimental results show that Huffman encoding achieves **significant space reduction** when symbol distributions are **non-uniform**, validating its theoretical optimality in practice.


# 1. Introduction and Motivation

In the digital era, where vast quantities of data are produced every second, efficient storage and transmission have become crucial. **Data compression** seeks to represent information using fewer bits than the original form, thereby saving both storage space and communication bandwidth.

Traditionally, according to the number of unique characters, we assign a fixed-length bit string to represent each symbol, as done in encoding schemes such as ASCII. However, characters in real-world data do not occur uniformly—some appear far more frequently than others. To save space, the characters that occur more often should be assigned shorter bit strings, while those that occur less frequently can be assigned longer ones. The challenge, therefore, is to determine an optimal way of assigning these variable-length codes to minimize the overall number of bits required.

Among several compression methods, **Huffman Encoding** stands out as a classic example of a **Greedy Algorithm** that achieves **lossless compression**. It assigns variable-length binary codes to symbols such that frequently occurring symbols receive shorter codes, while rare symbols are given longer codes. But this needs to be done carefully as it is susceptible to ambiguous interpretation. To emphasize this we consider the example of the Morse code,
![morse code](image_1.png)

The Morse code was designed so that the duration to send each letter was inverse to the frequency of the letter occurring in usual English text. This is generally a good strategy, but we notice that in the encoding of the Morse code we see that a **J** could be interpreted as **AM**, this is not considered a problem in the Morse code since the letters are usually sent with a short time gap. However using this is inefficient for storage to use a separator. Hence, we need a code that cannot be interpreted in multiple ways while having variable length. So the property we need here is

**Prefix free code:** is code system such that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system.



Morse code, in a way, follows a similar intuition by assigning simpler symbols or shorter actions to frequently occurring letters. However, it does not satisfy the **prefix-free property**; instead, it relies on fixed time intervals to mark the beginning and end of each letter.

**Huffman Encoding** effectively addresses this limitation by constructing a prefix-free variable-length code that minimizes the total expected code length, thereby achieving optimal lossless compression.

---

## Objectives
The goals of this project are to:
1. Implement Huffman Encoding and Decoding in Python.  
2. Construct frequency tables and Huffman Trees for textual data.  
3. Compute and analyze **compression ratios** against fixed-length encoding.  
4. Verify the **prefix-free property** and correctness of the algorithm.  
5. Analyze time and space complexities and demonstrate optimality through experiment.

---


### **Why Huffman Algorithm is a Greedy Algorithm**

The **Huffman Algorithm** is a *greedy algorithm* as it satisfies both the **greedy-choice property** and the **optimal substructure property**.  
It works by making the best decision at each step and then proceeding to the next, again choosing the best possible option at that moment — that is, it makes **locally optimal choices** at every stage.

- **Greedy-choice property:** A global optimum can be achieved by choosing a local optimum at each step.  
- **Optimal substructure:** An optimal solution to the problem contains optimal solutions to its subproblems.

Together, these ensure that Huffman’s locally optimal merges of the smallest frequencies lead to a globally optimal prefix-free code.


## **Applications**
Huffman Encoding (or, more broadly, the idea behind it) has numerous applications:
- **File Compression:** Used in ZIP, GZIP, and DEFLATE formats for efficient lossless compression.  
- **Image Compression:** Serves as the entropy coding stage in JPEG.  
- **Audio Compression:** Applied in MP3 and AAC to encode frequency coefficients efficiently.  
- **Data Transmission:** Reduces bandwidth usage in communication systems and embedded devices.  

# 2. Problem Definition and Objectives


## Problem Statement


### Formalized Description

**Input:**

1. Alphabet $A = (a_1, a_2, \dots, a_n)$, which is the symbol alphabet of size $n$. 
2. Weights $W = (w_1, w_2, \dots, w_n)$, which is the tuple of positive symbol weights (usually proportional to probabilities), i.e.
$$
w_i = \text{weight}(a_i), \quad i \in \mathbb{N}
$$

**Output:**

Code $C(W) = (c_1, c_2, \dots, c_n)$, which is the tuple of binary codewords, where $c_i$ is the codeword assigned to $a_i$,$\forall i \in \mathbb{N}$ such that $C(W)$ has the properties:

  1. $C(W)$ is a prefix-free code.
  2. Let $L(C(W)) = \sum_{i=1}^{n} w_i \times \text{length}(c_i)$ be the weighted path length of code $C$. $L(C(W)) \leq L(T(W))$ for any other code $T$ satisfying the condition 1.



---

## Informal Description

**Given:**  
A set of symbols $ A=(a_i)_{i=1}^n $, and for each symbol $ x \in A $, the frequency $ w_i $ representing the fraction of symbols in the text that are equal to $a_i$.

**Find:**  
A prefix-free binary code (a set of codewords) with minimum expected codeword length.

---

*Source: Adapted and summarized from the [“Huffman coding” article on Wikipedia](https://en.wikipedia.org/wiki/Huffman_coding).*


## Assumptions and Constraints

- Each input character has a non-negative frequency.  
- All symbols are independent and identically distributed within the input.  
- The output encoding must be prefix-free for correctness.

---


# 3. Algorithm Design

## Algorithmic Overview

The Huffman algorithm is **greedy** in nature — it always tries to obtain the most optimal outcome at each step without considering future consequences.  
It does so by constructing a **prefix-free binary tree**, built iteratively by combining the two least frequent nodes into a single new node and assigning the two original nodes as its left and right children.

In other words, a queue (or priority list) of nodes is maintained, from which the two nodes with the smallest frequencies are repeatedly removed, merged into a new combined node, and then reinserted into the queue. This process continues until only one node remains, which becomes the root of the Huffman tree.

The algorithm can be divided into the following stages:
1. Build a **frequency table** for all unique characters in the text.  
2. Create a **min-heap (priority queue)** of nodes sorted by frequency.  
3. Repeatedly remove the two nodes with the smallest frequencies and merge them into a new node whose frequency is their sum.  
4. Insert the new node back into the heap until only one node remains — the root of the Huffman Tree.  
5. Traverse the tree to assign binary codes: left edge adds a `0` and right edge adds a `1`.


---

## Data Structures Used
- **Priority Queue (Min-Heap):** Ensures efficient extraction of nodes with smallest frequencies, implemented via Python’s `heapq`.  
- **Binary Tree:** Represents hierarchical structure of merged nodes.  
- **Dictionary:** Stores mapping from characters to their Huffman codes.  

---

## Pseudocode

The following pseudocode describes the construction of the Huffman Tree and the generation of the Huffman Codes.

### Algorithm 1: Build Huffman Tree

Input: Set of symbols C = {c1, c2, ..., cn} with corresponding frequencies f(c)
Output: Root node of the Huffman Tree
```
1. Create a min-heap Q and insert all characters with their frequencies.
2. while size(Q) > 1 do
3. x ← Extract-Min(Q)          // Node with smallest frequency
4. y ← Extract-Min(Q)          // Node with second smallest frequency
5. z ← New node with frequency f(z) = f(x) + f(y)
6. z.left ← x
7. z.right ← y
8. Insert(Q, z)
9. end while
10. return Extract-Min(Q) // The remaining node is the root of the Huffman Tree
 ```
---

### Algorithm 2: Generate Codes

Input: Root node of Huffman Tree, current code = ""
Output: Code dictionary for each symbol
```
1. if node is leaf then
2. Assign current code to symbol(node)
3. else
4. GenerateCodes(node.left, code + "0")
5. GenerateCodes(node.right, code + "1")
6. end if
```
---

### Explanation

- The **Build Huffman Tree** algorithm constructs the binary tree by repeatedly combining the two least frequent nodes into a new parent node, whose frequency equals their sum.  
- The **Generate Codes** algorithm traverses the final tree recursively to assign binary codes:
  - A left edge adds a `0`
  - A right edge adds a `1`
- The resulting codes are **prefix-free**, meaning no codeword is a prefix of another — ensuring unique and unambiguous decoding.


**Example: "abracadabra"**

Consider the input string `"abracadabra"`. The frequencies of the characters are:

| Character | Frequency |
|:----------:|:----------:|
| a | 5 |
| b | 2 |
| r | 2 |
| c | 1 |
| d | 1 |

The Huffman algorithm proceeds as follows:
- Merge **c(1)** and **d(1)** → new node with frequency **2**  
- Merge **b(2)** and **r(2)** → new node with frequency **4**  
- Merge node(2) [from c,d] with node(4) [from b,r] → new node with frequency **6**  
- Merge **a(5)** and node(6) → root node with frequency **11**

**Resulting Huffman Codes:**

| Character | Huffman Code |
|:----------:|:-------------:|
| a | 0 |
| r | 100 |
| b | 101 |
| c | 110 |
| d | 111 |

---

## **Encoded Output Length**

The length of the Huffman code for the string `"abracadabra"` is: 01011000110011101011000  which consists of **23 bits** in total.  

In contrast, the fixed-length encoding version would require:

$$
(\text{length of "abracadabra"}) \times \lceil \log_2(5) \rceil = 11 \times 3 = 33
$$

Hence, the fixed-length encoded version would be **33 bits**, while the Huffman-encoded version uses **23 bits**, giving a savings of **10 bits (~30.3% reduction)**.








# 4. Correctness

### Key Definitions

**Definition (Prefix-Free Code):** A set of binary codewords is prefix-free if no codeword is a prefix of another.

**Definition (Expected Code Length):** For probabilities $p_i$ and code lengths $\ell_i$:
$$
L(C) = \sum_i p_i \ell_i
$$

The goal is to minimize $L(C)$.

**Definition (Full Binary Tree):** A binary tree such that each node has 2 children or is a leaf node. 

**Definition (Optimal Binary Tree):** An optimal binary tree is a $T$ of alphabet $A=(a_1,\dots,a_n)$ and corresponding weights $W=(w_1,\dots,w_n)$ is a full binary tree with its leaf nodes labelled with elements of $A$ and the quantity
$$
L(T) = \sum_{a_i\text{ leaf of } T} w_i \times \text{depth}(a_i)
$$
is minimized.

---



## Proof of Correctness and Optimality

**Lemma 4.1:** At the end of each step of Huffman's algorithm the heap contains either a labelled node or a binary tree 
**Proof:** We prove by induction on the iteration count.

**Base Case:** On the first iteration we start with only labelled nodes in the heap, hence after we take the least frequent elements $L_1,L_2$ from the heap and create a full binary tree with an unlabeled root whose left child is $L_1$ and right child $L_2$.

**Induction Step:** 
Assume that the statement holds after the $n-1$-th iteration. Then before running the $n$-th iteration all elements in the heap are either labelled or full binary trees. Now we take the least frequent elements $L_1,L_2$ then by induction hypothesis we have that $L_1,L_2$ are labelled node or binary tree, Now we create a node $R$ of which the left child is $L_1$ and right child is $L_2$, we just need to check that the node $R$ is the root node of a full binary tree, We see that $R$ has exactly 2 children. Now by induction hypothesis the child nodes could be labelled nodes or binary trees, hence after joining we get another binary tree.

**Lemma 4.2:** Any code from a full binary tree with symbols as leaf nodes produce a prefix-free code.
**Proof:** Trivial



**Lemma 4.3:** Let $T$ be a binary tree whose leaf nodes are labeled by $A$ and has the leaves have corresponding weights $W$ also let $x$ and $y$ be 2 leaves in $T$ with weights $w_x,w_y$. Then if $T'$ is a tree made by swapping $x$ and $y$
$$
L(T') - L(T) = (w_y - w_x) (\text{depth}(x,T)-\text{depth}(y,T))
$$
**Proof:**
$$
\begin{align*}
L(T') - L(T) &= w_y \text{depth}(x, T) + w_x \text{depth}(y, T) - w_x \text{depth}(w, T) - w_y \text{depth}(y, T) \\
&= w_y(\text{depth}(x, T) - \text{depth}(y, T)) + w_x(\text{depth}(y, T) - \text{depth}(x, T)) \\
&= (w_y - w_x)(\text{depth}(x, T) - \text{depth}(y, T))
\end{align*}
$$

**Lemma 4.4:** There exists an optimal binary tree such that symbols with the least weights are siblings

**Proof:** Let $T$ be any optimal tree let $x,y$ be the symbols with leasts weights. By definition, they are always leaf nodes. If there are more than 2 symbols which have the same least frequency take the ones with most depth in the tree.

If $x,y$ are already siblings there is nothing to do, otherwise we have two cases, we assume W.L.O.G $\text{depth}(x) >= \text{depth}(y)$
**Case 1:** $x$ has a sibling leaf node $z$
We create $T'$ by swapping $y$ and $z$ in $T$, then by **Lemma 4.3** we get $L(T') - L(T) = (w_y - w_z)(\text{depth}(z)-\text{depth}(y)) \leq 0$, But $T$ is optimal hence $T'$ is also optimal.

**Case 2:** $x$ does not have a sibling leaf node and hence there is a leaf node $z$ with depth greater than $\text{depth}(x)$.
We swap $x$ and $z$ to get $T'$ and by our choice that $x$,$y$ are the least weighted and having the most depth we have that $w_x < w_z$ and hence by **Lemma 4.3** we have $L(T')<L(T)$ contradicting minimality. So this case is not possible. $\blacksquare$


**Theorem (Correctness of Huffman Coding):** Huffman’s algorithm produces a prefix-free code of minimum expected length.

**Proof (By Induction on alphabet size $|A|$):**
Initial tree <=> code equivalence todo

**Base Case:** For $|A| = 1$ the algorithm terminates without entering the loop, and it gives a tree with one node labelled $a_1$, this tree will has $L(T)=0$ hence already at minimum.

**Inductive Step:** We have by the induction hypothesis that Huffman algorithm gives an optimal tree for alphabet of size $n-1$. Now we need to prove for alphabets of size $n$. If $A = (a_1,\dots,a_{n-1},a_n)$ and $W=(w_1,\dots,w_{n-1},w_n)$ we create a new alphabet $A' = (a_1,\dots,a_{n-2},z)$ and $W' = (w_1,\dots,w_{n-2},w_z =w_{n-1}+w_{n})$. After the first iteration of the algorithm `HuffmanTree(A,W)`, the 2 least frequent elements are made into children of a new node with frequency equal to their sum and hence it will run through the remaining iterations with the modified alphabet $A'$ and weight $W'$ hence the algorithm proceeds the same as `HuffmanTree(A',W')`. By our induction hypothesis we have that `HuffmanTree(A',W')` will give a optimal tree $T'$. Hence $T$ from `HuffmanTree(A,W)` is $T'$ with a binary tree containing $a_n,a_{n-1}$ as children in the place of $z$

$$
\begin{align*}
L(T) & = \sum_{a \in A'} w_a \text{depth}(a,T) + w_x \text{depth}(x,T) + w_y \text{depth}(y,T)\\
& = \sum_{a \in A'} w_a \text{depth}(a,T) + w_z\left(\text{depth}(z,T') + 1\right) \\
& = \sum_{a \in A'} w_a \text{depth}(a,T') + w_z = \sum w_a \text{depth}(a,T') + w_x + w_y \\
& = L(T') + w_x + w_y
\end{align*}
$$

Now assume for contradiction that $T$ is not optimal. Let $S$ be an optimal tree of $A$ that contains $a_n,a_{n-1}$ as siblings by **Lemma 4.4**. Now by removing $a_{n},a_{n-1}$ from $Z$ and labelling their parent $z$ with weight as their sum we get another full binary tree $Z'$. We can repeat the same calculation above to get $L(Z) = L(Z') + w_x + w_y$. Hence we get $L(T') =L(T) - w_x -w_y > L(Z) - w_x - w_y = L(Z')$ which is a contradiction as $L(T')$ is optimal for $A$.


# 5. Complexity Analysis

We now discuss the complexity of each of the algorithms involved in this process.
1. Huffman Tree
2. Encoding
3. Decoding


### Huffman tree
We first need to initialize a heap using the weights of the alphabet, to heapify this it takes $O(\log |A|)$ time. At each iteration of the loop 2 elements are popped from the heap and an element is pushed to the heap, hence the size of the heap decreases by 1. Hence the loop will run exactly $|A|-1$ many times.

At the ith iteration of the loop, we have that the heap has length $|A|-i|$, hence the push and pop operations on the min-heap takes $O(\log (|A|-i))$ time each. So the total time complexity will be 
$$
\sum_{i=1}^{|A|-1} 3O(\log |A|-i) = O(|A|\log |A|)
$$

## **Encoding and Decoding**

Once the tree is ready:

- **Encoding:** $O(M)$ time, where $M$ is message length — we simply look up codes and append bits.  
- **Decoding:** also $O(M)$, traversing one bit at a time from root to leaves.

Both phases are practically linear in message size.

---

## **Space Complexity**

Required data structures:

- Heap (≤ $n$ elements)  
- Huffman tree ($2n-1$ nodes)  
- Code lookup table (1 entry per symbol)

Hence $O(n)$ total space.

---

## **Complexity Summary**

| **Phase**           | **Time**         | **Space** |
|---------------------|------------------|-----------|
| Tree construction   | $O(n \log n)$    | $O(n)$    |
| Encoding            | $O(M)$           | $O(M)$    |
| Decoding            | $O(M)$           | $O(M)$    |

---


# **6. Implementation and Experimental Results**

## **Programming Environment**
This project was implemented in **Python 3.14.0** using standard library modules only.  
The following libraries were used:
- `heapq` — for implementing the min-heap (priority queue) used in tree construction.  
- `math` — for logarithmic calculations in compression metrics.  
- `os` and `sys` — for basic file handling and program control.

All code was written and tested in a **Jupyter Notebook** and a standard **Python environment**
The implementation focuses on readability and simplicity, with detailed comments explaining every step of the Huffman Encoding and Decoding process.

---

## **Input and Output Interface**
The program allows two ways to provide input:
1. **Manual Text Input:**  
   The user can enter a string directly into the console or notebook cell.
2. **File Input:**  
   A text file can be read from the system for encoding.

The output includes:
- A **frequency table** of characters.  
- A **Huffman tree** constructed from those frequencies.  
- The **encoded bitstring** (saved as `encoded_output.txt`).  
- The **decoded text** (saved as `decoded_output.txt`).  
- Compression statistics such as total bits, fixed-length bits, and percentage of space saved.

---


In [4]:
import heapq

class Node:
    def __init__(self, char, freq, left, right):
        self.char = char
        self.freq = freq
        self.left = left
        self.right = right

    def __lt__(self, o):
        return self.freq < o.freq

    def __str__(self):
        if self.char:
            return self.char
        return f"[{self.left.__str__()},{self.right.__str__()}]"



def get_frequency_heap(text):
    freq_dict = {}
    heap = []
    for char in text:
        if char in freq_dict:
            freq_dict[char] += 1
        else:
            freq_dict[char] = 1
    for char, freq in freq_dict.items():
        node = Node(char, freq, None, None)
        heapq.heappush(heap, node)
    return heap


def huffman_tree(text):
    min_heap = get_frequency_heap(text)
    if not min_heap:
        return None
    while len(min_heap) > 1:
        l1 = heapq.heappop(min_heap)
        l2 = heapq.heappop(min_heap)
        new_node = Node("", l1.freq + l2.freq, l1, l2)
        heapq.heappush(min_heap, new_node)

    return min_heap[0]


def huffman_dict(node):
    if node.char:
        return {node.char: ""}
    return {
        **{char: "0" + code for char, code in huffman_dict(node.left).items()},
        **{char: "1" + code for char, code in huffman_dict(node.right).items()},
    }


def huffman_encode(text, dict):
    encoded = ""
    for char in text:
        encoded += dict[char]
    return encoded


def huffman_decode(code, dict):
    invert_dict = {value: key for key, value in dict.items()}
    decoded = ""
    buffer = ""
    for bit in code:
        buffer += bit
        if buffer in invert_dict:
            decoded += invert_dict[buffer]
            buffer = ""
    return decoded


# **7. Experiments, Datasets and Observations**


## **Experimental Datasets**

| Dataset | Category | Size(bytes) |
| :--- | :--- | :--- |
| E.coli | Complete genome of the E. Coli bacterium | 4638690 |
| bible | The King James version of the bible | 4047392 |
| world | The CIA world fact book | 2473400 |

<br>

| Dataset | Category | Size |
| :--- | :--- | :--- |
| bib | Bibliography (refer format) | 111261 |
| book1 | Fiction book | 768771 |
| book2 | Non-fiction book (troff format) | 610856 |
| geo | Geophysical data | 102400 |
| news | USENET batch file | 377109 |
| obj1 | Object code for VAX | 21504 |
| obj2 | Object code for Apple Mac | 246814 |
| paper1 | Technical paper | 53161 |
| paper2 | Technical paper | 82199 |
| pic | Black and white fax picture | 513216 |
| progc | Source code in "C" | 39611 |
| progl | Source code in LISP | 71646 |
| progp | Source code in PASCAL | 49379 |
| trans | Transcript of terminal session | 93695 |

---

## **Compression Analysis**

Firstly we load our datasets into python


In [29]:
large_corpus_files = [
    'E.coli',
    'bible.txt',
    'world192.txt',
]

calgary_corpus_files = [
    'bib', 'book1', 'book2', 'news',  
    'paper1', 'paper2','paper3','paper4','paper5','paper6', 'progc', 'progl', 'progp' 
]

LARGE_CORPUS_DIR = "./large"
CALGARY_CORPUS_DIR = "./calgary"

datasets = {}
for file in large_corpus_files:
    with open(f"{LARGE_CORPUS_DIR}/{file}","r",encoding="ascii") as f:
        datasets[file] = f.read()

# for file in calgary_corpus_files:
#     with open(f"{CALGARY_CORPUS_DIR}/{file}","r",encoding="ascii") as f:
#         datasets[file] = f.read()





In [32]:
import matplotlib.pyplot as plt
import math
import numpy as np

asciisize = []
huffmansize = []
fixedlengthsize = []
for name,text in datasets.items():
    tree = huffman_tree(text)
    code = huffman_dict(tree)
    encoded = huffman_encode(text,code)
    print(len(code))
    fixedlengthsize += [math.ceil(math.log(len(code))) * len(text) // 1000]
    asciisize += [(len(text)*8) // 1000]
    huffmansize += [len(encoded)//8000]



datasetnames = datasets.keys()
datasize = {
    'ASCII': asciisize,
    'Fixed Length': fixedlengthsize,
    'Huffman': huffmansize,
}

x = np.arange(len(datasetnames))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

fig, ax = plt.subplots(layout='constrained')

for attribute, measurement in datasize.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=attribute)
    ax.bar_label(rects, padding=3)
    multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Size in KB')
ax.set_title('Compression')
ax.set_xticks(x + width, datasetnames)
ax.legend(loc='upper right', ncols=3)

plt.show()


AttributeError: 'dict_keys' object has no attribute 'items'



---

## **Verification**
The decoded output perfectly matched the original input in every test, confirming that the algorithm works correctly.  
The amount of space saved depended on how uneven the character frequencies were — when all characters appeared with similar frequency, compression was minimal, but when certain characters appeared much more often than others, the algorithm achieved a significant reduction in size.

---



# **8. Potential Issues**

Although the Huffman Encoding implementation works well for moderate inputs, a few refinements can make it more robust.

---

## **Memory Management and Large File Handling**

The current implementation loads the entire file into memory before computing frequencies and building the tree.  
For large datasets, this is inefficient and may exceed memory limits.

**Improvement:** Use a *streaming approach* — read data in small chunks and update frequency counts incrementally.  
This allows handling larger files efficiently without exhausting memory.

---

## ** Binary Output Format: Bits vs ASCII Representation**

Encoded data are currently stored as ASCII characters (`'0'` and `'1'`), wasting space since each bit uses an entire byte.

**Improvement:** Output should be written as a true *bitstream*.  
Python libraries such as `bitarray` or `io.BytesIO` can pack bits into bytes efficiently.  
During decoding, bytes can be unpacked bit by bit for traversal, ensuring that compression gains accurately reflect theoretical efficiency.

---

## ** Visualization and Interpretability**

ASCII trees are easy to inspect for small examples but become unreadable for larger datasets.

**Improvement:** Use visualization tools such as `Graphviz` or `NetworkX` to render trees graphically.  
Color-coding leaves (symbols) and internal nodes (merged frequencies) would make the construction process more interpretable and pedagogically useful.

---

## ** Error Handling and Robustness**

The current decoder assumes perfect input; a single corrupted bit may cause complete decoding failure.

**Improvement:** Introduce input validation and fault tolerance.  
Adding checksums or structured exception handling can help detect corrupted files and prevent runtime errors, making the implementation more reliable.

---


# **9. Challenges Faced and Conclusion**

---

## **Challenges Faced**

1. One of the main challenges we faced was setting up the programming environment. Installing Python, configuring the right libraries, and getting Jupyter and Git Bash to work properly took much longer than expected.

2. Even though we understood the Huffman algorithm conceptually, turning it into working code was far from straightforward. It took time to make the computer do exactly what we wanted — especially while handling text files, file paths, and encoding issues. Debugging small mistakes like queue ordering or missing cases often took hours.

<!-- 3. What surprised us most was how much effort it takes to correctly implement something that seems so basic on paper. Huffman’s algorithm is a simple algorithm, but even small modifications, misunderstandings, or blind spots make it tricky to get right in practice. -->

In the end, the experience made us appreciate how theory and implementation are two very different skills — and how much precision and patience real-world coding actually requires.

---

## **Improvements**

Generalizations and future enhancements include:

- Implementing binary-level compression to measure true disk savings.  
- Developing adaptive Huffman and arithmetic coding variants.  
- Applying Huffman coding to multimedia (images, audio).  
- Creating a graphical interface to visualize the encoding process.  
- Integrating entropy-based analysis to compare performance with the Shannon limit.  

---

## **Conclusion**

This project successfully implemented and analyzed **Huffman Encoding**, demonstrating its role as a classic example of a **greedy optimization algorithm**.  
The results verified that Huffman’s approach minimizes the expected number of bits per symbol while preserving exact reconstructability.  

The algorithm was found to be:

- **Correct:** Decoding perfectly reconstructs the original text.  
- **Efficient:** Exhibits near-linear runtime for practical datasets.  
- **Optimal:** Achieves minimal weighted path length among all prefix-free codes.  


---


# **Appendix**

---

## **A. Complete Python Code (Final Submission Version)**






---

## **B. How to Use the Code**
