# Problem Set 1 — Coding Part

**Lecture:** Data Compression With Deep probabilistic models (Prof. Bamler at University of Tuebingen)

- This notebook constitutes the coding part of Problem Set 1, published on 20 April 2021 and discussed on 26 April 2021.
- Download the full problem set from the [course website](https://robamler.github.io/teaching/compress21/).

## Problem 1.2: Naive Symbol Code Implementation

In this exercise, we'll implement a very naive but correct encoder and decoder for prefix-free symbol codes.
We only care about correctness for now, not about computational efficiency.

We represent bit strings (code words and the concatenated encoded message) as lists of boolean values, where `True` represents a "one"-bit and `False` represents a "zero" bit.
Please be aware that this would be an extremely inefficient representation for a real application.
We represent code books as dictionaries from symbols to bit strings (i.e., to lists of boolean values).

### Sample Code Books

Our decoding algorithm will only work with prefix codes.
Let's define some sample prefix codes for our unit tests.

In [None]:
# our example C^{(4)} from Problem 1.1
SAMPLE_CODEBOOK_MONOPOLY_C4 = {
    2: [False, True, False],
    3: [True, False],
    4: [False, False],
    5: [True, True],
    6: [False, True, True]
}

# additional example (exercise: verify that this is a prefix code)
SAMPLE_CODEBOOK2 = {
    'a': [True, False],
    'b': [False],
    'c': [True, True, False, False],
    'd': [True, True, False, True],
    'e': [True, True, True],
}

### Encoder

The encoder is very simple.
Fill in the blank where it says "TODO" (a single line of code will do).

In [None]:
def encode(message, codebook):
    """Encodes a sequence of symbols using a prefix-free symbol code.

    This is a very inefficient implementation for teaching purposes only.

    Args:
        message (list): The message you want to encode, as a list of symbols.
        codebook (dict): A codebook for a prefix-free symbol code. Must be a
            dictionary whose keys contain all symbols that appear in `message`
            (and may contain additional keys). Each key must map to a list of
            booleans, representing the code word as a sequence of bits. Must
            specify a prefix-free code, i.e., no code word may be the prefix
            of the code word for a different symbol.

    Returns:
        list: The encoded bit string as a list of bools.
    """
    
    encoded = []
    
    for symbol in message:
        # TODO: look up code word for `symbol` in the `codebook` and append
        # it to `encoded`
    
    return encoded

Now run these unit tests to verify your implementation:

In [None]:
assert encode([], SAMPLE_CODEBOOK_MONOPOLY_C4) == []
assert encode([], SAMPLE_CODEBOOK2) == []
assert (
    encode([4, 3, 6, 4, 2], SAMPLE_CODEBOOK_MONOPOLY_C4)
    == [False, False, True, False, False, True, True, False, False, False, True, False]
)
assert (
    encode(['c', 'b', 'a', 'd', 'b', 'b', 'd', 'e'], SAMPLE_CODEBOOK2)
    == [True, True, False, False, False, True, False, True, True, False, 
        True, False, False, True, True, False, True, True, True, True]
)

### Decoder

The decoder is more complicated because it has to infer the boundaries between concatenated code words.
To do this, we will use the assumption that the code book defines a *prefix-free* symbol code.

We use a kind of brute-force implementation here.
It is correct but very inefficient.
We'll implement a more efficient method on the next problem set.

Fill in the blanks where it says "TODO".

In [None]:
def decode(encoded, codebook):
    """Decodes a bitstring into a sequence of symbols using a prefix-free symbol code.

    This is a very inefficient implementation for teaching purposes only.

    Args:
        encoded (list): The compressed bit string as a list of bools.
        codebook (dict): A codebook for a prefix-free symbol code.

    Returns:
        list: The decoded message as a list of symbols.
    """
    
    def is_prefix_of(prefix_candidate, codeword):
        # TODO: Both `prefix_candidate` and `codeword` are lists of bools. Return
        # `True` if `codeword` is at least as long as `prefix_candidate` and if
        # `codeword` starts with `prefix_candidate`. Otherwise, return `False`.
    
    decoded = []
    partial_codeword = []
    candidate_symbols = list(codebook.keys())
    
    for bit in encoded:
        # Start reading in a new code word:
        # - Set `partial_codeword` to the empty list. We will accumulate the bits
        #   of the code word in this list.
        # - Set `candidate_symbols` to all symbols in the code book. We'll narrow
        #   down this list as we read more bits until it contains only a single
        #   symbol whose codeword equals `partial_codeword`.
        
        # append the current bit to `partial_codeword`:
        partial_codeword.append(bit)

        # TODO: apply a filter to `candidate_symbols`: only retain the ones
        # whose code words start with `partial_codeword`.

        if len(candidate_symbols) == 0:
            raise 'Encountered invalid code word.'
        elif len(candidate_symbols) == 1 and partial_codeword == codebook[candidate_symbols[0]]:
            # TODO:
            # - Append the decoded symbol to `decoded`.
            # - Then reset `partial_codeword` and `candidate_symbols` to their initial values
            #   so that we can start decoding the next code word.

    assert partial_codeword == [], 'The compressed message ended in the middle of a code word.'
    return decoded

Now run these unit tests to verify your implementation:

In [None]:
assert decode([], SAMPLE_CODEBOOK_MONOPOLY_C4) == []
assert decode([], SAMPLE_CODEBOOK2) == []
assert decode(
    [False, False, True, False, False, True, True, False, False, False, True, False],
    SAMPLE_CODEBOOK_MONOPOLY_C4) == [4, 3, 6, 4, 2]
assert decode(
    [True, True, False, False, False, True, False, True, True, False, 
     True, False, False, True, True, False, True, True, True, True],
    SAMPLE_CODEBOOK2) == ['c', 'b', 'a', 'd', 'b', 'b', 'd', 'e']

### Round-Trip Tests

These entropy coding algorithms can contain very subtle errors that wouldn't show up in the minimal unit tests we've tested so far.
It is generally a good idea to implement more elaborate tests.
This is easy to do, now that you have both an encoder and a decoder: generate some long-ish sequence of random symbols.
Then encode and decode them and verify that the decoder reconstructs the original message.
Always remember the random numer seed so that, if you find an error, you can start debugging.

## Problem 1.3: Binary Heap

This exercise is a preparation for the next problem set, where we will implement the Huffman coding algorithm for constructing optimal symbol codes.
Our implementation will use a *binary heap* (also known as a *priority heap*, a *min-heap*, or a *max-heap*).

- (Re-)familiarize yourself with the concept of a binary heap (e.g., skim the [Wikipedia article](https://en.wikipedia.org/wiki/Binary_heap).
  It's not so important for now how the heap is implemented, just make sure you understand what the `insert` and `pop` (or `extract`) operations do.
- The following code plays around with the binary heap implementation in the python standard library.
  Run it, read it, and make sure you understand what it does (this code has no particular purpose apart from verifying that we understand how the API works).

In [None]:
import numpy as np
import heapq

In [None]:
np.random.seed(123)
test_data = np.random.choice(10, size=20)
test_data

In [None]:
heap = []
for item in test_data:
    heapq.heappush(heap, item)
heap

In [None]:
sorted_test_data = []
while heap != []:
    sorted_test_data.append(heapq.heappop(heap))
sorted_test_data # Should print the items from `test_data` in sorted order.

In [None]:
assert set(test_data) == set(sorted_test_data)
assert sorted(sorted_test_data) == sorted_test_data