# LZW Encoding

This notebook shows the use of the `lzw_encode` and `lzw_decode` methods in `slz`. There is both a functional implementation and an object-oriented (stateful) implementation.


### Encoding
The encoding function `lzw_decode` has the following signature


```python
def lzw_encode(input: Union[str, bytes]) -> str:
```

In reality, the `str` return is really just a byte stream returned from the encoder. It almost always makes sense to do `b = lzw_encode(data).encode("utf-8")`. The reason the return type is `str` rather than `bytes` is to do with how bytes work in Python and what is possible with `pybind11`. See for example [https://github.com/pybind/pybind11/issues/1236]. 

The signature in the underlying implementation is 

```c++
std::stringstream lzw_encode(const std::string_view input);
```

## Encoding format.

The output of the encoder is a byte stream containing the compressed bytes. To be able to recover the stream information correctly when decoding some header information is prepended to the data. The data stream consists of
symbols that vary in size from 2-4 bytes per symbol. At the start of the stream all symbols are 2 bytes. As the prefix tree gets longer it becomes possible to gain more compression by using a larger symbol that represents a longer prefix. At some point the stream will contain symbols that are all 3 bytes long, and then even later all symbols will be 4 bytes long. When decoding the stream we need to know at which point the size of the symbols changes. We encode this information in a 12 byte header at the start of the stream

The header format is 

- (4 bytes) - offset in stream where the first 24-bit code is. If this is zero there are no 24-bit codes
- (4 bytes) - offset in stream where the first 32-bit code is. If this is zero there are no 32-bit codes.
- (4 bytes) - total number of codes in the stream.

The byte at offset 12 is the first byte in the data stream and is part of a 2-byte symbol.

In [2]:
from slz import lzw_encode      # import the functional encoder

# We start with the example from the unit test 

ModuleNotFoundError: No module named 'slz'

### Encoding a stream from a file 

To avoid getting psy-op'd on copyright issues we use the collected works of Shakespear from Project Gutenberg.

In [None]:
input_filename = "test/shakespear.txt"

with open(input_filename, "r") as fp:
    text = fp.read()
    
print(text[:128])