# Text compression

## Why?

* After the digitalization of any signal we get a sequence $s[]$
  of samples that represent the signal $s$ with more or less fidelity.
  
* $s[]$ is encoded using PCM (Pulse Code Modulation), in which
  every sample is represented with the same number of bits. For
  example, in a CD we have a data-rate of
  
  $$
    (16+16)\frac{\text{bits}}{\text{sample}}\times
    44{.}100\frac{\text{samples}}{\text{second}}=
    1{.}411{.}200\frac{\text{bits}}{\text{second}}.
  $$

## Sources of redundancy in signals

* In general, signals has different types of redundancy:

    + **Statistical redundancy**: It can be removed using
    probabilistic models of the signal producing lossless codecs. The
    codecs are also known as *text codecs*.
    
    + **Spatial/temporal redundancy**: It can be removed using
    spatial/temporal models of the signal and produces also lossless
    codecs.
    
    + **Psychological redundancy**: Some information that
    signal carry can not be perceived by humans. This kind of
    pseudo-redundancy is removed normally by means of quantization,
    producing lossy codecs.

## Symbols, runs, strings, code-words and code-streams

* In the context of statistical coding, each sample of $s[]$ is
  called a *symbol*.
  
* Depending on the type of statistical relationship among
  symbols, we will speak also about *strings* when we process
  more than one symbol and about *runs* when all the symbols are
  the same in a string.
  
* In any case, the output of the encoder is a sequence of
  code-words that all together generates a *code-stream*.

## A. Run-length encoding

* RLE (Run Length Encoding) is a technique that removes the data
  redundancy produced by the repetition of symbols. Example:
  ```
  aaaaa <-> 5a
  ```
  
* There are several versions of RLE codecs, which are different in
  the size of the source alphabet or the maximal/minimal length that
  the runs can take.

### A.1 N-ary run length encoding

RLE for N-ary alphabets (alphabets of size N), where typically, N=256.

#### Encoder

1. While there are symbols to encode:
    1. Let $s$ the next symbol.
    2. Read the next $n$ consecutive symbols equal to $s$.
    3. Write the pair $ns$.

#### Decoder

1. While there are $ns$ pairs to decode:
    1. Write $n$-times the symbol $s$.
    
#### Example

Runs:
```
aaaabbbbbaaaaaabbbbbbbcccccc
```
are encoded as:
```
4a5b6a7b6c
```

### A.2 Binary run length encoding

* It is not necessary to indicate the next symbol
  (only the length) because if a run ends, the other (possible) symbol
  start with the next run.
  
#### Encoder

1. Let $s\leftarrow$ \texttt{0}.
2. While there are bits to encode:
    1. Read the next $n$ consecutive bits equal to $s$.
    2. Write $n$.
    3. $s\leftarrow (s+1)~\text{modulus}~2$.
    
#### Decoder

1. Let $s\leftarrow$ \texttt{0}.
2. While there are items $n$ to decode:
    1. Write $n$ bits equal to $s$.
    2. $s\leftarrow (s+1)~\text{modulus}~2$.

#### Example

Runs:
```
0000111110000001111111000000
```
are encoded as::
```
4 5 6 7 6
```

### A.3 [MPN-5 run length encoding](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=held+data+compression+techniques+applications&btnG=)

* Created by Microcom Inc. for the MNP (Microcom Networking
  Protocol) 5.

#### Codec

* The behavior of the codec can be easily defined with the
  following examples:
  
```
Input     Output
--------- ---------
ab        ab
aab       aab
aaab      aaa0b
aaaab     aaa1b
aaaaab    aaa2b
:         :
a^nb      aaa(n-3)b
```

#### Example

Runs:
```
aaaabbbbbaaaaaabbbbbbbcccccc
```
are encoded as:
```
aaa1bbb2aaa3bbb4
```

#### Lab

<img src="text_coding/lena-gray.png" style="width: 400px;"/>
<img src="text_coding/peppers-gray.png" style="width: 400px;"/>
<img src="text_coding/boats.png" style="width: 400px;"/>
<img src="text_coding/zelda.png" style="width: 400px;"/>

Using `rle`, compute the compression ratio of each image as

$$
\gamma = \frac{X}{Y}
$$

where $X$ is the size of the input (the sequence of symbols) and $Y$
the size of the output (the code-stream), and populate:
```
Codec | lena boats pepers zelda Average
------+--------------------------------
  rle | ....  ....   ....  ....    ....
```

## A.4 [Burrows-Wheeler transform](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=Burrows+M%2C+Wheeler+DJ%3A+A+Block+Sorting+Lossless+Data+Compression+Algorithm.&btnG=)

* BWT (Burrows-Wheeler Transform) is an algorithm that inputs
  a string and outputs:
  1. A different string with the same symbols (with longer runs),
    but with a different order.
  2. An index.
  
  
* There is an inverse transform that, using the output of the
  forward transform, recover the original string.
  
* The transformed string tends to have longer runs.

* The length of the runs in proportional to the correlation
  between the symbols and the length of the input.
  
### Forward transform

Let $B$ the block-size in symbols:

1. Read $B$ symbols.
2. Build a square matrix of size $B\times B$ where the first row is
  the original sequence, the second one is the same sequence but
  cyclically shifted one symbol to the left, and so on ...
3. Sort lexicographically the matrix by rows.
4. Search in the last column the row in which the first symbol of
  the original sequence it is found. This is the index $i$.
5. Output $i$ and the last column $O[]$.

#### Encoding example
<img src="text_coding/BWT_example.svg" style="width: 800px;"/>

### Inverse transform

1. Sort $O[]$ over $S[]$.
2. Compute $T[]$ where if $S[j]=O[l]$ (being $l$ the first symbol
  of $O[]$ that matches this condition), then $T[j]=l$. Notice that
  all of symbols of $T[]$ have to be different.
3. Let $k\leftarrow i$.
4. Execute $B$ times:
    1. Output $O[k]$.
    2. $k\leftarrow T[k]$.
    
#### Decoding example
<img src="text_coding/BWT_example_decod.svg" style="width: 400px;"/>

### Lab

In [7]:
# https://gist.github.com/dmckean/9723bc06254809e9068f

def bwt_encode(s):
    n = len(s)
    m = sorted([s[i:n]+s[0:i] for i in range(n)])
    I = m.index(s)
    L = ''.join([q[-1] for q in m])
    return (I, L)

from operator import itemgetter

def bwt_decode(I, L):
    n = len(L)
    X = sorted([(i, x) for i, x in enumerate(L)], key=itemgetter(1))

    T = [None for i in range(n)]
    for i, y in enumerate(X):
        j, _ = y
        T[j] = i

    Tx = [I]
    for i in range(1, n):
        Tx.append(T[Tx[i-1]])

    S = [L[i] for i in Tx]
    S.reverse()
    return ''.join(S)

index, encoded = bwt_encode('ababcbababaaaaaaa')
print (index, encoded)
decoded = bwt_decode(index, encoded)
print (decoded)

9 baaaaaabbabaacaab
ababcbababaaaaaaa


## B. String encoding

### How it works?

* We replace strings by code-words of less length.
* Strings are searched in a dictionary and the sequence of positions of the strings in the dictionary form the code-strem.

### B.1 LZ77 [[J. Ziv and A. Lempel, 1977]](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=Ziv+Lempel+universal+sequential+data+compression+1977&btnG=)

* In 1977, Jacov Ziv and Abraham Lempel propose the LZ77 algorithm.
* In the eighties, a branch of LZ77 known as LZSS and is
  implemented by Haruyasu Yoshizaki in the program LHARC, discovering
  the possibilities of the LZ77 encoding.
* After that, a large number of text compressors have been based
  on the LZ77 idea (or a variation of it). Some of the most famous
  are: `ARJ`, `RAR`, `gzip` and `7z`.
* LZ77 processes a sequence of symbols using the structure:

<img src="text_coding/LZ77.svg" style="width: 600px;"/>

* The dictionary and the look-ahead buffer have a fixed size and
  can be considered as a sliding window, where the input of a new
  symbol generates the output of the oldest one, which becomes the
  newest symbol of the dictionary.
  
#### Encoder

1. Let $I$ the length of the dictionary and $J$ the length of the
  buffer.
2. Input the first $J$ symbols in the buffer.
3. While the input is not exhausted:
    1. Let $i$ the position in the dictionary of the first $j$
    symbols of the buffer and $k$ the symbol that makes that $j$ can
    not be larger.
    2. Output $ijk$.
    3. Input the next $j+1$ in the buffer.
    
#### Decoder

1. While the code-words $ijk$ are not exhausted:
    1. Output the $j$ symbols extracted from the position $i$ in the
    dictionary.
    2. Output $k$.
    3. Introduce all the decoded symbols into the buffer.

#### Example

<img src="text_coding/LZ77_encoding_example.svg" style="width: 600px;"/>

<img src="text_coding/LZ77_decoding_example.svg" style="width: 600px;"/>

* Parameters $I$ and $J$ control the performance
  of the algorithm. They should be large enough to guarantee the
  matching of long strings, but should keep small in order to reduce
  the number of bits of the code-words $ijk$. Typical sizes are:
  $\log_2(I)=12.0$ and $\log_2(J)=4.0$.

#### Lab
To-do.

### B.2 LZ78 [[J. Ziv and A. Lempel, 1978]](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=Ziv+Lempel+1978&btnG=)

* In 1978, Ziv and Lempel published the LZ78 algorithm.

* LZ89 represents the dictionary in a recursive way with the idea
  of improving the search of the strings in the dictionary. Now, each
  entry in the dictionary is a pair $wk$, where $w$ is a pointer to
  the dictionary and $k$ is a symbol. In fact, each entry $wk$
  represents the string that results from the concatenation of string
  $w$ and $k$, where $w$ can be recursively computed as we have found
  $wk$.
  
* We will denote \textit{string}$(w)$ to the string that $w$
  represents.
  
* The empty string is obtained by \textit{string}$(0)$.

#### Encoder

1. $w\leftarrow 0$.
2. While the input is not exhausted:
    1. $k\leftarrow$ next input symbol.
    2. If $wk$ is found in the dictionary, then:
        1. $w\leftarrow$ address of $wk$ in the dictionary.
    3. Else:
        1. Output $wk$.
        2. Insert $wk$ in the dictionary.
        3. $w\leftarrow 0$.
        
#### Decoder

1. While the input is not exhausted:
    1. Input $wk$.
    2. Output $\text{string}(w)$.
    3. Output $k$.
    4. Insert $wk$ in the dictionary.
    
#### Example

<img src="text_coding/LZ78_encoding_example.svg" style="width: 600px;"/>

<img src="text_coding/LZ78_decoding_example.svg" style="width: 600px;"/>




### B.3 LZW [[T.A. Welch, 1984]](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=Terry+Welch+1984&btnG=)

* In 1984 Terry A. Welch proposes the LZW algorithm,
  which is an improved version of the LZ89 algorithm that does not
  writes raw symbols ($k$) to the code-stream.

* LZW was selected as encoding engine for the GIF (Graphics
  Interchange Format), and for the compressor `compress`.
  
* The dictionary is initially filled with the $2^k$ possible
  symbols (*roots*), that are stored in entries $0\cdots255$.
  
#### Encoder

1. $w\leftarrow$ next input symbol.
2. While the input is not exhausted:
    1. $k\leftarrow$ next input symbol.
    2. If $wk$ is found in the dictionary, then:
        1. $w\leftarrow$ address of $wk$ in the dictionary.
    3. Else:
        1. Output $\leftarrow w$.
        2. Insert $wk$ in the dictionary.
        3. $w\leftarrow k$.

#### Decoder

1. $code\leftarrow$ first input code-word.
2. Output $code$.
3. $old\_code\leftarrow code$.
4. While the input is not exhausted:
    1. $code\leftarrow$ next input code-word.
    2. $w\leftarrow old\_code$.
    3. If $code$ is found in the dictionary, then:
        1. Output string$(code)$.
    4. Else:
        1. Output string$(w)$.
        2. Output $k$.
    5. $k\leftarrow$ first symbol of the last output.
    6. Insert $wk$ in the dictionary.
    7. $old\_code\leftarrow code$.
    
#### Example

<img src="text_coding/LZW_encoding_example.svg" style="width: 600px;"/>

<img src="text_coding/LZW_decoding_example.svg" style="width: 600px;"/>




## C. Symbol encoding

### How it works?

* We can compress if each symbol is translated by code-words and,
  in average, the lengths of the code-words are smaller than the
  length of the symbols.
  
* The encoder and the decoder have a probabilistic model $M$ which
  says to the variable-length encoder ($C$)/decoder($C^{-1}$) the
  probability $p(s)$ of each symbol $s$.
  
<img src="text_coding/compresion_entropica.svg" style="width: 600px;"/>

* The most probable symbols are represented by the shorter
  code-words and viceversa.
  
### Bits

* Data is the representation of the information.

* Lossless data compression uses a shorter representation of the
  information.
  
* By definition, a bit of data stores a bit of information if and
  only if it represents the occurrence of an equiprobable event (an
  event that can be true or false with the same probability).
  
* By definition, a symbol $s$ with probability $p(s)$ stores

\begin{equation}
  I(s)=-\log_2 p(s) \tag{Eq:symbol_information}
\end{equation}
  
  bits of information.

* The length of the code-word depends on the probability as:

<img src="text_coding/prob_vs_long.svg" style="width: 600px;"/>

### Entropy

* The entropy $H(S)$ measures the amount of information per
  symbol that a source of information $S$ produces, in average, i.e.
  
\begin{equation}
  H(S) = \frac{1}{N}\sum_{s=1}^{N} p(s)\times I(s)
\end{equation}

  bits-of-information/symbol, where $N$ is the size of the source
  alphabet (number of different symbols).

### C.1 Universal coding

