# Text compression

## Why?

* After the digitalization of any signal we get a sequence $s[]$
  of samples that represent the signal $s$ with more or less fidelity.
  
* $s[]$ is encoded using PCM (Pulse Code Modulation), in which
  every sample is represented with the same number of bits. For
  example, in a CD we have a data-rate of
  
  $$
    (16+16)\frac{\text{bits}}{\text{sample}}\times
    44{.}100\frac{\text{samples}}{\text{second}}=
    1{.}411{.}200\frac{\text{bits}}{\text{second}}.
  $$

## Sources of redundancy in signals

* In general, signals has different types of redundancy:

    + **Statistical redundancy**: It can be removed using
    probabilistic models of the signal producing lossless codecs. The
    codecs are also known as *text codecs*.
    
    + **Spatial/temporal redundancy**: It can be removed using
    spatial/temporal models of the signal and produces also lossless
    codecs.
    
    + **Psychological redundancy**: Some information that
    signal carry can not be perceived by humans. This kind of
    pseudo-redundancy is removed normally by means of quantization,
    producing lossy codecs.

## Symbols, runs, strings, code-words and code-streams

* In the context of statistical coding, each sample of $s[]$ is
  called a *symbol*.
  
* Depending on the type of statistical relationship among
  symbols, we will speak also about *strings* when we process
  more than one symbol and about *runs* when all the symbols are
  the same in a string.
  
* In any case, the output of the encoder is a sequence of
  code-words that all together generates a *code-stream*.

## A. Run-length encoding

* RLE (Run Length Encoding) is a technique that removes the data
  redundancy produced by the repetition of symbols. Example:
  ```
  aaaaa <-> 5a
  ```
  
* There are several versions of RLE codecs, which are different in
  the size of the source alphabet or the maximal/minimal length that
  the runs can take.

### A.1 N-ary run length encoding

RLE for N-ary alphabets (alphabets of size N), where typically, N=256.

#### Encoder

1. While there are symbols to encode:
    1. Let $s$ the next symbol.
    2. Read the next $n$ consecutive symbols equal to $s$.
    3. Write the pair $ns$.

#### Decoder

1. While there are $ns$ pairs to decode:
    1. Write $n$-times the symbol $s$.
    
#### Example

Runs:
```
aaaabbbbbaaaaaabbbbbbbcccccc
```
are encoded as:
```
4a5b6a7b6c
```

### A.2 Binary run length encoding

* It is not necessary to indicate the next symbol
  (only the length) because if a run ends, the other (possible) symbol
  start with the next run.
  
#### Encoder

1. Let $s\leftarrow$ \texttt{0}.
2. While there are bits to encode:
    1. Read the next $n$ consecutive bits equal to $s$.
    2. Write $n$.
    3. $s\leftarrow (s+1)~\text{modulus}~2$.
    
#### Decoder

1. Let $s\leftarrow$ \texttt{0}.
2. While there are items $n$ to decode:
    1. Write $n$ bits equal to $s$.
    2. $s\leftarrow (s+1)~\text{modulus}~2$.

#### Example

Runs:
```
0000111110000001111111000000
```
are encoded as::
```
4 5 6 7 6
```

### A.3 [MPN-5 run length encoding](https://scholar.google.es/scholar?hl=es&as_sdt=0%2C5&q=held+data+compression+techniques+applications&btnG=)

* Created by Microcom Inc. for the MNP (Microcom Networking
  Protocol) 5.

#### Codec

* The behavior of the codec can be easily defined with the
  following examples:
  
```
Input     Output
--------- ---------
ab        ab
aab       aab
aaab      aaa0b
aaaab     aaa1b
aaaaab    aaa2b
:         :
a^nb      aaa(n-3)b
```

#### Example

Runs:
```
aaaabbbbbaaaaaabbbbbbbcccccc
```
are encoded as:
```
aaa1bbb2aaa3bbb4
```

#### Lab