# Step 3: Train an RNN

From the previous steps we have corpora containing corpus entries with the following contents:

1. audio data sampled to 16kHz (mono)
2. segmentation of the audio into speech- and pause-segments detected by WebRTC
3. segmentation of the audio into speech- and pause-segments derived from the raw data
4. transcriptions for the raw speech segments in original and normalized form
5. spectrograms of the audio to use as training data
6. labels for the spectrogram-frames ot use as training labels

We can now train an RNN that will learn the relationship between an audio signal and its textual representation using the [CTC loss function](https://www.cs.toronto.edu/~graves/icml_2006.pdf). 

The raw speech segments (3.) and their transcriptions (4.) are called the _labelled data_. This data consists of the input data  `X` and the labels `Y`. To train the network we use only a part of this labelled data. This part is referred to as **training set**. To prevent overfitting on this set we use another part of the labelled data to form a **validation/dev set**.

If the training is successful, the RNN will be able to output a transcription for unknown instances that is roughly equivalent to the actual transcription. We can evaluate the RNN's performance by comparing its output with the actual transcription on a **test-set**.

In [None]:
corpus_root = r'E:/'

As usual, let's do the imports and some helper functions before we start.

In [None]:
import os

from corpus_util import *
from audio_util import *
from data_util import *

rl_corpus_root = os.path.join(corpus_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')

rl_corpus_path = os.path.join(rl_corpus_root, 'readylingua.corpus')
ls_corpus_path = os.path.join(ls_corpus_root, 'librispeech.corpus')

Since the pronunciation of a given piece of text is highly dependent on the language, we train the RNN only on a specific language of the corpus. Therefore, let's load the corpus and extract the entries in German.

In [None]:
rl_corpus_de = load_corpus(rl_corpus_path)(languages='de')
rl_corpus_de.summary()

## Normalizing the transcript

In order to limit the number of target classes to the characters of the alphabet we need to normalize the transcripts. Normalizing involves the following steps:

1. remove leading and trailing whitespaces (trimming)
2. remove multiple subsequent occurences of whitespace within the transcript
3. replacing accentuated characters with character from the alphabet (e.g. _é_/_è_/_ê_/...->e, _ß_->ss, etc...)
4. removing non-alphanumeric characters (removes punctuation)
5. make everything lowercase

You can edit/execute the cell below with your own examples to see the result of normalization.

In [None]:
from string_utils import *
samples = [ 'Crème-brûlée', 'Außerirdische', ' foo    bar   ']
for sample in samples:
    print(f'{sample} ==> {normalize(sample)}')

## Tokenizing the transcript

In order to use the transcripts as training labels, it needs to be tokenized first. By tokenizing we mean splitting the transcription into words and then into characters. The tokens are the characters of the transcription, whereas a special token `<space>` is used between the characters of two words.

The mapping of audio to text is actually a classification problem: Parts of the audio signal are mapped each to a specific character (i.e. _token_). Since RNN resp. TensorFlow work best with numeric data, we need to encode the tokens to put them on an ordinal scale. The following table shows how the encoding is done:

| **Token**    | `<space>` | `a` | `b` | `c` | ... | `z`  |
|--------------|:---------:|:---:|:---:|:---:|:---:|:----:|
| **Encoding** | `0`       | `1` | `2` | `3` | ... | `26` |

The following table shows how a transcript is converted to its encoded form:

| **Original transcript** | The quick, brown fox jumps over the lazy dog!  |
|-------------------------|------------------------------------------------|
| **Normalized transcript** | the quick brown fox jumps over the lazy dog |
| **Tokenized transcript** | `['t', 'h', 'e', '<space>, 'q', 'u', 'i', 'c', 'k', '<space>', 'b', 'r', 'o', 'w', 'n', ...]` |
| **Encoded transcript** | `[ 20, 8, 5, 0, 17, 21, 9, 3, 11, 0, 2, 18, 15, 23, 14]` |

## The issue with numbers

Numbers in transcript pose a special problem, since their pronunciation differs fundamentally from their textual representation, if written with digits (which is usually the case). Consider the number `8`, which is represented textually by the digit `'8'` and is pronounced as `eight`. In this case, the actual sequence of characters (`'e', 'i', 'g', 'h', 't')` is replaced by a single character `'8'` and can therefore not be approximated like ordinary words.

The problem becomes even harder since compound number are sometimes pronounced differently than their individual parts would be pronounced. Consider the number `13` which is pronounced `'thirteen'` (and not `'onethree'`!). This becomes especially important in languages like German which swap the decimal part (e.g. `'21'` is pronounced as `'one-and-twenty'`).

Since numbers are a problem of their own we want to limit their influence on the training process by training the RNN only on transcripts without numbers. We can filter those out by using the corpus entry as a function and pass in the `numeric=False` argument to get only those speech segments whose transcripts do not contain numbers:

In [None]:
corpus_entry = rl_corpus_de['edznachrichten180111']
corpus_entry.summary()

corpus_entry_nonnumeric = corpus_entry(numeric=False)
corpus_entry_nonnumeric.summary()

## RNN architecture

We train an RNN with the following architecture (inspired by [this repository](https://github.com/philipperemy/tensorflow-ctc-speech-recognition)):

* number of features: 13
* number of hidden layers: 1
* RNN-cell type: LSTM

### RNN cost
We measure the cost in two ways:

* CTC-cost
* Label Error Rate (LER)

#### CTC cost
The CTC cost is calculated as follows:

tbd...

#### Label Error Rate (LER)
The LER is defined as the [edit distance](https://www.tensorflow.org/api_docs/python/tf/edit_distance) between prediction (_hypothesis_) and actual labels (_truth_), also called the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Observe for example the LER for the hypothesis `hello` and the truth `hallo`: The two strings differ in 1 of 5 characters, therefore the LER is `0.2`.

The LER can also be computed for strings of different lengths. The following table gives an overview of a few samples.

| Hypothesis | Truth | LER |
|---|---|---|
| 'hello' | 'hallo' | 0.2
| 'hell' | 'hallo' | 0.4
| 'hel' | 'hallo' | 0.6
| 'helloo' | 'hallo' | 0.4
| 'hellooo' | 'hallo' | 0.6
| 'helo' | 'hallo' | 0.4
| 'heloo' | 'hallo' | 0.4
| 'helooo' | 'hallo' | 0.6
| 'allo' | 'hallo' | 0.2
| 'elo' | 'hallo' | 0.6

You can also execute the cell below to see how the values are calculated with TensorFlow.

In [None]:
samples = [
    ('hello', 'hallo'),
    ('hell', 'hallo'),
    ('hel', 'hallo'),
    ('helloo', 'hallo'),
    ('hellooo', 'hallo'),
    ('helo', 'hallo'),
    ('heloo', 'hallo'),
    ('helooo', 'hallo'),
    ('allo', 'hallo'),
    ('elo', 'hallo'),
]

import tensorflow as tf
from rnn_utils import *

print('hypothesis'.ljust(15) + 'truth'.ljust(15) + 'ler'.ljust(10))
print('-'.join('' for _ in range(40)))
for hypothesis, truth in samples:
    h_values = encode(hypothesis)
    h_indices = [[0, i] for i in range(len(h_values))]
    h_shape = [1, len(h_values)]
    h_tensor = tf.SparseTensor(indices=h_indices, values=h_values, dense_shape=h_shape)
    
    t_values = encode(truth)
    t_indices = [[0, i] for i in range(len(t_values))]
    t_shape = [1, len(h_values)]
    t_tensor = tf.SparseTensor(indices=t_indices, values=t_values, dense_shape=t_shape)
    
    with tf.Session() as sess:
        ler = tf.edit_distance(h_tensor, t_tensor)
        edit_distance = sess.run(ler)
        print(f'{hypothesis.ljust(15)}{truth.ljust(15)}{str(edit_distance[0]).ljust(10)}')

## Proof of Concept

The RNN is supposed to learn the relationship between an audio signal and its transcription, i.e. if trained properly it should be able to generate a transcription for any unseen audio signal afterwards. To see whether the RNN learns something useful we train the RNN on only a single corpus entry from which we only use the first five speech segments. For simplicity, only speech segments that do not contain numbers are considered.

For the _ReadyLingua_ corpus the first corpus entry is the poem _"An die Freude"_ from F. Schiller. Its first five segments have the following transcript (normalized):

    an die freude von friedrich schiller
    freude schoner gotterfunken
    tochter aus elysium
    wir betreten feuertrunken himmlische dein heiligtum
    deine zauber binden wieder
    
We can now train the RNN with just these segments using the CTC- and LER-cost explained above. We train on 300 variants of the segments. In each variant the audio signal of the segment is cropped by 2000 frames (with a sampling rate of 16kHz this corresponds to 125ms). By comparing the actual transcription (_ground truth_) with the decoded output of the RNN (_prediction_) we can see that the RNN is indeed learning how to recognize speech.


We can see that while in the beginning the generated transcript does not make much sense. It consists mainly of the letter `e` which happens to be the most frequent character in a German text. With proceeding training the generated transcript become clearer until they match up almost perfectly with the actual transcripts. This is also reflected in the curves for the CTC- and LER-cost:

    Iteration 4:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       e ee e e e e
    
    Iteration 26:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       ie enienuenienheieiu
    
    Iteration 112:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       i betreterteueruner himmlische en heitum
    
    Iteration 178:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wi betreteneuertrunkeni himmlische dein heiligtum

    Iteration 298:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuertrunken himmlische dein heigtum

<img src="./assets/cost_ctc_sample.png" alt="CTC cost" style="width: 450px; float: left;"/>
<img src="./assets/cost_ler_sample.png" alt="LER cost" style="width: 450px;"/>

Of course by re-using the same sample 300 times the RNN has hopelessly overfitted to that sample. The RNN will not generalize well, meaning it will not perform well on unseen examples. The result is not representative. However, the PoC shows that the RNN is generally able to learn something useful given enough data to train on.

## Measuring the performance
To compare the transcription produced by the RNN with the actual transcription we need a way to measure the similarity between these two texts. One way of doing this is the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) which calculates the degree of similarity as the edit distance (number of single-character edits required to change text 1 into text 2), whereas a lower value means a higher degree of similarity.