# Step 3: Train an RNN

From the previous steps we have corpora with the following contents:

1. audio data sampled to 16kHz (mono)
2. segmentation of the audio into speech- and pause-segments detected by WebRTC
3. segmentation of the audio into speech- and pause-segments derived from the raw data
4. transcriptions for the raw speech segments in original and normalized form
5. spectrograms of the audio to use as training data
6. labels for the spectrogram-frames ot use as training labels

We can now train an RNN that will learn the relationship between an audio signal and its textual representation using the [CTC loss function](https://www.cs.toronto.edu/~graves/icml_2006.pdf). 

The raw speech segments (3.) and their transcriptions (4.) are called the _labelled data_. To do the training we use some of this labelled data as input data `X` and target labels `Y` in a **training set**. To prevent overfitting on this set we use another part of the labelled data to form a **validation/dev set**.

If the training is successful, the RNN will be able to output a transcription for unknown instances that is roughly equivalent to the actual transcription. We can evaluate the RNN's performance by comparing its output with the actual transcription on a **test-set**.

In [None]:
corpus_root = r'E:/'

As usual, let's do the imports and some helper functions before we start.

In [None]:
import os

from corpus_util import *
from audio_util import *
from data_util import *

rl_corpus_root = os.path.join(corpus_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')

rl_corpus_path = os.path.join(rl_corpus_root, 'readylingua.corpus')
ls_corpus_path = os.path.join(ls_corpus_root, 'librispeech.corpus')

Since the pronunciation of a given piece of text is highly dependent on the language, we train the RNN only on a specific language of the corpus. Therefore, let's load the corpus and extract the entries in German.

In [None]:
rl_corpus_de = load_corpus(rl_corpus_path)('de')
rl_corpus_de.summary()

## Tokenizing the transcript

In order to use the transcripts as training labels, it needs to be tokenized first. By tokenizing we mean splitting the transcription into words and then into characters. The tokens are the characters of the transcription, whereas a special token `<space>` is used between the characters of two words.

The mapping of audio to text is actually a classification problem: Parts of the audio signal are mapped each to a specific character (i.e. _token_). Since RNN work best with numeric data, we need to encode the tokens to put them on an ordinal scale. The following table shows how the encoding is done:

| **Token**    | `<space>` | `a` | `b` | `c` | ... | `z`  |
|--------------|:---------:|:---:|:---:|:---:|:---:|:----:|
| **Encoding** | `0`       | `1` | `2` | `3` | ... | `26` |

The following table shows how a transcript is converted to its encoded form:

| **Original transcript** | The quick, brown fox jumps over the lazy dog!  |
|-------------------------|------------------------------------------------|
| **Normalized transcript** | the quick brown fox jumps over the lazy dog |
| **Tokenized transcript** | `['t', 'h', 'e', '<space>, 'q', 'u', 'i', 'c', 'k', '<space>', 'b', 'r', 'o', 'w', 'n', ...]` |
| **Encoded transcript** | `[ 20, 8, 5, 0, 17, 21, 9, 3, 11, 0, 2, 18, 15, 23, 14]` |

## The issue with numbers

Numbers in transcript pose a special problem, since their pronunciation differs fundamentally from their textual representation, if written with digits (which is usually the case). Consider the number `8`, which is represented textually by the digit `'8'` and is pronounced as `eight`. In this case, the actual sequence of characters (`'e', 'i', 'g', 'h', 't')` is replaced by a single character `'8'` and can therefore not be approximated like ordinary words.

The problem becomes even harder since compound number are sometimes pronounced differently than their individual parts would be pronounced. Consider the number `13` which is pronounced `'thirteen'` (and not `'onethree'`!). This becomes especially important in languages like German which swap the decimal part (e.g. `'21'` is pronounced as `'one-and-twenty'`).

Since numbers are a problem of their own we want to limit their influence on the training process by training the RNN only on transcripts without numbers. We can filter those out by using the corpus entry as a function and pass in the `numeric=False` argument to get only those speech segments whose transcripts do not contain numbers:

In [None]:
corpus_entry = rl_corpus_de['edznachrichten180111']
corpus_entry.summary()

corpus_entry_nonnumeric = corpus_entry(numeric=False)
corpus_entry_nonnumeric.summary()

## Measuring the performance
To compare the transcription produced by the RNN with the actual transcription we need a way to measure the similarity between these two texts. One way of doing this is the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) which calculates the degree of similarity as the edit distance (number of single-character edits required to change text 1 into text 2), whereas a lower value means a higher degree of similarity.