# RNN prototype

From the previous notebooks the following data was created:

1. audio signal resampled to 16kHz (mono)
1. segmentation of the audio signal into speech- and pause-segments derived from the raw data or through WebRTC
1. spectrograms or MFCC of the audio signal to use as training data
1. transcriptions for the raw speech segments in original and normalized form

We can now train an RNN that will learn the relationship between an audio signal and its textual representation using the [CTC loss function](https://www.cs.toronto.edu/~graves/icml_2006.pdf) as described in the [introduction](00_introduction.ipynb).

The spectrograms of speech segments (3.) and their transcriptions (4.) are called the _labelled data_. This data consists of the input data  `X` and the labels `Y`. To train the network we use only a part of it. This part is referred to as **training set**. To prevent overfitting on this set we use another part of the labelled data to form a **validation/dev set**.

This notebook describes the training of a RNN prototype. The prototype does not represent a valid RNN that can be used for RNN. Instead, it is trained on various input. The results are then compared in order to gain knowledge about the influence of different properties of the input on the result.

The goal of this process is to get valuable information to train a real RNN later. If the training is successful, the RNN will be able to output a transcription for unknown instances that is roughly equivalent to the actual transcription. We can evaluate the RNN's performance by comparing its output with the actual transcription on a **test-set**.

In [None]:
corpus_root = r'E:/' # define the path to where the corpus files are located!

As usual, some imports and some helper functions need to be defined before we start.

In [None]:
import os

from util.corpus_util import *
from util.audio_util import *

rl_corpus_root = os.path.join(corpus_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')

## Procedure

Since the pronunciation of a given piece of text is highly dependent on the language, we train the RNN only on a specific language of the corpus at a time. Therefore different RNN need to be trained for different languages. 

To show the impact of various properties of the training set, the following Proof of Concepts (PoC) are created:

* **PoC #1 (benchmark)**: We start with a very simple RNN that is only trained on five speech segments from a single corpus entry in German and observe the training progress.
* **PoC #2 (convergence for German)**: We then compare the results by extending this example. The same RNN is trained on the same corpus entry, but this time with all speech segments. This should give us some hints as to how additional input from the same distribution will influence convergence during the learning progress.
* **PoC #3 (language, sequence length)**: The same RNN is trained again with only five speech segments, but those are taken from a corpus entry in a different language (English instead of German). From the result we hope to infer some information about how robust the RNN architecture is to various languages.
* **PoC #4 (convergence for English)**: The same RNN is trained again with all speech segments of the English corpus Entry. Comparison with PoC#2 should give us some information about the influence of language on convergence.

The RNN is trained with both spectrograms and MFCC as features. Comparing the performance between both types of features should give some information about what features are better suited.

In [None]:
rl_corpus_de = load_corpus(rl_corpus_root)(languages='de')
rl_corpus_de.summary()

In [None]:
corpus_entry = rl_corpus_de['edznachrichten180111']
corpus_entry.summary()

corpus_entry_nonnumeric = corpus_entry(numeric=False)
corpus_entry_nonnumeric.summary()

## RNN architecture

We train an RNN with the following architecture (inspired by [this repository](https://github.com/philipperemy/tensorflow-ctc-speech-recognition)):

* number of features: 13
* number of hidden layers: 1
* RNN-cell type: LSTM

### RNN cost
We measure the cost in two ways:

* CTC-cost
* Label Error Rate (LER)

#### CTC cost
The CTC cost is calculated as follows:

tbd...

#### Label Error Rate (LER)
The LER is defined as the [edit distance](https://www.tensorflow.org/api_docs/python/tf/edit_distance) between prediction (_hypothesis_) and actual labels (_truth_), also called the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Observe for example the LER for the hypothesis `hello` and the truth `hallo`: The two strings differ in 1 of 5 characters, therefore the LER is `0.2`.

The LER can also be computed for strings of different lengths. The following table gives an overview of a few samples.

| Hypothesis | Truth | LER |
|---|---|---|
| 'hello' | 'hallo' | 0.2
| 'hell' | 'hallo' | 0.4
| 'hel' | 'hallo' | 0.6
| 'helloo' | 'hallo' | 0.4
| 'hellooo' | 'hallo' | 0.6
| 'helo' | 'hallo' | 0.4
| 'heloo' | 'hallo' | 0.4
| 'helooo' | 'hallo' | 0.6
| 'allo' | 'hallo' | 0.2
| 'elo' | 'hallo' | 0.6

You can also execute the cell below to see how the values are calculated with TensorFlow.

In [None]:
samples = [
    ('hello', 'hallo'),
    ('hell', 'hallo'),
    ('hel', 'hallo'),
    ('helloo', 'hallo'),
    ('hellooo', 'hallo'),
    ('helo', 'hallo'),
    ('heloo', 'hallo'),
    ('helooo', 'hallo'),
    ('allo', 'hallo'),
    ('elo', 'hallo'),
]

import tensorflow as tf
from rnn_utils import *

print('hypothesis'.ljust(15) + 'truth'.ljust(15) + 'ler'.ljust(10))
print('-'.join('' for _ in range(40)))
for hypothesis, truth in samples:
    h_values = encode(hypothesis)
    h_indices = [[0, i] for i in range(len(h_values))]
    h_shape = [1, len(h_values)]
    h_tensor = tf.SparseTensor(indices=h_indices, values=h_values, dense_shape=h_shape)
    
    t_values = encode(truth)
    t_indices = [[0, i] for i in range(len(t_values))]
    t_shape = [1, len(h_values)]
    t_tensor = tf.SparseTensor(indices=t_indices, values=t_values, dense_shape=t_shape)
    
    with tf.Session() as sess:
        ler = tf.edit_distance(h_tensor, t_tensor)
        edit_distance = sess.run(ler)
        print(f'{hypothesis.ljust(15)}{truth.ljust(15)}{str(edit_distance[0]).ljust(10)}')

## Proof of Concept

The RNN is supposed to learn the relationship between an audio signal and its transcription, i.e. if trained properly it should be able to generate a transcription for any unseen audio signal afterwards. To see whether the RNN learns something useful we train the RNN on only a single corpus entry. For simplicity, only speech segments that do not contain numbers are considered. 

To get an intuition for how much the RNN is sensitive to different properties of the input data we examine the last two aspects a bit closer. To do this we use two corpus entries in different languages, whereas the average sequence length in one corpus is considerably longer than in the other. To minimize the influence of the pitch only corpus entries with female speakers were considered. For the sake of simplicity it is assumed that the speaking rate between the samples is similar. In consequence the term _length_ refers to both the length of the audio segment (in seconds) as well as the transcript length (number of characters).

This leaves us with the RNN being trained in three variations:

* **PoC #1**: Train on a corpus entry in German
* **PoC #2**: Train on a corpus entry in English with an average segment length similar to the sample used for PoC #1
* **PoC #3**: Train on a corpus entry in English with an average segment length that is longer than in the sample used for PoC #2

By comparing PoC #1 and #2 we get a feeling for how much the training depends on the language of the corpus entries. By comparing PoC #2 and #3 we get a feeling for how much the training depends on the length of the speech segments. 

### PoC #1: German sample (short segments)
For the _ReadyLingua_ corpus the first corpus entry in German is the poem _"An die Freude"_ from F. Schiller. For the first PoC we will train the RNN exclusively on this sample.

#### Partial sample

To get a fast feedback on the learning progress we will train the RNN only on the first five speech segments of the training sample. These segments have the following transcript (normalized):

    1. an die freude von friedrich schiller
    2. freude schoner gotterfunken
    3. tochter aus elysium
    4. wir betreten feuertrunken himmlische dein heiligtum
    5. deine zauber binden wieder
    
We can now train the RNN with just these segments using the CTC- and LER-cost explained above. We train on 300 variants of the segments. In each variant the audio signal of the segment is cropped by a random value between 1 and 2000 frames. With a sampling rate of 16kHz this corresponds to an audio segment of 125ms at max. By comparing the actual transcription (_ground truth_) with the decoded output of the RNN (_prediction_) we can see that the RNN is indeed learning how to recognize speech.

We can see that while in the beginning the generated transcript does not make much sense. It consists mainly of the letter `e` which happens to be the most frequent character in a German text. With proceeding training the generated transcript become clearer until they match up almost perfectly with the actual transcripts. This is also reflected in the curves for the CTC- and LER-cost:

    Iteration 4:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       e ee e e e e
    
    Iteration 26:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       ie enienuenienheieiu
    
    Iteration 112:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       i betreterteueruner himmlische en heitum
    
    Iteration 178:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wi betreteneuertrunkeni himmlische dein heiligtum

    Iteration 298:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuertrunken himmlische dein heigtum

<img src="../assets/cost_ctc_sample.png" alt="CTC cost" style="width: 450px; float: left;"/>
<img src="../assets/cost_ler_sample.png" alt="LER cost" style="width: 450px;"/>

Of course by re-using the same sample 300 times the RNN has hopelessly overfitted to that sample. The RNN will not generalize well, meaning it is not expected to perform well on unseen examples. The result is therefore not representative. However, the PoC shows that the RNN is generally able to learn something useful.

#### Whole sample

To get a better intuition on how fast the RNN learns the relationship we train it on the same corpus entry, but this time we use all speech segments. We can see that the RNN still learns the relationships between audio signal and transcription, but requires a considerably larger number of training epochs to get comparable results (i.e. similar costs like when only training on five speech segments):

    tbd.    

### PoC #2: English sample (short segments)

For the second PoC we use a corpus entry with similar properties like in PoC #1. The first corpus entry in the _ReadyLingua_ corpus is ... . Its first five speech segments have the following transcripts:

    tbd...
    
Training the RNN the same way like shown above gives us the following progress:

    tbd...
    

### PoC #: English sample (long segments)

For the third PoC the chapter _"Tom, the Piper's Son"_ from _"Mother Goose in Prose"_ from the _LibriSpeech_-corpus was used (`corpus_entry.id=121669`). The first five speech segments are:

    1. tom the pipers son
    2. the pig was eat and tom was beat and tom ran crying down the street
    3. he never did any work except to play the pipes and he played so badly that few pennies ever found their way into his pouch it was whispered around that old barney was not very honest
    4. but he was so sly and cautious that no one had ever caught him in the act of stealing although a good many things had been missed after they had fallen into the old mans way barney had one son named tom
    5. and they lived all alone in a little hut away at the end of the village street for toms mother had died when he was a baby you may not suppose that tom was a very good boy since he had such a queer father but neither was he very bad



## Measuring the performance
To compare the transcription produced by the RNN with the actual transcription we need a way to measure the similarity between these two texts. One way of doing this is the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) which calculates the degree of similarity as the edit distance (number of single-character edits required to change text 1 into text 2), whereas a lower value means a higher degree of similarity.