# RNN prototype

From the previous notebooks the following data was created:

1. audio signal resampled to 16kHz (mono)
1. segmentation of the audio signal into speech- and pause-segments derived from the raw data or through WebRTC
1. spectrograms or MFCC of the audio signal to use as training data
1. transcriptions for the raw speech segments in original and normalized form

We can now train an RNN that will learn the relationship between an audio signal and its textual representation using the [CTC loss function](https://www.cs.toronto.edu/~graves/icml_2006.pdf) as described in the [introduction](00_introduction.ipynb).

The spectrograms of speech segments (3.) and their transcriptions (4.) are called the _labelled data_. This data consists of the input data  `X` and the labels `Y`. To train the network we use only a part of it. This part is referred to as **training set**. To prevent overfitting on this set we use another part of the labelled data to form a **validation/dev set**.

This notebook describes the training of a RNN prototype. The prototype does not represent a valid RNN that can be used for RNN. Instead, it is trained on various input. The results are then compared in order to gain knowledge about the influence of different properties of the input on the result.

The goal of this process is to get valuable information to train a real RNN later. If the training is successful, the RNN will be able to output a transcription for unknown instances that is roughly equivalent to the actual transcription. We can evaluate the RNN's performance by comparing its output with the actual transcription on a **test-set**.

In [None]:
corpus_root = r'E:/' # define the path to where the corpus files are located!

As usual, some imports and some helper functions need to be defined before we start.

In [None]:
import os

from util.corpus_util import *
from util.audio_util import *

rl_corpus_root = os.path.join(corpus_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')

## Procedure

Since the pronunciation of a given piece of text is highly dependent on the language, we train the RNN only on a specific language of the corpus at a time. Therefore different RNN need to be trained for different languages. 

To show the impact of various properties of the training set, the following Proof of Concepts (PoC) are created:

* **PoC #1 (benchmark)**: We start with a very simple RNN that is only trained on five speech segments from a single corpus entry in German and observe the training progress.
* **PoC #2 (convergence for German)**: We then compare the results by extending this example. The same RNN is trained on the same corpus entry, but this time with all speech segments. This should give us some hints as to how additional input from the same distribution will influence convergence during the learning progress.
* **PoC #3 (language, sequence length)**: The same RNN is trained again with only five speech segments, but those are taken from a corpus entry in a different language (English instead of German). From the result we hope to infer some information about how robust the RNN architecture is to various languages.
* **PoC #4 (convergence for English)**: The same RNN is trained again with all speech segments of the English corpus Entry. Comparison with PoC#2 should give us some information about the influence of language on convergence.

The RNN is trained with both spectrograms and MFCC as features. Comparing the performance between both types of features should give some information about what features are better suited.

In [None]:
rl_corpus_de = load_corpus(rl_corpus_root)(languages='de')
rl_corpus_de.summary()

In [None]:
corpus_entry = rl_corpus_de['edznachrichten180111']
corpus_entry.summary()

corpus_entry_nonnumeric = corpus_entry(numeric=False)
corpus_entry_nonnumeric.summary()

## RNN architecture

We train a simple RNN with the following architecture (inspired by [this repository](https://github.com/philipperemy/tensorflow-ctc-speech-recognition)):

* number of features: 13 for MFCC resp. 161 for spectrogram (see [notebook 3](03_feature_extraction.ipynb#From-raw-waves-to-spectrograms) on how to calculate those values)
* number of hidden layers: 1
* RNN-cell type: LSTM

### RNN cost
The RNN performance is measured with two metrics:

* CTC-loss
* Label Error Rate (LER)

The calculation of the CTC-loss has been described as part of the description of CTC in [the introduction](00_Introduction.ipynb#Calculating-the-CTC-loss-by-creating-valid-alignments-with-dynamic-programming). The LER is calculated as follows.

#### Label Error Rate (LER)
The LER is defined as the [edit distance](https://www.tensorflow.org/api_docs/python/tf/edit_distance) between prediction (_hypothesis_) and actual labels (_ground truth_), also called the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Observe for example the LER for the hypothesis `hello` and the truth `hallo`: The two strings differ in 1 of 5 characters, therefore the LER is `0.2`.

The LER can also be computed for strings of different lengths. The following table gives an overview of a few samples.

| Hypothesis | Truth | LER |
|---|---|---|
| 'hello' | 'hallo' | 0.2
| 'hell' | 'hallo' | 0.4
| 'hel' | 'hallo' | 0.6
| 'helloo' | 'hallo' | 0.4
| 'hellooo' | 'hallo' | 0.6
| 'helo' | 'hallo' | 0.4
| 'heloo' | 'hallo' | 0.4
| 'helooo' | 'hallo' | 0.6
| 'allo' | 'hallo' | 0.2
| 'elo' | 'hallo' | 0.6

You can also execute the cell below to see how the values are calculated with TensorFlow.

In [None]:
samples = [
    ('hello', 'hallo'),
    ('hell', 'hallo'),
    ('hel', 'hallo'),
    ('helloo', 'hallo'),
    ('hellooo', 'hallo'),
    ('helo', 'hallo'),
    ('heloo', 'hallo'),
    ('helooo', 'hallo'),
    ('allo', 'hallo'),
    ('elo', 'hallo'),
]

import tensorflow as tf
from util.rnn_util import *

print('hypothesis'.ljust(15) + 'truth'.ljust(15) + 'LER'.ljust(10))
print('-'.join('' for _ in range(40)))
for hypothesis, truth in samples:
    h_values = encode(hypothesis)
    h_indices = [[0, i] for i in range(len(h_values))]
    h_shape = [1, len(h_values)]
    h_tensor = tf.SparseTensor(indices=h_indices, values=h_values, dense_shape=h_shape)
    
    t_values = encode(truth)
    t_indices = [[0, i] for i in range(len(t_values))]
    t_shape = [1, len(h_values)]
    t_tensor = tf.SparseTensor(indices=t_indices, values=t_values, dense_shape=t_shape)
    
    with tf.Session() as sess:
        ler = tf.edit_distance(h_tensor, t_tensor)
        edit_distance = sess.run(ler)
        print(f'{hypothesis.ljust(15)}{truth.ljust(15)}{str(edit_distance[0]).ljust(10)}')

## Training profiles

In order to efficiently use the project time, the RNN was trained several times with different profiles. A profile is a combination of values for of the following configuration items:

| configuration item | possible values |
|---|---|
| language | German or English |
| feature type | MFCC or spectrograms |
| data type | original or synthetisized data (see below) |

An iterative approach was taken to train the RNNs. For each step in the iteration the value of configuration item was changed. The resulting profiles were used to train a PoC. The following table shows the mapping of PoCs to their profiles:



For each profile convergence behavior for the CTC- and LER-cost was inspected by inspecting their plots. This served as a basis for further decisions. 

This approach should ensure that no time is wasted to create a sophisticated setup that may or may not lead to better results. By changing only ever one single configuration item in each step, the impact of each change could be analyzed in isolation and conclusions for the following steps could be drawn.

| PoC id | language | feature type | data type |
|---|---|---|---|
| Poc#1 | German | MFCC | original |
| Poc#2 | German | MFCC | synthesized|
| Poc#3 | German | Mel-Spectrogram | original |
| Poc#4 | German | Mel-Spectrogram | synthesized|
| Poc#5 | German | Power-Spectrogram | original |
| Poc#6 | German | Power-Spectrogram | synthesized|
| Poc#7 | English | MFCC | original |
| Poc#8 | English | MFCC | synthesized|
| Poc#9 | English | Mel-Spectrogram | original |
| Poc#10| English | Mel-Spectrogram | synthesized|
| Poc#11 | English | Power-Spectrogram | original |
| Poc#12 | English | Power-Spectrogram | synthesized|

### Synthetisized training data

Audio data can be particularly well synthetisized by altering the original data through the addition of distortion (change of tempo/loudness/pitch, adding echo/reverb or background noise etc.). To artificially reduce overfitting, the

In [None]:
audio, rate = corpus_entry.audio, corpus_entry.rate
display(Audio(data=audio, rate=rate))

distorted = distort(audio, rate, tempo=2.0)
display(Audio(data=distorted, rate=rate))

distorted = distort(audio, rate, pitch=True)
display(Audio(data=distorted, rate=rate))

### Definition of convergence

Observing the trend of the cost is one way to gain information about the training process. Another type of information that can be gained implicitly from the cost is the time needed until the training process has converged. For this, _convergence_ needs to be defined first. For this process, the term _convergence_ was defined for the LER-cost to meet two criteria:

1. the mean LER-cost over the last 10 epochs must be below 0.05. This ensures that the predicted transcriptions diverge from the actual transcriptions by 5% at max (in terms of edit distance). The value of 5% is somewhat arbitrarily chosen. However it lies within the range of the LERs found for the best STT systems found in research papers.
1. the change rate of the average LER costs over the last 10 averages must be below 0.01, i.e. the last 10 average LER-costs must not change more than 1% on average. This ensures that no early stopping is performed because the average cost will have flattened and reached a plateau.

## Proof of Concept

The RNN is supposed to learn the relationship between an audio signal and its transcription, i.e. if trained properly it should be able to generate a transcription for any unseen audio signal afterwards. To see whether the RNN is even able to learn something useful with the given architecture we train the RNN on only a single corpus entry. For simplicity, only speech segments that do not contain numbers are considered. 

To get an intuition for how much the RNN is sensitive to different properties of the input data we examine the last two aspects a bit closer. To do this we use two corpus entries in different languages (German and English), whereas the average speech sequence length in one entry (English) is considerably longer than in the other. To minimize the influence of acoustic properties like pitch of the voice, only corpus entries with female speakers were considered. For the sake of simplicity it is assumed that the speaking rate between the samples is similar. As a consequence the term _length_ refers to both the length of the audio segment (in seconds) as well as the transcript length (number of characters).

This leaves us with the RNN being trained in three variations:

* **PoC #1 (benchmark)**: Train the RNN on a German corpus entry. Only five speech segments are used. Observe the training progress. This should give us a benchmark to compare to.
* **PoC #2 (convergence for German)**: Train the same RNN on the same corpus entry, but this time with all speech segments.
* **PoC #3 (language, sequence length)**: Train the same RNN again, but with a corpus entry in English. Only five speech segments are considered. Compare the result with PoC#1.
* **PoC #4 (convergence for English)**: Train the same RNN again with all speech segments of the English corpus entry.

Comparing Poc #1 and #2 should give us some hints as to how additional input from the same distribution will impact convergence during the learning progress. Comparing PoC #1 and #3 should allow for inferring  how robust the RNN architecture is to various languages and average sequence lenghts. By comparing PoC #2 and #4 we should get a feeling about the influence of language on convergence speed.

### PoC #1
For the _ReadyLingua_ corpus the first corpus entry in German is the poem _"An die Freude"_ from F. Schiller. The first PoC is trained exclusively on the first five speech segments. These segments have the following transcript (normalized):

    1. an die freude von friedrich schiller
    2. freude schoner gotterfunken
    3. tochter aus elysium
    4. wir betreten feuertrunken himmlische dein heiligtum
    5. deine zauber binden wieder
    
The RNN is trained by repeatedly feeding it these training samples. One iteration over the whole training set is called _epoch_. The RNN is trained until convergence as defined above. The LER- and CTC-cost is measured after each epoch. Additionally, an artificial validation sample is created by randomly choosing a sample from the training set and shifting its audio signal to the left. Shifting is done by cropping a random number (between 1 and 2000) of samples from the beginning. With a sampling rate of 16kHz this corresponds to an audio segment of 125ms at most.

Of course by training on such a small training set the RNN will hopelessly overfit to those 5 samples.  Also, the validation set consist variants of the training set and is therefore not representative because the RNN will have already seen the samples in some other form. 

However, although the result are not representative, this PoC is a quick and cheap way to validate if the RNN is able to learn the relationship between audio signal and transcription at all. The results serve as a baseline to compare to.

#### Results

By comparing the actual transcription (_ground truth_) with the decoded output of the RNN (_prediction_) we can see that the RNN indeed learns how to recognize speech. The following log extract shows the learning progress during various stages of the learning progress.

    Epoch 1:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       e    
    
    Epoch 5:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       irie e eiei e e e e ei 
    
    Epoch 25:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       tirberete euernke uishe en heitum
    
    Epoch 38:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuernkenfmimische ein heiligtum

    Epoch 63:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuertrunken himlische dein heiligum
    
It is evident that in the first few epochs the generated transcript does not make much sense. It consists mainly of the letter `e` which happens to be the most frequent character in a German text. With further iterations however, the generated transcripts become clearer until they match up almost perfectly with the actual transcripts. After just 63 epochs the RNN produces transcripts that are very similar to the originl ones. Note that in the last two epochs above the RNN has actually unlearned how to recognize the word _heiligtum_. This is possible in LSTM-networks, because LSTM-cells include a trainable parameter $\Gamma_f$ that controls how much of the cell value of the previous time step is used to calculate the cell value in the current time step (forget-gate). For more information about LSTM-cells see [Christopher Olah's blog about understanding LSTM cells](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

The learning progress is also reflected in the plots for the CTC- and LER-cost:    

<table>
    <tr>
        <td>
            <img src="../assets/poc1_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc1_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation
The PoC shows that the RNN is generally able to learn something useful. However, the results are not representative as the RNN will not generalize well, meaning it is not expected to perform well on unseen examples. The results can still be used as a reference value.

### PoC #2

To get a better intuition on how fast the RNN learns, it was trained on the same data again using MFCC as features. However, the original data was only included in the first epoch. For all following epochs, the audio was distorted by slightly changing the tempo by a random factor between 0.8 and 1.2. This resulted in synthesized data which should simulate slower and faster speakers.

#### Results

Because the RNN did not converge, training was aborted after more than 13.000 epochs. By that time the LER rate oscillated around a value between 0.07 and 0.08 with average change rates below 0.1%, which is close to the convergence criteria. 

<table>
    <tr>
        <td>
            <img src="../assets/poc2_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc2_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


This is also reflected in the results of the last epochs:

    Ground truth (train-set):     an die freude von friedrich schiller
    Prediction (train-set):       an die freude von f friedrich schiler            
    Ground truth (train-set):     freude schoner gotterfunken
    Prediction (train-set):       freude schoner goterfunken                       
    Ground truth (train-set):     tochter aus elysium
    Prediction (train-set):       tochter aus elysium                              
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wr betren feuertrunken himlische dein h heiligtum
    Ground truth (train-set):     deine zauber binden wieder
    Prediction (train-set):       deine zauber binde wieder    
  
#### Interpretation

We can see that the RNN still learns the relationships between audio signal and transcription when trained on synthesized data. In contrast to PoC#1 the training batches were similar but no two training samples were exactly the same. This reduced overfitting and the risk of the RNN learning the mapping by heard, but required a considerably larger number of training epochs to get comparable results (more than 13k epochs compared to only a few dozen before).

### PoC #3

For the third PoC we use a corpus entry with similar properties like in PoC #1. The first corpus entry in the _LibriSpeech_ corpus is ... . Five speech segments with similar lenghts as in PoC #1 have the following transcripts:

<table>
    <tr>
        <td>
            <img src="../assets/poc3_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc3_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Results
Training the RNN the same way like shown above gives us the following progress:

    tbd...   

### PoC #4

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc4_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc4_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #5

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc5_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc5_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #6

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc6_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc6_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #7

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc7_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc7_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #8

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc8_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc8_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #: English sample (long segments)

For the third PoC the chapter _"Tom, the Piper's Son"_ from _"Mother Goose in Prose"_ from the _LibriSpeech_-corpus was used (`corpus_entry.id=121669`). The first five speech segments are:

    1. tom the pipers son
    2. the pig was eat and tom was beat and tom ran crying down the street
    3. he never did any work except to play the pipes and he played so badly that few pennies ever found their way into his pouch it was whispered around that old barney was not very honest
    4. but he was so sly and cautious that no one had ever caught him in the act of stealing although a good many things had been missed after they had fallen into the old mans way barney had one son named tom
    5. and they lived all alone in a little hut away at the end of the village street for toms mother had died when he was a baby you may not suppose that tom was a very good boy since he had such a queer father but neither was he very bad