# ASR stage

From the previous notebooks the following data was created:

1. **audio signal**: resampled to 16kHz (mono)
1. **segmentation**: parts of the audio signal containing speech, derived from the raw data or through VAD
1. **features**: Power-spectrograms, Mel-spectrograms or MFCC of the audio signal to use as training data
1. **labels**: transcriptions for the raw speech segments in original and normalized form

The set of features and labels is called the _labelled data_. The input data is often denoted as  `X` and the labels as `Y`. To train the network only a part of the labelled data is used. This part is referred to as **training set**. By training on the training set, an RNN for ASR that will learn the relationship between an audio signal and its textual representation. The RNN will use the [CTC loss function](https://www.cs.toronto.edu/~graves/icml_2006.pdf) as described in the [introduction](00_introduction.ipynb). 

To prevent overfitting on this set we use another part of the labelled data to form a **validation/dev set**. The trained RNN can be validated by feeding it previously unseen instances from the dev set. From observation of the results decisions can be made about what to optimize next. Because conclusions are drawn from the validation, the validation set is indirectly used for training and should not be used for evaluation.

Finally, there is the **test-set**, which contains completely unseen instances. The RNN can be evaluated with instances from this set. Because evaluation should be the last step when creating an Neural Network, instances from this set should be held off until training is finished.

An RNN trained for ASR is supposed to learn the relationship between an audio signal and its transcription, i.e. if trained properly it should be able to generate a transcription for any unseen audio signal (_prediction_) that is roughly equivalent to the acutal transcription (_ground truth_). However, this notebook describes the training of a RNN prototype (_Proof of Concept_ or _PoC_). The purpose of this prototype is not to be trained into a fully-fletched ASR model but rather to examine the influence of different properties of the data on the training process. By comparing the learning progress with different inputs, valuable information should be gained that can be used to train the actual RNN for ASR later.

Execute the following cell to import some modules used in this notebook.

In [None]:
import os

from util.corpus_util import *
from util.audio_util import *

from IPython.display import HTML, Audio

rl_corpus = get_corpus('rl')
ls_corpus = get_corpus('ls')

## Data synthetization

In Deep Learning, additional training data can often help getting better results. Additional data can be artificially created through synthetization. Audio data is particularly easy to synthetisize by altering the original signal, e.g. by applying the following changes:

* change of tempo
* change of loudness
* change of pitch
* adding echo/reverb
* adding background noise

Data synthetization can help improving the performance of an RNN, especially if labelled data is scarce. However, this can only be done to a certain extent, because the distribution of training data should still reflect the distribution of the data the trained RNN is later used for. Synthetization must also be done carefully, especially when adding background noise. A certain background noise may be applied only once, otherwise the RNN will learn how to subtract it from the given signal.

For this notebook, synthetic data is produced by adding some distortion to the original signal. Different speaking rates can be simulated by changing the tempo. Different voices can be simulated by changing the pitch. Execute the following cell to hear some examples:

In [None]:
corpus_entry = rl_corpus['edznachrichten180111']

original, rate = corpus_entry.audio, corpus_entry.rate
display(HTML('original signal:'))
display(Audio(data=original, rate=rate))

fast_reader = distort(original, rate, tempo=1.5)
display(HTML('fast reader:'))
display(Audio(data=fast_reader, rate=rate))

high_pitched = distort(original, rate, pitch=True)
display(HTML('high pitched voice:'))
display(Audio(data=high_pitched, rate=rate))

## Procedure

The pronunciation of a given piece of text is highly dependent on the language. ASR is therefore highly sensitive to the language. An RNN for ASR is usually only valid for the language it was trained an and not be able to recognize words from a different language. For that reason, the PoC trained in this notebook is only trained on one language at a time. For each language the RNN is trained using the different features. Each combination of language and feature type is trained using either only the original data or the original data that has been augmented using synthetized variants. Different combinations of language, feature type and data type yield different profiles for training.

### Training profiles

The properties _language_, _feature type_ and _data type_ have the following value ranges:

| configuration item | possible values |
|---|---|
| language | German or English |
| feature type | MFCC, Mel- or Power-Spectrograms |
| data type | original or synthetisized data |

This gives us the following matrix of training profiles:

| Profile | language | feature type | data type |
|---|---|---|---|
| Poc#1 | German | MFCC | original |
| Poc#2 | German | MFCC | synthesized|
| Poc#3 | German | Mel-Spectrogram | original |
| Poc#4 | German | Mel-Spectrogram | synthesized|
| Poc#5 | German | Power-Spectrogram | original |
| Poc#6 | German | Power-Spectrogram | synthesized|
| Poc#7 | English | MFCC | original |
| Poc#8 | English | MFCC | synthesized|
| Poc#9 | English | Mel-Spectrogram | original |
| Poc#10| English | Mel-Spectrogram | synthesized|
| Poc#11 | English | Power-Spectrogram | original |
| Poc#12 | English | Power-Spectrogram | synthesized|

Between subsequent profiles only ever one configuration item is changed. Comparing the results of two PoCs trained on different profiles should give some insight about how that particular property affects the learning progress. E.g. by comparing the results of PoC#1 and PoC#2 we get an intuition for how synthetisized data will impact the speed of learning with features and language being identical. By comparing Poc#1 with PoC#3 the efficiency in the learning progress can be assessed when using MFCC or Mel-Spectrograms as features. By Comparing Poc#1 with Poc#7 the impact of a change in language can be estimated.

The results of the PoCs serve a basis for further decisions. By changing only one single configuration item in each iteration, the impact of each change can be analyzed in isolation and conclusions for the following steps can be drawn. This follows the principle of incremental changes and should prevent spending too much project time on a highly sophisticated setup that may or may not work.

## Proof of Concept (PoC)

To see whether the PoC is even able to learn something useful, the following simplifications were made:

* the RNN is only trained on five speech segments of a single corpus entry
* only speech segments that do not contain numbers are considered

Note that for the German and English training samples speech segments from the _ReadyLingua_ corpus were drawn. You can listen to the corpus entries by executing the cell below. Note that the two recordings exhibit similar quality and are both read by a female speaker. This should limit the impact of recording quality and gender on the training progress.

In [None]:
corpus_entry_de = rl_corpus['andiefreudehokohnerauschenrein']
segments_de = corpus_entry_de.speech_segments_not_numeric
av_seg_len_de = sum(s.audio_length for s in segments_de) / len(segments_de)
av_trn_len_de = sum(len(s.text) for s in segments_de) / len(segments_de)

corpus_entry_en = rl_corpus['sunday22ohnerauschen']
segments_en = corpus_entry_en.speech_segments_not_numeric
av_seg_len_en = sum(s.audio_length for s in segments_en) / len(segments_en)
av_trn_len_en = sum(len(s.text) for s in segments_en) / len(segments_en)

display(HTML('German corpus entry ({len(segments_de)} segments):'))
display(HTML(f'Average speech segment length: {av_seg_len_de:.3f} seconds'))
display(HTML(f'Average transcript length: {av_trn_len_de:.3f} characters'))
display(Audio(data=corpus_entry_de.audio, rate=corpus_entry_de.rate))
for s in corpus_entry_de.speech_segments_not_numeric[:5]:
    print(s.text)
    
display(HTML('English corpus entry  ({len(segments_en)} segments):'))
display(HTML(f'Average speech segment length: {av_seg_len_en:.3f} seconds'))
display(HTML(f'Average transcript length: {av_trn_len_en:.3f} characters'))
display(Audio(data=corpus_entry_en.audio, rate=corpus_entry_en.rate))
for s in corpus_entry_en.speech_segments_not_numeric[:5]:
    print(s.text)


### Architecture

The PoC is a RNN with a simple architecture. It only has one layer which is time-distributed over $T_x$ time steps. The value for $T_x$ depends on the length of the input signal. The number of units $n$ depends on the number of features. For MFCC features this value is $13$, for Mel-Spectrograms $40$ and for Power-Spectrograms $161$ (see [notebook 3](03_feature_extraction.ipynb#From-raw-waves-to-spectrograms) on how to calculate those values). The RNN uses [LSTM cells](http://colah.github.io/posts/2015-08-Understanding-LSTMs). LSTM cells have the ability to not only learn from previous time steps by adding information, but also to remove (_"forget"_) information from previous steps. They do so by having trainable parameters to control how much information flows from one time step to the next (_forget gate_).

![PoC architecture](../assets/poc_architecture.jpg)

### Training and validation

For profiles that do not use synthesized data, the training data simply consists of the audio signal and labels of the five speech segments. For profiles that use synthesized data, this set was augmented by adding another five speech segments that were artificially created by adding some distortion as described above. Because the distortion was made randomly, this resulted in a slightly different training set for each epoch.

The validation set was generated from randomly shifted and distorted audio signals of the training set.

### Performance
The RNN performance is measured with two metrics:

* CTC-loss
* Label Error Rate (LER)

The calculation of the CTC-loss has been described as part of the description of CTC in [the introduction](00_Introduction.ipynb#Calculating-the-CTC-loss-by-creating-valid-alignments-with-dynamic-programming). The LER is calculated as follows.

#### Label Error Rate (LER)
The LER is defined as the [edit distance](https://www.tensorflow.org/api_docs/python/tf/edit_distance) between prediction (_hypothesis_) and actual labels (_ground truth_), also called the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Observe for example the LER for the hypothesis `hello` and the truth `hallo`: The two strings differ in 1 of 5 characters, therefore the LER is `0.2`.

The LER can also be computed for strings of different lengths. The following table gives an overview of a few samples.

| Hypothesis | Truth | LER |
|---|---|---|
| 'hello' | 'hallo' | 0.2
| 'hell' | 'hallo' | 0.4
| 'hel' | 'hallo' | 0.6
| 'helloo' | 'hallo' | 0.4
| 'hellooo' | 'hallo' | 0.6
| 'helo' | 'hallo' | 0.4
| 'heloo' | 'hallo' | 0.4
| 'helooo' | 'hallo' | 0.6
| 'allo' | 'hallo' | 0.2
| 'elo' | 'hallo' | 0.6

You can also execute the cell below to see how the values are calculated with TensorFlow.

In [None]:
samples = [
    ('hello', 'hallo'),
    ('hell', 'hallo'),
    ('hel', 'hallo'),
    ('helloo', 'hallo'),
    ('hellooo', 'hallo'),
    ('helo', 'hallo'),
    ('heloo', 'hallo'),
    ('helooo', 'hallo'),
    ('allo', 'hallo'),
    ('elo', 'hallo'),
]

import tensorflow as tf
from util.rnn_util import *

print('hypothesis'.ljust(15) + 'truth'.ljust(15) + 'LER'.ljust(10))
print('-'.join('' for _ in range(40)))
for hypothesis, truth in samples:
    h_values = encode(hypothesis)
    h_indices = [[0, i] for i in range(len(h_values))]
    h_shape = [1, len(h_values)]
    h_tensor = tf.SparseTensor(indices=h_indices, values=h_values, dense_shape=h_shape)
    
    t_values = encode(truth)
    t_indices = [[0, i] for i in range(len(t_values))]
    t_shape = [1, len(h_values)]
    t_tensor = tf.SparseTensor(indices=t_indices, values=t_values, dense_shape=t_shape)
    
    with tf.Session() as sess:
        ler = tf.edit_distance(h_tensor, t_tensor)
        edit_distance = sess.run(ler)
        print(f'{hypothesis.ljust(15)}{truth.ljust(15)}{str(edit_distance[0]).ljust(10)}')

### Definition of convergence

The RNN is trained until convergence. To decide when to stop, the term _convergence_ needs to be defined first. This can be done by observing the trend of the cost curve. For the PoCs in this notebook, _convergence_ was reached if the LER cost either met both of the following two criteria or 10'000 epochs have passed:

1. **Criterion 1**: the prediction must be accurate enough.
1. **Criterion 2**: the average LER cost must have plateaued, i.e. not change more than 1% over the last 10 epochs

The first criterion is measured by calculating the mean LER-cost over the last 10 epochs. It must be below 0.05 to fullfill the criterion. This ensures that the predicted transcriptions diverge from the actual transcriptions by 5% at most (in terms of edit distance). The value of 5% is somewhat arbitrarily chosen. However it lies within the range of the LERs found for the best STT systems found in research papers.

The second criterion ensures that training is not stopped too soon. Learning should still continue if the gradient of the average LER cost is still negative.

## Results and interpretation

### PoC #1: MFCC, German, original data
For the _ReadyLingua_ corpus the first corpus entry in German is the poem _"An die Freude"_ from F. Schiller. The first PoC is trained exclusively on the first five speech segments. These segments have the following transcript (normalized):

    1. an die freude von friedrich schiller
    2. freude schoner gotterfunken
    3. tochter aus elysium
    4. wir betreten feuertrunken himmlische dein heiligtum
    5. deine zauber binden wieder
    
The RNN is trained by repeatedly feeding it these training samples. One iteration over the whole training set is called _epoch_. The RNN is trained until convergence as defined above. 

After each epoch, the training progress is validated by calculating the LER- and CTC-cost. For this, an artificial validation set is created by randomly shifting the audio signals from the training set by some milliseconds. Shifting is done by cropping a random number between 1 and 2000 samples from the beginning. By doing so, the audio signal is translated to the left by 125ms at most (with a sampling rate of 16kHz).

It is evident that by training on such a small training set the RNN will hopelessly overfit to those 5 samples. Also, the validation set consist of variants of the training set and is therefore not representative because the RNN will have already seen the samples in some similar form. However, although the result are not representative, this PoC is a quick and cheap way to validate if the RNN is able to learn anything at all. The results can serve as a baseline for comparison.

#### Results

By comparing the actual transcription (_ground truth_) with the decoded output of the RNN (_prediction_) we can see that the RNN indeed learns how to recognize speech. The following log extract shows the learning progress during various stages of the learning progress.

    Epoch 1:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       e    
    
    Epoch 5:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       irie e eiei e e e e ei 
    
    Epoch 25:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       tirberete euernke uishe en heitum
    
    Epoch 38:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuernkenfmimische ein heiligtum

    Epoch 63:
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wir betreten feuertrunken himlische dein heiligum
    
It is evident that in the first few epochs the generated transcript does not make much sense. It consists mainly of the letter `e` which happens to be the most frequent character in a German text. With further iterations however, the generated transcripts become clearer until they match up almost perfectly with the actual transcripts. After just 63 epochs the RNN produces transcripts that are very similar to the originl ones. Note that in the last two epochs above the RNN has actually unlearned how to recognize the word _heiligtum_. This is possible in LSTM-networks, because LSTM-cells include a trainable parameter $\Gamma_f$ that controls how much of the cell value of the previous time step is used to calculate the cell value in the current time step (forget-gate). For more information about LSTM-cells see [Christopher Olah's blog about understanding LSTM cells](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

The learning progress is also reflected in the plots for the CTC- and LER-cost:    

<table>
    <tr>
        <td>
            <img src="../assets/poc1_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc1_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

### PoC#2: MFCC, German, synthesized data

To get a better intuition on how fast the RNN learns, it was trained on the same data again using MFCC-features. However, the original training data was augmented by adding synthesized speech segments that were created by distorting the audio signal as described above. The transcript remained the same.

#### Results

Because the RNN did not converge, training was aborted after more than 13.000 epochs. By that time the LER rate oscillated around a value between 0.07 and 0.08 with average change rates below 0.1%, which is close to the convergence criteria. 

<table>
    <tr>
        <td>
            <img src="../assets/poc2_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc2_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


This is also reflected in the results of the last epochs:

    Ground truth (train-set):     an die freude von friedrich schiller
    Prediction (train-set):       an die freude von f friedrich schiler            
    Ground truth (train-set):     freude schoner gotterfunken
    Prediction (train-set):       freude schoner goterfunken                       
    Ground truth (train-set):     tochter aus elysium
    Prediction (train-set):       tochter aus elysium                              
    Ground truth (train-set):     wir betreten feuertrunken himmlische dein heiligtum
    Prediction (train-set):       wr betren feuertrunken himlische dein h heiligtum
    Ground truth (train-set):     deine zauber binden wieder
    Prediction (train-set):       deine zauber binde wieder    
  
#### Interpretation

We can see that the RNN still learns the relationships between audio signal and transcription when trained on synthesized data. In contrast to PoC#1 the training batches were similar but no two training samples were exactly the same. This reduced overfitting and the risk of the RNN learning the mapping by heard, but required a considerably larger number of training epochs to get comparable results (more than 13k epochs compared to only a few dozen before).

### PoC #3: German, Mel-Spectrograms, original data

For the third PoC we use a corpus entry with similar properties like in PoC #1. The first corpus entry in the _LibriSpeech_ corpus is ... . Five speech segments with similar lenghts as in PoC #1 have the following transcripts:

<table>
    <tr>
        <td>
            <img src="../assets/poc3_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc3_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Results
Training the RNN the same way like shown above gives us the following progress:

    tbd...   

### PoC #4: German, Mel-Spectrograms, synthesized data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc4_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc4_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #5: German, Power-Spectrograms, original data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc5_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc5_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #6: German, Power-Spectrograms, synthesized data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc6_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc6_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #7: English, MFCC, original data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc7_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc7_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>

#### Interpretation


### PoC #8: English, MFCC, synthesized data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc8_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc8_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


### PoC #9: English, Mel-Spectrograms, original data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc9_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc9_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


### PoC #10: English, Mel-Spectrograms, synthesized data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc10_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc10_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


### PoC #11: English, Power-Spectrograms, original data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc11_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc11_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


### PoC #12: English, Power-Spectrograms, Synthesized data

#### Results

<table>
    <tr>
        <td>
            <img src="../assets/poc12_ctc.png" alt="CTC cost" style="width: 450px; "/>
        </td>
        <td>
            <img src="../assets/poc12_ler.png" alt="LER cost" style="width: 450px;"/>
        </td>
    </tr>
</table>


## Summary

This notebook showed how a PoC was trained in order to gain insight about the usefulness of the different feature types (MFCC, Mel-Spectrograms and Power-Spectrograms. It also tried to assess the influence of language and synthetisized data on the training process.

The results showed that the PoC was able to learn from MFCC features for German samples without synthetisized data. For this combination, the RNN converged pretty quickly and was able to make good predictions for audio data that was created using shifted and distorted versions of the training data.

For other combinations the PoC did not converge and in some cases produce somewhat random results. Generally, the training curves exhibited high variance, meaning that there was a gap between training and validation loss. This is an indication for severe overfit that is probably a result of the lack of training data. This is also true for feature types with a small number of features (MFCC)-