# ASR-Stage

In this Notebook a state-of-the-art model for ASR is presented. A simplified version of this model was implemented to be used in the ASR stage of the pipeline.

## Deep Speech
In 2014 <cite data-cite="undefined"></cite> presented a RNN model called _Deep Speech_ to recognize speech from possible noisy environments. Since the model used CTC, it did non require pre-alignment e.g. by supplying a phonetic transcript. The model was trained on various corpora (_WSJ_, _Switchbboard_, _Fisher_, _Baidu_) containing both conversational and read speech. This data was augmented by artificially adding background noise (_data synthesis_). Transcripts for sequences of audio frames were learned using a target alphabet that consisted of 29 characters (`a..z`, `space`, apostrophe and _blank_). Performance was measured with _Label Error Rate_ and _Word Error Rate_. The model architecture was remarkabily simple, consisting of only 5 layers, one of which was an RNN layer:

<figure>
    <img src="../assets/deep_speech_architecture.png" />
    <caption>DeepSpeech architecture (source: <cite data-cite="undefined"></cite>)</caption>
</figure>

### Model layers
The first three layers are _convolutional_ layers and not recurrent. The _clipped ReLU_ function was used as activation function, which is defined as:

$$
g(z) = \min(\max(0,z), 20)
$$

The fourth layer is recurrent with forward and backward recurrence (_bi-directional_). Hence the units in this layer share weights over time and depend on both the time frames before and after each time step. Note that no LSTM-cells were used in favor of a simpler model that requires less data. _Clipped ReLU_ was also used as activation functionfor this layer.

The last layer is again non-recurrent, consisting of densely connected units (_fully-connected_) As activation function, softmax was used, yielding a probability for each character of the target alphabet. From the matrix of probabilities ($(T_x \times 29)$) the loss was calculated by measuring the prediction error with CTC as described before.

### Regularization

To prevent overfitting, dropout with values between 5% and 10% were applied for the first three (non-recurrent) layers. The values of the input signals were centered by subtracting the global mean and then scaled through division by the standard deviation. Additionally, each audio signal was shifted 5ms to the left and right to calculate two additional values per time frame. The probabilities were then calculated by averaging over all three values. Finally, an ensemble of several RNN was used at test time.

### Model features and performance

Spectrograms were used as features. Despite its simple architecture, the model outperformed previously published systems of that time. This was also possible because a language model (LM) was used to model the probabilities between sequences of letters and words with _n-grams_. This language model fixed error in transcript that were the result of learning plausible renderings of words, that were grammatically incorrect, like the following example taken from the original paper:

| RNN output | Actual transcription |
|---|---|
| bostin | Boston |
| arther n tickets | are there any tickets |

Additionally, performance was improved by introducing a novel approach to parallelize calculations on multiple GPUs. Furthermore, computational effort was reduced through halving the time steps by only using every second time step in the bidirectional layer.

## Model implementation for this project

Because training a state-of-the-art model was not required (and also not feasible) for this project, a simpler model should be trained. A simpler model also allowed for shorter training times and therefore faster feedback cycles. This was crucial for this project as only one GPU was available and the available project time for this stage of the pipeline was very limited. The idea was to find out whether the pipelined approach would still work when using a less capable model in the ASR stage. The ASR-model used in this project is therefore a simplified version of the Deep Speech model as presented in the original paper. Simplification was made in the following aspects:

* no LM was used
* no data synthetization was done, i.e. no audio translation, distortion or superposition of background noise
* the first three layers are not convolutional, i.e. no striding in the spectrograms was applied to halve the time steps. This also means that no context frames were used to calculate the features for each frame. Instead, simple FC layers with a dropout of `0.1` were used.
* the apostrophe was not part of the target alphabet, thus the target alphabet consisted of 28 characters (`a..z`, `space`, `blank`)
* No ensembling was used
* less training data was available (some hundred hours compared to some thousand for _DeepSpeech_

Implementation was done in Python using [Keras](https://keras.io) with [TensorFlow](http://tensorflow.org) backend. The model was trained on both corpora (ReadyLingua and LibriSpeech) using different types of features (MFCC, Mel-Spectrograms, Power-Spectrograms).

### Model architecture

Similar to the original DeepSpeech model, the simplified model consist of 5 layers, whereas the first three layers are fully connected (in contrast to convolutional in the original model). This architecture was inspired by code in [this repository](https://github.com/igormq/asr-study). 

#### Input layer
The input layer has shape $(N, T_x, f)$, where $N$ is the batch size, $T_x$ the sequence length and $f$ the number of features. A batch size of $N=32$ was arbitrarily chosen for training. Since the ASR model is trained on speech segments of different lengths, T_x is determined by the longest segment in each batch. Shorter segments are zero-padded to match $T_x$. However, the value of $T_x$ may vary between batches. The number of features $f$ depends on the type of features used for training and was set to default values of $f=13$ for MFCC-features, $f=40$ for Mel-Spectrograms and $f=161$ for power-spectrograms.

Because calculating the features is time-consuming, they were pre-calculated and stored in a [HDF5](https://h5py.org)-file to speed up the training process. According to the [Space-time tradeoff](https://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff) these files can become quite big.

### Other layers

The other layers correspond more or less to the layers of the _DeepSpeech_ model with the simplifications described above. Execute the following cell to get the Keras summary describing the architecture for a model that can be trained on MFCC-features.


In [None]:
from util.brnn_util import *

model = deep_speech_model(num_features=13)
model.summary()

## Results

