# Feature extraction

In the [first notebook](01_corpus_creation.ipynb) corpora containing standardized corpus entries were created. Each corpus entry corresponds to a recording for which the following two pieces of information are available:

1. **Segmentation information**: Information _where_ in the recording something is being said (speech segments)
1. **Partial transcripts**: one transcript for each speech segment

The RNN will be trained on features of the speech segments. Although it is possible to directly transcribe raw speech waveforms with RNNs <cite data-cite="6174726/DDKDDT5P"></cite> this is rarely done as computational cost is high and performance is somewhat limited. Instead, features are extracted from the raw waveform and training is done on those features. Two popular methods that are frequently found in academic research papers are:

- spectrograms (portion of each frequency within a short time frame of the waveform)
- Mel-Frequency Cepstral Coefficients (MFCC)

Both methods are explained below. Whatever method is used, feature extraction is computationally expensive and requires a lot of time. To speed up the iterations when training the RNN and get feedback faster, the input data (the spectrograms) are pre-computed and stored on disk so they do not need to be calculated at training time. Also, the labels (the information about speech pauses) need to be encoded in a suitable format. This notebook describes how both is done.

Execute the following cell to import modules and define functions used in this notebook.

In [None]:
%matplotlib inline
# %matplotlib notebook

from util.corpus_util import *
from util.audio_util import *

import random
import numpy as np
import os
from os.path import isdir, join
from pathlib import Path
import pandas as pd

# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import librosa

from sklearn.decomposition import PCA

# Visualization
from IPython.display import HTML, Audio
import ipywidgets as widgets
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

default_figsize = (12,5)
default_facecolor = 'white'
default_font = {'family': 'serif', 
                'weight': 'normal', 
#                 'size': 12
               }
plt.rc('font', **default_font)

def show_labelled_data(corpus_entry):
    display(HTML(f'<h3>{corpus_entry.name} (id={corpus_entry.id})</h3>'))
    display(HTML(f'{len(corpus_entry.speech_segments)} speech segments, {len(corpus_entry.pause_segments)} pause segments'))
    
    # audio data
    audio, rate = corpus_entry.audio, corpus_entry.rate
    display(Audio(data=audio, rate=rate))
    
    fig = plt.figure(figsize=default_figsize, facecolor=default_facecolor)

    # plot raw wave
    ax_wave = fig.add_subplot(212)
    title = f'Raw wave of {corpus_entry.audio_file} with speech pauses'
    ax_wave = show_wave(audio, rate, ax_wave, title)
    
    # plot spectrogram
    window_size_ms, step_size_ms = 20, 10
    window_size, step_size = ms_to_frames(window_size_ms, rate), ms_to_frames(step_size_ms, rate)
    ax_spec = fig.add_subplot(211)
    title = f'Spectrogram of ' + corpus_entry.audio_file
    spec = corpus_entry.spectrogram(window_size=window_size_ms, step_size=step_size_ms)
    ax_spec = show_spectrogram(spec, rate, step_size, ax_spec, title=title, scale=None)
    
    # overlay speech and pause segments
    speech_boundaries = calculate_boundaries(corpus_entry.speech_segments)
    speech_boundaries_u = calculate_boundaries(corpus_entry.speech_segments_unaligned)
    pause_boundaries = calculate_boundaries(corpus_entry.pause_segments)
    
    # rescale boundaries from frames to seconds
    speech_boundaries = speech_boundaries / corpus_entry.rate
    speech_boundaries_u = speech_boundaries_u / corpus_entry.rate
    pause_boundaries = pause_boundaries / corpus_entry.rate
    
#     show_segments(ax_spec, speech_boundaries, color='green')
    show_segments(ax_wave, speech_boundaries, color='green')
#     show_segments(ax_spec, speech_boundaries_u, color='yellow')
    show_segments(ax_wave, speech_boundaries_u, color='yellow')
#     show_segments(ax_spec, pause_boundaries, color='red')
    show_segments(ax_wave, pause_boundaries, color='red')
    
    speech_segments = mpatches.Patch(color='green', alpha=0.6, label='speech segments')
    speech_segments_u = mpatches.Patch(color='yellow', alpha=0.6, label='speech segments unaligned')
    pause_segments = mpatches.Patch(color='red', alpha=0.6, label='pause segments')
    ax_wave.legend(handles=[speech_segments, speech_segments_u, pause_segments], bbox_to_anchor=(0, -0.5, 1., -0.4), loc=3, mode='expand', borderaxespad=0, ncol=3)
    
    return ax_spec, ax_wave

def show_wave(audio, sample_rate, ax=None, title=None):
    if not ax:
        plt.figure(figsize=default_figsize, facecolor=default_facecolor)
        
    p = librosa.display.waveplot(audio.astype(float), sample_rate)
    ax = p.axes
    ax.set_ylabel('Amplitude')
    if title:
        plt.title(title)
    plt.tight_layout()
    return ax

def show_spectrogram(spec, sample_rate, step_size, ax=None, title=None, scale='db'):
    if not ax:
        plt.figure(figsize=default_figsize, facecolor=default_facecolor)

    ax = librosa.display.specshow(spec, sr=sample_rate, hop_length=step_size, 
                                  x_axis='time', y_axis='hz', cmap='viridis')
    if scale == 'db':
        plt.colorbar(format='%+2.0f dB')
    if title:
        plt.title(title)
    plt.tight_layout()    
    return ax
    
def show_spectrogram_3d(spec, title=None):
    data = [go.Surface(z=spec)]
    layout = go.Layout(
        title=title,
        scene = dict(
            xaxis = dict(title='Time'),
            yaxis = dict(title='Frequencies'),
            zaxis = dict(title='Log amplitude'),
            ),
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig)      


def show_segments(ax, boundaries, ymin=0, ymax=1, color='red'):
    for i, (start_frame, end_frame) in enumerate(boundaries):
        rect = ax.axvspan(start_frame, end_frame, ymin=ymin, ymax=ymax, color=color, alpha=0.5)
        y_0, y_1 = ax.get_ylim()
        x = start_frame + (end_frame - start_frame)/2
        y = y_0 + 0.01*(y_1-y_0) if ymin==0 else y_1 - 0.05*(y_1-y_0)
        ax.text(x, y, str(i+1), horizontalalignment='center', fontdict={'family': 'sans-serif', 'size': 15, 'color': 'white'})

def calculate_boundaries(segments):
    start_frames = (seg.start_frame for seg in segments)
    end_frames = (seg.end_frame for seg in segments)
    return np.array(list(zip(start_frames, end_frames)))

def on_create_data_rl_button_click(sender):
    print(f'tbd: create RL-features in HDF5')
    
def on_create_data_ls_button_click(sender):
    print(f'tbd: create LS-features in HDF5')
    
# UI elements
layout = widgets.Layout(width='250px', height='50px')
create_data_rl_btn = widgets.Button(description="Create labelled data for ReadyLingua", button_style='info', layout=layout, icon='download')
create_data_rl_btn.on_click(on_create_data_rl_button_click)
create_data_ls_btn = widgets.Button(description="Create labelled data for LibriSpeech", button_style='info', layout=layout, icon='download')
create_data_ls_btn.on_click(on_create_data_ls_button_click)

All data (features and labels) is stored as numpy arrays whose dimensions partially depend on the proposed network architecture and the kind of features used (spectrograms or MFCC). Regardless of the kind of features used, the raw wave (as a sequence of sampling points) will be converted to a sequence of audio frames.

The RNN is trained on audio data (sequence of frames) and will output whether a specific section in the audio signal is speech or pause (sequence of labels). Because both input and the output are sequences, this is a sequence-to-sequence model with a **many-to-many** architecture. The most important parameters for this kind of model are:

* $T_x$: Number of sequence tokens in an individual sample. This value may vary between samples!
* $T_y$: Number of sequence tokens in the output. This value may be different from $T_x$ and also vary between samples!
* $N$: Number of training samples
* $K$: Number of different output labels (i.e. the output alphabet)

In the following sections the following variables are used to denote the two components of the labelled data:

| Formal symbol | Variable name | Type | Shape | description |
|---|---|---|---|---|
| $X$ | `X` | 2D-Array | $(N \times T_x)$ | The spectrograms of a speech segment |
| $Y$ | `Y` | 2D-Array | $(T_y \times K)$ | The one-hot encoded transcript of a speech segment | 

Let's load the created corpora to make them available to this notebook.

In [None]:
rl_corpus = get_corpus('rl')
ls_corpus = get_corpus('ls')

## Train/Dev/Test split
The labelled data is split into subsets for training (_train-set_), parameter tuning (_dev-set_) and model evaluation (_test-set_). Because the raw data is distributed differently for each source (number of languages, homogeneity of the recording quality, ratio of male vs. female speakers, presence of distortions like reverb or overdrive, and many more) it must be ensured that this distribution is represented in each subset.For the _LibriSpeech_ corpus this is already done, whereas a good split needs to be determined first for the _ReadyLingua_ corpus.  Also the corpus sizes are highly different in terms of size (24.997 speech segments/~22h audio for _ReadyLingua_ vs. 334.345 entries/~1.140h audio). 

### ReadyLingua corpus
The raw data exhibits high variance with respect to relevant features (recording quality, length of samples, presence of distortion, ...). Since the corpus is rather small there may be only one sample for a specific feature value (e.g. only one recording with reverb). Therefore to keep things simple the split into train-, dev- and test-set was done with a 80/10/10-rule without closer examination of the underlying data. This will probably not result in an optimal split since it would be possible for example that all the female speakers will be put in one subset. However, as a first attempt this fact is ignored.

Improvements could be made to the training process by manually assigning each sample to a specific set by carefully inspecting the relevant features. The corpus could also be extended by creating synthetisized data, e.g. creating samples with reverb from the original samples. Because the LibriSpeech corpus looked much more promising for a proof of concept, this time was not invested.

### LibriSpeech corpus
The LibriSpeech raw data is already split into train-, dev- and test-set. Each chapter is read by a different speaker. Each speaker is only contained in one of the subsets. Efforts have been made to keep the different sets within the same probability distributions (regarding to accents, ratio of male/female speakers, ...). The information about what subset a particular entry belongs to was preserved during corpus creation. To leverage the efforts made by the LibriSpeech project, this train/dev/test-set split was not changed.

---

You can explore the size of the subsets for each corpus by executing the cell below to see the number of samples (corpus entries) in each subset.

In [None]:
rl_train, rl_dev, rl_test = rl_corpus.train_dev_test_split()
print(f'ReadyLingua corpus ({len(rl_train + rl_dev + rl_test)} samples): #train-samples: {len(rl_train)}, #dev-samples: {len(rl_dev)}, #test-samples: {len(rl_test)}')

ls_train, ls_dev, ls_test = ls_corpus.train_dev_test_split()
print(f'LibriSpeech corpus ({len(ls_train + ls_dev + ls_test)} samples): #train-samples: {len(ls_train)}, #dev-samples: {len(ls_dev)}, #test-samples: {len(ls_test)}')

##  Feature extraction from audio signal

In order to train an RNN, each training sample needs to be converted into some sort of sequence of features. In this case the samples are the audio files from the corpus entries that were converted to WAV files (`*.wav`) and downsampled to 16kHz (mono). This chapter describes different ways of extracting features from the audio signal that can be used for training.

### Raw waves
As the file extension suffix suggests the wave files contain the audio signal as a raw wave, which is just a series of discrete sample values. Because a sampling rate of 16kHz was used we get 16'000 sample values per second. A sample value corresponds to the _amplitude_ of the waveform at the given time step. These values can be stored in a 1-dimensional Numpy array and plotted in two dimension (time vs. amplitude).

For example consider the a raw wave for a spefic speech segment. Feel free to change the first line to visualize the raw wave for a random corpus entry.

In [None]:
corpus_entry = rl_corpus['20161124weeklyaddressthanksgiving']
speech_segment = corpus_entry.speech_segments[49]

# uncomment the following lines for random corpus entry and/or speech segment
# corpus_entry = random.choice(rl_corpus)
# speech_segment = random.choice(corpus_entry.speech_segments)

print(f'number of sampling points: {speech_segment.audio.shape[0]}, sampling rate: {speech_segment.rate}')
print(f'transcript: {speech_segment.transcript}')

display(Audio(data=speech_segment.audio, rate=speech_segment.rate))
title = f'Raw wave of speech segments in {corpus_entry.id}.wav'

fig = plt.figure(figsize=default_figsize, facecolor=default_facecolor)
show_wave(speech_segment.audio, speech_segment.rate, title=title)

### From raw waves to spectrograms

Although already a sequence, training on the raw wave would not be very useful since we would only have one feature (the amplitude) per time step. However, an audio signal just a bunch of overlaying frequencies of different phases and amplitudes. For a given time slot (_window_), the raw signal can be decomposed into its underlying frequencies using _Fourier Transformation_, yielding the amplitude of each frequency. 
These values can be stored in a 1-D array of shape $(f \times 1)$, whereas $f$ denotes the number of frequencies.

Since we will be using spectrograms as input values `X` to train an RNN, $T_x$ denotes the number of windows that can be calculated from the audio signal. All the windows together form the **spectrogram** which can be represented as a matrix of shape ($f \times T_x$) whereas each entry corresponds to the _magnitude_ of frequency $f$ in window $T_x$. A spectrogram can be visualized by color-coding the values. Consider the following spectrogram derived from the raw wave above.

In [None]:
mag_spec = speech_segment.mag_specgram()
print(mag_spec.shape)
title=f'Amplitude-spectrogram of speech segment in {corpus_entry.id}.wav'
show_spectrogram(mag_spec, speech_segment.rate, ms_to_frames(10, speech_segment.rate), scale=None)

Such a spectrogram can now be calculated for every speech segment. The following table contains all relevant parameters:

| Symbol | Variable in code | Value | Description |
|---|---|---|---|
| $n$ | `num_values` | - | number of discrete sampling values in audio signal |
| $r$ | `sample_rate` | - | sampling rate of audio signal |
| $w_{ms}$ | `window_size_ms` | 20 | Window length in ms |
| $w$ | `window_size` | 320 | Window length in frames |
| $s_{ms}$ | `step_size_ms` | 10 | Step length in ms |
| $s$ | `step_size` | 160 | Step length in frames |

Note that the window and step length in frames unit can be derived from their values in milliseconds by calculating $w = \frac{r \cdot w_{ms}}{1000}$ or $s = \frac{r \cdot s_{ms}}{1000}$ respectively.

To calculate the spectrogram for an audio signal, a sliding window of size $w$ is moved over the sample values with step size $s$. Note that the step size is usually smaller than the window size which means the windows will overlap to a certain degree. For any given audio signal $x$ the number of windows $T_x$ can be calculated by dividing the number of sampling values by the size of the overlap:

$$
T_x = \left\lfloor \frac{n}{(w-s)} + 1 \right\rfloor
$$

The flooring is needed because the window size might not match up exactly with the number of sample values, resulting in fractional values for $T_x$. Since the windows of the spectrogram will be used as input to an RNN, $T_x$ corresponds to the number of training samples. Therefore only whole numbers make sense. Alternatively, padding could be used to make the number of frames result in an integer number.

According to the [Nyquist theorem](https://en.wikipedia.org/wiki/Nyquist_rate) the sampling frequency of an audio signal must be (at least) twice the frequency of the signal frequency to be able to reconstruct the original signal from the discrete sampling values. Since our sampling rate is 16kHz this means the maximum frequency that can be reproduced is 8kHz. Therefore the frequencies in our spectrogram are all in the range $0..\frac{r}{2} = 0..8000$ Hz. This interval can be divided into equally sized sections. Including the borders of these sections this gives us $f$ equidistant sampling frequencies. The value for $f$ can be calculated as follows:

$$
f = \frac{w}{2} + 1
$$

Note that we add 1 at the end because the borders (lowest and the highest frequency) are both included.

Since the frequency band of the spectrogram will be spaced equally, the distance between two sample frequencies is 

$$
\Delta f = \frac{r}{2\cdot (f - 1)}
$$

This means that frequency phase $F_i$ in the frequency band can be calculated as follows:

$$
F_i = i \cdot \Delta f
$$

**Example**:

For this project all audio signals were re-sampled with with a sampling rate $r=16000$. To calculate the spectrogram we use a sliding window of $w_{ms}=20ms$ length and a step size of $s_{ms}=10ms$ . In frame units this gives us the values $w=\frac{16000 \cdot 20 ms}{1000 ms} = 320$ frames per window and $s=\frac{16000 \cdot 10 ms}{1000 ms} = 160$ frames per step.

As stated above the frequencies all lie in the interval $[0..8000]$. This band is now divided into sections giving us $f = \frac{320}{2} + 1 = 161$ sample frequencies, whereas the distance between each frequency is $\Delta f = \frac{16000}{2 \cdot (161 - 1)} = \frac{8000}{160} = 50 Hz$. The $i$-th sample frequency can therefore be calculated as. $F_i = i \cdot 50$. The frequencies in the spectrogram are then:

    [0, 50, 100, 150, ... , 7950, 8000]

The raw wave for the example speech sequence above consists of $n = 9760$ sample values. Using a window size of $w=320$ frames and a step size of $s=160$ frames we arrive at a value of $T_x = \left\lfloor \frac{9760}{(320-160)} + 1 \right\rfloor = 62$  frames in the spectrogram.

We can verify this for the above spectrogram:

In [None]:
window_size_ms, step_size_ms, num_vals = 20, 10, speech_segment.audio.shape[0]
window_size = ms_to_frames(window_size_ms, speech_segment.rate)
step_size = ms_to_frames(step_size_ms, speech_segment.rate)

print(f'n = {num_vals}\t(number of sample values)')
print(f'r = {speech_segment.rate}\t(sample rate)')
print(f'w_ms = {window_size_ms}\t(window size in ms)\t\t ==> w = {window_size}\t\t(window size in frames)')
print(f's_ms = {step_size_ms}\t(step size in ms)\t\t ==> s = {step_size}\t\t(step size in frames)')
print()

f, T_x = mag_spec.shape
print(f'spec.shape = (f, T_x) = ({f}, {T_x})')
print()

delta_f = int(speech_segment.rate / (2 * (f - 1)))
print(f'delta_f = {delta_f}\t(difference between sample frequencies in Hz)')
print()

freqs = np.array(range(0, f*delta_f, delta_f))
print('Frequencies (y-Axis):')
print(freqs)

#### Power-Spectrograms

Above spectrogram visualizes the amplitude of the frequencies. Because of the way people hear, we recalculate the values to decibel (dB) units. To do this we take the logarithm of the squared amplitude and get so-called power spectrograms. The logarithmic scale of decibels corresponds with how humans perceive loudness: To double the perceived volume of a sound you would need to put 8 times more energy in it.

We can visualize the results by plotting the log-scaled dB-values along the two axes (time and frequency):

In [None]:
pow_spec = speech_segment.pow_specgram()
print(pow_spec.shape)
title = f'3D Power-Spectrogram of speech segments in {corpus_entry.id}.wav'
show_spectrogram_3d(librosa.power_to_db(pow_spec, ref=np.max), title=title)

We can further reduce the above 3D-plot by one dimension by flattening it along the z-axis (log ampliltude). We don't lose any information because the third dimension (dB value) is color-coded. From this plot we clearly see that most of the high-valued entries of the spectrogram all lie within a frequency range of approximately 300-3400 Hz, which corresponds to the usable voice frequency band [used e.g. in telephony](https://en.wikipedia.org/wiki/Voice_frequency).

In [None]:
title=f'Power-spectrogram of speech segment in {corpus_entry.id}.wav'
show_spectrogram(librosa.power_to_db(pow_spec, ref=np.max), speech_segment.rate, step_size=step_size, title=title)

#### Mel-Spectrograms

In the power spectrogram above the features were extracted by putting the values on the Hertz scale. This scale is logarithmic, i.e. a value of 2000Hz is considered twice as high than a value of 1000 Hz, which in turn is twice as high as a value of 500 Hz. In other words: doubling a Hertz value corresponds to setting the pitch of a tone an octave higher. Consider the following examples for reference:

| 500Hz | 1000Hz | 2000Hz
|---|---|---
|<audio src="../assets/500Hz.wav" style="width: 150px;" controls preload></audio>|<audio src="../assets/1000Hz.wav" style="width: 150px;" controls preload></audio>|<audio src="../assets/2000Hz.wav" style="width: 150px;" controls preload></audio>

Alternatively the values could be put on the the [Mel scale](https://en.wikipedia.org/wiki/Mel_scale) which is also logarithmic but based on psycho-acoustic findings about how pitches of equal distance are perceived by humans. It turns out that that until approximately 500Hz the Mel scale corresponds roughly with the Hertz scale, whereas for sounds above 500Hz the intervals between two sounds must increase in order to be perceived the same distance from another. In other words: lower frequencies are perceptually more important than higher frequencies. Thus the Mel scale is more discriminative for sounds on low frequencies and less discriminative for sounds at high frequencies. As a result, four octaves on the hertz scale above 500 Hz are judged to comprise about two octaves on the mel scale. The reference point is set at 1000Hz, which corresponds to 1000MEL. 

Consider the following plot which maps values on the Hertz-scale to their counterparts on the Mel-scale:

![Mel scale](../assets/mel-scale.svg)
_Source: Wikipedia_

There are various formulas to convert Hz to MEL. One possible choice is the one from Douglas O'Shaughnessy <cite data-cite="6174726/NQC4HAR8"></cite>:

$$
m = 2995 \log_{10}\left( 1 + \frac{f}{700} \right) = 1127 \ln \left( 1 + \frac{1}{700} \right)
$$

Using this formula we arrive at the following values for the above values on the Hertz scale:

| ~607MEL = 500Hz | 1000MEL = 1000Hz| ~1521MEL = 2000Hz
|---|---|---
|<audio src="../assets/607Hz.wav" style="width: 150px;" controls preload></audio>|<audio src="../assets/1000Hz.wav" style="width: 150px;" controls preload></audio>|<audio src="../assets/1521Hz.wav" style="width: 150px;" controls preload></audio>

We can now use the Mel scale instead of the Hertz scale to create a **Mel power spectrogram** as an alternative to the "normal" spectrogram. To do this, each bin in the spectrogram is divided into chunks of different sizes. For lower frequencies, the chunks are smaller because the human ear is able to discern more subtle changes in frequency in low-frequency areas (i.e. the Mel scale is more discriminative here). For higher frequencies the chunks become larger. The values in each chunk are averaged.

Note that the number of features is usually smaller than when calculating the spectrograms because the frequencies have been binned. The number of bins determines the number of frequencies.

In [None]:
mel_spec = speech_segment.mel_specgram()
print(mel_spec.shape)
title=f'Mel-spectrogram of speech segment in {corpus_entry.id}.wav'
show_spectrogram(librosa.power_to_db(mel_spec, ref=np.max), sample_rate=speech_segment.rate, step_size=step_size, title=title)

#### MFCC

As an alternative to Spectrograms we could use Mel Frequency Cepstral Coefficients (MFCC) as features. Because MFCC allow for a more compact representation of speech features than spectrograms, MFCC have a long-standing tradition in speech recognition and are mentioned often in papers. MFCC use the same psychoacoustic findings as described above for the Mel-Spectrograms and can be computed by performing the following steps (see <cite data-cite="6174726/YWWQ86H5"></cite>):

1. Divide the signal into frames: usually some small time frame like 20ms is used
1. Obtain the amplitude spectrum: done using Fourier Transformation
1. Take the logarithm: because the perceived loudness of a signal has been found to be approximately logarithmic
1. Convert to Mel spectrum: The frequencies are put into mel-spaced bins, whereas the bins for lower frequencies are smaller and become larger for higher frequencies. The frequency components in each bin are averaged, which leads to a smoothed spectrum of frequencies from the original spectrogram.
1. The number of features is reduced. PCA could be used for this transformation, but usually Discrete Cosine Transform (DCT) is used to approximate it.

Note that the first 4 steps are the same steps that are needed to calculate the Mel-Spectrogram. Therefore MFCC can be easily calculated from Mel-Spectrogram by performing the feature reduction (step 5). The number of principal components can be chosen. The resulting feature matrix usually contains much less features than the power-spectrograms.

In [None]:
mfccs = speech_segment.mfcc().T
print(mfccs.shape)
title=f'MFCC of speech segment in {corpus_entry.id}.wav'
show_spectrogram(mfccs, sample_rate=speech_segment.rate, step_size=step_size, title=title)

### Comparison of features

The following table compares the number of features produced for each feature type (Power-Spectrograms, Mel-Spectrograms and MCC) with the default values used in this project:

| Feature Type | #Features |
|---|---|
| Power-Spectrograms | 161 |
| Mel-Spectrograms | 40 |
| MFCC | 13 |

## Label extraction from transcripts

Because our approach is to train an RNN that can produce a transcript for a given audio signal (ASR) we need to derive the target labels for each speech sequence. We can derive those labels from the transcript, which is included for each speech sequence in the corpora. However, because an RNN only works with numerical values, we must find a way to convert the representation from alphanumeric to numeric. Additionally, since transcripts often contain unwanted characters and to reduce the number of possible target labels, some preprocessing needs to be done. This chapter documents how both is done.

### Normalizing the transcript

In order to facilitate learning, the alphabet of target characters in the transcript is limited to the 26 lowercase characters of the alphabet (`[a..z]`). Since this limitation results in less classes to compute probabilities for, it is expected to speed up the learning process and improve the quality of the result. 

The limitation of target classes is done by normalizing the transcript. Normalization involves the following steps:

1. remove leading and trailing whitespaces (trimming)
2. remove multiple subsequent occurences of whitespace within the transcript
3. replacing accentuated characters with their base character from the alphabet (e.g. _é_/_è_/_ê_/...->e, _ß_->ss, etc...)
4. removing non-alphanumeric characters (removes punctuation)
5. make everything lowercase

You can edit/execute the cell below with your own examples to see the result of normalization.

In [None]:
from util.string_util import *
samples = [ 'Crème-brûlée', 'Außerirdische', ' foo    bar   ']
for sample in samples:
    print(f'{sample} ==> {normalize(sample)}')

### Tokenizing the transcript

In order to arrive at the target labels for a speech sequence, its normalized transcript needs to be tokenized first. Tokenizing involves splitting the transcription first into words and then characters. The tokens are the characters of the transcription, whereas a special token `<space>` is used between the characters of two words.

The mapping of audio to text is actually a classification problem: Parts of the audio signal are mapped each to a specific character (i.e. _token_). Since an RNN resp. TensorFlow only works with numeric data, token need to be numerically encoded to put them on an ordinal scale. The following table shows how the encoding is done:

| **Token**    | `<space>` | `a` | `b` | `c` | ... | `z`  |
|--------------|:---------:|:---:|:---:|:---:|:---:|:----:|
| **Encoding** | `0`       | `1` | `2` | `3` | ... | `26` |

The following table shows how an example transcript is converted to its encoded form:

| **Original transcript** | The quick, brown fox jumps over the lazy dog!  |
|-------------------------|------------------------------------------------|
| **Normalized transcript** | the quick brown fox jumps over the lazy dog |
| **Tokenized transcript** | `['t', 'h', 'e', '<space>, 'q', 'u', 'i', 'c', 'k', '<space>', 'b', 'r', 'o', 'w', 'n', ...]` |
| **Encoded transcript** | `[ 20, 8, 5, 0, 17, 21, 9, 3, 11, 0, 2, 18, 15, 23, 14, ...]` |

#### The issue with numbers

Numbers in transcript pose a special problem, since their pronunciation differs fundamentally from their lexical representation, if written with digits (which is usually the case). Consider the number `8`, which is represented textually by the digit `'8'` and is pronounced as _'eight'_. In this case, the actual sequence of characters (`'e', 'i', 'g', 'h', 't')` is replaced by a single character `'8'` and can therefore not be approximated like ordinary words.

The problem becomes even harder since compound number are sometimes pronounced differently than their individual parts would be pronounced. Consider the number `13` which is pronounced `'thirteen'` (and not `'onethree'`!). This becomes especially important in languages like German which swap the decimal part (e.g. `'21'` is pronounced as `'one-and-twenty'`).

Since numbers are a problem of their own we want to limit their influence on the training process by training the RNN only on transcripts without numbers. We can filter those out by using the corpus entry as a function and pass in the `numeric=False` argument to get only those speech segments whose transcripts do not contain numbers:

## Summary

This chapter showed how to extract features from standardized corpus entries that can be used to train an RNN. Two types of spectrograms (Power and Mel-Power spectrograms) and MFCC were identified as possible features. The number of features is generally smallest for MFCC and highest for Power-Spectrograms. This might be important because training on more features also means more training data is required.

A feature matrix `X` can be created for each speech segment, together with the target vector `Y`, which is just an numerically encoded form of the normalized transcript. Normalization is needed to limit the influence of various side effects caused e.g. by the specialized textual representation of numbers and also to improve the learning process.

Both `X` and `Y` make up the labelled data which can be further subdivided into three different sets (train-, dev- and test-set) used to train, validate and evaluate an RNN. For the _LibriSpeech_ corpus this is easy because the subdivision is already made by the creators of the raw data. For _ReadyLingua_, a split of 80/10/10% was made for each language without deeper analyzation of properties of the recordings (speaker gender, recording quality, etc...).

## References

<div class="cite2c-biblio"></div>