# IP8: Creation of Labelled Data
As usual, let's define the imports and some helper functions before we start.

In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, Audio
import ipywidgets as widgets

from corpus_util import *
from audio_util import *

def show_audio(corpus_entry):
    entry_title = HTML(f"""
    <h3>Sample corpus entry: {corpus_entry.name}</h3>
    <p><strong>Path to raw data</strong>: {corpus_entry.original_path}</p>
    """)
    entry_audio = Audio(corpus_entry.audio_file)
    entry_text = widgets.Accordion(children=[widgets.HTML(f'<pre>{corpus_entry.transcript}</pre>')], selected_index=None)
    entry_text.set_title(0, 'Transcript')
    
    display(entry_title)
    display(entry_audio)
    display(entry_text)

def show_spectrogram(audio_file):
    NFFT = 200  # Length of each window segment
    Fs = 8000  # Sampling frequencies
    noverlap = 120  # Overlap between windows

    freqs, times, spec = calculate_spectrogram(audio_file, nfft=NFFT, fs=Fs, noverlap=noverlap)

    pad_xextent = (NFFT - noverlap) / Fs / 2
    xextent = np.min(times) - pad_xextent, np.max(times) + pad_xextent
    xmin, xmax = xextent
    extent = xmin, xmax, freqs[0], freqs[-1]

    im = plt.imshow(spec, extent=extent, aspect='auto')
    plt.ylabel('Frequency [Hz]')
    plt.xlabel('Time [steps]')  

After creating the corpora we can start creating labelled data to train an RNN. In the following sections the following variable namesare used to denote the parts of this data:

* `X`: The training data, i.e. the spectrograms (one spectrogram per corpus entry)
* `Y`: The training labels, i.e. sequences of zeroes when text is being spoken and sequences of ones when nothing is being spoken (i.e. silence or only background noise)

Let's load the created corpora to make them available to this notebook.

In [None]:
ls_corpus = load_corpus(r'E:\librispeech-corpus\librispeech.corpus')
rl_corpus = load_corpus(r'E:\readylingua-corpus\readylingua.corpus')

## Train/Dev/Test split
The labelled data is split into subsets for training (_train-set_), validation (_dev-set_) and model evaluation (_test-set_). Since the corpora were constructed from different amounts of raw data, they vary in size and probability distribution (number of languages, homogeneity of the recording quality, ratio of male vs. female speakers, presence of distortions like reverb, echo or overdrive, and many more). Since the starting point for the creation of the corpus was so different, different approaches were taken to split the corpus up into train-, dev- and test-set.

#### ReadyLingua corpus
tbd.

#### LibriSpeech corpus
The LibriSpeech raw data is already split into train-, dev- and test-set. Each chapter is read by a different speaker. Each speaker is only contained in one of the subsets. Efforts have been made to keep the different sets within the same probability distributions (regarding to accents, ratio of male/female speakers, ...). To leverage these efforts, the corresponding corpus entries created from the raw data are kept in the same sets.

---

You can explore the subsets by executing the cell below to see the number of samples (corpus entries) in each subset.

In [None]:
ls_train, ls_dev, ls_test = ls_corpus.train_dev_test_split()
print(f'#train-samples: {len(ls_train)}, #dev-samples: {len(ls_dev)}, #test-samples: {len(ls_test)}')

rl_train, rl_dev, rl_test = rl_corpus.train_dev_test_split()
print(f'#train-samples: {len(rl_train)}, #dev-samples: {len(rl_dev)}, #test-samples: {len(rl_test)}')

##  From corpus entries to spectrograms
In order to train an RNN, each sample needs to be converted into some sort of sequence. In this case the samples are the audio files from the corpus entries and the sequences are their spectrograms. You can explore a random sample together with its spectrogram by executing the cell below.

In [None]:
random_entry = random.choice(ls_corpus)
show_spectrogram(random_entry.audio_file)
show_audio(random_entry)