# IP8: Creation of Labelled Data
Define a path to an empty directory with enough free storage where the labelled data can be stored:

In [None]:
target_root = r'E:/'

As usual, let's define the imports and some helper functions before we start.

In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, Audio
import ipywidgets as widgets

from create_labelled_data import create_subsets
from corpus_util import *
from audio_util import *
from data_util import *

def show_audio(corpus_entry):
    entry_title = HTML(f"""
    <h3>Sample corpus entry: {corpus_entry.name}</h3>
    <p><strong>Path to raw data</strong>: {corpus_entry.original_path}</p>
    """)
    entry_audio = Audio(corpus_entry.audio_file)
    entry_text = widgets.Accordion(children=[widgets.HTML(f'<pre>{corpus_entry.transcription}</pre>')], selected_index=None)
    entry_text.set_title(0, 'Transcription')
    
    display(entry_title)
    display(entry_audio)
    display(entry_text)

def show_spectrogram(audio_file):
    NFFT = 200  # Length of each window segment
    Fs = 8000  # Sampling frequencies
    noverlap = 120  # Overlap between windows

    freqs, times, spec = calculate_spectrogram(audio_file, nfft=NFFT, fs=Fs, noverlap=noverlap)
    
    pad_xextent = (NFFT - noverlap) / Fs / 2
    xextent = np.min(times) - pad_xextent, np.max(times) + pad_xextent
    xmin, xmax = xextent
    extent = xmin, xmax, freqs[0], freqs[-1]

    im = plt.imshow(spec, extent=extent, aspect='auto')
    plt.ylabel('Frequency [Hz]')
    plt.xlabel('Time [steps]')  

    
def on_create_data_rl_button_click(sender):
    rl_target_root = os.path.join(target_root, 'readylingua-data')
    create_subsets(ls_corpus, rl_target_root)
    
def on_create_data_ls_button_click(sender):
    ls_target_root = os.path.join(target_root, 'librispeech-data')
    create_subsets(ls_corpus, ls_target_root)
    
# UI elements
layout = widgets.Layout(width='250px', height='50px')
create_data_rl_btn = widgets.Button(description="Create labelled data for ReadyLingua", button_style='info', layout=layout, icon='download')
create_data_rl_btn.on_click(on_create_data_rl_button_click)
create_data_ls_btn = widgets.Button(description="Create labelled data for LibriSpeech", button_style='info', layout=layout, icon='download')
create_data_ls_btn.on_click(on_create_data_ls_button_click)

After creating the corpora we can start creating labelled data to train an RNN. In the following sections the following variable namesare used to denote the parts of this data:

* `X`: The training data, i.e. the spectrograms (one spectrogram per corpus entry)
* `Y`: The training labels, i.e. sequences of zeroes when text is being spoken and sequences of ones when nothing is being spoken (i.e. silence or only background noise)

Let's load the created corpora to make them available to this notebook.

In [None]:
ls_corpus = load_corpus(r'E:\librispeech-corpus\librispeech.corpus')
rl_corpus = load_corpus(r'E:\readylingua-corpus\readylingua.corpus')

## Train/Dev/Test split
The labelled data is split into subsets for training (_train-set_), validation (_dev-set_) and model evaluation (_test-set_). Since the corpora were constructed from different amounts of raw data, they vary in size and probability distribution (number of languages, homogeneity of the recording quality, ratio of male vs. female speakers, presence of distortions like reverb, echo or overdrive, and many more). Since the starting point for the creation of the corpus was so different, different approaches were taken to split the corpus up into train-, dev- and test-set.

#### ReadyLingua corpus
tbd.

#### LibriSpeech corpus
The LibriSpeech raw data is already split into train-, dev- and test-set. Each chapter is read by a different speaker. Each speaker is only contained in one of the subsets. Efforts have been made to keep the different sets within the same probability distributions (regarding to accents, ratio of male/female speakers, ...). To leverage these efforts, the corresponding corpus entries created from the raw data are kept in the same sets.

---

You can explore the subsets by executing the cell below to see the number of samples (corpus entries) in each subset.

In [None]:
ls_train, ls_dev, ls_test = ls_corpus.train_dev_test_split()
print(f'#train-samples: {len(ls_train)}, #dev-samples: {len(ls_dev)}, #test-samples: {len(ls_test)}')

rl_train, rl_dev, rl_test = rl_corpus.train_dev_test_split()
print(f'#train-samples: {len(rl_train)}, #dev-samples: {len(rl_dev)}, #test-samples: {len(rl_test)}')

##  From corpus entries to spectrograms
In order to train an RNN, each sample needs to be converted into some sort of sequence. In this case the samples are the audio files from the corpus entries and the sequences are their spectrograms. You can explore a random sample together with its spectrogram by executing the cell below.

In [None]:
random_entry = random.choice(ls_corpus)
show_spectrogram(random_entry.audio_file)
show_audio(random_entry)

A spectrogram is now created as a matrix `x` for every single corpus entry. All the `x`-es are then collected and form `X`. For each corpus this gives us three seperate files (`X_train.ls`, `X_dev.ls` and `X_test.ls` for LibriSpeech data and `X_train.rl`, `X_dev.rl` and `X_test.rl` for ReadyLingua data).

Accordingly, the segmentation information (speech- and pause-segments) is obtained from each corpus entry to form a label vector `y` for each sample. All label vectors arae collected into the Label matrix `Y`. Like for the data part, also the labels are kept in three seperate files per corpus (`Y_train.ls`, `Y_dev.ls` and `Y_test.ls` resp. `Y_train.rl`, `Y_dev.rl` and `Y_test.rl`)

## Creating the spectrograms and labels

Click the button below to start processing the

In [None]:
display(widgets.HBox([create_data_rl_btn, create_data_ls_btn]))

## Exploring the labelled data
After the data has been processed, we can visualize a sample by comparing a spectrogram with its corresponding label vector.

In [None]:
from data_util import *
import random
import os

def visualize_sample(root_path, subset_name, ix_sample=None):
    subset_entries = load_subset(subset_name, root_path)
    subset_entries = list(subset_entries)
    X, Y = subset_entries[ix_sample] if ix_sample is not None else random.choice(subset_entries)
    return X, Y

ls_root = os.path.join(target_root, 'librispeech-data')
visualize_sample(ls_root, 'train')