# IP8: Creation of Labelled Data
Define a path to an empty directory with enough free storage where the labelled data can be stored:

In [None]:
target_root = r'E:/'

As usual, let's define the imports and some helper functions before we start.

In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, Audio
import ipywidgets as widgets
from IPython.display import HTML, Audio

from create_labelled_data import create_subsets
from corpus_util import *
from audio_util import *
from data_util import *

import os
from os.path import isdir, join
from pathlib import Path
import pandas as pd

# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import librosa

from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
from IPython.display import HTML, Audio
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

%matplotlib inline

from corpus_util import *
from audio_util import *
from data_util import *

rl_corpus_root = os.path.join(target_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(target_root, 'librispeech-corpus')

rl_data_root = os.path.join(target_root, 'readylingua-data')
ls_data_root = os.path.join(target_root, 'librispeech-data')

rl_corpus_path = os.path.join(rl_corpus_root, 'readylingua.corpus')
ls_corpus_path = os.path.join(ls_corpus_root, 'librispeech.corpus')


def show_labelled_data(corpus_entry):
    rate, audio = corpus_entry.audio
    
    display(Audio(data=audio, rate=rate))
    
    fig = plt.figure(figsize=(14, 8))
    ax_wave = show_wave(fig, audio, corpus_entry.audio_file)
    
    freqs, times, spectrogram = log_specgram(audio, rate)
    ax_spec, extent = show_spectrogram(fig, freqs, times, spectrogram)
    left, right, bottom, top = extent

    x, y, _ = load_labelled_data(corpus_entry, r'E:\readylingua-data')
    
    boundaries = calculate_pause_boundaries(y)
    show_pause_segments(ax_wave, boundaries, len(audio))
    show_pause_segments(ax_spec, boundaries, right-left)
    
def show_wave(fig, audio, audio_file):
    ax1 = fig.add_subplot(211)
    ax1.set_title('Raw wave of ' + audio_file)
    ax1.set_ylabel('Amplitude')
    ax1.plot(np.linspace(0, len(audio), len(audio)), audio)
    return ax1

def show_spectrogram(fig, freqs, times, spectrogram):
    ax2 = fig.add_subplot(212)
    extent = [times.min(), times.max(), freqs.min(), freqs.max()]
    ax2.imshow(spectrogram.T, aspect='auto', origin='lower', extent=extent)
    ax2.set_yticks(freqs[::16])
    ax2.set_xticks(times[::int(len(times)/10)])
    ax2.set_title('Spectrogram of ' + corpus_entry.audio_file)
    ax2.set_ylabel('Freqs in Hz')
    ax2.set_xlabel('Seconds')
    return ax2, extent

def show_pause_segments(ax, boundaries, x_width):
    for pause_start, pause_end in boundaries:
        ax.axvspan(pause_start*x_width, pause_end*x_width, color='red', alpha=0.5)
    
def calculate_pause_boundaries(y):
    boundaries = np.flatnonzero(np.diff(np.r_[0,y,0]) != 0).reshape(-1,2) - [0,1]
    return [tuple(elem) for elem in boundaries / len(y)]

    
def on_create_data_rl_button_click(sender):
    rl_target_root = os.path.join(target_root, 'readylingua-data')
    create_subsets(ls_corpus, rl_target_root)
    
def on_create_data_ls_button_click(sender):
    ls_target_root = os.path.join(target_root, 'librispeech-data')
    create_subsets(ls_corpus, ls_target_root)      
    
# UI elements
layout = widgets.Layout(width='250px', height='50px')
create_data_rl_btn = widgets.Button(description="Create labelled data for ReadyLingua", button_style='info', layout=layout, icon='download')
create_data_rl_btn.on_click(on_create_data_rl_button_click)
create_data_ls_btn = widgets.Button(description="Create labelled data for LibriSpeech", button_style='info', layout=layout, icon='download')
create_data_ls_btn.on_click(on_create_data_ls_button_click)

After creating the corpora we can start creating labelled data to train an RNN. In the following sections the following variable namesare used to denote the parts of this data:

* `X`: The training data, i.e. the spectrograms (one spectrogram per corpus entry)
* `Y`: The training labels, i.e. sequences of zeroes when text is being spoken and sequences of ones when nothing is being spoken (i.e. silence or only background noise)

Let's load the created corpora to make them available to this notebook.

In [None]:
ls_corpus = load_corpus(rl_corpus_path)
rl_corpus = load_corpus(ls_corpus_path)

## Train/Dev/Test split
The labelled data is split into subsets for training (_train-set_), validation (_dev-set_) and model evaluation (_test-set_). Since the corpora were constructed from different amounts of raw data, they vary in size and probability distribution (number of languages, homogeneity of the recording quality, ratio of male vs. female speakers, presence of distortions like reverb, echo or overdrive, and many more). Since the starting point for the creation of the corpus was so different, different approaches were taken to split the corpus up into train-, dev- and test-set.

#### ReadyLingua corpus
tbd.

#### LibriSpeech corpus
The LibriSpeech raw data is already split into train-, dev- and test-set. Each chapter is read by a different speaker. Each speaker is only contained in one of the subsets. Efforts have been made to keep the different sets within the same probability distributions (regarding to accents, ratio of male/female speakers, ...). To leverage these efforts, the corresponding corpus entries created from the raw data are kept in the same sets.

---

You can explore the subsets by executing the cell below to see the number of samples (corpus entries) in each subset.

In [None]:
ls_train, ls_dev, ls_test = ls_corpus.train_dev_test_split()
print(f'#train-samples: {len(ls_train)}, #dev-samples: {len(ls_dev)}, #test-samples: {len(ls_test)}')

rl_train, rl_dev, rl_test = rl_corpus.train_dev_test_split()
print(f'#train-samples: {len(rl_train)}, #dev-samples: {len(rl_dev)}, #test-samples: {len(rl_test)}')

##  From corpus entries to spectrograms
In order to train an RNN, each sample needs to be converted into some sort of sequence. In this case the samples are the audio files from the corpus entries and the sequences are their spectrograms. You can explore a random sample together with its spectrogram by executing the cell below.

In [None]:
corpus_entry = random.choice(ls_corpus)
show_spectrogram(random_entry.audio_file)
show_audio(random_entry)

A spectrogram is now created as a matrix `x` for every single corpus entry. All the `x`-es are then collected and form `X`. For each corpus this gives us three seperate files (`X_train.ls`, `X_dev.ls` and `X_test.ls` for LibriSpeech data and `X_train.rl`, `X_dev.rl` and `X_test.rl` for ReadyLingua data).

Accordingly, the segmentation information (speech- and pause-segments) is obtained from each corpus entry to form a label vector `y` for each sample. All label vectors arae collected into the Label matrix `Y`. Like for the data part, also the labels are kept in three seperate files per corpus (`Y_train.ls`, `Y_dev.ls` and `Y_test.ls` resp. `Y_train.rl`, `Y_dev.rl` and `Y_test.rl`)

## Creating the spectrograms and labels

Click the button below to start processing the

In [None]:
display(widgets.HBox([create_data_rl_btn, create_data_ls_btn]))

## Exploring the labelled data
After the data has been processed, we can visualize a sample by comparing a spectrogram with its corresponding label vector.

In [None]:
corpus_entry = random.choice(rl_corpus)
show_labelled_data(corpus_entry)