# IP8: Creation of Corpora
This IPython notebook documents and visualizes some crucial steps made during the progress of the project. I should help the reader understand how and why decisions were made as well as illustrate some important concepts with examples.

## Prerequisites
This project was built using Python 3.6 and Anaconda 3. Please install the packages listed in `requirements.txt`. Additionally, you need the following tools and resources:

* [FFMPEG](http://www.ffmpeg.org/): for the conversion and/or resampling of audio files
* ReadyLingua raw data: You need to get the raw files somehow and store them on your machine.

### Training data
Every Neural Network needs training data. The RNN used in this project is no exception. Since this project is about Forced Alignment (FA), training data consisted of pre-aligned audio and transcript data. This training data was derived from the following resources:

* ReadyLingua
* Migros PodClub
* LibriSpeech
* ... (additional Corpora tbd.)

Data from ReadyLingua and PodClup is not publicly available so you must specify the path to the directory where those files are stored. You must use an absolute path.

Data from the LibriSpeech is publicly available. You can specify a folder where those files are stored. You must use an absolute path. If the directory is empty, LibriSpeech data will automatically be downloaded. If the directory is not empty, it is assumed that the data from LibriSpeech was automatically downloaded and extracted in this directory. In this case the directory structure must match the expected structure.


### Set directories
This project uses several corpora as training data. The corpora need to be created and trained, which requires approximately 350GB of free storage on the harddisk with the currently included corpora. Note: Final storage use might be lower since some of the memory is only used temporarily.

**Don't forget to execute the cell to apply the changes!**

In [None]:
rl_source_root = r'D:\corpus\readylingua-raw'   # path to directory where raw ReadyLingua data is stored
ls_source_root = r'D:\corpus\librispeech-raw'   # path to directory where LibriSpeech files are/will be downloaded (will be changed if files are downloaded)

target_root = r'E:/'                            # path to the directory where the corpora will be created (must have at least 350GB of free storage)

### Imports and helper functions
Execute the cell below to import modules and helper functions

In [None]:
"""
Imports and some helper functions. You don't need to change anything in here!
"""
import tarfile
import random
from os import listdir, rmdir, remove, makedirs
from random import randint
from shutil import move

import ipywidgets as widgets
import matplotlib.pyplot as plt
import os.path
import requests
from os.path import exists
from tqdm import tqdm

import librispeech_corpus
import readylingua_corpus
from audio_util import *
from corpus_util import *
from IPython.display import HTML, Audio
import ipywidgets as widgets

% matplotlib inline

# name of target directory for ReadyLingua corpus files (default value)
rl_target_root = os.path.join(target_root, 'readylingua-corpus')
# name of target directory for LibriSpeech corpus files (default value)
ls_target_root = os.path.join(target_root, 'librispeech-corpus')
# path to ReadyLingua corpus file (will be set after corpus has been created)
rl_corpus_file = os.path.join(rl_target_root, 'readylingua.corpus')
# path to LibriSpeech corpus file (will be set after corpus has been created)
ls_corpus_file = os.path.join(ls_target_root, 'librispeech.corpus')

def show_corpus_entry(corpus, ix_entry=None, ix_speech=None, ix_pause=None):
    corpus_entry = corpus[ix_entry] if ix_entry is not None else random.choice(corpus)
    speech = corpus_entry.speech_segments[ix_speech] if ix_speech is not None else random.choice(corpus_entry.speech_segments)
    pause = corpus_entry.pause_alignments[ix_pause] if ix_pause is not None else random.choice(corpus_entry.pause_segments)

    show_audio(corpus_entry)
    show_segment(speech)
    show_segment(pause)


def show_audio(corpus_entry):
    entry_title = HTML(f"""
    <h3>Sample corpus entry: {corpus_entry.name}</h3>
    <p><strong>Path to raw data</strong>: {corpus_entry.original_path}</p>
    """)
    entry_audio = Audio(corpus_entry.audio_file)
    entry_text = widgets.Accordion(children=[widgets.HTML(f'<pre>{corpus_entry.transcription}</pre>')], selected_index=None)
    entry_text.set_title(0, 'Transcription')
    
    display(entry_title)
    display(entry_audio)
    display(entry_text)
    
def show_segment(segment):
    segment_title = HTML(f'<strong>Sample alignment</strong> (start_frame={segment.start_frame}, end_frame={segment.end_frame})')
    segment_audio = Audio(data=segment.audio, rate=16000.0)

    display(segment_title)
    display(segment_audio)
    if segment.text:
        segment_text = HTML(f'<pre>{segment.text}</pre>')
        display(segment_text)


def download_file(url, target_dir):
    r = requests.get(url, stream=True)
    total_size = int(r.headers.get('content-length', 0));
    block_size = 1024
    wrote = 0
    tmp_file = os.path.join(target_dir, 'download.tmp')
    if not exists(target_dir):
        makedirs(target_dir)

    with open(tmp_file, 'wb') as f:
        with tqdm(r.iter_content(32 * block_size), total=total_size, unit='B', unit_divisor=block_size,
                  unit_scale=True) as pbar:
            for data in r.iter_content(32 * 1024):
                wrote = wrote + len(data)
                f.write(data)
                pbar.update(len(data))

    if total_size != 0 and wrote != total_size:
        print("ERROR, something went wrong")

    print('Extracting data...')
    tar = tarfile.open(tmp_file, "r:gz")
    tar.extractall(target_dir)
    tar.close()

    remove(tmp_file)


def move_files(src_dir, target_dir):
    for filename in listdir(src_dir):
        move(os.path.join(src_dir, filename), os.path.join(target_dir, filename))
    rmdir(src_dir)


def on_download_ls_button_click(sender):
    global ls_source_root
    print('Downloading LibriSpeech data... Get lunch or something!')
    print('Download 1/2: Audio data')
    download_dir = os.path.join(ls_source_root, 'audio')
    if exists(download_dir) and listdir(download_dir):
        print(f'Directory {download_dir} exists and is not empty. Assuming data was already downloaded there.')
    else:
        download_file('http://www.openslr.org/resources/12/original-mp3.tar.gz', download_dir)
        print('Done! Moving files...')
        move_files(os.path.join(download_dir, 'LibriSpeech'), download_dir)

    print('Download 2/2: Text data')
    download_dir = os.path.join(ls_source_root, 'books')
    if exists(download_dir) and listdir(download_dir):
        print(f'Directory {download_dir} exists and is not empty. Assuming data was already downloaded there.')
    else:
        download_file('http://www.openslr.org/resources/12/original-books.tar.gz', download_dir)
        move_files(os.path.join(download_dir, 'LibriSpeech'), download_dir)
        makedirs(os.path.join(download_dir, 'utf-8'))
        move_files(os.path.join(download_dir, 'books', 'utf-8'), os.path.join(download_dir, 'utf-8'))
        delete_directory = os.path.join(download_dir, 'books')
        print(f'Done! Please delete {delete_directory} manually (not needed)')

    print(f'Files downloaded and extracted to: {ls_source_root}')


def on_create_rl_button_click(sender):
    global rl_corpus_file
    print('Creating ReadyLingua corpus... Get a coffee or something!')
    rl_corpus, rl_corpus_file = readylingua_corpus.create_corpus(source_root=rl_source_root, target_root=rl_target_root)
    print(f'Done! Corpus with {len(rl_corpus)} entries saved to {rl_corpus_file}')


def on_create_ls_button_click(sender):
    global ls_corpus_file
    print('Creating LibriSpeech corpus... Go to bed or something!')
    ls_corpus, ls_corpus_file = librispeech_corpus.create_corpus(source_root=ls_source_root, target_root=ls_target_root)
    print(f'Done! Corpus with {len(ls_corpus)} entries saved to {rl_corpus_file}')

# UI elements
layout = widgets.Layout(width='250px', height='50px')
download_ls_button = widgets.Button(description="Download LibriSpeech Data", button_style='info', layout=layout, icon='download')
download_ls_button.on_click(on_download_ls_button_click)
create_rl_button = widgets.Button(description="Create ReadyLingua Corpus", button_style='warning', layout=layout, icon="book", tooltip='~5 minutes')
create_rl_button.on_click(on_create_rl_button_click)
create_ls_button = widgets.Button(description="Create LibriSpeech Corpus", button_style='warning', layout=layout,icon="book", tooltip='~5 hours')
create_ls_button.on_click(on_create_ls_button_click)

## Creation of Corpora
The alignment information is extracted from the raw files and stored in **corpus_entries**. Those corpus entries can be created in this notebook.

### Corpus entries
In order to allow data from all sources for training, it had to be converted to a common format. Since (to my knowledge) there is not a standardized format for FA, I had to define one myself. Therefore I went for the following structure for a single corpus entry:

```JSON
// definition of the corpus
corpus = [corpus_entry]

// definition of an individual corpus entry
corpus_entry = 
{
    'audio': [byte],                 // bytes from the audio file
    'transcription': string,         // raw (unaligned) text 
    'speech-pauses': [speech_pause], // segmentation of the audio file into speech and pause segments
    'alignment': [alignment]         // alignment of bits of the unaligned text with the audio
}

// definition of a speech or pause segment
speech_pause = 
{
    'id': string,                    // some unique identifier
    'start': int,                    // start frame of the segment
    'end': int,                      // end frame of the speech pause
    'class': string                  // 'speech' for a speech segment, 'pause' for a pause segment
}

// definition of an alignment
alignment = 
{
    'text': string,                  // text that is being spoken in the audio
    'start': int,                    // start frame in the audio file (when the text starts)
    'end': int                       // end frame in the audio file (when the text stops)
}
```



### Create ReadyLingua Corpus
ReadyLingua (RL) provides alignment data distributed over several files files:

* `*.wav`: Audio file containing the speech
* `*.txt`: UTF-8 encoded (unaligned) transcription
* `* - Segmentation.xml`: file comtaining the definition of speech- and pause segments
```XML
<Segmentation>
    <SelectionExtension>0</SelectionExtension>
    <Segments>
	<Segment id="1" start="83790" end="122598" class="Speech" uid="5" />
	...
    </Segments>
    <Segmenter SegmenterType="SICore.AudioSegmentation.EnergyThresholding">
        <MaxSpeechSegmentExtension>50</MaxSpeechSegmentExtension>
        <Length>-1</Length>
        <Energies>
            <Value id="1" value="0" />
            ...
        </Energies>
        <OriginalSegments>
            <Segment id="1" start="83790" end="100548" class="Speech" uid="2" />
            ...
        </OriginalSegments>
        <EnergyPeak>3569753</EnergyPeak>
        <StepSize>441</StepSize>
        <ITL>146139</ITL>
        <ITU>730695</ITU>
        <LastUid>2048</LastUid>
        <MinPauseDuration>200</MinPauseDuration>
        <MinSpeechDuration>150</MinSpeechDuration>
        <BeginOfSilence>1546255</BeginOfSilence>
        <SilenceLength>100</SilenceLength>
        <ThresholdCorrectionFactor>1</ThresholdCorrectionFactor>
    </Segmenter>
</Segmentation>
```
* `* - Index.xml`: file containing the actual alignments of text to audio
```XML
<XMLIndexFile>
    <Version>2.0.0</Version>
    <SamplingRate>44100</SamplingRate>
    <NumberOfIndices>91</NumberOfIndices>
    <TextAudioIndex>
        <TextStartPos>0</TextStartPos>
        <TextEndPos>36</TextEndPos>
        <AudioStartPos>952101</AudioStartPos>
        <AudioEndPos>1062000</AudioEndPos>
        <SpeakerKey>-1</SpeakerKey>
    </TextAudioIndex>
    ...
</XMLIndexFile>    
```
* `* - Project.xml`: Project file binding the different files together for a corpus entry (note: this file is optional, i.e. there may be not project file for a corpus entry)

Corpus entries are organized in a folder hierarchy. There is a fileset for each corpus entry. Usually, the files for a specific corpus entry reside in a leaf directory (i.e. a directory without further subdirectories). If there is a project file, this file is used to locate the files needed to 

Audio data is provided as Wave-Files with a sampling rate of 44,1 kHz (stereo). Because most ASR corpora provide their recordings with a sampling rate of 16 kHz the files were downsampled and the alignment information adjusted. The raw transcription is integrated as-is. The XML files are parsed to extract the alignment data. Alignment-, textual and downsampled audio data are merged into a corpus entry as described above. 

#### Create corpus entries
We need to extract the alignments from the segmentation information of the raw data. For this, the downloaded data needs to be converted to corpus entries. This process takes a few minuts, so this is a good time to have a coffee break.

In [None]:
display(create_rl_button)

#### Explore corpus
To see if everything worked as expected let's check out a sample alignment. You can execute the cell below to show a random alignment from a random corpus entry. You can execute the cell several times to see different samples.

In [None]:
rl_corpus = load_corpus(rl_corpus_file)
show_corpus_entry(rl_corpus)

### Create LibriSpeech Corpus
[LibriSpeech](http://www.openslr.org/12/) is an open-source corpus for Automatic Speech Recognition (ASR). It contains recordings of LibriVox' public domain audio books and their transcriptions made by volunteers. The data is evenly distributed in terms of gender, recording length, accent, etc. The corpus is split into training-, dev- and test-set (`train-*.tar.gz`, `dev-*.tar.gz` and `test-*.tar.gz`). However, those sets only contain the transcript as a set of segments and an audio file for each segment. They do not contain any temporal information which is needed for alignment.

Luckily, there is also the `original-mp3-tar.gz` for download which contains the original LibriVox mp3 files (from which the corpus was created) along with the alignment information. Alignment is made on utterance level, i.e. the transcript is split up into segments whereas each segment corresponds to an utterance. Segments were derived by allowing splitting on every silence interval longer than 300ms. 

The data is organized into subdirectories of the following format:

    ./LibriSpeech/mp3/{speaker_id}/{chapter_id}/

There is one subdirectory containing all the information about a recording. For this project the following files are important:

- `{chapter_id}.mp3`: The audio file containing the recording. The audio is mono with a bitrate of 128 kB/s and a sampling rate of 44.1 kHz.
- `{speaker_id}-{chapter_id}.seg.txt`: Text file containing temporal information about the segments (one segment per line). The time is indicated in seconds.
Example:
```
14-208_0000 25.16 40.51
```
- `{speaker_id}-{chapter_id}.trans.txt`: Text file containing the transcriptions of the segments (one segment per line). The transcription is all uppercase and does not contain any punctuation.
```
14-208_0000 CHAPTER ELEVEN THE MORROW BROUGHT A VERY SOBER LOOKING MORNING THE SUN MAKING ONLY A FEW EFFORTS...
```

In order to create the corpus, these files had to be parsed and the audio was converted and downsampled to a 16kHz Wave-file.
Information about the Speakers, Chapters and Books were extracted from the respective files (`SPEAKERS.TXT`, `CHAPTERS.TXT` and `BOOKS.TXT`).

#### Download raw data
To create the LibriSpeech corpus you first need to download the raw data. The files are over 80GB and need to be extracted, so this might take a while...

In [None]:
display(download_ls_button)

#### Create corpus
We need to extract the alignments from the segmentation information of the raw data. For this, the downloaded data needs to be converted to corpus entries. **This process takes several hours, so you might want to do this just before knocking-off time!**

In [None]:
display(create_ls_button)

#### Explore corpus
To see if everything worked as expected let's check out a sample alignment. You can execute the cell below to show a random alignment from a random corpus entry. You can execute the cell several times to see different samples.

In [None]:
ls_corpus = load_corpus(ls_corpus_file)
show_corpus_entry(ls_corpus)