# Corpus creation
This IPython notebook documents the creation of the corpora from raw data. The corpora can also be created interactively.

## Background

Machine Learning task in the field of _Natural Language Processing (NLP)_ often rely on corpora. The ASR-stage in this project is no exception. Raw data is available from manifold sources. For this project, two sources (_ReadyLingua_ and _LibriSpeech_) were considered. However, other sources are conceivable. The final solution should be able to train on data from arbitrary resources. However, since properties and format of the raw data is usually not standardized between sources, some pre-processing is required in order to bring raw data into a format that can be used by the final system.

Since each data source makes its own assumptions about how data should be represented, a separate preprocessing step is required for each data source. The processed data is then stored in _corpora_, which contain the actual data (audio signals and transcripts) as well as metadata (audio segmentation information, audio length, sampling rate, language, speaker gender, etc...). The data from the sources used in this project comes from different distributions (e.g. number of languages, speaker per gender, etc.). Therefore the processed data has been stored in different corpora.

## Prerequisites
This project was built using Python 3.6 and Anaconda 3. Please install the packages listed in `requirements.txt`. Additionally, you need the following tools and resources:

* [FFMPEG](http://www.ffmpeg.org/): for the conversion and/or resampling of audio files
* _ReadyLingua_ raw data: The aligned data from _ReadyLingua_ is not public domain. You need to get permission of the owner and store them on your machine.

All other data is publicly available and will be downloaded as needed by this notebook.

### Source directory
Since data from ReadyLingua and PodClub is not open to the public you must specify the path to the directory where those files are stored in the following cell. You must use an absolute path.

Data from the LibriSpeech is available under the [Creative Commons](https://en.wikipedia.org/wiki/Creative_Commons) license. You can download the files yourself and specify an absolute path to folder where the files are stored. If the directory is empty, LibriSpeech data will automatically be downloaded and extracted there. If the directory is not empty, it is assumed that the data from LibriSpeech was already manually downloaded and extracted in this directory. In this case the directory structure must match the expected structure.

In [None]:
rl_source_root = r'D:\corpus\readylingua-raw'   # path to directory where raw ReadyLingua data is stored
ls_source_root = r'D:\corpus\librispeech-raw'   # path to directory where LibriSpeech files are or will be downloaded

### Target directory
This notebook will create various corpora that need to be persisted somewhere. Specify the path to a directory that provides enough storage. Approximately 350GB of free storage is required. Note: Final storage use might be lower since some of the memory is only used temporarily.

**Don't forget to execute the cell to apply the changes!**

In [None]:
target_root = r'E:/'                            # path to the directory where the corpora will be created (must have at least 350GB of free storage)

### Imports and helper functions
Execute the cell below to import modules and helper functions

In [None]:
"""
Imports and some helper functions. You don't need to change anything in here!
"""
import tarfile
import random
from os import listdir, rmdir, remove, makedirs
from random import randint
from shutil import move

import ipywidgets as widgets
import matplotlib.pyplot as plt
import os.path
import requests
from tqdm import tqdm

import create_ls_corpus
import create_rl_corpus
from util.audio_util import *
from util.corpus_util import *
from IPython.display import HTML, Audio
import ipywidgets as widgets

% matplotlib inline

# path to target directory for ReadyLingua corpus files (default value)
rl_target_root = os.path.join(target_root, 'readylingua-corpus')
# path to target directory for LibriSpeech corpus files (default value)
ls_target_root = os.path.join(target_root, 'librispeech-corpus')

def show_corpus_entry(corpus_entry, speech=None, speech_unaligned=None, pause=None):
    speech = speech if speech else random.choice(corpus_entry.speech_segments)
    speech_unaligned = speech_unaligned if speech_unaligned \
                        else random.choice(corpus_entry.speech_segments_unaligned) if corpus_entry.speech_segments_unaligned \
                        else None
    pause = pause if pause else random.choice(corpus_entry.pause_segments)

    show_audio(corpus_entry)
    show_segment(speech)
    if speech_unaligned:
        show_segment(speech_unaligned)
    show_segment(pause)


def show_audio(corpus_entry):
    title = HTML(f"""
    <h3>Sample corpus entry: {corpus_entry.name}</h3>
    <p><strong>Path to raw data</strong>: {corpus_entry.raw_path}</p>
    <p>{len(corpus_entry.speech_segments)} speech segments, {len(corpus_entry.pause_segments)} pause segments</p>
    """)
    audio = Audio(data=corpus_entry.audio, rate=corpus_entry.rate)
    transcript = widgets.Accordion(children=[widgets.HTML(f'<pre>{corpus_entry.transcript}</pre>')], selected_index=None)
    transcript.set_title(0, 'Transcript')
    
    display(title)
    display(audio)
    display(transcript)
    
def show_segment(segment):
    title = HTML(f'<strong>Sample {segment.segment_type}</strong> (start_frame={segment.start_frame}, end_frame={segment.end_frame})')
    audio = Audio(data=segment.audio, rate=segment.rate)

    display(title)
    display(audio)
    if segment.text:
        transcript = HTML(f'<pre>{segment.transcript}</pre>')
        display(transcript)


def download_file(url, target_dir):
    r = requests.get(url, stream=True)
    total_size = int(r.headers.get('content-length', 0));
    block_size = 1024
    wrote = 0
    tmp_file = os.path.join(target_dir, 'download.tmp')
    if not exists(target_dir):
        makedirs(target_dir)

    with open(tmp_file, 'wb') as f:
        with tqdm(r.iter_content(32 * block_size), total=total_size, unit='B', unit_divisor=block_size,
                  unit_scale=True) as pbar:
            for data in r.iter_content(32 * 1024):
                wrote = wrote + len(data)
                f.write(data)
                pbar.update(len(data))

    if total_size != 0 and wrote != total_size:
        print("ERROR, something went wrong")

    print('Extracting data...')
    tar = tarfile.open(tmp_file, "r:gz")
    tar.extractall(target_dir)
    tar.close()

    remove(tmp_file)


def move_files(src_dir, target_dir):
    for filename in listdir(src_dir):
        move(os.path.join(src_dir, filename), os.path.join(target_dir, filename))
    rmdir(src_dir)


def on_download_ls_button_click(sender):
    global ls_source_root
    print('Downloading LibriSpeech data... Get lunch or something!')
    print('Download 1/2: Audio data')
    download_dir = os.path.join(ls_source_root, 'audio')
    if exists(download_dir) and listdir(download_dir):
        print(f'Directory {download_dir} exists and is not empty. Assuming data was already downloaded there.')
    else:
        download_file('http://www.openslr.org/resources/12/original-mp3.tar.gz', download_dir)
        print('Done! Moving files...')
        move_files(os.path.join(download_dir, 'LibriSpeech'), download_dir)

    print('Download 2/2: Text data')
    download_dir = os.path.join(ls_source_root, 'books')
    if exists(download_dir) and listdir(download_dir):
        print(f'Directory {download_dir} exists and is not empty. Assuming data was already downloaded there.')
    else:
        download_file('http://www.openslr.org/resources/12/original-books.tar.gz', download_dir)
        move_files(os.path.join(download_dir, 'LibriSpeech'), download_dir)
        makedirs(os.path.join(download_dir, 'utf-8'))
        move_files(os.path.join(download_dir, 'books', 'utf-8'), os.path.join(download_dir, 'utf-8'))
        move_files(os.path.join(download_dir, 'books', 'ascii'), os.path.join(download_dir, 'ascii'))
        delete_directory = os.path.join(download_dir, 'books')
        print(f'Done! Please delete {delete_directory} manually (not needed)')

    print(f'Files downloaded and extracted to: {ls_source_root}')


def on_create_rl_button_click(sender):
    global rl_corpus_file
    print('Creating ReadyLingua corpus... Get a coffee or something!')
    rl_corpus, rl_corpus_file = readylingua_corpus.create_corpus(source_root=rl_source_root, target_root=rl_target_root)
    print(f'Done! Corpus with {len(rl_corpus)} entries saved to {rl_corpus_file}')


def on_create_ls_button_click(sender):
    global ls_corpus_file
    print('Creating LibriSpeech corpus... Go to bed or something!')
    ls_corpus, ls_corpus_file = librispeech_corpus.create_corpus(source_root=ls_source_root, target_root=ls_target_root)
    print(f'Done! Corpus with {len(ls_corpus)} entries saved to {rl_corpus_file}')

# UI elements
layout = widgets.Layout(width='250px', height='50px')
download_ls_button = widgets.Button(description="Download LibriSpeech Data", button_style='info', layout=layout, icon='download')
download_ls_button.on_click(on_download_ls_button_click)
create_rl_button = widgets.Button(description="Create ReadyLingua Corpus", button_style='warning', layout=layout, icon="book", tooltip='~5 minutes')
create_rl_button.on_click(on_create_rl_button_click)
create_ls_button = widgets.Button(description="Create LibriSpeech Corpus", button_style='warning', layout=layout,icon="book", tooltip='~5 hours')
create_ls_button.on_click(on_create_ls_button_click)

## Corpus structure
The alignment information is extracted from the raw data and stored as a **corpus** containing **corpus_entries**. A corpus entry reflects a single instance for training, validation or evaluation. It contains all the information about the audio and its segmentation.

### Preprocessing
The raw data was integrated as-is applying only the following preprocessing steps:

* **Resampling**: Audio data was resampled to 16kHz (mono) WAV files
* **Cropping**: Some of the audio files (especially in the LibriSpeech data contained some preliminary information about LibriVox and the book being read before the actual recording. This speech data was not aligned. The audio was therefore cropped at the beginning to the frame where the first alignment information (speech or pause segments) begins. Likewise, the audio is cropped at the end to the frame where the last alignment information ends.

### Corpus entries
In order to allow data from all sources for training, it had to be converted to a common format. Since (to my knowledge) there is not a standardized format for FA, I had to define one myself. Therefore I went for the following structure for a single corpus entry:

```JSON
// corpus is iterable over its corpus_entries
Corpus = {
    'name': string,                    // display name
    'root_path': string,               // absolute path to the directory containing the corpus files
    'corpus_entries': [CorpusEntry]    // the entries of the corpus
}

// corpus_entry is iterableover its segments
CorpusEntry = 
{
    'corpus': Corpus,                  // reference to the corpus
    'audio_file': string               // absolute path to the preprocessed audio file
    'transcript': string,              // transcription of the audio as raw (unaligned) text   
    'segments': [Segment],             // speech- and pause-segments of the audio
    'original_path': string            // absolute path to the directory containing the raw files
    'name': string                     // display name
    'id': string                       // unique identifier
    'language': string,                // 'de'/'fr'/'it'/'en'/'es'/'unknown'
    'chapter_id': string,              // identifier of the chapter of the book if available, else 'unknown'
    'speaker_id': string,              // identifier of the speaker if available, else 'unknown'
    'original_sampling_rate': string,  // sampling rate of the raw audio file
    'original_channels': string,       // number of channels in the raw audio file
    'subset': string,                  // membership to a subset ('train'/'dev'/'test'/'unknown')
    'media_info': dict,                // PyDub information about the converted audio file
    'speech_segments': [Segment],      // segments filtered for type=='speech' (at runtime)
    'pause_segments': [Segment],       // segments filtered for type=='pause' (at runtime)
    'alignment': ([byte], [Segment]),  // audio and segmentation information
    'alignment_cropped': ([byte], [Segment]) // audio and segmentation information with start and end cropped
}

// definition of a speech or pause segment
Segment = 
{
    'corpus_entry': CorpusEntry,       // reference to the corresponding CorpusEntry
    'start_frame': int,                // index of the start frame of the segment within the audio
    'end_frame': int,                  // index of the end frame of the segment  within the audio
    'start_text': int,                 // index of first character of the segment in the transcription
    'end_text': int,                   // index of the last character of the segment in the transcription
    'segment_type': string,            // 'speech' for a speech segment, 'pause' for a pause segment
    'audio': [byte],                   // part of the audio of the corpus entry which belongs to this segment
    'text': string                     // part of the transcription of the corpus entry which belongs to this segment
}
```

### Create ReadyLingua Corpus
ReadyLingua (RL) provides alignment data distributed over several files files:

* `*.wav`: Audio file containing the speech
* `*.txt`: UTF-8 encoded (unaligned) transcription
* `* - Segmentation.xml`: file comtaining the definition of speech- and pause segments
```XML
<Segmentation>
    <SelectionExtension>0</SelectionExtension>
    <Segments>
	<Segment id="1" start="83790" end="122598" class="Speech" uid="5" />
	...
    </Segments>
    <Segmenter SegmenterType="SICore.AudioSegmentation.EnergyThresholding">
        <MaxSpeechSegmentExtension>50</MaxSpeechSegmentExtension>
        <Length>-1</Length>
        <Energies>
            <Value id="1" value="0" />
            ...
        </Energies>
        <OriginalSegments>
            <Segment id="1" start="83790" end="100548" class="Speech" uid="2" />
            ...
        </OriginalSegments>
        <EnergyPeak>3569753</EnergyPeak>
        <StepSize>441</StepSize>
        <ITL>146139</ITL>
        <ITU>730695</ITU>
        <LastUid>2048</LastUid>
        <MinPauseDuration>200</MinPauseDuration>
        <MinSpeechDuration>150</MinSpeechDuration>
        <BeginOfSilence>1546255</BeginOfSilence>
        <SilenceLength>100</SilenceLength>
        <ThresholdCorrectionFactor>1</ThresholdCorrectionFactor>
    </Segmenter>
</Segmentation>
```
* `* - Index.xml`: file containing the actual alignments of text to audio
```XML
<XMLIndexFile>
    <Version>2.0.0</Version>
    <SamplingRate>44100</SamplingRate>
    <NumberOfIndices>91</NumberOfIndices>
    <TextAudioIndex>
        <TextStartPos>0</TextStartPos>
        <TextEndPos>36</TextEndPos>
        <AudioStartPos>952101</AudioStartPos>
        <AudioEndPos>1062000</AudioEndPos>
        <SpeakerKey>-1</SpeakerKey>
    </TextAudioIndex>
    ...
</XMLIndexFile>    
```
* `* - Project.xml`: Project file binding the different files together for a corpus entry (note: this file is optional, i.e. there may be not project file for a corpus entry)

Corpus entries are organized in a folder hierarchy. There is a fileset for each corpus entry. Usually, the files for a specific corpus entry reside in a leaf directory (i.e. a directory without further subdirectories). If there is a project file, this file is used to locate the files needed to 

Audio data is provided as Wave-Files with a sampling rate of 44,1 kHz (stereo). Because most ASR corpora provide their recordings with a sampling rate of 16 kHz the files were downsampled and the alignment information adjusted. The raw transcription is integrated as-is. The XML files are parsed to extract the alignment data. Alignment-, textual and downsampled audio data are merged into a corpus entry as described above. 

#### Create corpus entries
We need to extract the alignments from the segmentation information of the raw data. For this, the downloaded data needs to be converted to corpus entries. This process takes a few minutes, so this is a good time to have a coffee break.

In [None]:
display(create_rl_button)

#### Explore corpus
Let's load the newly created corpus (needs to be done only once) and print some stats:

In [None]:
rl_corpus = load_corpus(rl_target_root)
rl_corpus.summary()

You can access each corpus entry either by a numerical index or by its ID (string).

In [None]:
# acces by index
first_entry = rl_corpus[0]
first_entry.summary()

# access by ID
other_entry = rl_corpus['news170524']
other_entry.summary()

# get a list of IDs
rl_corpus.keys

You can also filter the corpus by language to get only the corpus entry with the specified language(s):

In [None]:
rl_corpus_de = rl_corpus(languages='de')
rl_corpus_de.summary()

rl_corpus_fr = rl_corpus(languages='fr')
rl_corpus_fr.summary()

rl_corpus_de_fr = rl_corpus(languages=['de', 'fr'])
rl_corpus_de_fr.summary()

To see if everything worked as expected let's check out a sample alignment. You can execute the cell below to show a random alignment from a random corpus entry. You can execute the cell several times to see different samples.

In [None]:
corpus_entry = random.choice(rl_corpus_de)
# corpus_entry = rl_corpus['edznachrichten180201']
show_corpus_entry(corpus_entry)

### Create LibriSpeech Corpus
The _LibriSpeech_ raw data is split into training-, dev- and test-set (`train-*.tar.gz`, `dev-*.tar.gz` and `test-*.tar.gz`). However, those sets only contain the transcript as a set of segments and an audio file for each segment. They do not contain any temporal information which is needed for alignment.

Luckily, there is also the `original-mp3-tar.gz` for download which contains the original LibriVox mp3 files (from which the corpus was created) along with the alignment information. Alignment is made on utterance-level, i.e. the transcript is split up into segments whereas each segment corresponds to an utterance. Segments were derived by allowing splitting on every silence interval longer than 300ms. 

The data is organized into subdirectories of the following path format:

    ./LibriSpeech/mp3/{speaker_id}/{chapter_id}/

There is one directory per entry containing all the information about a recording. For this project the following files are important:

- **Audio recording** `{chapter_id}.mp3`: The audio file containing the recording. The audio is mono with a bitrate of 128 kB/s and a sampling rate of 44.1 kHz and needs to be converted/resampled to the target format.
- **Transcription file** `{speaker_id}-{chapter_id}.trans.txt`: Text file containing the transcriptions of the segments (one segment per line). Each line is prefixed with the transcription ID. The transcription is all uppercase and does not contain any punctuation.
```
14-208_0000 CHAPTER ELEVEN THE MORROW BROUGHT A VERY SOBER LOOKING MORNING THE SUN MAKING ONLY A FEW EFFORTS...
```
- **Segmentation file** `{speaker_id}-{chapter_id}.seg.txt`: Text file containing temporal information about the segments (one segment per line). Each line is prefixed with the ID of the transcription for which the information is valid. The time is indicated in seconds. Example:
Example:
```
14-208_0000 25.16 40.51
```

In order to create the corpus, these files had to be parsed and the audio was converted and downsampled to a 16kHz Wave-file.
Information about the Speakers, Chapters and Books were extracted from the respective files (`SPEAKERS.TXT`, `CHAPTERS.TXT` and `BOOKS.TXT`).

#### Missing speech segments
The original idea was to treat the aligned utterances as speech segments and the gaps between them as pause segments. Unfortunately it turned out that only part of the original text was aligned. As a result, the segmentation file contained some "gaps" between aligned audio, which corresponded to utterances of text passages which were then being falsely treated as pauses.

In order to detect whether an assumed pause segment was acutally a speech segment the speech segments before and after had to be examined and compared with the original passage from the book. If the original passage from the book contained text between the transcript for the speech segments, the pause segment was converted to a speech segment. Since no alignment information is available for this missing speech segment, this leads to three subsequent speech segments, i.e. there are no pause segments in between. To compare aligned text from the segmentation file with the original passage from the book, all text was normalized (uppercase, removal of non-ASCII characters, removal of punctuation).

#### Download raw data
To create the LibriSpeech corpus you first need to download the raw data. The files are over 80GB and need to be extracted, so this might take a while...

In [None]:
display(download_ls_button)

#### Create corpus
We need to extract the alignments from the segmentation information of the raw data. For this, the downloaded data needs to be converted to corpus entries. **This process takes several hours, so you might want to do this just before knocking-off time!**

In [None]:
display(create_ls_button)

#### Explore corpus
Again, let's load the newly created corpus:

In [None]:
ls_corpus = load_corpus(ls_target_root)
ls_corpus.summary()

To see if everything worked as expected let's check out a sample alignment. You can execute the cell below to show a random alignment from a random corpus entry. You can execute the cell several times to see different samples.

In [None]:
corpus_entry = random.choice(ls_corpus)
# corpus_entry = ls_corpus[0]
corpus_entry.summary()
show_corpus_entry(corpus_entry)