# IP8
This IPython notebook documents and visualizes some crucial steps made during the progress of the project. I should help the reader understand how and why decisions were made as well as illustrate some important concepts with examples.

## Corpora
Every Neural Network needs training data. The RNN used in this project is no exception. Since this project is about Forced Alignment (FA), training data consisted of pre-aligned audio and transcript data. This training data was derived from the following resources:

* **ReadyLingua**: Aligned data in various languages and by various speakers provided by ReadyLingua.
* **LibriSpeech**: Open Source ASR corpus (http://openslr.org/12/) containing roughly 1000h aligned speech data.
* ... (additional Corpora tbd.)

In order to all data for training, it had to be converted to a common format. Since (to my knowledge) there is not a standardized format for FA, I had to define one myself. Therefore I went for the following structure for a single corpus entry:

```JSON
// definition of the corpus
corpus = [corpus_entry]

// definition of an individual corpus entry
corpus_entry = 
{
    'audio': [byte],                 // bytes from the audio file
    'transcript': string,            // raw (unaligned) text 
    'speech-pauses': [speech_pause], // segmentation of the audio file into speech and pause segments
    'alignment': [alignment]         // alignment of bits of the unaligned text with the audio
}

// definition of a speech or pause segment
speech_pause = 
{
    'id': string,                    // some unique identifier
    'start': int,                    // start frame of the segment
    'end': int,                      // end frame of the speech pause
    'class': string                  // 'speech' for a speech segment, 'pause' for a pause segment
}

// definition of an alignment
alignment = 
{
    'text': string,                  // text that is being spoken in the audio
    'start': int,                    // start frame in the audio file (when the text starts)
    'end': int                       // end frame in the audio file (when the text stops)
}
```

### ReadyLingua Corpus
ReadyLingua (RL) provides alignment data distributed over several files files:

* `*.wav`: Audio file containing the speech
* `*.txt`: UTF-8 encoded (unaligned) transcript
* `* - Segmentation.xml`: file comtaining the definition of speech- and pause segments
```XML
<Segmentation>
    <SelectionExtension>0</SelectionExtension>
    <Segments>
	<Segment id="1" start="83790" end="122598" class="Speech" uid="5" />
	...
    </Segments>
    <Segmenter SegmenterType="SICore.AudioSegmentation.EnergyThresholding">
        <MaxSpeechSegmentExtension>50</MaxSpeechSegmentExtension>
        <Length>-1</Length>
        <Energies>
            <Value id="1" value="0" />
            ...
        </Energies>
        <OriginalSegments>
            <Segment id="1" start="83790" end="100548" class="Speech" uid="2" />
            ...
        </OriginalSegments>
        <EnergyPeak>3569753</EnergyPeak>
        <StepSize>441</StepSize>
        <ITL>146139</ITL>
        <ITU>730695</ITU>
        <LastUid>2048</LastUid>
        <MinPauseDuration>200</MinPauseDuration>
        <MinSpeechDuration>150</MinSpeechDuration>
        <BeginOfSilence>1546255</BeginOfSilence>
        <SilenceLength>100</SilenceLength>
        <ThresholdCorrectionFactor>1</ThresholdCorrectionFactor>
    </Segmenter>
</Segmentation>
```
* `* - Index.xml`: file containing the actual alignments of text to audio
```XML
<XMLIndexFile>
    <Version>2.0.0</Version>
    <SamplingRate>44100</SamplingRate>
    <NumberOfIndices>91</NumberOfIndices>
    <TextAudioIndex>
        <TextStartPos>0</TextStartPos>
        <TextEndPos>36</TextEndPos>
        <AudioStartPos>952101</AudioStartPos>
        <AudioEndPos>1062000</AudioEndPos>
        <SpeakerKey>-1</SpeakerKey>
    </TextAudioIndex>
    ...
</XMLIndexFile>    
```
* `* - Project.xml`: Project file binding the different files together for a corpus entry (note: this file is optional, i.e. there may be not project file for a corpus entry)

Corpus entries are organized in a folder hierarchy. There is a fileset for each corpus entry. Usually, the files for a specific corpus entry reside in a leaf directory (i.e. a directory without further subdirectories). If there is a project file, this file is used to locate the files needed to 

Audio data is provided as Wave-Files with a sampling rate of 44,1 kHz (stereo). Because most ASR corpora provide their recordings with a sampling rate of 16 kHz the files were downsampled and the alignment information adjusted. The raw transcription is integrated as-is. The XML files are parsed to extract the alignment data. Alignment-, textual and downsampled audio data are merged into a corpus entry as described above. 

In [None]:
from corpus_util import *

# load corpora
print('loading ReadyLingua corpus...')
rl_corpus = load_corpus('readylingua/readylingua.corpus')
print(f'...done! Loaded {len(rl_corpus)} corpus entries')


loading ReadyLingua corpus...
...done! Loaded 1 corpus entries


In [5]:
from random import randint

import IPython.display

def select_entry(corpus, ix=None):
    return corpus[ix] if ix else corpus[randint(0, len(corpus)) - 1]

def select_alignment(audio, alignments, ix=None):
    alignment_ix = ix or randint(0, len(alignments) - 1)
    
    alignment = alignments[alignment_ix]
    start = alignment['start']
    end = alignment['end']
    alignment_audio = audio[start:end]
    alignment_text = alignment['text']
    return alignment_audio, alignment_text

# display 5 random audio with aligned text
for i in range(5):
    corpus_entry = select_entry(rl_corpus)
    alignment_audio, alignment_text = select_alignment(corpus_entry['audio'], corpus_entry['alignments'])
    print(alignment_audio)
    #IPython.display.Audio(filename=alignment_audio)
    IPython.display.HTML(f'<p>{alignment_text}</p>')






