# Putting it all together

The purpose of this notebook is to show how Forced Alignment on an unknown audio/transcript pair could be done by combining all the stages from the previous notebooks:

- **VAD-Stage**: The speech parts are extracted from the audio signal using WebRTC
- **ASR-Stage**: The speech parts are transcribed using an RNN. Because only PoCs were trained for this stage, ceiling analysis is done for this stage by using a state-of-the art model. We will use [Google's Speech-to-Text API](https://cloud.google.com/speech-to-text/) for this.
- **LSA-Stage**: The partial transcripts are aligned with the original transcript.

All these stages are applied on a single corpus entry for demonstration purposes.

In [None]:
corpus_root = r'E:/'

In [None]:
from IPython.display import HTML, Audio
import ipywidgets as widgets

def show_audio(corpus_entry):
    title = HTML(f"""
    <h3>Sample corpus entry: {corpus_entry.name}</h3>
    <p><strong>Path to raw data</strong>: {corpus_entry.original_path}</p>
    <p>{len(corpus_entry.speech_segments)} speech segments, {len(corpus_entry.pause_segments)} pause segments</p>
    """)
    audio = Audio(data=corpus_entry.audio, rate=corpus_entry.rate)
    transcript = widgets.Accordion(children=[widgets.HTML(f'<pre>{corpus_entry.transcript}</pre>')], selected_index=None)
    transcript.set_title(0, 'Transcript')
    
    display(title)
    display(audio)
    display(transcript)

## The candidate
We will use an English transcript read by a female speaker in US-English. The following entry from the LibriSpeech corpus has been randomly selected.

In [None]:
import os
from util.corpus_util import load_corpus

rl_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')
rl_corpus = load_corpus(rl_corpus_root)
corpus_entry= rl_corpus['171001']
print(f'id: {corpus_entry.id}')
print(f'path to raw data: {corpus_entry.original_path}')
print(f'name: {corpus_entry.name}')
print(f'language: {corpus_entry.language}')
print(f'speaker id: {corpus_entry.speaker_id}')
print(f'chapter id: {corpus_entry.chapter_id}')

## VAD Stage
The audio signal of the candidate is approximately 30 minutes long and can be split into 548 segments:

In [None]:
from util.vad_util import *

audio, rate = corpus_entry.audio, corpus_entry.rate
voice_activities = extract_voice_activities(audio, rate)
print(f'got {len(voice_activities)} voice_activities')

## ASR stage
Each segment can now be transcribed by using the Google-STT API. Note that it may take some time to process all segments. Therefore only the first 10 speech segments are transcribed here for demonstration purposes. **Also note that free usage of the API is constrained to a time and/or call limit. Excessively executing the following cell will therefore lead to deplete the usage limit!**

In [None]:
from util.asr_util import *

print('transcripts of first 10 speech segments (generated by ASR)')
print()
partial_transcripts = []
for i, va in enumerate(voice_activities[:10], 1):    
    partial_transcript = transcribe_audio(va.audio, va.rate)
    partial_transcripts.append(partial_transcript)
    print(i, partial_transcript)
    display(Audio(data=va.audio, rate=va.rate))

Compare these transcripts with the original transcript:

In [None]:
show_audio(corpus_entry)

Obviously the generated transcripts are not perfect. By comparing with the original transcript, one can immediately spot some errors in the transcripts generated by the RNN (e.g. _stubbly ratchet_ instead of _doubly wretched_, _traps a prima_ instead of _trap's abramuh_ etc.). Also for the last speech segment the ASR-Model was unable to generate a transcript.

On the other hand, we can see that even though the generated transcript contain some errors, their pronunciation is actually very close to the original transcript and could indeed be a valid transcription in other situations (e.g. _LC_ instead of _Elsie_). Also the words and sentences of the transcripts are orthographically and grammatically correct, a clear indication that a language model has been used to improve the raw results from the first pass.

## LSA-Stage

To see whether the output quality of the ASR stage is high enough to align the individual partial transcripts with the original transcript, the Smith-Waterman algorithm from the LSA stage can be applied. The resulting (textual) alignment and the temporal information from the speech segments can then be combined to obtain an alignment between the original audio/transcript pair.

For simplicity, the transcripts for the 548 detected speech segments have been transcribed with the Google-STT engine and the result has been savet [to a file](../assets/171001.txt). The following code calculates the alignments for the 10 partial transcripts above. Again, for simplicity, the alignments for all partial transcripts have been pre-calculated for demonstration purposes.

In [None]:
from util.lsa_util import *
from tabulate import tabulate

alignments = align_transcripts(partial_transcripts, corpus_entry.transcript)
print(tabulate(alignments, headers=['partial', 'original', 'b']))

The so retreived sequence alignment information can be combined with the temporal information to get the alignment between audio signal and transcript: