# Putting it all together

The purpose of this notebook is to show how Forced Alignment on an unknown audio/transcript pair could be done by combining all the stages from the previous notebooks:

- **VAD-Stage**: The speech parts are extracted from the audio signal using WebRTC
- **ASR-Stage**: The speech parts are transcribed using an RNN. Because only PoCs were trained for this stage, ceiling analysis is done for this stage by using a state-of-the art model. We will use [Google's Speech-to-Text API](https://cloud.google.com/speech-to-text/) for this.
- **LSA-Stage**: The partial transcripts are aligned with the original transcript using the Smith-Waterman algorithm and the Levenshtein Similarity.

All these are applied on individual pairs of audio/transcript for demonstration purposes. The alignments are visualized in a HTML page containing an audio player and the transcript, where the alignments are highlighted as the audio plays. In order to be able to see these pages, you need to start a server, which can be done by running the following command from the source directory of this project:

    python ./demos/server.py
    
A list of already prepared alignments can den be found under [http://localhost:8000](http://localhost:8000)

In [None]:
from util.audio_util import *
from util.corpus_util import *
from util.vad_util import *
from util.asr_util import *
from util.lsa_util import *

import librosa
from pattern3.metrics import levenshtein_similarity
from tabulate import tabulate
from IPython.display import HTML, Audio
import ipywidgets as widgets
from pathlib import Path
from os.path import join

demo_dir = join('..', 'assets', 'demo_files')

def show_link(url, text):
    display(HTML(f"<a href='{url}'>{text}</a>"))

def vad(audio, rate):
    voice_segments = extract_voice(audio, rate)
    print(f'got {len(voice_segments)} voice_segments')
    return voice_segments

def asr(voice_segments, max_segments=10, language='en'):
    voice_segments = transcribe(voice_segments[:max_segments], language)

    print(f'ASR-transcripts of first {max_segments} voiced segments:')
    print()
    for i, voice in enumerate(voice_segments, 1):    
        print(i, voice.transcript)
        display(Audio(data=voice.audio, rate=voice.rate))
        
    return voice_segments

def lsa(voice_segments, transcript):
    alignments = align(voice_segments, transcript)
    for al in alignments:
        partial_transcript = al.transcript
        alignment_text = al.alignment_text
        edit_distance = levenshtein_similarity(partial_transcript.upper(), alignment_text.upper())
        print(f'similarity: {edit_distance}, transcript: «{partial_transcript}», aligned text: «{alignment_text}»')

## A simple example

We will use a recording and a transcript of Donald Trump's weekly address made on February 11, 2018. Audio and text were downloaded [here](https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-27/). Apart from extracting the audio from the video as MP3, no processing was done. Also the example has not been used in any way before.

You can see the unaligned audio and transcript below:

In [None]:
example_audio_path = join(demo_dir, 'address.mp3')
example_audio_transcript = join(demo_dir, 'address.txt')

audio, rate = read_audio(example_audio_path)
transcript = Path(example_audio_transcript).read_text(encoding='utf-8')

display(Audio(data=audio, rate=rate))
display(HTML(transcript))

### VAD Stage
The audio signal of the candidate is approximately 2:30 minutes long and can be split into 41 voiced segments:

In [None]:
voice_segments = vad(audio, rate)

### ASR stage
Each segment can now be transcribed by using the Google-STT API. Note that it may take some time to process all segments. Therefore only the first 10 voiced segments are transcribed here for demonstration purposes. **Also note that free usage of the API is constrained to a time and/or call limit. Excessively executing the following cell will therefore lead to deplete the usage limit!**

In [None]:
voice_segments = asr(voice_segments)

We can see that the STT-API provides very good transcriptions, except for number `3` where the term _and lawmakers joined me_ was transcribed as _hello Americans join me_.

### LSA-Stage

To see whether the output quality of the ASR stage is high enough to align the individual partial transcripts with the original transcript, the Smith-Waterman algorithm from the LSA stage can be applied. The resulting (textual) alignment and the temporal information from the speech segments can then be combined to obtain an alignment between the original audio/transcript pair.

The following code calculates the alignments for the 10 partial transcripts above.

In [None]:
lsa(voice_segments, transcript)

The so retreived sequence alignment information can be combined with the temporal information to get the alignment between audio signal and transcript. This has been done for the whole sample (all 41 voiced segments). While doing so, the following intermediary results have been saved:

* [file containing the results of the ASR stage (partial transcripts)](../demos/htdocs/address/transcript_asr.txt): blank lines mean no transcript could be generated
* [file containing the results of the LSA stage (alignments)](../demos/htdocs/address/alignment.txt): This includes the score for the Leventhstein Similarity

The end result can be viewed [here](http://localhost:8000/address)

## A more difficult example

To see how the pipeline performs on more challenging examples, we will use a sample from the Librivox corpus (ID=171001). This sample exhibits the following difficulties:

* the recording is much longer (approximately 30 minutes)
* the recording does not always match up with the transcript (e.g. theres the Librivox preamble about licenses, which is not included in the transcript)
* the recording contains passages with quite some slang or old English using words that are not used anymore today. Those passages might be hard to recognize in the ASR stage and have been transcribed using non-standard syntax.

In [None]:
example_audio_path = join(demo_dir, '171001.mp3')
example_audio_transcript = join(demo_dir, '171001.txt')

# only load first 64 seconds from audio
audio, rate = librosa.load(example_audio_path, duration=64)
# read first 24 lines from transcript
with open(example_audio_transcript, 'r', encoding='utf-8') as myfile:
    transcript = '\n'.join([next(myfile) for x in range(24)])

display(Audio(data=audio, rate=rate))
display(HTML(transcript))

### VAD stage

The VAD stage yields much more voiced segments than are present in the corpups' metadata.

In [None]:
voice_segments = vad(audio, rate)

### ASR stage

Again, the first 10 voiced segments are transcribed.

In [None]:
voice_segments = asr(voice_segments)

Compare these transcripts with the original transcript it becomes clear that the generated transcripts are less perfect than in the simple example above. Some errors in the ASR-generated transcripts can immediately be spotted (e.g. _stubbly ratchet_ instead of _doubly wretched_, _traps a prima_ instead of _trap's abramuh_ etc.). Also, although not visible from the 10 first examples, for some segments the STT-API was unable to generate a transcript.

On the other hand, we can see that even though some of the generated transcript contain errors, their pronunciation is actually very close to the original transcript and could indeed be a valid transcription in other situations (e.g. _LC_ or _else be_ instead of _Elsie_). Also the words and sentences of the transcripts are orthographically and grammatically correct, a clear indication that a language model has been used to improve the raw results from the first pass.

### LSA stage

Finally, the transcribed 10 segments are aligned with the original transcript.

In [None]:
lsa(voice_segments, transcript)

As above, the whole pipeline was processed for the whole example (all 521 voiced segments) and the results have been saved to:

* [file containing the results of the ASR stage (partial transcripts)](../demos/htdocs/171001/transcript_asr.txt): blank lines mean no transcript could be generated
* [file containing the results of the LSA stage (alignments)](../demos/htdocs/171001/alignment.txt): This includes the score for the Leventhstein Similarity

The end result can be viewed [here](http://localhost:8000/171001)

## Example in German

For the sake of completeness, an audio/transcription pair in a language other than English shall be aligned. The poem _An die Freude_ by Friedrich Schiller is used for this with the following audio and transcript: 

In [None]:
example_audio_path = join(demo_dir, 'andiefreude.mp3')
example_audio_transcript = join(demo_dir, 'andiefreude.txt')

audio, rate = read_audio(example_audio_path)
transcript = Path(example_audio_transcript).read_text(encoding='utf-8')

display(Audio(data=audio, rate=rate))
display(HTML(transcript))

### VAD + ASR + LSA stage

The following cell contains the code to put the first 10 speech segments through the pipeline. As before, the alignment for the whole poem can be viewed [here](http://localhost:8000/andiefreude)

In [None]:
voice_segments = vad(audio, rate)
voice_segments = asr(voice_segments, language='de')
lsa(voice_segments, transcript)

## Try your own example

To see how the pipeline works with your own examples, provide an absolute path to an audio file and a transcription. For example you can record a [random article on Wikipedia](https://en.wikipedia.org/wiki/Special:Random) and save both the text and the recording in `yourname.mp3` and `yourname.txt`. For the recording MP3 and WAV files are supported formats. The quality of the recording will have an impact on the alignment result. The transcription must be given as a simple UTF8-encoded text file.

In [None]:
audio_path = r'D:\code\ip8\assets\demo_files\daniel.mp3' # enter path to audio file her (MP3 or WAV)
transcript_path = r'D:\code\ip8\assets\demo_files\daniel.txt' # enter path to transcript here (.txt file)

In [None]:
from util.e2e_util import *

url = create_demo(audio_path, transcript_path)
show_link(url, 'Click here to go to your alignment')

## Summary

In this notebook the pipeline approach proposed by this project was evaluated on different combinations of audio/transcriptions. A ceiling analysis was performed by using an API Google's STT-engine instead of an own implementation for an ASR-engine. By replacing the most critical component with a state-of-the-art model for ASR, the pipeline was able to produce fairly good, although not perfect examples. Alignments were particularly difficult to generate if the audio contained slang or if the recording quality was bad.

Despite the pipeline being highly dependent on the quality of the ASR stage, the hypothesis [formulated at the beginning](00_Introduction.ipynb#Plan-and-Hypothesis) could be verified. The pipeline can be sait to generally work provided the quality of the partial transcriptions is high enough.