# Putting it all together

The purpose of this notebook is to see if the pipeline works. For this wee choose a audio/transcript pair and run it through all stages described in the previous notebooks:

- **VAD-Stage**: The speech parts are extracted from the audio signal using WebRTC
- **ASR-Stage**: The speech parts are transcribed using an RNN. Because only PoCs were trained for this stage, ceiling analysis is done for this stage by using a state-of-the art model. We will use [Google's Speech-to-Text API](https://cloud.google.com/speech-to-text/) for this.
- **LSA-Stage**: The partial transcripts are aligned with the original transcript using the Smith-Waterman algorithm and the Levenshtein Similarity.

The alignments are visualized in a HTML page containing an audio player and the transcript, where the alignments are highlighted as the audio plays. If you checked out the code yourself, you need to start a server In order to be able to see these pages. Run the following command from the source directory of this project:

    python ./demos/server.py
    
The server is then available under [http://localhost:8000](http://localhost:8888).
    
If you read this page as part of the official project documentation ([http://ip8.tiefenauer.info](http://ip8.tiefenauer.info)) you don't have to do anything. A list of already prepared alignments can den be found under [http://ip8.tiefenauer.info:8888](http://ip8.tiefenauer.info:8888)

In [None]:
from util.audio_util import *
from util.corpus_util import *
from util.vad_util import *
from util.asr_util import *
from util.lsa_util import *

import librosa
from pattern3.metrics import levenshtein_similarity
from tabulate import tabulate
from IPython.display import HTML, Audio
import ipywidgets as widgets
from pathlib import Path
from os.path import join

demo_dir = join('..', 'assets', 'demo_files')

def show_link(url, text):
    display(HTML(f"<a href='{url}'>{text}</a>"))

def vad(audio, rate):
    voice_segments = extract_voice(audio, rate)
    print(f'got {len(voice_segments)} voice_segments')
    return voice_segments

def asr(voice_segments, max_segments=10, language='en'):
    voice_segments = transcribe(voice_segments[:max_segments], language)

    print(f'ASR-transcripts of first {max_segments} voiced segments:')
    print()
    for i, voice in enumerate(voice_segments, 1):    
        print(i, voice.transcript)
        display(Audio(data=voice.audio, rate=voice.rate))
        
    return voice_segments

def lsa(voice_segments, transcript):
    alignments = align(voice_segments, transcript)
    for al in alignments:
        partial_transcript = al.transcript
        alignment_text = al.alignment_text
        edit_distance = levenshtein_similarity(partial_transcript.upper(), alignment_text.upper())
        print(f'similarity: {edit_distance}, transcript: «{partial_transcript}», aligned text: «{alignment_text}»')

## Example in English

We will use a recording and a transcript of Donald Trump's weekly address made on February 11, 2018. Audio and text were downloaded [here](https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-27/). Apart from extracting the audio from the video as MP3, no processing was done. Also the example has not been used in any way before.

You can see the unaligned audio and transcript below:

In [None]:
example_audio_path = join(demo_dir, 'address.mp3')
example_audio_transcript = join(demo_dir, 'address.txt')

audio, rate = read_audio(example_audio_path)
transcript = Path(example_audio_transcript).read_text(encoding='utf-8')

display(Audio(data=audio, rate=rate))
display(HTML(transcript))

### VAD Stage
The audio signal of the candidate is approximately 2:30 minutes long and can be split into 41 voiced segments:

In [None]:
voice_segments = vad(audio, rate)

### ASR stage
Each segment can now be transcribed by using the Google-STT API. Note that it may take some time to process all segments. Therefore only the first 10 voiced segments are transcribed here for demonstration purposes. **Also note that free usage of the API is constrained to a time and/or call limit. Excessively executing the following cell will therefore lead to deplete the usage limit!**

In [None]:
voice_segments = asr(voice_segments)

We can see that the STT-API provides very good transcriptions, except for number `3` where the term _and lawmakers joined me_ was transcribed as _hello Americans join me_.

### LSA-Stage

To see whether the output quality of the ASR stage is high enough to align the individual partial transcripts with the original transcript, the Smith-Waterman algorithm from the LSA stage can be applied. The resulting (textual) alignment and the temporal information from the speech segments can then be combined to obtain an alignment between the original audio/transcript pair.

The following code calculates the alignments for the 10 partial transcripts above.

In [None]:
lsa(voice_segments, transcript)

The so retreived sequence alignment information can be combined with the temporal information to get the alignment between audio signal and transcript. This has been done for the whole sample (all 41 voiced segments). The following intermediary results have been saved:

* [file containing the results of the ASR stage (partial transcripts)](../demos/htdocs/address/transcript_asr.txt): blank lines mean no transcript could be generated
* [file containing the results of the LSA stage (alignments)](../demos/htdocs/address/alignment.txt): This includes the score for the Leventhstein Similarity

The end result can be viewed [here](http://ip8.tiefenauer.info:8888/address)

## Example in German

For the sake of completeness, an audio/transcription pair in a language other than English shall be aligned. The poem _An die Freude_ by Friedrich Schiller is used for this with the following audio and transcript: 

In [None]:
example_audio_path = join(demo_dir, 'andiefreude.mp3')
example_audio_transcript = join(demo_dir, 'andiefreude.txt')

audio, rate = read_audio(example_audio_path)
transcript = Path(example_audio_transcript).read_text(encoding='utf-8')

display(Audio(data=audio, rate=rate))
display(HTML(transcript))

### VAD + ASR + LSA stage

The following cell contains the code to put the first 10 speech segments through the pipeline. As before, the alignment for the whole poem can be viewed [here](http://ip8.tiefenauer.info:8888/andiefreude)

In [None]:
voice_segments = vad(audio, rate)
voice_segments = asr(voice_segments, language='de')
lsa(voice_segments, transcript)

## Try your own example

To see how the pipeline works with your own examples, go to the [assets folder](/tree/assets/demo_files) and upload fhe following two files:

* an audio file (MP3 or WAV)
* a transcription file (TXT, UTF-8 encoded text)

For example you can record a [random article on Wikipedia](https://en.wikipedia.org/wiki/Special:Random). The quality of the recording will have an impact on the alignment result.

When you have uploaded both files, provide the file names below.

In [None]:
audio_file = r'myexample.mp3' # enter name of audio file her (MP3 or WAV)
trans_file = r'myexample.txt' # enter name of transcript here (.txt file)

import os
from os.path import join, exists
audio_path = join(demo_dir, audio_file)
trans_path = join(demo_dir, trans_file)

if not exists(audio_path):
    print(f'error: audio file does not exist: {audio_path}')
else:
    print(f'using audio file: {audio_path}')
    
if not exists(trans_path):
    print(f'error: transcription file does not exist: {trans_path}')
else:
    print(f'using transcription file: {audio_path}')    

Now execute the following cell to start the pipeline. You will see a link with the URL when the process is finished pointing to `https://ip8.tiefenauer.info`. When reading this running the Jupyter Notebook server on your own machine you have to change this part to your local loopback (`http://localhost...`).

In [None]:
from util.e2e_util import *

url = create_demo(audio_path, trans_path)
show_link(url, 'Click here to go to your alignment')

## Summary

In this notebook the pipeline approach proposed by this project was evaluated on different combinations of audio/transcriptions. Ceiling analysis on the ASR stage was performed by using an API Google's STT-engine instead of an own implementation for an ASR-engine. By replacing the most critical component with a state-of-the-art model for ASR, the pipeline was able to produce fairly good, although not perfect examples. Alignments were particularly difficult to generate if the audio contained slang or if the recording quality was bad.

Despite the pipeline being highly dependent on the quality of the ASR stage, the hypothesis [formulated at the beginning](00_Introduction.ipynb#Plan-and-Hypothesis) could be verified. The pipeline can be said to generally work provided the quality of the partial transcriptions is high enough.