# VAD stage

Segmentation is the process of splitting the audio into voiced and silent parts. Because the training data consists of aligned text, this information will be given at training time: For a given corpus entry the voiced segments and their transcription can easily be derived from metadata and used for training in the ASR-stage of the pipeline. However, there will be no segmentation information available at test time nor in production. The only thing known then will be the entire audio signal and its transcription. 

What is needed is a way to automatically extract the voiced parts, i.e. the speech segments, from an audio signal. Such speech segments can then be fed to the trained RNN, which will output a potentially faulty transcript which can then be aligned with the transcript of the whole recording later down the pipeline.

As stated in [the first notebook](00_introduction.ipynb) the original idea was to perform this chunking by using another RNN, which would learn how to detect speech pauses. However, instead of having to rely on such a RNN, speech pauses can be detected by using an existing VAD (_Voice Activity Detection_) algorithm. A VAD algorithm that is able to detect speech pauses with reasonable accuracy would remove the task of training and tuning an own system and therefore save time.

This chapter will compare one state-of-the-art implementation for VAD against the segmentation information from the corpus data that was acquired through manual labelling. 

In [None]:
corpus_root = r'E:/' # define the path to where the corpus files are located!

In [None]:
%matplotlib inline
from util.corpus_util import *
from util.vad_util import *
from webrtc_comparison import *

import numpy as np

# Visualization
from IPython.display import HTML, Audio
import ipywidgets as widgets
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import librosa.display

default_figsize = (12,5)
default_facecolor = 'white'

def show_wave(audio, sample_rate, ax=None, title=None):
    if not ax:
        plt.figure(figsize=default_figsize, facecolor=default_facecolor)
        
    p = librosa.display.waveplot(audio.astype(float), sample_rate)
    ax = p.axes
    ax.set_ylabel('Amplitude')
    if title:
        plt.title(title)
    plt.tight_layout()
    return ax


def show_segments(ax, boundaries, ymin=0, ymax=1, color='red'):
    for i, (start_s, end_s) in enumerate(boundaries):
        rect = ax.axvspan(start_s, end_s, ymin=ymin, ymax=ymax, color=color, alpha=0.5)
        y_0, y_1 = ax.get_ylim()
        x = start_s + (end_s - start_s)/2
        y = y_0 + 0.01*(y_1-y_0) if ymin==0 else y_1 - 0.05*(y_1-y_0)
        ax.text(x, y, str(i+1), horizontalalignment='center', fontdict={'family': 'sans-serif', 'size': 15, 'color': 'white'})          

In [None]:
rl_corpus = get_corpus('rl')
ls_corpus = get_corpus('ls')

## WebRTC

[WebRTC](https://webrtc.org/) is a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs. The WebRTC components have been optimized to best serve this purpose. There is also a VAD component, whose functionality has been [ported to Python by John Wiseman](https://github.com/wiseman/py-webrtcvad). It uses C code under the hood and is therefore very performant. Unfortunately, there is no information about the inner workings since there is no documentation available. Judging from the [source files](https://webrtc.googlesource.com/) however I suspect a Gaussian Mixture Model (GMM) is used to model the probability of a frame being speech or not.

Execute the cell below to compare the pause segments detected by WebRTC together with the pause segments from the metadata.

In [None]:
# corpus_entry = rl_corpus['news170524']
# corpus_entry = random.choice(rl_corpus)
corpus_entry = rl_corpus[0]

audio, rate = corpus_entry.audio, corpus_entry.rate
display(Audio(data=audio, rate=rate))

# pause boundaries from raw data
original_boundaries = calculate_boundaries(corpus_entry)

# pause boundaries from WebRTC
webrtc_boundaries = calculate_boundaries_webrtc(corpus_entry)

# convert frames to seconds
original_boundaries = original_boundaries / rate
webrtc_boundaries = webrtc_boundaries / rate

title = f'Raw wave of {corpus_entry.audio_file}'
ax_wave = show_wave(audio, rate, title=title)
show_segments(ax_wave, original_boundaries, ymax=0.5, color='green')
show_segments(ax_wave, webrtc_boundaries, ymin=0.5, color='blue')

pause_segments_original = mpatches.Patch(color='green', alpha=0.6, label=f'original speech segments ({len(original_boundaries)})')
pause_segments_webrtc = mpatches.Patch(color='blue', alpha=0.6, label=f'speech segments detected by WebRTC ({len(webrtc_boundaries)})')
ax_wave.legend(handles=[pause_segments_original, pause_segments_webrtc], bbox_to_anchor=(0, -0.2, 1., -0.1), loc=3, mode='expand', borderaxespad=0, ncol=2)

You can also listen to speech segments detected by WebRTC:

In [None]:
voiced_segments = webrtc_voice(corpus_entry.audio, corpus_entry.rate)    
for voice in list(voiced_segments)[:10]:
    display(Audio(data=voice.audio, rate=voice.rate))

## WebRTC vs. manual segmentation

By comparing the speech segments produced by WebRTC with the manually defined speech segments we can now calculate how much the speech pauses detected by WebRTC coincide with the speech pauses from raw data. To do this we can compare different metrics of the two results:

* **Precision**: Percentage of audio frames in classified as "speech" by WebRTC that are were also  classified as "speech" by a human
* **Recall**: Percentage of manually classified "speech" frames that were also detected by WebRTC
* **Difference**: Difference between the number of speech segments detected by WebRTC and manual segmentation. A negative value means WebRTC detected fewer speech segments. A positive value means WebRTC detected more speech segments. A value of zero means both methods produced the same number of (but not neccessarily the same) speech segments.

These metrics can be calculated for a corpus entry or the whole corpus. Precision and Recall can be further combined to a single value by calculating its **F-Score**:

$$ F = 2 \cdot \frac{P \cdot R}{P+R} $$

The first two metrics have to be taken with a grain of salt though, because they depend on the definition of a speech pause, which is highly subjective. WebRTC provides a parameter which controls the "aggressiveness" of speech detection (values between 0 and 3). A higher value means higher aggressiveness, which results in a higher probability for a frame being classified as "speech" and therefore in more speech segments.

In [None]:
for aggressiveness in 0, 1, 2, 3:
    print(f'measuring precision/recall for WebRTC-VAD with aggressiveness={aggressiveness}')
    p, r, f, n_orig, n_webrtc = precision_recall(corpus_entry, aggressiveness)
    print(f'# speech segments (manual): {n_orig}')
    print(f'# speech segments (WebRTC): {n_webrtc}')
    print(f'precision: {p:.3f}, recall: {r:3f}, F-score: {f:.3f}')
    print(f'------------------------------------------------------------------------------')

We can further examine to what degree the speech pauses detected by WebRTC overlap with the speech pauses from the raw data for a whole corpus. We do this by iterating over the whole corpus and perform above calculations for each corpus entry. The results for precision and recall can be averaged to get an idea of how well WebRTC generally performs. The results for the difference must be inspected more closely because the negative and positive values might cancel each other out, yielding an overall difference of zero, which is not correct since we are interested in the average difference of produced speech segments. We therefore differenciate three values for the difference:

* **Negative Difference**: Average difference between the number of of speech segments produced by WebRTC and the number of manually defined speech segments for cases where WebRTC produced **fewer** speech segments than a human.
* **Positive Difference**: Average difference between the number of of speech segments produced by WebRTC and the number of manually defined speech segments for cases where WebRTC produced **more** speech segments than a human.
* **Average Difference**: Average absolute difference between the number of speech segments produced by WebRTC and the number of manually defined speech segments. **All** corpus entries were considered.

In [None]:
title = f'Comparison of automatic/manual VAD for {rl_corpus.name} corpus'
rl_stats = create_corpus_stats(rl_corpus)
plot_stats(rl_stats, title)

In [None]:
title = f'Comparison of automatic/manual VAD for {ls_corpus.name} corpus'
ls_stats = create_corpus_stats(ls_corpus)
plot_stats(ls_stats, title)

### Results and interpretation

Above cell calculate the metrics for both the _ReadyLingua_ and the _LibriSpeech_ corpus. They do so by splitting the audio signal of each corpus entry into speech parts with WebRTC using all different values for the aggressivenes. Average precision, average recall and differences in number of speech segments can then be calculated by comparing the results with the manually defined speech segments.

Since this process can take a lot of time (especially for the LibriSpeech corpus, which contains more than 1000 hours of audio) the values have been calculated beforehand. The following charts show their development with varying values for the aggressiveness. The values of each metric are listed in the following tables. The best value is highlighted. The value for Recall has to be taken with a pinch of salt though: A high value can be easily achieved by treating the whole audio signal as a single speech segment. Also, the audio frames are heavily skewed towards lots of frames containing speech and only a few non-speech frames between the speech parts.

#### ReadyLingua corpus

![WebRTC VAD vs. manual speech segmentation for ReadyLingua corpus](../assets/webrtc_vs_manual_rl.png)

We can clearly observe a clear trend towards higher precision and recall with increasing aggressiveness. The best value for the F-score is reached with WebRTC-VAD set to its highest aggressiveness, corresponding to a value for the precision in the mid-eighties.

The plot also shows that the difference in the number of speech segments produced by WebRTC-VAD decreases with increasing aggressiveness. However, WebRTC will generally produce fewer speech segments than a human, resulting in an average negative difference which is considerably smaller than the average positive difference.

#### LibriSpeech corpus

![WebRTC VAD vs. manual speech segmentation for LibriSpeech corpus](../assets/webrtc_vs_manual_ls.png)

We can observe a slightly decreasing F-score. This is a direct result of the fact that precision remains constant while recall decreases for higher values of the aggressiveness. However both precision and F-score remain in an interval of 82-88%. These are rather high values. In its highest aggressiveness, WebRTC produces significantly more speech segments than with the next lower value.

#### Avg. Precision
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.768</td>
    <td>.787</td>
    <td>.816</td>
    <td style="background-color: lightgreen;">.846</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>.830</td>
    <td>.831</td>
    <td>.831</td>
    <td style="background-color: lightgreen;">.831</td>
  </tr>
</table>

#### Avg. Recall
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.877</td>
    <td>.892</td>
    <td>.908</td>
    <td style="background-color: lightgreen;">.932</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td style="background-color: lightgreen;">.967</td>
    <td>.967</td>
    <td>.964</td>
    <td>.950</td>    
  </tr>
</table>

#### F-Score
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.808</td>
    <td>.824</td>
    <td>.847</td>
    <td style="background-color: lightgreen;">.876</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>.886</td>
    <td style="background-color: lightgreen;">.886</td>
    <td>.885</td>
    <td>.879</td>
  </tr>
</table>

#### Differences in number of speech segments
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Avg. Difference</th>
    <th colspan="4">Avg. Difference (neg)</th>
    <th colspan="4">Avg. Difference (pos)</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>30.204</td>
    <td>29.359</td>
    <td>20.081</td>
    <td style="background-color: lightgreen;">16.645</td>
    <td>-34.756</td>
    <td>-34.312</td>
    <td>-27.068</td>
    <td style="background-color: lightgreen;">-15.677</td>
    <td>6.270</td>
    <td style="background-color: lightgreen;">6.211</td>
    <td>10.052</td>
    <td>17.330</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>23.708</td>
    <td style="background-color: lightgreen;">23.268</td>
    <td>24.779</td>
    <td>43.467</td>
    <td>-25.317</td>
    <td>-24.083</td>
    <td>-20.4</td>
    <td style="background-color: lightgreen;">-12.870</td>
    <td style="background-color: lightgreen;">21.538</td>
    <td>22.774</td>
    <td>28.850</td>
    <td>46.203</td>
  </tr>
</table>

## Summary

In this notebook the state-of-the-art automatic VAD system from WebRTC was compared against manually defined segmentations from different sources. Even though the inner workings remain unclear, the automatically detected speech segments results showed a very good perceived quality for randomly selected samples. 

The subjectively assessed high similarity to manual segmentation could be verified by measuring precision and recall for the corpora used in this project. Both were in ranges around 90% when set to suitable values for its aggressiveness. Given the highly subjective nature of speech segmentation, this is a very good result which makes WebRTC a valid candidate for the VAD-stage of the pipeline.

The performance of WebRTC-VAD can be considered very good, yielding results near-par to human performance when set to highest aggressiveness. The conclusion is to leave the aggressiveness of WebRTC-VAD at its highest setting (`3`).