# VAD stage

Segmentation information is the information about when in an audio signal someone is speaking. Because the training data consists of aligned text, this information will be given at training time: For a given corpus entry the speech segments and their transcription can easily be derived from metadata and used for training in the ASR-stage of the pipeline. However, there will be no segmentation information available at test time nor in production. The only thing known then will be the entire audio signal and its transcription. 

What is needed is the audio signal split into chunks, i.e. speech segments. Such speech segments can then be fed to the trained RNN, which will output a potentially faulty transcript which can then be aligned with the transcript of the whole recording in the LSA-stage.

As stated in [the first notebook](00_introduction.ipynb) the original idea was to perform this chunking by using another RNN, which would learn how to detect speech pauses. However, instead of having to rely on such a RNN, we could try out detecting speech pauses using a VAD (_Voice Activity Detection_) algorithm. A VAD algorithm that is able to detect speech pauses with reasonable accuracy would free us from the task of detecting them ourselves (by training an RNN e.g.).

This chapter will compare one state-of-the-art implementation for VAD against the segmentation information from the corpus data that was acquired through manual labelling. 

In [None]:
corpus_root = r'E:/' # define the path to where the corpus files are located!

In [None]:
from util.corpus_util import *
from util.webrtc_util import *

import numpy as np

# Visualization
from IPython.display import HTML, Audio
import ipywidgets as widgets
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import librosa.display

rl_corpus_root = os.path.join(corpus_root, 'readylingua-corpus')
ls_corpus_root = os.path.join(corpus_root, 'librispeech-corpus')

default_figsize = (12,5)
default_facecolor = 'white'

def show_wave(audio, sample_rate, ax=None, title=None):
    if not ax:
        plt.figure(figsize=default_figsize, facecolor=default_facecolor)
        
    p = librosa.display.waveplot(audio.astype(float), sample_rate)
    ax = p.axes
    ax.set_ylabel('Amplitude')
    if title:
        plt.title(title)
    plt.tight_layout()
    return ax


def show_segments(ax, boundaries, ymin=0, ymax=1, color='red'):
    for i, (start_frame, end_frame) in enumerate(boundaries):
        rect = ax.axvspan(start_frame, end_frame, ymin=ymin, ymax=ymax, color=color, alpha=0.5)
        y_0, y_1 = ax.get_ylim()
        x = start_frame + (end_frame - start_frame)/2
        y = y_0 + 0.01*(y_1-y_0) if ymin==0 else y_1 - 0.05*(y_1-y_0)
        ax.text(x, y, str(i+1), horizontalalignment='center', fontdict={'family': 'sans-serif', 'size': 15, 'color': 'white'})          

In [None]:
rl_corpus = load_corpus(rl_corpus_root)
ls_corpus = load_corpus(ls_corpus_root)

## WebRTC

[WebRTC](https://webrtc.org/) is a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs. The WebRTC components have been optimized to best serve this purpose. There is also a VAD component, whose functionality has been [ported to Python by John Wiseman](https://github.com/wiseman/py-webrtcvad). It uses C code under the hood and is therefore very performant. Unfortunately, there is no information about the inner workings since there is no documentation available. Judging from the [source files](https://webrtc.googlesource.com/) however I suspect a Gaussian Mixture Model (GMM) is used to model the probability of a frame being speech or not.

Execute the cell below to compare the pause segments detected by WebRTC together with the pause segments from the metadata.

In [None]:
corpus_entry = rl_corpus['news170524']
# corpus_entry = random.choice(rl_corpus)
# corpus_entry = rl_corpus[0]

audio, rate = corpus_entry.audio, corpus_entry.rate
display(Audio(data=audio, rate=rate))

# pause boundaries from raw data
original_boundaries = calculate_boundaries(corpus_entry.speech_segments)
original_boundaries = original_boundaries / rate

# pause boundaries from WebRTC
webrtc_boundaries, voiced_segments = calculate_boundaries_webrtc(corpus_entry)

title = f'Raw wave of {corpus_entry.audio_file}'
ax_wave = show_wave(audio, rate, title=title)
show_segments(ax_wave, original_boundaries, ymax=0.5, color='green')
show_segments(ax_wave, webrtc_boundaries, ymin=0.5, color='blue')

pause_segments_original = mpatches.Patch(color='green', alpha=0.6, label=f'original speech segments ({len(original_boundaries)})')
pause_segments_webrtc = mpatches.Patch(color='blue', alpha=0.6, label=f'speech segments detected by WebRTC ({len(webrtc_boundaries)})')
ax_wave.legend(handles=[pause_segments_original, pause_segments_webrtc], bbox_to_anchor=(0, -0.2, 1., -0.1), loc=3, mode='expand', borderaxespad=0, ncol=2)

You can also listen to speech segments detected by WebRTC:

In [None]:
import itertools

def play_webrtc_sample(webrtc_sample):
    audio = np.concatenate([frame.audio for frame in webrtc_sample])
    display(Audio(data=audio, rate=rate))
    
[play_webrtc_sample(sample) for sample in (voiced_segments[i] for i in range(10))]

## WebRTC vs. manual segmentation

By comparing the speech segments produced by WebRTC with the manually defined speech segments we can now calculate how much the speech pauses detected by WebRTC coincide with the speech pauses from raw data. To do this we can compare different metrics of the two results:

* **Precision**: Percentage of audio frames in classified as "speech" by WebRTC that are were also  classified as "speech" by a human
* **Recall**: Percentage of manually classified "speech" frames that were also detected by WebRTC
* **Difference**: Difference between the number of speech segments detected by WebRTC and manual segmentation. A negative value means WebRTC detected fewer speech segments. A positive value means WebRTC detected more speech segments. A value of zero means both methods produced the same number of (but not neccessarily the same) speech segments.

These metrics can be calculated for a corpus entry or the whole corpus. Precision and Recall can be further combined to a single value by calculating its **F-Score**:

$$ F = 2 \cdot \frac{P \cdot R}{P+R} $$

The first two metrics have to be taken with a grain of salt though, because they depend on the definition of a speech pause, which is highly subjective. WebRTC provides a parameter which controls the "aggressiveness" of speech detection (values between 0 and 3). A higher value means higher aggressiveness, which results in a higher probability for a frame being classified as "speech" and therefore in more speech segments.

In [None]:
for aggressiveness in 0, 1, 2, 3:
    print(f'measuring precision/recall for WebRTC-VAD with aggressiveness={aggressiveness}')
    p, r, f, d = precision_recall(corpus_entry, aggressiveness)
    print(f'precision: {p:.3f}, recall: {r:3f}, F-score: {f:.3f}, difference: {d:.3f}')

We can further examine to what degree the speech pauses detected by WebRTC overlap with the speech pauses from the raw data for a whole corpus. We do this by iterating over the whole corpus and perform above calculations for each corpus entry. The results for precision and recall can be averaged to get an idea of how well WebRTC generally performs. The results for the difference must be inspected more closely because the negative and positive values might cancel each other out, yielding an overall difference of zero, which is not correct since we are interested in the average difference of produced speech segments. We therefore differenciate three values for the difference:

* **Negative Difference**: Average difference between the number of of speech segments produced by WebRTC and the number of manually defined speech segments. Only those corpus entries were considered, where WebRTC produced **less** speech segments than a human.
* **Positive Difference**: Average difference between the number of of speech segments produced by WebRTC and the number of manually defined speech segments. Only those corpus entries were considered, where WebRTC produced **more** speech segments than a human.
* **Average Difference**: Average difference between the number of speech segments produced by WebRTC and the number of manually defined speech segments. **All** corpus entries were considered. A negative value means WebRTC generally produced less speech segments than a human would. A positive value means WebRTC produced more speech segments than a human. A value of zero means WebRTC produced exactly the same number of speech segments **or the positive and negative difference would cancel each other out**.

In [None]:
title = f'Comparison of automatic/manual VAD for {rl_corpus.name} corpus'
plot_stats(create_corpus_stats(rl_corpus), title=title)

In [None]:
title = f'Comparison of automatic/manual VAD for {ls_corpus.name} corpus'
plot_stats(create_corpus_stats(ls_corpus))

### Results and interpretation

Aboce cell compares the manual and automatic segmentation by calculating the average precision, average recall and average difference in number of speech segments created. The comparison has been made for each corpus and for all levels of aggressiveness. Since this process takes some time, the following figures and table show the result of a previous run. The best results are marked green.

#### Avg. Precision
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.849</td>
    <td>.850</td>
    <td>.873</td>
    <td style="background-color: lightgreen;">.901</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

#### Avg. Recall
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td style="background-color: lightgreen;">.988</td>
    <td>.987</td>
    <td>.982</td>
    <td>.970</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>    
  </tr>
</table>

#### F-Score
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.910</td>
    <td>.911</td>
    <td>.422</td>
    <td style="background-color: lightgreen;">.931</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

#### Differences in number of speech segments
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Avg. Difference</th>
    <th colspan="4">Avg. Difference (neg)</th>
    <th colspan="4">Avg. Difference (pos)</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>30.204</td>
    <td>29.359</td>
    <td>20.081</td>
    <td style="background-color: lightgreen;">16.645</td>
    <td>-34.756</td>
    <td>-34.312</td>
    <td>-27.068</td>
    <td style="background-color: lightgreen;">-15.677</td>
    <td>6.270</td>
    <td  style="background-color: lightgreen;">6.211</td>
    <td>10.052</td>
    <td>17.330</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

#### ReadyLingua corpus

The following plot visualizes the results for the ReadyLingua corpus. We can clearly observe that the precision increases quite significantly with increasing aggressiveness. At the same time, recall decreases, but not to the same rate. In its highest setting for aggressiveness WebRTC is able to detect speech segments with an F-Score of 0.931, which corresponds to values for Precision and Recall of over 90%.

The average difference in number of speech segments approaches zero with increasing aggressiveness. From the positive value we can conclude that WebRTC will generally produce more speech segments with increasing aggressiveness. For corpus entries, where WebRTC would produce more speech segments than a human, the difference is at only +6, meaning that when WebRTC produces more speech segments than a human the difference is only marginal. On the other hand the average difference when WebRTC produces fewer segments than a human, the difference is a higher.

Generally speaking the performance of WebRTC-VAD can be considered very good, yielding results near-par to human performance when set to highest aggressiveness. The conclusion is to leave the aggressiveness of WebRTC-VAD at its highest setting (`3`).

![WebRTC VAD vs. manual speech segmentation](../assets/webrtc_vs_manual_rl.png)

#### LibriSpeech corpus

tbd.

## Conclusion

In this chapter the state-of-the-art automatic VAD system from WebRTC was compared against manually defined segmentations from different sources. Even though the inner workings remain unclear, the automatically detected speech segments results showed very high similarity to manual segmentation and a very good perceived quality for randomly selected samples. 

These findings could be verified by measuring precision and recall for the whole corpora used in this project. Both were in ranges above 90% when set to suitable values for its aggressiveness. Given the highly subjective nature of speech segmentation, this is a very good result which makes WebRTC a valid candidate for the VAD-stage of the pipeline.