# Voice Activity Detection

The above plots show the segmentation of the audio signal into speech and pause segments using the ground truth derived from the metadata that was provided together with the raw data. However, instead of having to rely on such metadata being present, we could try out detecting speech pauses automatically using a VAD (Voice Activity Detection) algorithm. A VAD algorithm that is able to detect speech pauses with reasonable accuracy would free us from the task of detecting them ourselves (by training an RNN e.g.).

## WebRTC

[WebRTC](https://webrtc.org/) is a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs. The WebRTC components have been optimized to best serve this purpose. There is also a VAD component, whose functionality has been [ported to Python by John Wiseman](https://github.com/wiseman/py-webrtcvad). It uses C code under the hood and is therefore very performant.

Execute the cell below to compare the pause segments detected by WebRTC together with the pause segments from the metadata.

In [None]:
def calculate_boundaries_webrtc(corpus_entry, aggressiveness=3):
    voiced_segments, _ = split_segments(corpus_entry, aggressiveness=aggressiveness)
    boundaries = []
    for frames in voiced_segments:
        start_time = frames[0].timestamp
        end_time = (frames[-1].timestamp + frames[-1].duration)
        boundaries.append((start_time, end_time))
    return 2*np.array(boundaries), voiced_segments

# corpus_entry = random.choice(rl_corpus)
corpus_entry = rl_corpus['news170524']
# corpus_entry = rl_corpus[0]

audio, rate = corpus_entry.audio, corpus_entry.rate
display(Audio(data=audio, rate=rate))

# pause boundaries from raw data
original_boundaries = calculate_boundaries(corpus_entry.speech_segments)
original_boundaries = original_boundaries / rate

# pause boundaries from WebRTC
webrtc_boundaries, voiced_segments = calculate_boundaries_webrtc(corpus_entry)

title = f'Raw wave of {corpus_entry.audio_file}'
ax_wave = show_wave(audio, rate, title=title)
show_segments(ax_wave, original_boundaries, ymax=0.5, color='green')
show_segments(ax_wave, webrtc_boundaries, ymin=0.5, color='blue')

pause_segments_original = mpatches.Patch(color='green', alpha=0.6, label=f'original speech segments ({len(original_boundaries)})')
pause_segments_webrtc = mpatches.Patch(color='blue', alpha=0.6, label=f'speech segments detected by WebRTC ({len(webrtc_boundaries)})')
ax_wave.legend(handles=[pause_segments_original, pause_segments_webrtc], bbox_to_anchor=(0, -0.2, 1., -0.1), loc=3, mode='expand', borderaxespad=0, ncol=2)

You can also listen to speech segments detected by WebRTC:

In [1]:
import itertools

def play_webrtc_sample(webrtc_sample):
    audio = np.concatenate([frame.audio for frame in webrtc_sample])
    display(Audio(data=audio, rate=rate))
    
[play_webrtc_sample(sample) for sample in (voiced_segments[i] for i in range(10))]

NameError: name 'voiced_segments' is not defined

## WebRTC vs. manual segmentation

We can calculate how much the speech pauses automatically detected by WebRTC coincide with the speech pauses from raw data, which were manually defined. To do this we can compare different metrics of the two results:

* **Precision**: Percentage of audio frames in classified as "speech" by WebRTC that are were actually manually classified "speech"
* **Recall**: Percentage of manually classified "speech" frames that were also detected by WebRTC
* **Difference**: Difference between the number of speech segments detected by WebRTC and manual segmentation. A negative value means WebRTC detected fewer speech segments. A positive value means WebRTC detected more speech segments. A value of zero means both methods produced the same number of (but not neccessarily the same) speech segments.

These metrics can be calculated for a corpus entry or the whole corpus. Precision and Recall can be further combined to a single value by calculating its **F-Score**:

$$ F = 2 \cdot \frac{P \cdot R}{P+R} $$

The first two metrics have to be taken with a grain of salt though, because they depend on the definition of a speech pause, which is highly subjective. WebRTC provides a parameter which controls the "aggressiveness" of speech detection (values between 0 and 3). A higher value means higher aggressiveness, which results in a higher probability for a frame being classified as "speech" and therefore in more speech segments.

In [2]:
from operator import itemgetter

def getOverlap(a, b):
    return max(0, min(a[1], b[1]) - max(a[0], b[0]))

def calc_intersection(a, b):
    a = sorted(a, key=itemgetter(0))
    b = sorted(b, key=itemgetter(0))
    for start_a, end_a in a:
        x = set(range(start_a, end_a + 1))
        for start_b, end_b in ((s, e) for (s, e) in b if getOverlap((s, e), (start_a, end_a))):
            y = range(start_b, end_b + 1)
            intersection = x.intersection(y)
            if intersection:
                yield min(intersection), max(intersection)

def precision_recall(corpus_entry, aggressiveness):
    boundaries_original = calculate_boundaries(corpus_entry.speech_segments)
    boundaries_webrtc, _ = calculate_boundaries_webrtc(corpus_entry, aggressiveness=aggressiveness)
    boundaries_webrtc = boundaries_webrtc * corpus_entry.rate # convert to frames
    boundaries_webrtc = boundaries_webrtc.astype(int)
    
    intersections = calc_intersection(boundaries_original, boundaries_webrtc)
    n_frames_intersection = sum(len(range(start, end + 1)) for start, end in intersections)
    n_frames_original = sum(len(range(start, end + 1)) for start, end in boundaries_original)
    n_frames_webrtc = sum(len(range(start, end + 1)) for start, end in boundaries_webrtc)
    
    p = n_frames_intersection / (n_frames_webrtc + 1e-3)
    r = n_frames_intersection / (n_frames_original + 1e-3)
    f = 2.0 * p * r / (p + r + 1e-3)
    d = len(boundaries_webrtc) - len(boundaries_original)
    
    return p, r, f, d

for aggressiveness in 0,1,2,3:
    print(f'measuring precision/recall for WebRTC-VAD with aggressiveness={aggressiveness}')
    p, r, f, d = precision_recall(corpus_entry, aggressiveness)
    print(f'precision is: {p}')
    print(f'recall is: {r}')
    print(f'F-score is: {f}')
    print(f'difference: {d}')

measuring precision/recall for WebRTC-VAD with aggressiveness=0


NameError: name 'corpus_entry' is not defined

We can further examine to what degree the speech pauses detected by WebRTC overlap with the speech pauses from the raw data for a whole corpus. We do this by iterating over the whole corpus and perform above calculations for each corpus entry. The results for precision and recall can be averaged to get an idea of how well WebRTC generally performs. The results for the difference must be inspected more closely because the negative and positive values might cancel each other out, yielding an overall difference of zero, which is not correct since we are interested in the average difference of produced speech segments. We therefore differenciate three values for the difference:

* **Absolute Difference**: Average of the absolute values of the differences over all corpus entries
* **Negative Difference**: Average of the negative values of the differences over all corpus entries (corpus entries where WebRTC produced less speech segments than a human)
* **Positive Difference**: Average of the positive values of the differences over all corpus entries (corpus entries where WebRTC produced more speech segments than a human)

In [3]:
from tqdm import tqdm
from tabulate import tabulate
from util.log_util import print_to_file_and_console

def compare_corpus(corpus, corpus_root, aggressiveness):
#     np.seterr(all='raise')
    p_r_f_d = list(tqdm((precision_recall(corpus_entry, aggressiveness) for corpus_entry in corpus[:1]), total=len(corpus)))
    p_r_f_d = np.asarray(p_r_f_d)
    avg_p, avg_r, avg_f, avg_d = np.abs(p_r_f_d).mean(axis=0)
    ds = p_r_f_d[:,3]
    avg_d_neg = np.extract(ds < 0, ds).mean()
    avg_d_pos = np.extract(ds > 0, ds).mean()

    return avg_p, avg_r, avg_f, avg_d, avg_d_neg, avg_d_pos

def create_corpus_stats(corpus):
    print(f'Comparing automatic/manual VAD for {corpus.name} corpus')
    stats = {'Aggressiveness': [0,1,2,3], 'Precision': [], 'Recall': [], 'F-Score': [], 'Difference (absolute)': [], 'Difference (negative)': [], 'Difference (positive)': []}
    for aggressiveness in stats['Aggressiveness']:
        print(f'precision/recall with aggressiveness={aggressiveness}\n')
        avg_p, avg_r, avg_f, avg_d, avg_d_neg, avg_d_pos = compare_corpus(rl_corpus, rl_corpus_root, aggressiveness)
        stats['Precision'].append(avg_p)
        stats['Recall'].append(avg_r)
        stats['F-Score'].append(avg_f)
        stats['Difference (absolute)'].append(avg_d)
        stats['Difference (negative)'].append(avg_d_neg)
        stats['Difference (positive)'].append(avg_d_pos)

    stats_file = os.path.join(corpus.root_path, 'corpus.stats')
    if os.path.exists(stats_file):
        os.remove(stats_file)
    print(f'Writing results to {stats_file}')
    f = print_to_file_and_console(stats_file)        
    print(tabulate(stats, headers='keys'))
    f.close()
    return stats

def plot_stats(stats, title=None):
    x = stats['Aggressiveness']
    
    fig, ax1 = plt.subplots(figsize=default_figsize, facecolor=default_facecolor)
    if title:
        ax1.set_title(title)
    ax1.set_xticks(x)
    ax1.set_xlabel('aggressiveness')
    ax1.set_ylabel('precision/recall/F-score')
    p, = ax1.plot(x, np.array(stats['Precision']), color='r', label='Precision')
    r, = ax1.plot(x, np.array(stats['Recall']), color='g', label='Recall')
    r, = ax1.plot(x, np.array(stats['F-Score']), color='b', label='F-Score')
    
    ax2 = ax1.twinx()
    ax2.set_ylabel('difference')
    d_abs, = ax2.plot(x, np.array(stats['Difference (absolute)']), color='c', label='Difference (absolute)')
    d_neg, = ax2.plot(x, np.array(stats['Difference (negative)']), color='m', label='Difference (negative)')
    d_pos, = ax2.plot(x, np.array(stats['Difference (positive)']), color='y', label='Difference (positive)')
    
    plt.legend(handles=[p, r, d_abs, d_neg, d_pos], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    fig.tight_layout()
    plt.show()
    
title = f'Comparison of automatic/manual VAD for {rl_corpus.name} corpus'
plot_stats(create_corpus_stats(rl_corpus), title=title)

# title = f'Comparison of automatic/manual VAD for {ls_corpus.name} corpus'
# plot_stats(create_corpus_stats(ls_corpus))

NameError: name 'rl_corpus' is not defined

##### Results and interpretation

Aboce cell compares the manual and automatic segmentation by calculating the average precision, average recall and average difference in number of speech segments created. The comparison has been made for each corpus and for all levels of aggressiveness. Since this process takes some time, the following figures and table show the result of a previous run. The best results are marked green.

###### Avg. Precision
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.849</td>
    <td>.850</td>
    <td>.873</td>
    <td style="background-color: lightgreen;">.901</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

###### Avg. Recall
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.988</td>
    <td>.987</td>
    <td>.982</td>
    <td style="background-color: lightgreen;">.970</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>    
  </tr>
</table>

###### F-Score
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Aggressiveness</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>.456</td>
    <td>.457</td>
    <td>.462</td>
    <td style="background-color: lightgreen;">.467</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

###### Differences in number of speech segments
<table>
  <tr>
    <th>Corpus</th>
    <th colspan="4">Avg. Difference (abs)</th>
    <th colspan="4">Avg. Difference (neg)</th>
    <th colspan="4">Avg. Difference (pos)</th>
  </th>
  <tr>
    <th></th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
    <th>0</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>ReadyLingua</td>
    <td>30</td>
    <td>29</td>
    <td>20</td>
    <td style="background-color: lightgreen;">17</td>
    <td>-29</td>
    <td>-18</td>
    <td>-16</td>
    <td style="background-color: lightgreen;">-6</td>
    <td style="background-color: lightgreen;">1</td>
    <td style="background-color: lightgreen;">1</td>
    <td>4</td>
    <td>11</td>
  </tr>
  <tr>
    <td>LibriSpeech</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
    <td>tbd</td>
  </tr>
</table>

###### ReadyLingua corpus

The following plot visualizes the results for the ReadyLingua corpus. We can clearly observe that the precision increases quite significantly with increasing aggressiveness. At the same time, recall decreases, but not to the same rate. In its highest setting for aggressiveness WebRTC is able to detect speech segments with an F-Score of 0.467, which corresponds to values for Precision and Recall of over 90%.

The average difference in number of speech segments also approaches to zero with increasing aggressiveness. The average difference of corpus entries, where WebRTC would produce less speech segments than a human is at only -6, meaning that when WebRTC produces fewer speech segments than a human there are on average 6 speech segments less than a human would produce. Again, this is valid for highest aggressiveness. On the other hand the average difference when WebRTC produces more segments than a human, the difference starts to increase with increasing aggressiveness.  However, the sum of absolute values of difference is still lowest with a value of 3 for the agressivenes. We can conclude that generally WebRTC will produce more speech segments with increasing aggressiveness.

Generally speaking the performance of WebRTC-VAD can be considered very good, yielding results near-par to human performance when set to highest aggressiveness. The conclusion is to leave the aggressiveness of WebRTC-VAD at its highest setting (`3`).

![WebRTC VAD vs. manual speech segmentation](../assets/webrtc_vs_manual_rl.png)

###### LibriSpeech corpus

tbd.