In [None]:
# TODO: switch to AMI
PROTOCOL = 'Debug.SpeakerDiarization.Debug'

# Voice activity detection with `pyannote.audio`

Voice activity detection (VAD) is the task of detecting speech regions in a given audio stream or recording.  
In this notebook, we will train and evaluate a VAD pipeline on Debug database.

In [None]:
from pyannote.database import get_protocol, FileFinder
protocol = get_protocol(PROTOCOL, preprocessors={"audio": FileFinder()})

`pyannote.database` *protocols* usually define 
* a training set: `for training_file in protocol.train(): ...`, 
* a validation set: `for validation_file in protocol.development(): ...` 
* an evaluation set `for evaluation_file in protocol.test(): ...`

Let's listen to the first training file and visualize its reference annotation:

In [None]:
first_training_file = next(protocol.train())

In [None]:
from pyannote.audio.utils.preview import listen
listen(first_training_file)

In [None]:
first_training_file['annotation']

The expected output of a perfect voice activity detection pipeline would look like this:

In [None]:
from pyannote.audio.pipelines.voice_activity_detection import OracleVoiceActivityDetection
oracle_vad = OracleVoiceActivityDetection()

oracle_vad(first_training_file).get_timeline()

## Training

We initialize a VAD *task* that describes how the model will be trained:

* `protocol` indicates that we will use files available in `protocol.train()`.
* `duration=2.` and `batch_size=16` indicates that the model will ingest batches of 16 two seconds long audio chunks.

In [None]:
from pyannote.audio.tasks import VoiceActivityDetection
vad = VoiceActivityDetection(protocol, duration=2., batch_size=16)

We initialize the *model*: it needs to know about the task (`task=vad`) for which it is being trained for:

In [None]:
from pyannote.audio.models.segmentation.debug import SimpleSegmentationModel
model = SimpleSegmentationModel(task=vad)

Now that everything is ready, let's train with `pytorch-ligthning`!

In [None]:
import pytorch_lightning as pl
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, vad)

## Inference

Once trained, we will apply the model on a test file:

In [None]:
test_file = next(protocol.test())
# here we use a test file provided by the protocol, but it could be any audio file
# e.g. test_file = "/path/to/test.wav".

Because the model was trained on 2s audio chunks and that test files are likely to be much longer than that, we wrap the `model` with an `Inference` instance: it will take care of sliding a 2s window over the whole file and aggregate the output of the model.

In [None]:
from pyannote.audio import Inference
inference = Inference(model)
vad_probability = inference(test_file)

In [None]:
vad_probability

## Pipeline

Almost there! To obtain the final speech regions, we need to apply a detection threshold.  
For that, we rely on the voice activity detection pipeline whose hyper-parameters are set manually:
- `onset=0.5`: mark region as `active` when probability goes above 0.5
- `offset=0.5`: switch back to `inactive` when probability goes below 0.5
- `min_duration_on=0.1`: remove `active` regions shorter than 100ms
- `min_duration_off=0.1`: fill `inactive` regions shorter than 100ms.

In [None]:
from pyannote.audio.pipelines import VoiceActivityDetection as VoiceActivityDetectionPipeline
pipeline = VoiceActivityDetectionPipeline(scores=inference).instantiate(
    {"onset": 0.5, "offset": 0.5, "min_duration_on": 0.1, "min_duration_off": 0.1})

Here we go:

In [None]:
pipeline(test_file).get_timeline()

## Optimizing pipeline hyper-parameters

While good enough, the hyper-parameters that we chose manually, we can try to optimize them on the validation set to get even better performance.

In [None]:
# to make things faster, we run the inference once and for all... 
validation_files = list(protocol.development())
for file in validation_files:
    file['vad_scores'] = inference(file)
# ... and tell the pipeline to load VAD scores directly from files
pipeline = VoiceActivityDetectionPipeline(scores="vad_scores")

In [None]:
from pyannote.pipeline import Optimizer
optimizer = Optimizer(pipeline)
optimizer.tune(validation_files, n_iterations=200, show_progress=False)

There you go: better hyper-parameters that should lead to better results!

In [None]:
optimized_pipeline = VoiceActivityDetectionPipeline(scores=inference).instantiate(optimizer.best_params)
optimized_pipeline(test_file).get_timeline()