# Diarize single-channel audio files

This notebook instantiates a `pyannote-audio` pipeline and diarizes the single-channel left|right audio files. The diarization results are stored in `.rttm` files.

In [20]:
from pathlib import Path
import diarize_utils as utils
from pyannote.audio import Pipeline

## Define the project

The source audio files are stored in a subdirectories named `audio/left` and `audio/right` in the project root. The left and right `.rttm` outputs will be stashed in `left` and `right` subdirectories of `diarized/rttm`.

In [21]:
projroot = Path('/global/scratch/users/rsprouse/yidcorp/')
wavleft = projroot / 'audio' / 'left'
wavright = projroot / 'audio' / 'right'
rttmleft = projroot / 'diarized' / 'rttm' / 'left'
rttmright = projroot / 'diarized' / 'rttm' / 'right'

## Mirror the source audio directories

The left and right audio channels will be diarized separately and the results stored in subdirectories of `diarized/rttm` named `left` and `right`. As a first step we use `mirror_dir` to re-create the directory structures of `wavleft` and `wavright`. The output `.rttm` files will be placed in these parallel directory structures.

In [22]:
utils.mirror_dir(wavleft, rttmleft)
utils.mirror_dir(wavright, rttmright)

## Instantiate the pipeline

TODO: more on auth tokens
TODO: more on setting params.

In [23]:
# Store the token as the first line of `tokenfile`. This file should not be
# readable by other users on the system and should not be added to a git
# repository.
tokenfile = '/global/home/users/rsprouse/pyannote-auth-token'
with open(tokenfile, 'r') as tf:
    auth_token = tf.readline().strip()

In [24]:
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=auth_token
)
parameters = {
    "segmentation": {
        "min_duration_off": 0.3,
    },
}

pipeline.instantiate(parameters)

<pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x2b53734aeb50>

## Diarize the left channels

The `compare_dirs` function finds `left` `.wav` files that do not have a corresponding output file. The `ext1` and `ext2` values ensure that `compare_dirs` only looks for `.wav` and `.rttm` files in their corresponding directories.

In [25]:
todoleft = utils.compare_dirs(
    dir1=wavleft, ext1='.wav',
    dir2=rttmleft, ext2='.rttm'
)
todoleft

Unnamed: 0,relpath,fname,barename
0,.,Abraham_Slucki_Tape2.wav,Abraham_Slucki_Tape2
1,.,Adalbert_Fried_Tape1.wav,Adalbert_Fried_Tape1
2,.,Adalbert_Fried_Tape2.wav,Adalbert_Fried_Tape2
3,.,Adalbert_Fried_Tape3.wav,Adalbert_Fried_Tape3
4,.,Aizik_Dimantstein_Tape1.wav,Aizik_Dimantstein_Tape1
...,...,...,...
187,.,Wolf_Scheinberg_Tape2.wav,Wolf_Scheinberg_Tape2
188,.,Yokheved_Ayberman_Tape1.wav,Yokheved_Ayberman_Tape1
189,.,Yokheved_Ayberman_Tape2.wav,Yokheved_Ayberman_Tape2
190,.,Zigmund_Neufeld_Tape1.wav,Zigmund_Neufeld_Tape1


`todoleft` is a dataframe in which the rows represent input audio files that require processing of the left channel.

The `diarize_df` function iterates over the rows of `todoleft` and uses the pipeline to diarize the input audio file and produce an `.rttm`.

In [None]:
utils.diarize_df(todoleft, pipeline, num_spkr, wavleft, rttmleft)

## Diarize the right channels

In [26]:
todoright = utils.compare_dirs(
    dir1=wavright, ext1='.wav',
    dir2=rttmright, ext2='.rttm'
)
todoright

Unnamed: 0,relpath,fname,barename
0,.,Abraham_Slucki_Tape2.wav,Abraham_Slucki_Tape2
1,.,Adalbert_Fried_Tape1.wav,Adalbert_Fried_Tape1
2,.,Adalbert_Fried_Tape2.wav,Adalbert_Fried_Tape2
3,.,Adalbert_Fried_Tape3.wav,Adalbert_Fried_Tape3
4,.,Aizik_Dimantstein_Tape1.wav,Aizik_Dimantstein_Tape1
...,...,...,...
189,.,Wolf_Scheinberg_Tape2.wav,Wolf_Scheinberg_Tape2
190,.,Yokheved_Ayberman_Tape1.wav,Yokheved_Ayberman_Tape1
191,.,Yokheved_Ayberman_Tape2.wav,Yokheved_Ayberman_Tape2
192,.,Zigmund_Neufeld_Tape1.wav,Zigmund_Neufeld_Tape1


In [None]:
utils.diarize_df(todoright, pipeline, num_spkr, wavright, rttmright)