In [1]:
%%bash
# Cell to run prior to running following cells in notebook during development.
# Delete this cell prior to publishing.
stereodir=/global/scratch/users/rsprouse/yidcorp/audio/stereo
tgtdir=/global/home/groups/fc_phonlab/spkrcorpus
dur=45

# Clear out existing files.
rm -rf ${tgtdir}/audio
rm -rf ${tgtdir}/diarized

# Speaker 1
spkrdir=${tgtdir}/audio/stereo/speaker_1
mkdir -p ${spkrdir}
sox ${stereodir}/Moishe_Gorelik_Tape4.wav ${spkrdir}/interview_a.wav trim 0 ${dur}
sox ${stereodir}/Moishe_Gorelik_Tape5.wav ${spkrdir}/interview_b.wav trim 0 ${dur}

# Speaker 2
spkrdir=${tgtdir}/audio/stereo/speaker_2
mkdir -p ${spkrdir}
sox ${stereodir}/Zhenya_Raykhman_Tape2.wav ${spkrdir}/interview_a.wav trim 0 ${dur}
sox ${stereodir}/Zhenya_Raykhman_Tape3.wav ${spkrdir}/interview_b.wav trim 0 ${dur}

In [2]:
from pathlib import Path
import diarize_utils as utils
from pyannote.audio import Pipeline
from phonlab.utils import dir2df
from audiolabel import df2tg

# Prepare audio files and diarize them

This notebook outlines a sample workflow that can be used to prepare a set of audio files for diarization and then run a diarization pipeline on them to produce annotation files (`.eaf` for use with [ELAN](https://archive.mpi.nl/tla/elan); `.TextGrid` for use with [Praat](https://www.fon.hum.uva.nl/praat/)). The workflow produces speaker tiers in the annotation files that contain intervals that mark locations in the audio where the speaker is talking.

TODO: image of sample textgrid/eaf output + waveform

### The overall workflow

The workflow consists of two steps:

1. [Preparation of the audio files for input to the diarization process](#step-1).
1. [Diarization of the pre-processed audio files](#step-2).

Each step is designed to be easily repeated whenever new inputs to the step are added to the project. For the first step this would be when new source audio files are added. For the second step this would be when the first step has pre-processed new files.

## Project organization

The input and output files in the project corpus are stored as subdirectories of the project root, which is defined by `projroot`.

In [3]:
projroot = Path('/global/home/groups/fc_phonlab/spkrcorpus')

### Input audio files

The source audio files are stereo files in the `audio/stereo` directory that are organized by speaker. The [`dir2df` function](https://github.com/rsprouse/phonlab/blob/master/doc/Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df()%60.ipynb) produces a dataframe of the files found in the source audio directory. The speaker subdirectories are shown by the value of the `relpath` column, and the filenames are stored in `fname`. The `barename` column contains the filename without its extension.

In [4]:
dir2df(
    projroot/'audio'/'stereo',
    addcols=['barename']
)

Unnamed: 0,relpath,fname,barename
0,speaker_1,interview_a.wav,interview_a
1,speaker_1,interview_b.wav,interview_b
2,speaker_2,interview_a.wav,interview_a
3,speaker_2,interview_b.wav,interview_b


### Processed audio files

The first step in the workflow extracts a channel from the stereo `.wav` files and downsamples it. Since the input files are stereo files the left and right channels are extracted separately before downsampling.

`channelmap` defines a mapping of input audio channel to output directory. Here channel `1` will map to the `audio/left` subdirectory, and channel `2` maps to `audio/right`.

For mono input files a simpler `channelmap` can be created that maps channel `1` to a subdirectory name.

In [5]:
channelmap = {
    1: 'left',
    2: 'right'
}

#channelmap = {1: 'downsampled'} # A sample `channelmap` for mono input audio files.

### Diarized outputs

In the second step of the workflow the pre-processed files are diarized. The `output_type` selects `TextGrid` or `eaf` output files.

In [6]:
output_type = 'TextGrid' # Desired output type: 'eaf' or 'TextGrid'
num_speakers = 2  # Per channel
buffer = 0.250 # In seconds
# Text to include in labelled speech regions.
speech_label = '*' if output_type == 'TextGrid' else ''

TODO: more on auth tokens

In [7]:
tokenfile = projroot/'pyannote-auth-token'
with open(tokenfile, 'r') as tf:
    auth_token = tf.readline().strip()

## <a id="step-1">Step 1: Extract the channels and downsample audio</a>

The output audio files will consist of a single channel from an input audio file that has been downsampled to 16000 Hz, which matches the sample rate used to train the diarization model. (TODO: verify)

The `compare_dirs` function finds `stereo` files that do not yet have corresponding `left` or `right` output files. The `ext1` and `ext2` values ensure that `compare_dirs` only looks for `.wav` files in the corresponding directories. `compare_dirs` returns a dataframe in which each row contains a file that requires processing.

We iterate over the rows of the `todo` dataframe and use `prep_audio` to extract one channel of audio and downsample. The resulting `.wav` file is stored in a `left` or `right` subdirectory. The inclusion of `relpath` in the output filepath also ensures that the `speaker` directory structure is replicated in the output directory.

In [8]:
verbose = True   # Set to false to suppress progress messages

# Loop over the channels defined in `channelmap`.
for chan_num, chan_name in channelmap.items():
    srcdir = projroot/'audio'/'stereo'
    chandir = projroot/'audio'/chan_name

    # Find input stereo files that don't have a corresponding
    # left|right pre-processed file.
    todo = utils.compare_dirs(
        dir1=srcdir, ext1='.wav',
        dir2=chandir, ext2='.wav'
    )

    # Loop over the files that require processing.
    for row in todo.itertuples():
        infile = srcdir/row.relpath/row.fname
        outfile = chandir/row.relpath/row.fname

        # Create pre-processed output file for left|right channel.
        if verbose:
            print(f'prep_audio: {outfile}')
        utils.prep_audio(infile, outfile, chan_num)

prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/left/speaker_1/interview_a.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/left/speaker_1/interview_b.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/left/speaker_2/interview_a.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/left/speaker_2/interview_b.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/right/speaker_1/interview_a.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/right/speaker_1/interview_b.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/right/speaker_2/interview_a.wav
prep_audio: /global/home/groups/fc_phonlab/spkrcorpus/audio/right/speaker_2/interview_b.wav


## <a id="step-2">Step 2: Diarize the pre-processed files</a>

### Instantiate the pipeline

TODO: more on setting params.

Note that `pipeline` only needs to be instantiated once. The cell that performs diarization can be executed repeatedly without re-instantiating `pipeline`.

In [9]:
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=auth_token
)
parameters = {
    "segmentation": {
        "min_duration_off": 0.3,
    },
}

pipeline.instantiate(parameters)

<pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x2b83aabc5f70>

### Diarize the pre-processed audio

The outputs of diarization are annotation files for each of the pre-processed audio files.

First `compare_dirs` function finds pre-processed left|right `.wav` files that do not have a corresponding `TextGrid`. The `ext1` and `ext2` values ensure that `compare_dirs` only looks for `.wav` and `TextGrid` files in their corresponding directories.

Each `TextGrid` is created by the `diarize` function while iterating over `todo`. Note that the output filename is constructed by `barename` (the input file's filename without extension) and using `TextGrid` as the extension. 

In [10]:
# Loop over the channels defined in `channelmap`.
for chan_num, chan_name in channelmap.items():
    wavdir = projroot/'audio'/chan_name
    outdir = projroot/'diarized'/output_type/chan_name

    # Find input pre-processed files that don't have a corresponding
    # left|right TextGrid.
    todo = utils.compare_dirs(
        dir1=wavdir, ext1='.wav',
        dir2=outdir, ext2=f'.{output_type}'
    )

    # Loop over the files that require processing.
    for row in todo.itertuples():
        wavfile = wavdir/row.relpath/row.fname
        outfile = outdir/row.relpath/f'{row.barename}.{output_type}'

        # Create TextGrid for left|right pre-processed audio file.
        if verbose:
            print(f'diarize: {outfile}')
        diarization = utils.diarize(
            wavfile, pipeline, outfile, num_speakers, buffer, speech_label
        )

diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/left/speaker_1/interview_a.TextGrid


[W NNPACK.cpp:51] Could not initialize NNPACK! Reason: Unsupported hardware.


diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/left/speaker_1/interview_b.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/left/speaker_2/interview_a.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/left/speaker_2/interview_b.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/right/speaker_1/interview_a.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/right/speaker_1/interview_b.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/right/speaker_2/interview_a.TextGrid
diarize: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/right/speaker_2/interview_b.TextGrid


## <a id="step-3">Step 3: Combine diarized outputs</a>

In this step we combine the tiers in the `left` and `right` output files into a single `TextGrid` or `eaf` file.

We also use a sorting algorithm to assign labels to the combined tiers. For the interviews in our corpus we expect the person doing the most speaking to be the person being interviewed. We also expect the average intensity of each person's speech to be greater in the channel corresponding to the closer microphone. On the basis of this sorting we assign `Subject probable`, `Subject unlikely`, `Interviewer probable`, and `Interviewer unlikely` to the annotation tiers.

If the sorting algorithm fails to cleanly find the subject and interviewer on separate channels, then `unknown` names are assigned.

In [11]:
# Find existing left/right label files that need to be combined.
annodir = projroot/'diarized'/output_type
(annodir/'combined').mkdir(parents=True, exist_ok=True)
todo = utils.compare_dirs(
    dir1=annodir/'left', ext1=output_type,
    dir2=annodir/'combined', ext2=output_type
)

# Loop over label files to be combined and sort the tiers.
for row in todo.itertuples():
    tierdfs, tiernames = utils.sort_tiers(
        annodir,
        list(channelmap.values()),
        projroot/'audio'/'stereo',
        row.relpath,
        row.fname
    )
    outfile = annodir/'combined'/row.relpath/f'{row.barename}.{output_type}'
    outfile.parent.mkdir(parents=True, exist_ok=True)
    if verbose:
        print(f'sorted: {outfile}')
    if output_type == 'TextGrid':
        df2tg(
            tierdfs,
            tnames=tiernames,
            lbl='label',
            ftype='praat_short',
            outfile=outfile
        )
    elif output_type == 'eaf':
        utils.write_eaf(tierdfs, tiernames, outfile, speech_label, 't1', 't2')

sorted: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/combined/speaker_1/interview_a.TextGrid
sorted: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/combined/speaker_1/interview_b.TextGrid
sorted: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/combined/speaker_2/interview_a.TextGrid
sorted: /global/home/groups/fc_phonlab/spkrcorpus/diarized/TextGrid/combined/speaker_2/interview_b.TextGrid
