# Agentic Pipeline for Speaker Diarization and Quality Check

This notebook implements a Python pipeline for generating subtitles from audio files, featuring speaker diarization and a quality check agent. It's useful for analyzing short clips with multiple speakers, producing SRT files with labeled dialogues.

## Features
- Speaker diarization with consistent labeling.
- Dialogue-level transcription.
- Confidence-based quality evaluation per segment.

## Approach
1. **Audio Preparation**: Standardize audio to mono PCM WAV.
2. **Transcription**: Leverage Whisper for accurate segmenting.
3. **Diarization**: Use pyannote embeddings and clustering for speaker assignment—more reliable than basic methods.
4. **Merging**: Combine segments for natural flow.
5. **Output**: SRT with labels.
6. **Quality Agent**: Rule-based confidence scoring and feedback.

## Pipeline Flow
Input audio → Convert → Transcribe → Embed & Cluster → Merge → SRT & Quality Report.

## Limitations and Improvements
- Limitations: Fixed speaker count; best on clean audio.
- Improvements: Add overlap detection; LLM feedback; auto-speaker estimation.

In [None]:
!apt install ffmpeg -y
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q git+https://github.com/pyannote/pyannote-audio
!pip install -q speechbrain
!pip install -q scikit-learn pandas

import whisper
import datetime
import subprocess
import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import wave
import contextlib
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from sklearn.metrics.pairwise import cosine_distances
import math
import pandas as pd
import os
from google.colab import files

In [2]:
# Upload audio file
uploaded = files.upload()
path = next(iter(uploaded))
print(f"Uploaded audio: {path}")

Saving audio1.wav to audio1.wav
Uploaded audio: audio1.wav


In [3]:
converted_path = 'audio_mono_pcm.wav'
result = subprocess.run(['ffmpeg', '-i', path, '-f', 'wav', '-acodec', 'pcm_s16le', '-ar', '16000', '-ac', '1', converted_path, '-y'], capture_output=True, text=True)
print("ffmpeg stdout:", result.stdout)
print("ffmpeg stderr:", result.stderr)
if result.returncode != 0:
    print("Conversion failed! Check stderr above (e.g., input file invalid or ffmpeg error).")
else:
    if os.path.exists(converted_path):
        print("Converted successfully: audio_mono_pcm.wav")
        print(f"File size: {os.path.getsize(converted_path)} bytes")
    else:
        print("Conversion completed but audio_mono_pcm.wav not found!")
path = converted_path

ffmpeg stdout: 
ffmpeg stderr: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-l

In [4]:
# Parameters
num_speakers = 2 #@param {type:"integer"}

language = 'English' #@param ['any', 'English']

model_size = 'large' #@param ['tiny', 'base', 'small', 'medium', 'large']


In [5]:
# Load Whisper model
model = whisper.load_model(model_size)

# Load embedding model
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"))

100%|█████████████████████████████████████| 2.88G/2.88G [01:13<00:00, 42.3MiB/s]
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


hyperparams.yaml: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/hyperparams.yaml' -> '/root/.cache/torch/pyannote/speechbrain/hyperparams.yaml'
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in /root/.cache/torch/pyann

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/classifier.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/classifier.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["classifier"] = /root/.cache/torch/pyannote/speechbrain/classifier.ckpt
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


label_encoder.txt: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/label_encoder.txt' -> '/root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["label_encoder"] = /root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): embedding_model -> /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): mean_var_norm_emb -> /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): classifier -> /root/.cac

In [6]:
with contextlib.closing(wave.open(path, 'r')) as f:
    frames = f.getnframes()
    rate = f.getframerate()
    duration = frames / float(rate)
print(f"Audio duration: {duration} seconds")

Audio duration: 87.8643125 seconds


In [7]:
# Transcribe
result = model.transcribe(path, verbose=False)  # verbose=False to get avg_logprob
segments = result["segments"]

Detected language: English


100%|██████████| 8786/8786 [01:10<00:00, 124.73frames/s]


In [8]:
audio = Audio(sample_rate=16000, mono=True)  # pyannote requires 16kHz mono

def segment_embedding(segment):
    start = segment["start"]
    end = min(duration, segment["end"])
    clip = Segment(start, end)
    waveform, sample_rate = audio.crop(path, clip)
    return embedding_model(waveform.unsqueeze(0))

In [9]:
embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
    embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

In [10]:
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
    segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

In [11]:
# Merge consecutive segments from the same speaker if within 1 second
merged_segments = []
current_segment = None

for segment in segments:
    if current_segment is None:
        current_segment = segment
    elif (current_segment['speaker'] == segment['speaker']) and (segment['start'] - current_segment['end'] < 1):
        # Merge
        current_segment['text'] += ' ' + segment['text'].strip()
        current_segment['end'] = segment['end']
        # Average avg_logprob for merged (approximate)
        current_segment['avg_logprob'] = (current_segment['avg_logprob'] + segment['avg_logprob']) / 2
    else:
        merged_segments.append(current_segment)
        current_segment = segment

if current_segment:
    merged_segments.append(current_segment)

segments = merged_segments  # Replace with merged

In [12]:
def time_format(secs):
    td = datetime.timedelta(seconds=round(secs))
    return str(td).replace(':', ',', 2)[:-3] + ',' + str(int((secs % 1) * 1000)).zfill(3)

with open("output.srt", "w") as f:
    for i, segment in enumerate(segments, 1):
        start = time_format(segment["start"])
        end = time_format(segment["end"])
        f.write(f"{i}\n{start} --> {end}\n[{segment['speaker']}] {segment['text'].strip()}\n\n")
print("SRT generated: output.srt")

SRT generated: output.srt


In [13]:
# Compute cluster centers
cluster_centers = {}
for label in set(labels):
    cluster_embeds = embeddings[labels == label]
    cluster_centers[label] = np.mean(cluster_embeds, axis=0)

# Evaluate each segment
data = []
for i, segment in enumerate(segments):
    embed = embeddings[i]
    label_idx = labels[i]  # 0-based
    center = cluster_centers[label_idx]

    # Diarization confidence: 1 - cosine distance (higher = better fit)
    dist = cosine_distances([embed], [center])[0][0]
    diar_conf = (1 - dist) * 100

    # Transcription confidence: exp(avg_logprob) * 100 (geometric mean probability)
    trans_conf = math.exp(segment['avg_logprob']) * 100

    # Overall confidence: average
    overall_conf = (diar_conf + trans_conf) / 2

    # Feedback based on thresholds
    if overall_conf > 90:
        feedback = "Clear and consistent."
    elif overall_conf > 80:
        feedback = "Possible clarity issues."
    else:
        feedback = "Needs review."

    data.append({
        'chunk': segment['text'].strip(),
        'speaker': segment['speaker'],
        'trans_conf': round(trans_conf, 2),
        'diar_conf': round(diar_conf, 2),
        'overall_conf': round(overall_conf, 2),
        'feedback': f"Confidence: {overall_conf:.1f}. Feedback: {feedback}"
    })

df = pd.DataFrame(data)
display(df)

Unnamed: 0,chunk,speaker,trans_conf,diar_conf,overall_conf,feedback
0,Excuse me. Excuse me. Sorry. Do you speak Engl...,SPEAKER 1,83.0,37.63,60.32,Confidence: 60.3. Feedback: Needs review.
1,"No, I don't. Sorry.",SPEAKER 2,83.0,67.1,75.05,Confidence: 75.1. Feedback: Needs review.
2,Oh. My car's broken down and I wondered if you...,SPEAKER 1,83.0,59.01,71.0,Confidence: 71.0. Feedback: Needs review.
3,"Well, you know, that's wasted on me. I don't u...",SPEAKER 2,83.0,74.65,78.83,Confidence: 78.8. Feedback: Needs review.
4,You don't speak any English at all?,SPEAKER 1,83.0,44.75,63.88,Confidence: 63.9. Feedback: Needs review.
5,"Not a word, no. It's one of those things where...",SPEAKER 2,84.77,72.94,78.85,Confidence: 78.9. Feedback: Needs review.
6,Hi. My car's broken down and I need to find a ...,SPEAKER 1,85.36,68.81,77.09,Confidence: 77.1. Feedback: Needs review.
7,No. I'm sorry. I didn't understand that at all.,SPEAKER 1,85.36,60.31,72.84,Confidence: 72.8. Feedback: Needs review.
8,"All right. Well, thanks.",SPEAKER 1,85.36,57.65,71.51,Confidence: 71.5. Feedback: Needs review.
9,"Tell you what, if you go down that way, about ...",SPEAKER 2,85.36,36.06,60.71,Confidence: 60.7. Feedback: Needs review.
