# Task 1: Speech-to-Text Benchmarking using Word Error Rate (WER)

This notebook benchmarks three leading Speech-to-Text engines:
1. OpenAI Whisper  
2. faster-whisper  
3. Vosk  

We evaluate them using **Word Error Rate (WER)** on a small test set of audio files.

Report Link: https://drive.google.com/file/d/1PVfo9WylX4lqOrTO2MT0cnDdks1DNGbn/view?usp=drive_link

## What is Word Error Rate (WER)?

WER is a standard metric for evaluating speech recognition systems.

WER = (Substitutions + Deletions + Insertions) / Number of words in reference

Lower WER = better performance.

In [None]:
%pip install openai-whisper faster-whisper vosk jiwer soundfile librosa

## Dataset

We use 5 short audio samples (25 - 40 sec) with known ground-truth transcripts.
Each audio file has a corresponding `.txt` file containing the correct transcription.

Directory structure:

/dataset
1.     /audio-1.wav
2.     /audio-2.wav
3.     /audio-3.wav
4.     /audio-4.wav
5.     /audio-5.wav

## Cell 2: Imports

In [None]:
import os
import whisper
from faster_whisper import WhisperModel
from vosk import Model, KaldiRecognizer
import json
import wave
import soundfile as sf
from jiwer import wer

## Cell 3: Loading Test Audio

In [None]:
DATASET_PATH = "/dataset"

def load_test_data(dataset_path):
    data = []
    for file in os.listdir(dataset_path):
        if file.endswith(".wav"):
            audio_path = os.path.join(dataset_path, file)
            txt_path = audio_path.replace(".wav", ".txt")
            if os.path.exists(txt_path):
                with open(txt_path, "r", encoding="utf-8") as f:
                    reference = f.read().strip()
                data.append((audio_path, reference))
    return data

test_data = load_test_data(DATASET_PATH)

print(f"Loaded {len(test_data)} audio samples")

## Model 1: OpenAI Whisper
High accuracy multilingual transformer-based model.

In [None]:
whisper_model = whisper.load_model("base")

## Cell 4: Whisper Transcription + WER

In [None]:
def evaluate_whisper(model, test_data):
    wers = []
    for audio_path, reference in test_data:
        result = model.transcribe(audio_path)
        prediction = result["text"].strip()
        error = wer(reference.lower(), prediction.lower())
        wers.append(error)
        print(f"Whisper WER for {os.path.basename(audio_path)}: {error}")
    return sum(wers) / len(wers)

whisper_avg_wer = evaluate_whisper(whisper_model, test_data)
print(f"\nAverage Whisper WER: {whisper_avg_wer}")

## Model 2: faster-whisper
Optimized implementation of Whisper using CTranslate2 for faster inference.

In [None]:
faster_model = WhisperModel("base", device="cpu", compute_type="int8")

## Cell 5: Faster-whisper Transcription + WER

In [None]:
def evaluate_faster_whisper(model, test_data):
    wers = []
    for audio_path, reference in test_data:
        segments, info = model.transcribe(audio_path)
        prediction = " ".join([segment.text for segment in segments]).strip()
        error = wer(reference.lower(), prediction.lower())
        wers.append(error)
        print(f"faster-whisper WER for {os.path.basename(audio_path)}: {error}")
    return sum(wers) / len(wers)

faster_whisper_avg_wer = evaluate_faster_whisper(faster_model, test_data)
print(f"\nAverage faster-whisper WER: {faster_whisper_avg_wer}")

## Model 3: Vosk
Lightweight offline speech recognition engine.

In [None]:
!wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
!unzip vosk-model-small-en-us-0.15.zip

In [None]:
vosk_model = Model("vosk-model-small-en-us-0.15")

## Cell 6:  Vosk Transcription

In [None]:
def transcribe_vosk(audio_path, model):
    wf = wave.open(audio_path, "rb")
    rec = KaldiRecognizer(model, wf.getframerate())
    rec.SetWords(True)

    result_text = ""

    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            res = json.loads(rec.Result())
            result_text += " " + res.get("text", "")

    final_res = json.loads(rec.FinalResult())
    result_text += " " + final_res.get("text", "")

    return result_text.strip()

## Cell 7: Vosk Transcription + WER

In [None]:
def evaluate_vosk(model, test_data):
    wers = []
    for audio_path, reference in test_data:
        prediction = transcribe_vosk(audio_path, model)
        error = wer(reference.lower(), prediction.lower())
        wers.append(error)
        print(f"Vosk WER for {os.path.basename(audio_path)}: {error}")
    return sum(wers) / len(wers)

vosk_avg_wer = evaluate_vosk(vosk_model, test_data)
print(f"\nAverage Vosk WER: {vosk_avg_wer}")

## Final WER Comparison
Lower is better.

In [None]:
import pandas as pd

results = {
    "Model": ["Whisper", "faster-whisper", "Vosk"],
    "Average WER": [whisper_avg_wer, faster_whisper_avg_wer, vosk_avg_wer]
}

df = pd.DataFrame(results)
df

# Task 2: Transcription using faster-whisper 

Report LinK: https://drive.google.com/file/d/1PVfo9WylX4lqOrTO2MT0cnDdks1DNGbn/view?usp=drive_link

### Cell 1: Install Dependencies

In [None]:
%pip install faster-whisper soundfile

### Cell 2: Importing Required Libraries

In [None]:
import soundfile as sf
import math
from faster_whisper import WhisperModel

### Cell 3: Loading Podcast Audio File 

We here load the mp3 audio file and convert it to proper mono 16k Hz wav format, for proper transcribing.

In [None]:
!ffmpeg -y -i /podcast-audio/807931c237e75122fd4f0bb4ec9f7d1b.mp3 -ac 1 -ar 16000 clean_audio.wav

### Cell 4: Loading the wav file, and checking it's features

In [None]:
import soundfile as sf

audio, sr = sf.read("clean_audio.wav")
print("Sample rate:", sr)
print("Shape:", audio.shape)
print("Duration (sec):", len(audio) / sr)

### Cell 5: Splitting Audio into 45-Second Chunks

In [None]:
import numpy as np
import math
import soundfile as sf

def split_audio_correct(audio_path, chunk_duration=45):
    audio, sr = sf.read(audio_path)

    if len(audio.shape) > 1:
        audio = audio.mean(axis=1)  # force mono

    total_samples = len(audio)
    samples_per_chunk = int(chunk_duration * sr)

    chunks = []

    for start in range(0, total_samples, samples_per_chunk):
        end = start + samples_per_chunk
        chunk = audio[start:end]

        if len(chunk) == 0:
            continue

        chunks.append(chunk.astype(np.float32))

    print(f"Total chunks created: {len(chunks)}")
    return chunks, sr

### Cell 6: Loading faster-whisper Model

In [None]:
from faster_whisper import WhisperModel

model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8"
)

### Cell 7: Timeline formatting

In [None]:
def format_timestamp(seconds):
    hrs = int(seconds // 3600)
    mins = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hrs:02d}:{mins:02d}:{secs:02d}"

### Cell 8: Transcribing Long Audio using Chunking

In [None]:
def transcribe_long_audio_with_timestamps(model, audio_path, chunk_duration=45):
    chunks, sr = split_audio_correct(audio_path, chunk_duration)
    full_text = []

    print("\n================= STARTING TRANSCRIPTION =================\n")

    for idx, chunk in enumerate(chunks):
        start_time = idx * chunk_duration
        end_time = start_time + (len(chunk) / sr)

        start_ts = format_timestamp(start_time)
        end_ts = format_timestamp(end_time)

        print(f"\n[{start_ts} - {end_ts}]")
        print("-" * 60)

        segments, info = model.transcribe(
            chunk,
            language="en",
            beam_size=5,
            temperature=0.0,
            vad_filter=True,
            vad_parameters=dict(min_silence_duration_ms=500)
        )

        chunk_text = ""
        for segment in segments:
            chunk_text += segment.text + " "

        chunk_text = chunk_text.strip()

        # Print chunk transcription
        print(chunk_text)

        # Store
        full_text.append(chunk_text)

    print("\n================= TRANSCRIPTION COMPLETE =================\n")

    final_transcript = " ".join(full_text)

    print("\n============= FULL TRANSCRIPT (COMBINED) =============\n")
    print(final_transcript)

    return full_text, final_transcript

### Cell 9: Running Transcription on Podcast Audio

In [None]:
chunk_texts, final_transcript = transcribe_long_audio_with_timestamps(
    model,
    "clean_audio.wav",
    chunk_duration=45
)

### Cell 8: Saving Final Transcript

In [None]:
OUTPUT_PATH = "/output/podcast_transcript_with_timestamps.txt"

with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    for idx, text in enumerate(chunk_texts):
        start_time = idx * 45
        end_time = start_time + len(text)

        start_ts = format_timestamp(start_time)
        end_ts = format_timestamp(start_time + 45)

        f.write(f"[{start_ts} - {end_ts}]\n")
        f.write(text + "\n\n")

    f.write("\n================ FULL TRANSCRIPT ================\n\n")
    f.write(final_transcript)

print("✅ Transcript saved to:", OUTPUT_PATH)

# Task 3: Research Topic Segmentation methods

It accomplished basically researching about the the available methods for Topic Segmentation and following is a concise report for the same.

Link: https://drive.google.com/file/d/1PVfo9WylX4lqOrTO2MT0cnDdks1DNGbn/view?usp=drive_link

Finally selected to Segment topics using Transformer Based Deep Learning Approaches

# Task 4: Topic Segmentation using Transformer-Based Deep Learning Approaches (alongwith validation)

Report: https://drive.google.com/file/d/1DNxTWfeJeloGkiwoQOCv5WbXB9FCC7oo/view?usp=drive_link

### Cell 1: Install Dependencies

!pip install transformers sentencepiece nltk

### Cell 2: Importing Dependencies

In [None]:
import nltk
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import LongformerTokenizer, LongformerModel
from transformers import pipeline
import numpy as np

nltk.download('punkt')

### Cell 4: Load Transcript

In [None]:
TRANSCRIPT_PATH = "/podcast_transcription/podcast_transcript_with_timestamps.txt"

with open(TRANSCRIPT_PATH, "r", encoding="utf-8") as f:
    transcript = f.read()

print(transcript[:1000])

### Cell 5: Sentence Tokenization

In [None]:
sentences = nltk.sent_tokenize(transcript)
print(f"Total sentences: {len(sentences)}")

### Cell 6: Helper Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

## Approach 1: BERT Based segmentation

**Logic:**

We use **BERT Embeddings** -> Compute similarity -> Drop = Topic Boundary

### Cell 1: Load BERT model

In [None]:
from transformers import BertModel

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_model.eval()

### Cell 2: BERT Embedding Function 

In [None]:
def get_bert_embedding(text):
    inputs = bert_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

### Cell 3: Compute Similarities (BERT)

In [None]:
bert_embeddings = [get_bert_embedding(s) for s in sentences]

bert_similarities = []
for i in range(len(bert_embeddings) - 1):
    sim = cosine_similarity(bert_embeddings[i], bert_embeddings[i + 1])[0][0]
    bert_similarities.append(sim)

### Cell 4: Detect Boundaries (BERT)

In [None]:
bert_threshold = 0.6
bert_boundaries = [i for i, sim in enumerate(bert_similarities) if sim < bert_threshold]

print("BERT topic boundaries at sentence indices:", bert_boundaries)

### Cell 5: Build Segments (BERT)

In [None]:
def build_segments(sentences, boundaries):
    segments = []
    start = 0
    for boundary in boundaries:
        segment = " ".join(sentences[start:boundary+1])
        segments.append(segment)
        start = boundary + 1
    segments.append(" ".join(sentences[start:]))
    return segments

bert_segments = build_segments(sentences, bert_boundaries)

print(f"Total BERT Segments: {len(bert_segments)}")

## Approach 2: GPT Based Segmentation

### Cell 1: Load GPT Pipeline

In [None]:
gpt_classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

### Cell 12: GPT Topic change detection

In [None]:
def gpt_topic_change(sent1, sent2):
    hypothesis = "The topic of the two sentences is the same."
    result = gpt_classifier(
        sequences=sent1 + " " + sent2,
        candidate_labels=["same topic", "different topic"]
    )
    return result["labels"][0] == "different topic"

### Cell 3: Detect Boundaries (GPT)

In [None]:
gpt_boundaries = []

for i in range(len(sentences) - 1):
    if gpt_topic_change(sentences[i], sentences[i + 1]):
        gpt_boundaries.append(i)

print("GPT topic boundaries:", gpt_boundaries)

### Cell 4: Build Segments (GPT)

In [None]:
gpt_segments = build_segments(sentences, gpt_boundaries)
print(f"Total GPT Segments: {len(gpt_segments)}")

## Approach 3: Longformer Based Segmentation

**Longformer is built for long documents, and thus perfect for podcasts**

### Cell 1: Load Longformer

In [None]:
long_tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
long_model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
long_model.eval()

### Cell 2: Longformer Embedding Function

In [None]:
def get_longformer_embedding(text):
    inputs = long_tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
    with torch.no_grad():
        outputs = long_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

### Cell 3: Compute Similarities (Longformer)

In [None]:
long_embeddings = [get_longformer_embedding(s) for s in sentences]

long_similarities = []
for i in range(len(long_embeddings) - 1):
    sim = cosine_similarity(long_embeddings[i], long_embeddings[i + 1])[0][0]
    long_similarities.append(sim)

### Cell 4: Detect Boundaries (Longformer)

In [None]:
long_threshold = 0.65
long_boundaries = [i for i, sim in enumerate(long_similarities) if sim < long_threshold]

print("Longformer topic boundaries:", long_boundaries)

### Cell 5: Build Segments (Longformer)

In [None]:
long_segments = build_segments(sentences, long_boundaries)
print(f"Total Longformer Segments: {len(long_segments)}")

## Comparison and Best model selection

### Cell 1: Compare Segment Counts

In [None]:
print("BERT Segments:", len(bert_segments))
print("GPT Segments:", len(gpt_segments))
print("Longformer Segments:", len(long_segments))

### Cell 2: Print Sample Segments

In [None]:
print("\n=== BERT Sample Segment ===\n", bert_segments[0][:500])
print("\n=== GPT Sample Segment ===\n", gpt_segments[0][:500])
print("\n=== Longformer Sample Segment ===\n", long_segments[0][:500])

### Cell 3: Simple Evaluation Logic

In [None]:
results = {
    "BERT": len(bert_segments),
    "GPT": len(gpt_segments),
    "Longformer": len(long_segments)
}

best_model = min(results, key=lambda x: abs(results[x] - 14))  # 14 = expected segments
print("Best performing model:", best_model)

## Saving the Segments

### Cell 1: Saving the segments

In [None]:
def save_all_segments_to_file(segments, model_name):
    file_path = f"/kaggle/working/{model_name.lower()}_all_segments.txt"
    
    with open(file_path, "w", encoding="utf-8") as f:
        for i, segment in enumerate(segments):
            f.write(f"--- Segment {i+1} ---\n")
            f.write(segment.strip() + "\n\n")
    
    print(f"✅ Saved ALL {len(segments)} segments to:", file_path)

### Cell 2: Running the function to save

In [None]:
save_all_segments_to_file(bert_segments, "BERT")
save_all_segments_to_file(gpt_segments, "GPT")
save_all_segments_to_file(long_segments, "Longformer")

## Approach 4: Topic Segmentation using LLM 

### Cell 1: Initial setup + Importing Libraries

In [None]:
%pip install openai tiktoken

In [None]:
import os
import tiktoken
from openai import OpenAI

### Cell 2: Setting up the LLM (gpt 4o, marketplace) 

In [None]:
client = OpenAI(
    base_url="https://models.github.ai/inference",
    api_key="removed for privacy",
)

### Cell 3: System Prompt

In [None]:
SYSTEM_PROMPT = """
You are an expert system for topic segmentation of long podcast transcripts.

Your task:
- Segment the transcript into meaningful topical sections.
- Detect natural discourse shifts (not paragraph breaks).
- Assign a short, clear title to each segment.

Rules:
1. Each segment must have:
   - A title (3–7 words)
   - Start and end timestamps
   - Original transcript text (no paraphrasing)
2. Do NOT summarize.
3. Do NOT remove or rewrite text.
4. Preserve chronological order.
5. Output format MUST be:

Segment N:
Title: <title>
Time: <start timestamp> - <end timestamp>
Text:
<original transcript text>

6. Ensure all content is covered.
"""

### Cell 4: Loading the transcription

In [None]:
TRANSCRIPT_PATH = "/podcast_transcription/podcast_transcript_with_timestamps.txt"

with open(TRANSCRIPT_PATH, "r", encoding="utf-8") as f:
    full_transcript = f.read()

### Cell 5: Loading chunks in LLM

In [None]:
tokenizer = tiktoken.get_encoding("cl100k_base")

MAX_TOKENS = 3500  # safe margin
OVERLAP = 200

def chunk_text(text):
    tokens = tokenizer.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + MAX_TOKENS
        chunk = tokenizer.decode(tokens[start:end])
        chunks.append(chunk)
        start = end - OVERLAP

    return chunks


transcript_chunks = chunk_text(full_transcript)
print("Total LLM chunks:", len(transcript_chunks))

### Cell 6: Getting LLM configs

In [None]:
llm_outputs = []

for i, chunk in enumerate(transcript_chunks):
    print(f"Processing chunk {i+1}/{len(transcript_chunks)}")

    response = client.chat.completions.create(
        model="openai/gpt-4o",
        temperature=0.0,
        max_tokens=4096,
        top_p=1,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": chunk}
        ]
    )

    llm_outputs.append(response.choices[0].message.content)

### Cell 7: Viewing LLM outputs

In [None]:
final_llm_segmentation = "\n\n".join(llm_outputs)

print(final_llm_segmentation[:1500])

### Cell 8: Saving LLM Outputs

In [None]:
LLM_OUTPUT_PATH = "/output/llm_topic_segments.txt"

with open(LLM_OUTPUT_PATH, "w", encoding="utf-8") as f:
    f.write(final_llm_segmentation)

print("✅ LLM topic segmentation saved to:", LLM_OUTPUT_PATH)