# Comparative Baseline: Whisper, Wav2Vec2, and Conformer Models
### Zane A. Graper | MSAI 699 Capstone | November 2025

This notebook evaluates three pretrained automatic speech recognition (ASR) models on child-speech datasets.

**Models Tested:**
1. **Whisper-Small (OpenAI)** – a multilingual encoder-decoder transformer trained on 680k hours of labeled data.  
2. **Wav2Vec2-Large (Facebook)** – a self-supervised model that learns speech representations from raw audio.  
3. **Conformer-CTC (Facebook)** – a convolution-enhanced transformer model optimized for speech recognition.

**Goal:**  
Measure zero-shot performance of each model on two child-speech datasets:
- **TomRoma corpus** (`audiodata/`)  
- **CSLU Kids’ Speech corpus** (`kids_speech_wav/`)

This establishes a comparative baseline before any domain-specific fine-tuning or phoneme-aware adaptation.

### Notebook Setup:

In [None]:
# ==========================================
# GOOGLE DRIVE & PATH SETUP
# ==========================================
from google.colab import drive
import os
import torch
import pandas as pd
from tqdm import tqdm

drive.mount('/content/drive')

BASE_DIR = "/content/drive/MyDrive/Capstone"
os.environ["HF_HOME"] = f"{BASE_DIR}/hf_cache"

# Dataset dictionary
DATASETS = {
    "tomroma": {
        "csv": f"{BASE_DIR}/Baseline/transcriptions_cleaned.csv",
        "audio": f"{BASE_DIR}/audiodata"
    },
    "cslu": {
        "csv": f"{BASE_DIR}/Baseline/child_speech_cleaned.csv",
        "audio": f"{BASE_DIR}/kids_speech_wav"
    }
}

device = 0 if torch.cuda.is_available() else -1
print("Device index for pipelines:", device)

## Test 1 – Whisper-Small

The Whisper-Small model is a 244M-parameter encoder-decoder transformer trained on large multilingual data.  
It performs direct end-to-end transcription from waveform to text.

This test measures Whisper’s ability to generalize to child voices, which differ acoustically from adult speech.  
We expect higher word error rates due to pitch and articulation differences.

In [None]:
# ==========================================
# WHISPER-SMALL
# ==========================================
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    device=device
)

for label, paths in DATASETS.items():
    csv_path, audio_dir = paths["csv"], paths["audio"]
    print(f"\n=== Running Whisper-Small on {label.upper()} ===")

    df = pd.read_csv(csv_path)
    if "whisper_small" not in df.columns:
        df["whisper_small"] = ""

    for i, row in tqdm(df.iterrows(), total=len(df)):
        if pd.notna(row["whisper_small"]) and row["whisper_small"].strip():
            continue
        try:
            path = os.path.join(audio_dir, row["filename"])
            result = asr(path)
            df.at[i, "whisper_small"] = result["text"]
        except Exception as e:
            df.at[i, "whisper_small"] = f"[error: {e}]"

    df.to_csv(csv_path, index=False)
    print(f"✅ Saved results for {label} to {csv_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Device set to use cuda:0
100%|██████████| 625/625 [17:09<00:00,  1.65s/it]


## Test 2 – Wav2Vec2-Large

The Wav2Vec2-Large-960h-lv60 model uses **self-supervised pretraining** on 60k hours of unlabelled speech and fine-tuning on 960h of LibriSpeech.  
Unlike Whisper, Wav2Vec2 uses a **CTC (Connectionist Temporal Classification)** head for transcription.

This experiment evaluates how well a self-supervised model trained on adult data transcribes children’s speech.

## Wav2Vec2

In [None]:
# ==========================================
# WAV2VEC2-LARGE
# ==========================================
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-large-960h-lv60",
    device=device
)

for label, paths in DATASETS.items():
    csv_path, audio_dir = paths["csv"], paths["audio"]
    print(f"\n=== Running Wav2Vec2-Large on {label.upper()} ===")

    df = pd.read_csv(csv_path)
    if "wav2vec2" not in df.columns:
        df["wav2vec2"] = ""

    for i, row in tqdm(df.iterrows(), total=len(df)):
        if pd.notna(row["wav2vec2"]) and row["wav2vec2"].strip():
            continue
        try:
            path = os.path.join(audio_dir, row["filename"])
            result = asr(path)
            df.at[i, "wav2vec2"] = result["text"]
        except Exception as e:
            df.at[i, "wav2vec2"] = f"[error: {e}]"

    df.to_csv(csv_path, index=False)
    print(f"✅ Saved results for {label} to {csv_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60 and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0
100%|██████████| 625/625 [02:18<00:00,  4.52it/s]


## Test 3 – Conformer-CTC

The Conformer model integrates **convolutional modules** into a Transformer backbone to capture both local and global context.  
The “rope-large-960h-ft” checkpoint has been fine-tuned for English speech recognition.

This test examines whether the architectural changes in Conformer offer better robustness to children’s speech compared to Whisper or Wav2Vec2.

In [None]:
# ==========================================
# CONFORMER-CTC
# ==========================================
!pip install torchcodec > /dev/null

import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC

model_name = "facebook/wav2vec2-conformer-rope-large-960h-ft"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ConformerForCTC.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu").eval()
target_sr = 16000

for label, paths in DATASETS.items():
    csv_path, audio_dir = paths["csv"], paths["audio"]
    print(f"\n=== Running Conformer-CTC on {label.upper()} ===")

    df = pd.read_csv(csv_path)
    if "conformer" not in df.columns:
        df["conformer"] = ""

    for i, row in tqdm(df.iterrows(), total=len(df)):
        if pd.notna(row["conformer"]) and row["conformer"].strip():
            continue
        try:
            path = os.path.join(audio_dir, row["filename"])
            waveform, sr = torchaudio.load(path, backend="ffmpeg")

            # Convert to mono
            if waveform.shape[0] > 1:
                waveform = torch.mean(waveform, dim=0, keepdim=True)
            # Resample if needed
            if sr != target_sr:
                waveform = torchaudio.transforms.Resample(sr, target_sr)(waveform)

            inputs = processor(waveform.squeeze().numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
            input_values = inputs.input_values.to(model.device)

            with torch.inference_mode():
                logits = model(input_values).logits
                predicted_ids = torch.argmax(logits, dim=-1)
                transcription = processor.batch_decode(predicted_ids)[0]

            df.at[i, "conformer"] = transcription
        except Exception as e:
            df.at[i, "conformer"] = f"[error: {e}]"

    df.to_csv(csv_path, index=False)
    print(f"✅ Saved results for {label} to {csv_path}")

## Results Summary and Next Steps

After processing both the **TomRoma** and **CSLU** datasets with all three models,  
the outputs will be evaluated using standard metrics (WER, CER, BLEU).

**Expected Observations:**
- Whisper may outperform others on longer utterances due to its multilingual pretraining.  
- Wav2Vec2 may capture clean acoustic segments but struggle with high-pitch child speech.  
- Conformer may yield slightly improved intelligibility due to its convolutional refinement of spectral features.

## Evaluation – Compute File-Level Word Error Rates (WER)

After transcribing both datasets with all three ASR models, this step evaluates **how accurately each model reproduced the reference transcripts**.

For each audio file:
- The ground-truth transcript (`transcription`) is compared against the model’s prediction.  
- We compute **Word Error Rate (WER)** using the `jiwer` library.  
- Punctuation is removed and text is normalized to lowercase for fairness.  
- To prevent extreme values from skewing the data, WER values are **capped at 1.0**.

The results are saved back into a new CSV for later analysis (`*_wer` columns for each model).

In [None]:
# ==========================================
# COMPUTE FILE-LEVEL WER FOR ALL MODELS
# ==========================================

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
from jiwer import wer
import re, os

# ==========================================
# CONFIGURATION
# ==========================================
BASE_DIR = "/content/drive/MyDrive/Capstone"
FILES = {
    "tomroma": f"{BASE_DIR}/Baseline/transcriptions_cleaned.csv",
    "cslu": f"{BASE_DIR}/Baseline/child_speech_cleaned.csv"
}

def normalize(s):
    """Lowercase and remove punctuation/spaces for fair WER comparison."""
    s = re.sub(r"[^\w\s]", "", str(s).lower())
    return re.sub(r"\s+", " ", s).strip()

# ==========================================
# WER COMPUTATION LOOP
# ==========================================
for label, path in FILES.items():
    print(f"\n=== Computing WER for {label.upper()} ===")
    df = pd.read_csv(path)

    if "transcription" not in df.columns:
        raise ValueError(f"No transcription column found in {path}")

    for model in ["whisper_small", "wav2vec2", "conformer"]:
        if model not in df.columns:
            print(f"⚠️ Column {model} not found in {label}, skipping.")
            continue

        wer_col = f"{model}_wer"
        print(f"  → Calculating {wer_col}")

        df[wer_col] = df.apply(
            lambda row: min(
                1.0,
                wer(normalize(row["transcription"]), normalize(row[model]))
            ),
            axis=1
        )

    # Save WER results to a new file
    out_path = path.replace(".csv", "_wer.csv")
    df.to_csv(out_path, index=False)
    print(f"✅ Saved WER results → {out_path}")

### WER Output Interpretation

Values range from **0.0 (perfect match)** to **1.0 (completely incorrect)**.  
Because individual utterances can vary widely in difficulty, per-file WER offers fine-grained insight into where models fail.  

The average Word Error Rates (WER) for each model and dataset are summarized below:

| Dataset | Whisper-Small | Wav2Vec2-Large | Conformer-CTC |
|:---------|:--------------:|:---------------:|:---------------:|
| **CSLU Kids’ Speech** | **0.3938** | 0.7329 | 0.6450 |
| **TomRoma Child Speech** | **0.2321** | 0.3858 | 0.3805 |

**Interpretation:**  
- Whisper-Small achieved the lowest WER across both datasets, consistent with its multilingual and robust pretraining.  
- Wav2Vec2 and Conformer exhibited higher errors, particularly on the CSLU corpus, which contains longer and more articulated utterances.  
- The TomRoma dataset yielded generally lower WERs, likely due to shorter, simpler phrases and cleaner background conditions.