# Baseline Model: IPA-to-Text Pipeline for Child Speech Recognition
### Zane A. Graper | MSAI 699 Capstone | November 2025

This notebook implements a **two-stage baseline** for automatic speech recognition (ASR) on child speech.  
Instead of mapping raw audio directly to text, we decompose the task:

1. **Audio → IPA Phoneme Sequence** using a pretrained Wav2Vec2 model (`facebook/wav2vec2-lv-60-espeak-cv-ft`).  
2. **IPA → Text Reconstruction** using the fine-tuned T5-small model (`zanegraper/t5-small-ipa-phoneme-to-text`).

This mirrors the phonological pathway in human speech perception—separating acoustic decoding from language reconstruction.  
The goal here is not high accuracy but a **quantitative baseline** to compare against later fine-tuned or hybrid models.

## Step 1 – Environment Setup and Dependencies

Install required Python packages and ensure GPU access through PyTorch.  
Libraries used:

- **transformers / evaluate / jiwer** – for model loading and metrics  
- **librosa** – for waveform loading and resampling  
- **phonemizer / espeak-ng** – for IPA conversion utilities  
- **pandas / tqdm** – for data handling and progress tracking

In [None]:
# Install Dependencies
!pip install -U pip setuptools wheel
!pip install -q numpy==1.26.4
!pip install -q torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Collecting setuptools
  Downloading setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading setuptools-80.9.0-py3-none-any.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 75.2.0
    Uninstalling setuptools-75.2.0:
      Successfully uninstalled setuptools-75.2.0
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour i

[31mERROR: Operation cancelled by user[0m[31m
[0m^C


In [None]:
!pip install evaluate jiwer

import torch, os, pandas as pd, numpy as np, librosa
from tqdm import tqdm
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, AutoTokenizer, AutoModelForSeq2SeqLM
from evaluate import load
from jiwer import wer, cer

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


## Step 2 – Path Configuration and Dataset Selection

We define file locations on Google Drive.  
Two datasets are used for baseline evaluation:

- **TomRoma** – child-speech corpus recorded in realistic home environments.  
- **CSLU Kids’ Speech** – studio-quality recordings across grades K-10.

Each dataset provides paired `.wav` files and manual transcriptions used for evaluation.

In [None]:
# ======================================================
# PATH SETUP
# ======================================================
from google.colab import drive
drive.mount('/content/drive')

import os

BASE_DIR = "/content/drive/MyDrive/Capstone"
OUTPUT_DIR = f"{BASE_DIR}/Baseline"
os.makedirs(OUTPUT_DIR, exist_ok=True)

DATASETS = {
    "tomroma": {
        "csv": f"{BASE_DIR}/Baseline/transcriptions_cleaned.csv",
        "audio": f"{BASE_DIR}/audiodata"
    },
    "clsu": {
        "csv": f"{BASE_DIR}/Baseline/child_speech_cleaned.csv",
        "audio": f"{BASE_DIR}/kids_speech_wav"
    }
}

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 3 – Audio → IPA Phoneme Conversion

The first model converts 16 kHz audio into an **International Phonetic Alphabet (IPA)** sequence.  
We use `facebook/wav2vec2-lv-60-espeak-cv-ft`, a model trained on multilingual speech aligned to **eSpeak-NG** phoneme labels.

This isolates acoustic errors (e.g., child pitch, articulation) before any language-model bias.

In [None]:
# ======================================================
# AUDIO → IPA MODEL
# ======================================================
!apt-get update -qq && apt-get install -y espeak-ng
!pip install -q phonemizer==3.2.1 transformers==4.43.3

import torch, librosa, phonemizer
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"

ipa_model_name = "facebook/wav2vec2-lv-60-espeak-cv-ft"
ipa_processor = Wav2Vec2Processor.from_pretrained(ipa_model_name)
ipa_model = Wav2Vec2ForCTC.from_pretrained(ipa_model_name).to(device)

def audio_to_ipa(audio_path):
    try:
        speech, sr = librosa.load(audio_path, sr=16000)
        inputs = ipa_processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
        with torch.no_grad():
            logits = ipa_model(inputs.input_values.to(device)).logits
        pred_ids = torch.argmax(logits, dim=-1)[0]
        return ipa_processor.decode(pred_ids)
    except Exception as e:
        return f"[error: {e}]"

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
espeak-ng is already the newest version (1.50+dfsg-10ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 44 not upgraded.


  import pkg_resources
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-lv-60-espeak-cv-ft were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-lv-60-espeak-cv-ft and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably

Each file is processed to yield a string such as  
`/hɛloʊ wɝld/ → IPA sequence`.  
Errors are caught and logged for skipped files.

## Step 4 – IPA → Text Transformation

The second stage reconstructs intelligible English text from IPA sequences using the fine-tuned **T5-small** model.  
This model was trained on ~780 k synthetic IPA–text pairs from BookCorpus.

It acts as a phoneme-to-grapheme translator and provides an interpretable linguistic baseline.

In [None]:
# ======================================================
# IPA → TEXT MODEL
# ======================================================
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, AutoTokenizer, AutoModelForSeq2SeqLM

text_model_id = "zanegraper/t5-small-ipa-phoneme-to-text"
text_tokenizer = AutoTokenizer.from_pretrained(text_model_id)
text_model = AutoModelForSeq2SeqLM.from_pretrained(text_model_id).to(device)

def ipa_to_text(ipa_sequence):
    try:
        inputs = text_tokenizer(ipa_sequence, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            outputs = text_model.generate(**inputs, max_new_tokens=64)
        return text_tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        return f"[error: {e}]"

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

The output of this stage allows direct comparison with ground-truth transcripts for WER/CER analysis.

## Step 5 – Integrated Pipeline

This wrapper runs both stages sequentially over each dataset:  
1. Convert audio to IPA  
2. Convert IPA to text  
3. Save intermediate and final results as CSV for evaluation

In [None]:
# ======================================================
# PROCESSING PIPELINE
# ======================================================
def process_dataset(csv_path, audio_dir, label):
    df = pd.read_csv(csv_path)
    if "ipa_phonemes" not in df.columns: df["ipa_phonemes"] = ""
    if "ipa_text_pred" not in df.columns: df["ipa_text_pred"] = ""

    for i, row in tqdm(df.iterrows(), total=len(df), desc=f"Processing {label}"):
        audio_file = os.path.join(audio_dir, row["filename"])
        ipa_seq = audio_to_ipa(audio_file)
        df.at[i, "ipa_phonemes"] = ipa_seq
        df.at[i, "ipa_text_pred"] = ipa_to_text(ipa_seq)

    out_csv = f"{OUTPUT_DIR}/{label}_ipa_text.csv"
    df.to_csv(out_csv, index=False)
    print(f"✅ Saved results to {out_csv}")
    return df

Each dataset produces a CSV with columns:  
`filename | transcription | ipa_phonemes | ipa_text_pred`

## Step 6 – Evaluation Metrics

Performance is quantified with three metrics:

- **WER (Word Error Rate)** – measures word-level transcription errors  
- **CER (Character Error Rate)** – captures fine-grained spelling mismatches  
- **BLEU Score** – evaluates linguistic similarity of predicted and reference sentences

Lower WER/CER and higher BLEU indicate better performance.

In [None]:
# ======================================================
# EVALUATION (corrected)
# ======================================================
from evaluate import load
import pandas as pd

def evaluate_baseline(df):
    # Drop NaN or blank rows
    df = df.dropna(subset=["transcription", "ipa_text_pred"])
    df = df[df["transcription"].str.strip().ne("") & df["ipa_text_pred"].str.strip().ne("")]
    df = df[(df["transcription"].str.split().str.len() < 100) & (df["ipa_text_pred"].str.split().str.len() < 100)]

    refs = df["transcription"].astype(str).tolist()
    preds = df["ipa_text_pred"].astype(str).tolist()

    # Clean pairs again to prevent empty entries
    paired = [(p, r) for p, r in zip(preds, refs) if p.strip() and r.strip()]
    preds, refs = zip(*paired)

    # Compute metrics safely
    wer_metric = load("wer")
    cer_metric = load("cer")
    bleu_metric = load("bleu")

    wer_val = min(wer_metric.compute(predictions=preds, references=refs), 1.0)
    cer_val = min(cer_metric.compute(predictions=preds, references=refs), 1.0)
    bleu_val = bleu_metric.compute(predictions=preds, references=refs)["bleu"]

    metrics = {"WER": wer_val, "CER": cer_val, "BLEU": bleu_val}
    print(metrics)
    return metrics

Results are printed for each dataset.  
These metrics serve as the reference baseline prior to any model adaptation.

## Step 7 – Baseline Results and Discussion

| Dataset | WER | CER | BLEU |
|:--------:|:----:|:----:|:----:|
| TomRoma | ≈1.00 | 0.92 | 0.00 |
| CSLU | ≈0.99 | 0.61 | 0.07 |

**Interpretation:**  
The near-100 % WER confirms that the un-tuned IPA → Text model cannot yet generalize to real child speech.  
High CER and low BLEU are expected because the IPA outputs include non-standard symbols and truncated tokens.

**Next Steps:**  
1. Fine-tune the IPA → Text model on a subset of real child data.  
2. Introduce phonological error augmentation (gliding, stopping, cluster reduction).  
3. Compare against Whisper and Wav2Vec2 direct ASR baselines.

These results provide the required **baseline metrics** for the Capstone milestone.

In [None]:
# ======================================================
# RUN PIPELINE
# ======================================================
import pandas as pd
import os
from tqdm import tqdm
from evaluate import load
from jiwer import wer, cer

tomroma = process_dataset(DATASETS["tomroma"]["csv"], DATASETS["tomroma"]["audio"], "tomroma")
clsu = process_dataset(DATASETS["clsu"]["csv"], DATASETS["clsu"]["audio"], "clsu")

print("TomRoma Metrics:")
evaluate_baseline(tomroma)

print("CSLU Metrics:")
evaluate_baseline(clsu)

Processing tomroma: 100%|██████████| 625/625 [06:29<00:00,  1.60it/s]


✅ Saved results to /content/drive/MyDrive/Capstone/Baseline/tomroma_ipa_text.csv


Processing clsu: 100%|██████████| 819/819 [07:57<00:00,  1.72it/s]


✅ Saved results to /content/drive/MyDrive/Capstone/Baseline/clsu_ipa_text.csv
TomRoma Metrics:


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

{'WER': 1.0, 'CER': 0.916226795563727, 'BLEU': 0.0}
CSLU Metrics:
{'WER': 0.9942849177585726, 'CER': 0.6070217276099629, 'BLEU': 0.06820126072143742}


{'WER': 0.9942849177585726,
 'CER': 0.6070217276099629,
 'BLEU': 0.06820126072143742}