# **A/B Testing Pipeline for Phoneme-to-Text Models**

Zane Graper

Capstone - Week 6

This notebook performs a controlled A/B evaluation of two IPAâ†’Text models: (A) the collapsed version of the CHILDES-fine-tuned T5 model and (B) the retrained, stabilized version designed to prevent decoder repetition loops. Using precomputed IPA transcriptions from earlier baseline experiments, the notebook applies both models to identical input sequences, computes standard text-generation metrics (WER, CER, BLEU), and summarizes their performance. In addition to numerical evaluation, the notebook provides collapse-detection tools and qualitative comparison tables that highlight linguistic differences between the model outputs.

---


### Install Dependencies

In [None]:
!pip install -U pip setuptools wheel
!pip install -q numpy==1.26.4
!pip install -q torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
!pip install -q evaluate jiwer transformers datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
pytensor 2.35.1 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m

In [None]:
!pip install -q evaluate
!pip install -q jiwer
!pip install -q transformers
!pip install -q datasets

### Mount Google Drive and set the Paths

In [None]:
import os

from google.colab import drive
drive.mount('/content/drive')

BASE_DIR = "/content/drive/MyDrive/Capstone"
OUTPUT_DIR = f"{BASE_DIR}/AB_Testing"
os.makedirs(OUTPUT_DIR, exist_ok=True)

IPA_RESULTS = {
    "tomroma": f"{BASE_DIR}/Baseline/tomroma_ipa_text.csv",
    "clsu": f"{BASE_DIR}/Baseline/clsu_ipa_text.csv"
}

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load the IPA-to-Text Models

This section loads both HuggingFace models into memory: the unstable collapsed model and the newly retrained, stable model. Each model receives its own tokenizer and is moved to GPU for faster decoding.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"
device

# Model A - Collapsed Model
model_A_id = "zanegraper/t5-ipa-childes-finetuned"  # COLLAPSED VERSION
tokenizer_A = AutoTokenizer.from_pretrained(model_A_id)
model_A = AutoModelForSeq2SeqLM.from_pretrained(model_A_id).to(device)

# Model B - Retrained (Stable)
model_B_id = "zanegraper/t5-small-ipa-childes-phoneme-to-text"  # RETRAINED VERSION
tokenizer_B = AutoTokenizer.from_pretrained(model_B_id)
model_B = AutoModelForSeq2SeqLM.from_pretrained(model_B_id).to(device)

### Modular Decoding Function

A reusable function that takes an IPA string and generates its decoded text using a specified model and tokenizer. Beam search and repetition penalties are included to reduce looping and increase decoding stability.

In [None]:
def run_ipa_to_text(model, tokenizer, ipa_seq, max_new_tokens=64):
    try:
        inputs = tokenizer(ipa_seq, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True,
                repetition_penalty=1.2
            )
        return tokenizer.decode(output[0], skip_special_tokens=True)
    except Exception as e:
        return f"[decode_error: {e}]"

### A/B Model Evaluation Driver

A structured procedure that applies Model A or Model B across the entire dataset, storing each modelâ€™s predictions in uniquely named output columns. Results are saved to CSV files for subsequent metric evaluation.

In [None]:
import pandas as pd
from tqdm import tqdm

def evaluate_model_on_dataset(model, tokenizer, ipa_df, label, model_label):
    df = ipa_df.copy()
    pred_col = f"text_pred_{model_label}"
    df[pred_col] = ""

    for i, row in tqdm(df.iterrows(), total=len(df), desc=f"{label} â€“ {model_label}"):
        ipa_seq = row["ipa_phonemes"]
        df.at[i, pred_col] = run_ipa_to_text(model, tokenizer, ipa_seq)

    out_csv = f"{OUTPUT_DIR}/{label}_{model_label}.csv"
    df.to_csv(out_csv, index=False)
    print(f"Saved predictions â†’ {out_csv}")
    return df

### Metric Function

A lightweight wrapper around standard evaluation metricsâ€”WER, CER, and BLEUâ€”computing accuracy scores between predictions and ground-truth reference transcripts.

In [None]:
from evaluate import load

def compute_metrics(preds, refs):
    wer_metric = load("wer")
    cer_metric = load("cer")
    bleu_metric = load("bleu")

    wer_val = wer_metric.compute(predictions=preds, references=refs)
    cer_val = cer_metric.compute(predictions=preds, references=refs)
    bleu_val = bleu_metric.compute(predictions=preds, references=refs)["bleu"]

    return {
        "WER": min(wer_val, 1.0),
        "CER": min(cer_val, 1.0),
        "BLEU": bleu_val
    }

### Repetition Collapse Detection

Collapsed T5-style decoders often produce looping output where the same token sequence repeats indefinitely. The following function quantifies repetition by computing (1) the proportion of repeated tokens, (2) the longest repeated n-gram, and (3) whether the output length is abnormally long relative to the reference. This allows the notebook to explicitly measure decoder instability for reporting.

In [None]:
import re
from collections import Counter

def detect_collapse(output_text, min_ngram=3, collapse_threshold=0.30):
    """
    Detects repetition loops common in collapsed T5 models.
    Returns a dictionary of collapse indicators.
    """

    tokens = output_text.split()
    if len(tokens) < min_ngram:
        return {"collapse": False, "repeat_ratio": 0, "max_repeat_ngram": ""}

    # 1. Token repetition ratio
    token_counts = Counter(tokens)
    repeat_ratio = 1 - (len(token_counts) / len(tokens))

    # 2. Largest repeating n-gram pattern
    def find_repeated_ngram(tokens, n):
        seen = {}
        for i in range(len(tokens) - n + 1):
            ngram = tuple(tokens[i:i+n])
            if ngram in seen:
                return " ".join(ngram)
            seen[ngram] = True
        return ""

    max_ngram = ""
    for n in range(min_ngram, min(min_ngram + 5, len(tokens))):
        ngram = find_repeated_ngram(tokens, n)
        if ngram:
            max_ngram = ngram
            break

    collapse_flag = repeat_ratio > collapse_threshold or bool(max_ngram)

    return {
        "collapse": collapse_flag,
        "repeat_ratio": round(repeat_ratio, 3),
        "max_repeat_ngram": max_ngram,
        "token_len": len(tokens)
    }

### A/B Evaluation Loop

The main execution block that iterates over datasets (TomRoma, CSLU), runs both models, and collects metrics. Outputs include printed model performance and serialized prediction files.

In [None]:
all_results = {}
qualitative_samples_all = {}

for label, ipa_csv in IPA_RESULTS.items():
    print(f"Loading IPA CSV for: {label}")
    df = pd.read_csv(ipa_csv)

    # Clean missing rows
    df = df.dropna(subset=["ipa_phonemes", "transcription"]).reset_index(drop=True)

    # ======================================================
    # A: Collapsed Model Predictions
    # ======================================================
    df_A = evaluate_model_on_dataset(model_A, tokenizer_A, df, label, "collapsed")

    # Apply collapse detection
    df_A["collapse_info"] = df_A["text_pred_collapsed"].apply(detect_collapse)
    df_A["collapse_flag"] = df_A["collapse_info"].apply(lambda x: x["collapse"])

    # ======================================================
    # B: Retrained Model Predictions
    # ======================================================
    df_B = evaluate_model_on_dataset(model_B, tokenizer_B, df, label, "retrained")

    # Apply collapse detection
    df_B["collapse_info"] = df_B["text_pred_retrained"].apply(detect_collapse)
    df_B["collapse_flag"] = df_B["collapse_info"].apply(lambda x: x["collapse"])

    # ======================================================
    # Compute Metrics
    # ======================================================
    preds_A = df_A["text_pred_collapsed"].tolist()
    preds_B = df_B["text_pred_retrained"].tolist()
    refs = df["transcription"].tolist()

    metrics_A = compute_metrics(preds_A, refs)
    metrics_B = compute_metrics(preds_B, refs)

    # Collapse stats
    collapse_rate_A = df_A["collapse_flag"].mean()
    collapse_rate_B = df_B["collapse_flag"].mean()

    # Add collapse stats to results dictionary
    metrics_A["CollapseRate"] = collapse_rate_A
    metrics_B["CollapseRate"] = collapse_rate_B

    # ======================================================
    # Store results for summary tables
    # ======================================================
    all_results[label] = {
        "collapsed": metrics_A,
        "retrained": metrics_B
    }

    print(f"\nðŸ“Š Results for {label}:")
    print("Collapsed:", metrics_A)
    print("Retrained:", metrics_B)

Loading IPA CSV for: tomroma


tomroma â€“ collapsed: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 624/624 [23:57<00:00,  2.30s/it]


Saved predictions â†’ /content/drive/MyDrive/Capstone/AB_Testing/tomroma_collapsed.csv


tomroma â€“ retrained: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 624/624 [22:29<00:00,  2.16s/it]


Saved predictions â†’ /content/drive/MyDrive/Capstone/AB_Testing/tomroma_retrained.csv

ðŸ“Š Results for tomroma:
Collapsed: {'WER': 1.0, 'CER': 1.0, 'BLEU': 0.0, 'CollapseRate': np.float64(0.7788461538461539)}
Retrained: {'WER': 1.0, 'CER': 1.0, 'BLEU': 0.0, 'CollapseRate': np.float64(0.6282051282051282)}
Loading IPA CSV for: clsu


clsu â€“ collapsed: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 819/819 [39:29<00:00,  2.89s/it]


Saved predictions â†’ /content/drive/MyDrive/Capstone/AB_Testing/clsu_collapsed.csv


clsu â€“ retrained: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 819/819 [36:26<00:00,  2.67s/it]


Saved predictions â†’ /content/drive/MyDrive/Capstone/AB_Testing/clsu_retrained.csv

ðŸ“Š Results for clsu:
Collapsed: {'WER': 1.0, 'CER': 1.0, 'BLEU': 0.0, 'CollapseRate': np.float64(0.9511599511599511)}
Retrained: {'WER': 1.0, 'CER': 1.0, 'BLEU': 0.007527232863881872, 'CollapseRate': np.float64(0.8327228327228328)}


### Produce A/B Results Table

This section aggregates all metrics into a single comparison table, making differences between the collapsed and retrained models easy to interpret and cite in reports.

In [None]:
import pandas as pd

rows = []
for dataset, result in all_results.items():
    row = {
        "Dataset": dataset,
        "A_WER": result["collapsed"]["WER"],
        "A_CER": result["collapsed"]["CER"],
        "A_BLEU": result["collapsed"]["BLEU"],
        "B_WER": result["retrained"]["WER"],
        "B_CER": result["retrained"]["CER"],
        "B_BLEU": result["retrained"]["BLEU"],
    }
    rows.append(row)

results_df = pd.DataFrame(rows)
results_df.to_csv(f"{OUTPUT_DIR}/AB_results_summary.csv", index=False)
results_df

Unnamed: 0,Dataset,A_WER,A_CER,A_BLEU,B_WER,B_CER,B_BLEU
0,tomroma,1.0,1.0,0.0,1.0,1.0,0.0
1,clsu,1.0,1.0,0.0,1.0,1.0,0.007527


### Qualitative Comparison Table

Numerical metrics donâ€™t fully illustrate model behavior, so qualitative samples help explain specific strengths and weaknesses. This table displays ground truth, collapsed-model prediction, retrained-model prediction, and collapse flagsâ€”all aligned for easy inspection and inclusion in the Week-6 report.

In [None]:
def qualitative_table(df_A, df_B, num_samples=10):
    # Align original df, collapsed df, retrained df
    merged = pd.DataFrame({
        "Reference": df_A["transcription"],
        "Collapsed Output": df_A["text_pred_collapsed"],
        "Retrained Output": df_B["text_pred_retrained"],
        "Collapsed_Collapse?": df_A["collapse_info"].apply(lambda x: x["collapse"]),
        "Retrained_Collapse?": df_B["collapse_info"].apply(lambda x: x["collapse"])
    })

    # Sample randomly for inspection
    sample_df = merged.sample(num_samples, random_state=42)
    return sample_df

qualitative_samples = qualitative_table(df_A, df_B, num_samples=12)
qualitative_samples

Unnamed: 0,Reference,Collapsed Output,Retrained Output,Collapsed_Collapse?,Retrained_Collapse?
86,Tornadoes bring rain and big winds,tin ah ah ah ah ah ah ah ah ah ah ah ah ah ah ...,thomas oh there's bigger a a a a a a a a a a a...,True,True
432,Sometimes it is hard to see in dust storms,um the tusame's uh uh uh uh uh uh uh uh uh uh ...,um that sam's uh uh uh uh uh uh uh uh uh uh uh...,True,True
799,A butterfly wing is a two sided thing,a boy or a a a a a a a a a a a a a a a a a a a...,a big thing a fly with a things too s i a a a ...,True,True
417,People use trash to make new things,p p l yo's choo choo choo choo choo choo.,people use to the rest to the the rest um make...,True,True
678,Baby dinosaurs hatched from the eggs,baby jennifer swoose hah ah ah ah ah ah ah ah ...,baby jennifer swoosh hatched for a a a a a a a...,True,True
532,Lightning can hurt people,l i l i a a a a a a a a a a a a a a a a a a a ...,li li a a a a a a a a a a a a a a a a a a a a ...,True,True
598,If the river gets too deep water goes onto the...,uh uh uh uh uh uh uh uh uh uh uh uh uh uh uh u...,if the other thing of the other thing that's t...,True,True
767,Animals live along the coast and in nearby oceans,then the animals love l a a a a a a a a a a a ...,then the animals lived longing in the a a a a ...,True,True
192,Scientists thought all the blue butterflies ha...,i n t s i n t t t u b u r r r r r r r r r r r r.,i n t n s i n t t u u u u u u u u u u u u u u ...,True,True
537,Then more people use them,ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah a...,ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah a...,True,True
