# **A/B Testing - Model Comparison**

Zane Graper

MSAI699 Capstone

---

This notebook evaluates two IPA→Text models using a held-out validation corpus derived from the CHILDES dataset. The goal is to compare the previously fine-tuned model, which demonstrated decoder collapse during earlier testing, against a newly retrained version designed with corrective measures to prevent repetition failures. By running both models on identical IPA sequences and comparing Word Error Rate (WER), Character Error Rate (CER), and BLEU, this evaluation provides a controlled, quantitative assessment of each model’s reliability and accuracy. The results presented here form the basis for selecting the model that will be integrated into the final version of the correction-layer pipeline.

---

### Install Dependencies

In [None]:
!pip install -q evaluate jiwer transformers pandas numpy

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.2/3.2 MB[0m [31m137.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m63.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### Mount Google Drive and Define Paths

This block mounts Google Drive and defines the base directories where the validation corpus is stored and where output predictions will be saved. It ensures that all files used during evaluation are read and written consistently.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

BASE_DIR = "/content/drive/MyDrive/Capstone"
VALID_PATH = f"{BASE_DIR}/Corpus/ipa_childes/child_valid.tsv"
OUTPUT_DIR = f"{BASE_DIR}/Evaluation/ModelEval_Validation"
import os
os.makedirs(OUTPUT_DIR, exist_ok=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load and Prepare Validation Corpus

This block loads the CHILDES validation TSV, assigns proper column names, removes malformed rows, and selects a randomized subset of 500 samples. Using a fixed random seed guarantees reproducibility.

In [None]:
import pandas as pd

# Load without header
df_valid = pd.read_csv(VALID_PATH, sep="\t", header=None)

# Rename properly
df_valid = df_valid.rename(columns={
    0: "ipa_phonemes",
    1: "transcription"
})

# Clean missing
df_valid = df_valid.dropna(subset=["ipa_phonemes", "transcription"]).reset_index(drop=True)

# Take a random subset of 500 samples
df_valid_subset = df_valid.sample(n=500, random_state=42).reset_index(drop=True)

print("Subset shape:", df_valid_subset.shape)
df_valid_subset.head()

Subset shape: (500, 2)


Unnamed: 0,ipa_phonemes,transcription
0,b ɪ k ʌ z ʃ iː h æ d ð oʊ z æ n d ʃ iː h æ d d...,because she had those and she had dose.
1,w ɛ l w ʌ t j uː w ʌ t j uː w ʌ t j uː n eɪ m ...,well what you what you what you name daddy?
2,aɪ s iː ʌ b ɪ ɡ w iː l d ɛ ɹ ə,i see a big wheel dere.
3,d oʊ n t d ɪ s t ɜː b ʌ m æ n w ɛ n h iː z w ɜ...,don't disturb a man when he's working.
4,ð ɪ ɑ p ə z ɪ t ʌ v h aɪ ɪ z l oʊ,the opposite of high is low.


### Load IPA→Text Models (Collapsed & Retrained)

This section loads both models from HuggingFace: the collapsed model and the retrained, stable version. Each model is moved onto the GPU to accelerate inference.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"
device

# Model A - Collapsed
model_A_id = "zanegraper/t5-ipa-childes-finetuned"
tokenizer_A = AutoTokenizer.from_pretrained(model_A_id)
model_A = AutoModelForSeq2SeqLM.from_pretrained(model_A_id).to(device)

# Model B - Retrained
model_B_id = "zanegraper/t5-small-ipa-phoneme-to-text"
tokenizer_B = AutoTokenizer.from_pretrained(model_B_id)
model_B = AutoModelForSeq2SeqLM.from_pretrained(model_B_id).to(device)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

### IPA→Text Decoding Function

This reusable function converts an IPA string into a text prediction using beam search and repetition penalties, helping stabilize generation and reduce looping behavior.

In [None]:
def run_ipa_to_text(model, tokenizer, ipa_seq, max_new_tokens=64):
    try:
        inputs = tokenizer(ipa_seq, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=4,
                early_stopping=True,
                repetition_penalty=1.2
            )
        return tokenizer.decode(output[0], skip_special_tokens=True)
    except Exception as e:
        return f"[decode_error: {e}]"

### Evaluation Driver

This function applies a given model to the entire validation subset and stores predictions in a labeled output column. It also saves a CSV containing all generated predictions for further analysis.

In [None]:
from tqdm import tqdm

def evaluate_on_valid(model, tokenizer, df, model_label):
    df_out = df.copy()
    df_out[f"text_pred_{model_label}"] = ""

    for i, row in tqdm(df_out.iterrows(), total=len(df_out), desc=f"Evaluating {model_label}"):
        ipa_seq = row["ipa_phonemes"]
        df_out.at[i, f"text_pred_{model_label}"] = run_ipa_to_text(model, tokenizer, ipa_seq)

    df_out.to_csv(f"{OUTPUT_DIR}/validation_{model_label}.csv", index=False)
    print(f"Saved predictions → {OUTPUT_DIR}/valid_{model_label}.csv")

    return df_out

### Metric Function

This block computes WER, CER, and BLEU using HuggingFace’s `evaluate` library. These metrics quantify the accuracy and error patterns of each model.

In [None]:
from evaluate import load

def compute_metrics(preds, refs):
    wer_val = load("wer").compute(predictions=preds, references=refs)
    cer_val = load("cer").compute(predictions=preds, references=refs)
    bleu_val = load("bleu").compute(predictions=preds, references=refs)["bleu"]

    return {
        "WER": float(min(wer_val, 1.0)),
        "CER": float(min(cer_val, 1.0)),
        "BLEU": float(bleu_val)
    }

### Run A/B Evaluation

This block executes the evaluation for both models, computes metrics, and prints the results. It provides a direct numerical comparison between the collapsed and retrained model on the same input data.

In [None]:
df_A = evaluate_on_valid(model_A, tokenizer_A, df_valid_subset, "collapsed")
df_B = evaluate_on_valid(model_B, tokenizer_B, df_valid_subset, "retrained")

metrics_A = compute_metrics(df_A["text_pred_collapsed"].tolist(),
                            df_A["transcription"].tolist())

metrics_B = compute_metrics(df_B["text_pred_retrained"].tolist(),
                            df_B["transcription"].tolist())

print("\n📊 Validation Metrics:")
print("Collapsed Model:", metrics_A)
print("Retrained Model:", metrics_B)

Evaluating collapsed: 100%|██████████| 500/500 [18:19<00:00,  2.20s/it]


Saved predictions → /content/drive/MyDrive/Capstone/Evaluation/ModelEval_Validation/valid_collapsed.csv


Evaluating retrained: 100%|██████████| 500/500 [05:34<00:00,  1.49it/s]


Saved predictions → /content/drive/MyDrive/Capstone/Evaluation/ModelEval_Validation/valid_retrained.csv

📊 Validation Metrics:
Collapsed Model: {'WER': 1.0, 'CER': 1.0, 'BLEU': 0.026991630609680836}
Retrained Model: {'WER': 0.6638992270254795, 'CER': 0.37910174152153986, 'BLEU': 0.21579100465963177}


### Conclusion

The validation results demonstrate a clear and substantial performance difference between the two IPA→Text models. The collapsed model produced extremely poor scores across all metrics—with WER and CER at 1.0 and BLEU near zero—indicating that it regularly generated incorrect or unusable outputs due to repetition failures. In contrast, the retrained model achieved markedly better performance, reducing WER to 0.66, CER to 0.38, and increasing BLEU nearly ten-fold, confirming that the corrective training steps successfully stabilized decoding and improved linguistic fidelity. Given these results, the retrained model is objectively the more reliable and accurate choice, and it should serve as the foundation for all subsequent development and integration within the correction-layer system.