# **A/B Evaluation - Heuristic Boundaries**

Zane Graper

MSAI699 Capstone

---

This notebook conducts an A/B evaluation of a boundary-insertion heuristic designed to segment IPA phoneme sequences prior to decoding with the fine-tuned model `zanegraper/t5-small-ipa-phoneme-to-text`. The experiment compares two conditions: the baseline model using raw IPA sequences (A), and a modified version where boundaries are inserted according to a simplified phonological rule set (B). The goal is to measure whether heuristic segmentation improves transcription accuracy when decoding child speech phonemes from the CHILDES IPA corpus. Word Error Rate (WER) serves as the primary metric for determining whether boundaries provide a measurable benefit.

In [None]:
!pip install transformers accelerate datasets sentencepiece --quiet

In [None]:
import pandas as pd
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm import tqdm
import torch

### Load Model

Loads the tokenizer and T5 model from HuggingFace and moves the model to GPU if available. This prepares the decoding pipeline used for both A and B conditions.

In [None]:
model_name = "zanegraper/t5-small-ipa-phoneme-to-text"

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to("cuda" if torch.cuda.is_available() else "cpu")

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

### Boundary Heuristic

Implements the simplified rule-based boundary insertion function, which adds separators in the IPA sequence based on vowel/nasal → voiceless-stop transitions.

In [None]:
VOWELS = set(["i", "ɪ", "e", "ɛ", "æ", "ɑ", "ɔ", "o", "ʊ", "u", "ə", "ʌ"])
NASALS = set(["m", "n", "ŋ"])
VOICELESS_STOPS = set(["p", "t", "k"])

def approximate_boundaries(ipa_seq: str) -> str:
    tokens = ipa_seq.split()
    if len(tokens) <= 1:
        return ipa_seq

    out = []
    for i, tok in enumerate(tokens[:-1]):
        nxt = tokens[i + 1]
        out.append(tok)

        if tok in VOWELS.union(NASALS) and nxt in VOICELESS_STOPS:
            out.append("|")

    out.append(tokens[-1])
    return " ".join(out)

### Load Evaluation Data

Mounts Google Drive and loads the CHILDES validation set, applying basic cleaning and sampling to create a standardized evaluation subset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

BASE_DIR = "/content/drive/MyDrive/Capstone"
VALID_PATH = f"{BASE_DIR}/Corpus/ipa_childes/child_valid.tsv"

# Load TSV with *no* header row
df_valid = pd.read_csv(VALID_PATH, sep="\t", header=None)

# Rename columns explicitly
df_valid = df_valid.rename(columns={
    0: "ipa_phonemes",
    1: "transcription"
})

# Drop any bad rows
df_valid = df_valid.dropna(subset=["ipa_phonemes", "transcription"]).reset_index(drop=True)

# Take a random subset of 5000 for the A/B test
df = df_valid.sample(n=5000, random_state=42).reset_index(drop=True)

print("Subset shape:", df.shape)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Subset shape: (5000, 2)


Unnamed: 0,ipa_phonemes,transcription
0,b ɪ k ʌ z ʃ iː h æ d ð oʊ z æ n d ʃ iː h æ d d...,because she had those and she had dose.
1,w ɛ l w ʌ t j uː w ʌ t j uː w ʌ t j uː n eɪ m ...,well what you what you what you name daddy?
2,aɪ s iː ʌ b ɪ ɡ w iː l d ɛ ɹ ə,i see a big wheel dere.
3,d oʊ n t d ɪ s t ɜː b ʌ m æ n w ɛ n h iː z w ɜ...,don't disturb a man when he's working.
4,ð ɪ ɑ p ə z ɪ t ʌ v h aɪ ɪ z l oʊ,the opposite of high is low.


### Inference Helper

Defines a helper function that handles tokenization, model generation, and decoding, enabling consistent transcript prediction for both A and B runs.

In [None]:
def generate_text(ipa_string: str) -> str:
    inputs = tokenizer(ipa_string, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_length=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

### Run A/B Inference

Executes transcription under both experimental conditions—raw IPA (A) and boundary-modified IPA (B)—and stores the outputs in the dataframe.

In [None]:
preds_A = []
preds_B = []

for ipa in tqdm(df["ipa_phonemes"]):
    # A
    pred_A = generate_text(ipa)
    preds_A.append(pred_A)

    # B
    ipa_b = approximate_boundaries(ipa)
    pred_B = generate_text(ipa_b)
    preds_B.append(pred_B)

df["pred_no_boundaries"] = preds_A
df["pred_with_boundaries"] = preds_B

100%|██████████| 5000/5000 [1:12:47<00:00,  1.14it/s]


### Compute Metrics (WER)

Calculates Word Error Rate for each output pair by comparing predictions to reference transcripts, enabling a quantitative comparison of A vs. B.

In [None]:
def wer(ref, hyp):
    r = ref.split()
    h = hyp.split()

    # dynamic programming
    dp = np.zeros((len(r)+1, len(h)+1))
    for i in range(len(r)+1):
        dp[i][0] = i
    for j in range(len(h)+1):
        dp[0][j] = j

    for i in range(1, len(r)+1):
        for j in range(1, len(h)+1):
            cost = 0 if r[i-1] == h[j-1] else 1
            dp[i][j] = min(
                dp[i-1][j] + 1,
                dp[i][j-1] + 1,
                dp[i-1][j-1] + cost
            )
    return dp[len(r)][len(h)] / max(1, len(r))

df["wer_A"] = df.apply(lambda x: wer(x["transcription"], x["pred_no_boundaries"]), axis=1)
df["wer_B"] = df.apply(lambda x: wer(x["transcription"], x["pred_with_boundaries"]), axis=1)


### Summary Comparison

Aggregates the WER values across all samples and reports the mean WER for both conditions, along with the delta indicating relative improvement or degradation.

In [None]:
mean_A = df["wer_A"].mean()
mean_B = df["wer_B"].mean()

print("=== A/B Boundary Heuristic Results ===")
print(f"A (No boundaries):  {mean_A:.4f}")
print(f"B (With boundaries): {mean_B:.4f}")
print(f"Δ Improvement: {(mean_A - mean_B):.4f}")

=== A/B Boundary Heuristic Results ===
A (No boundaries):  0.6946
B (With boundaries): 0.7453
Δ Improvement: -0.0508


In [None]:
df.to_csv("boundary_ab_results.csv", index=False)
print("Saved results → boundary_ab_results.csv")

Saved results → boundary_ab_results.csv


### Conclusion

The results show that the boundary-enhanced condition (WER = 0.7453) performed worse than the baseline without boundaries (WER = 0.6946), with a negative delta of −0.0508. This indicates that the simplified segmentation heuristic did not improve decoding accuracy and, in fact, degraded performance on the CHILDES validation data. The outcome suggests that manually inserting boundaries based on limited vowel/nasal transitions does not align well with the phonetic structure captured by the model and may distort the expected token patterns. In this experiment, heuristic boundary insertion provided no measurable benefit and demonstrates that segmentation, at least in this form, is not advantageous for this IPA-to-text model.