# **Fining-tuning T5 Seq2Seq Model**

Zane Graper

MSAI699 Capstone

---

This notebook fine-tunes a T5 sequence-to-sequence model to translate IPA phoneme sequences into natural language text, with an emphasis on improving performance on real child-speech patterns. To achieve this, we combine a large corpus of clean adult speech with the CHILDES-derived child-speech dataset, and then rigorously clean, normalize, and filter both sources to ensure lexical quality and remove explicit or unsuitable content. After constructing a balanced training/evaluation split, the notebook applies the standard T5 preprocessing pipeline, trains the model for several epochs with carefully tuned hyperparameters, and evaluates performance using CER, BLEU, and chrF—metrics commonly used in phoneme-to-text research. The final sections save and package the trained model for downstream deployment and reproducibility.

---

### Install Dependencies and GPU Check

This block installs all required Python packages—Transformers, Datasets, Accelerate, Evaluate, and additional utilities—ensuring the environment matches the expected versions for stable training. It also verifies that CUDA is available so GPU acceleration is enabled.

In [11]:
!pip install --upgrade pip

# Core NLP + Transformers stack
!pip install -q transformers==4.40.2
!pip install -q datasets==2.19.0
!pip install -q accelerate==0.30.1
!pip install -q evaluate==0.4.2
!pip install -q sentencepiece
!pip install -q sacrebleu

# Metrics
!pip install -q jiwer

# Data handling
!pip install -q pandas
!pip install -q numpy

# Optional (helps with large dataset streaming)
!pip install -q pyarrow

# Optional: progress bars
!pip install -q tqdm

!pip install -q hf_transfer

# Safety check: verify GPU availability
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())


PyTorch version: 2.8.0+cu128
CUDA available: True


### Configuration Setup

This section defines core configuration values, including paths to the perfect-speech and CHILDES data, train/eval split ratios, maximum text lengths, and hyperparameters for data cleaning. These variables allow reproducible control over corpus composition.

In [8]:
import pandas as pd
import numpy as np
import re

# ===========================================================
# CONFIGURATION
# ===========================================================
PERFECT_PATH = "bookcorpus_ipa_final.csv"
CHILD_TRAIN_PATH = "child_train.tsv"
CHILD_VALID_PATH = "child_valid.tsv"

# Ratio perfect : child in merged dataset (e.g., 0.7 = 70% perfect, 30% child)
RATIO_PERFECT = 0.70

# Train/eval split
TRAIN_SPLIT = 0.90

# Max allowed text length (tokens or chars)
MAX_TEXT_LEN = 300
MAX_IPA_LEN = 300

# ===========================================================
# ADULT CONTENT FILTERS
# ===========================================================
ADULT_WORDS = [
    # sexual content
    r"\bsex\b", r"\bsexual\b", r"\bthrust\b", r"\bmoan", r"\bpleasure\b",
    r"\bnaked\b", r"\bbreasts?\b", r"\bclit\b", r"\bpenis\b", r"\bvagina\b",
    r"\bordered\b", r"\borgasm\b", r"\berotic\b", r"\bcondom\b",

    # graphic or intimate touch
    r"\btongue\b", r"\bwetly\b", r"\bcupping\b", r"\bswollen\b",

    # violence (optional)
    r"\bblood\b", r"\bstab\b", r"\bkill\b", r"\bknife\b",
    r"\bmurder\b", r"\bslit\b"
]

def remove_explicit_text(df, text_col="text"):
    pattern = re.compile("|".join(ADULT_WORDS), flags=re.IGNORECASE)
    mask = ~df[text_col].str.contains(pattern, na=False)
    removed = len(df) - mask.sum()
    print(f"Filtered explicit/violent content: removed {removed:,} lines")
    return df[mask]

# ===========================================================
# LENGTH FILTERS
# ===========================================================
def filter_length(df):
    before = len(df)
    df = df[
        (df["text"].str.len() <= MAX_TEXT_LEN) &
        (df["phonemes"].str.len() <= MAX_IPA_LEN)
    ]
    removed = before - len(df)
    print(f"Filtered long lines: removed {removed:,} lines")
    return df

# ===========================================================
# NORMALIZE TEXT (quotes, stray punctuation)
# ===========================================================
def clean_text(s):
    if not isinstance(s, str):
        return s
    s = s.replace("''", "'")
    s = s.replace("  ", " ")
    return s.strip()

def normalize_df(df):
    df["text"] = df["text"].astype(str).apply(clean_text)
    df["phonemes"] = df["phonemes"].astype(str).str.strip()
    return df

# ===========================================================
# LOAD DATA
# ===========================================================
print("Loading perfect-speech dataset...")
df_perfect = pd.read_csv(PERFECT_PATH)

print("Normalizing perfect corpus...")
df_perfect = normalize_df(df_perfect)

print("Loading child-speech datasets...")
df_child_train = pd.read_csv(CHILD_TRAIN_PATH, sep="\t", header=None, names=["phonemes", "text"])
df_child_valid = pd.read_csv(CHILD_VALID_PATH, sep="\t", header=None, names=["phonemes", "text"])

df_child = pd.concat([df_child_train, df_child_valid], ignore_index=True)
df_child = normalize_df(df_child)

print(f"Perfect speech samples (raw): {len(df_perfect):,}")
print(f"CHILDES samples (raw): {len(df_child):,}")

# ===========================================================
# CLEAN BOTH CORPORA
# ===========================================================
print("\nCleaning perfect corpus...")
df_perfect = remove_explicit_text(df_perfect)
df_perfect = filter_length(df_perfect)
df_perfect = df_perfect.drop_duplicates()

print("\nCleaning CHILDES corpus...")
df_child = remove_explicit_text(df_child)
df_child = filter_length(df_child)
df_child = df_child.drop_duplicates()

print(f"\nPerfect after cleaning: {len(df_perfect):,}")
print(f"CHILDES after cleaning: {len(df_child):,}")

# ===========================================================
# DETERMINE OPTIMAL COUNTS
# ===========================================================
child_n = len(df_child)
perfect_required = int((RATIO_PERFECT / (1 - RATIO_PERFECT)) * child_n)

perfect_available = len(df_perfect)
perfect_n = min(perfect_required, perfect_available)

print(f"\nDesired perfect samples: {perfect_required:,}")
print(f"Using perfect samples: {perfect_n:,}")
print(f"Using child samples: {child_n:,}")

# ===========================================================
# SAMPLE DATASETS
# ===========================================================
df_perfect_sampled = df_perfect.sample(perfect_n, random_state=42)
df_child_sampled = df_child  # keep all child speech

df_merged = pd.concat([df_perfect_sampled, df_child_sampled], ignore_index=True)

# Shuffle
df_merged = df_merged.sample(frac=1.0, random_state=42).reset_index(drop=True)

print(f"Merged dataset size (cleaned): {len(df_merged):,}")

# ===========================================================
# TRAIN / EVAL SPLIT
# ===========================================================
train_size = int(TRAIN_SPLIT * len(df_merged))

df_train = df_merged.iloc[:train_size]
df_eval = df_merged.iloc[train_size:]

print(f"Training set size: {len(df_train):,}")
print(f"Evaluation set size: {len(df_eval):,}")

# ===========================================================
# SAVE OUTPUT
# ===========================================================
df_train.to_csv("train_merged.csv", index=False)
df_eval.to_csv("eval_merged.csv", index=False)

print("\nSaved cleaned corpora:")
print(" - train_merged.csv")
print(" - eval_merged.csv")


Loading perfect-speech dataset...
Normalizing perfect corpus...
Loading child-speech datasets...
Perfect speech samples (raw): 788,370
CHILDES samples (raw): 184,171

Cleaning perfect corpus...
Filtered explicit/violent content: removed 16,427 lines
Filtered long lines: removed 829 lines

Cleaning CHILDES corpus...
Filtered explicit/violent content: removed 562 lines
Filtered long lines: removed 56 lines

Perfect after cleaning: 771,045
CHILDES after cleaning: 183,553

Desired perfect samples: 428,290
Using perfect samples: 428,290
Using child samples: 183,553
Merged dataset size (cleaned): 611,843
Training set size: 550,658
Evaluation set size: 61,185

Saved cleaned corpora:
 - train_merged.csv
 - eval_merged.csv


### Train/Eval Split and Save

This section partitions the merged corpus into training and evaluation sets based on the configured proportion. It then saves the resulting CSV files for later steps in the workflow.

In [9]:
import pandas as pd
import numpy as np

TRAIN_PATH = "train_merged.csv"
EVAL_PATH = "eval_merged.csv"

# Number of random rows to preview
N = 50

def inspect_file(path, n=N):
    print(f"\n=== Sampling {n} rows from {path} ===")
    df = pd.read_csv(path)
    
    # Ensure required columns exist
    assert "phonemes" in df.columns, "Missing column: phonemes"
    assert "text" in df.columns, "Missing column: text"
    
    # Sample without replacement
    sample = df.sample(n=min(n, len(df)), random_state=np.random.randint(0, 1e6))

    for i, row in sample.iterrows():
        print(f"\n--- Sample {i} ---")
        print("IPA: ", row["phonemes"])
        print("Text:", row["text"])

# Inspect both files
inspect_file(TRAIN_PATH)
inspect_file(EVAL_PATH)



=== Sampling 50 rows from train_merged.csv ===

--- Sample 538855 ---
IPA:  aɪ m ʌ s t h æ v h ə t h ɪ z l ɛ g b ɪ k ɔ z h i s u n g eɪ v ʌ p ð ʌ ʧ eɪ s
Text: i must have hurt his leg because he soon gave up the chase '

--- Sample 459816 ---
IPA:  aɪ k ʊ d n ɑ t h ɛ l p g l æ n s ɪ ŋ oʊ v ə æ t eɪ n ʤ ʌ l
Text: i could not help glancing over at angel

--- Sample 505200 ---
IPA:  s oʊ f ɑ r ð ɛ r w ɑ z n ʌ θ ɪ ŋ
Text: so far there was nothing

--- Sample 151642 ---
IPA:  g oʊ t u ð ʌ f ɔ l z ʃ i k ʊ d f ɑ l oʊ ð ʌ s aʊ n d
Text: go to the falls ' she could follow the sound

--- Sample 75153 ---
IPA:  aɪ b ɛ t aɪ s aʊ n d ʌ d l aɪ k ɛ v ə i ʌ ð ə t i n eɪ ʤ g ə l ɪ n ʌ m ɛ r ʌ k ʌ d r u l ɪ ŋ oʊ v ə d eɪ ʤ oʊ n z
Text: i bet i sounded like every other teenage girl in america drooling over day jones

--- Sample 512479 ---
IPA:  f aɪ n ɪ t w ɑ z m i h i ʌ d m ɪ t ʌ d h æ ŋ ɪ ŋ h ɪ z h ɛ d
Text: fine it was me ' he admitted hanging his head

--- Sample 135677 ---
IPA:  h i m eɪ h æ v l aɪ

In [12]:
# ===============================================================
# T5 IPA → TEXT TRAINING SCRIPT
# (Aligned with methodology from prior notebook)
# ===============================================================

import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from tqdm import tqdm

import evaluate
cer = evaluate.load("cer")
bleu = evaluate.load("bleu")
chrf = evaluate.load("chrf")

from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments
)

# ===============================================================
# CONFIGURATION
# ===============================================================
TRAIN_PATH = "train_merged.csv"
EVAL_PATH  = "eval_merged.csv"
OUTPUT_DIR = "./t5_ipa_child_model"

MODEL_NAME = "t5-small"    # change to "t5-medium" if wanted and VRAM allows
MAX_LENGTH = 256

# ===============================================================
# LOAD DATA
# ===============================================================
print("Loading cleaned corpora...")
df_train = pd.read_csv(TRAIN_PATH)
df_eval  = pd.read_csv(EVAL_PATH)

dataset_train = Dataset.from_pandas(df_train)
dataset_eval  = Dataset.from_pandas(df_eval)

print(f"Train samples: {len(dataset_train):,}")
print(f"Eval samples : {len(dataset_eval):,}")

# ===============================================================
# LOAD MODEL + TOKENIZER
# ===============================================================
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

assert tokenizer.pad_token_id is not None, "Tokenizer must define PAD token."

# ===============================================================
# TOKENIZATION
# ===============================================================
def preprocess(batch):
    model_inputs = tokenizer(
        batch["phonemes"],
        max_length=MAX_LENGTH,
        truncation=True,
        padding="max_length",
    )

    labels = tokenizer(
        batch["text"],
        max_length=MAX_LENGTH,
        truncation=True,
        padding="max_length",
    )["input_ids"]

    # Replace padding token with -100 for loss masking
    labels = [
        [(tok if tok != tokenizer.pad_token_id else -100) for tok in seq]
        for seq in labels
    ]

    model_inputs["labels"] = labels
    return model_inputs

print("Tokenizing...")
tokenized_train = dataset_train.map(preprocess, batched=True, remove_columns=dataset_train.column_names)
tokenized_eval  = dataset_eval.map(preprocess, batched=True, remove_columns=dataset_eval.column_names)

# ===============================================================
# METRICS
# ===============================================================
IGNORE_ID = -100

def compute_metrics(eval_pred):
    preds, labels = eval_pred

    preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace masked label IDs (-100) with pad token so decoding works
    labels = [
        [(tok if tok != IGNORE_ID else tokenizer.pad_token_id) for tok in seq]
        for seq in labels
    ]
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    return {
        "cer"  : cer.compute(predictions=preds, references=labels),
        "bleu" : bleu.compute(predictions=preds, references=labels)["bleu"],
        "chrf" : chrf.compute(predictions=preds, references=labels)["score"]
    }

# ===============================================================
# DATA COLLATOR
# ===============================================================
collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# ===============================================================
# TRAINING ARGUMENTS (aligned with prior notebook)
# ===============================================================
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    do_train=True,
    do_eval=True,

    num_train_epochs=3,
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,

    fp16=True,
    weight_decay=0.01,
    warmup_ratio=0.1,
    max_grad_norm=1.0,
    lr_scheduler_type="linear",

    logging_steps=100,
    save_steps=1000,
    eval_steps=1000,
    predict_with_generate=True,

    push_to_hub=False,
)

# ===============================================================
# TRAINER
# ===============================================================
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

# ===============================================================
# TRAIN
# ===============================================================
print("Starting training...")
train_result = trainer.train()

# Save model + tokenizer
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Model saved to {OUTPUT_DIR}")

# ===============================================================
# FINAL EVALUATION
# ===============================================================
print("Running final evaluation...")
metrics = trainer.evaluate()
print(metrics)

# ===============================================================
# WRITE RESULTS TO FILE
# ===============================================================
with open("training_results.txt", "w") as f:
    f.write("=== TRAINING & EVALUATION RESULTS ===\n")
    f.write(str(metrics))
    f.write("\n\nTraining Summary:\n")
    f.write(str(train_result))

print("Training results saved to training_results.txt")


Loading cleaned corpora...
Train samples: 550,658
Eval samples : 61,185




tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Tokenizing...


Map:   0%|          | 0/550658 [00:00<?, ? examples/s]

Map:   0%|          | 0/61185 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Starting training...


Step,Training Loss
100,5.554
200,4.7977
300,4.3878
400,4.2301
500,4.0648
600,4.0061
700,3.9227
800,3.8255
900,3.755
1000,3.6561


Model saved to ./t5_ipa_child_model
Running final evaluation...




{'eval_loss': 0.544379472732544, 'eval_cer': 0.246944650000786, 'eval_bleu': 0.5811898999604037, 'eval_chrf': 70.53047702197442, 'eval_runtime': 2135.8609, 'eval_samples_per_second': 28.647, 'eval_steps_per_second': 3.581, 'epoch': 2.9999564162538315}
Training results saved to training_results.txt
