#  BYT5-SMALL Fine-tuning for Phoneme → Text Conversion

Zane Graper

Capstone

This notebook trains a T5-small IPA→Text sequence-to-sequence model using a large corpus of IPA–text pairs. Whereas earlier notebooks focused on dataset construction and evaluation, this file contains the full supervised fine-tuning pipeline—tokenization, data preprocessing, metric computation, training configuration, and post-training validation. The workflow follows best practices from modern ASR and text-generation research: masking padding tokens for stable loss computation, using beam search for higher-quality decoding, and evaluating with CER, BLEU, and chrF to capture complementary error behaviors (character-level, n-gram matching, and soft string similarity). The structure provides a clean, reproducible baseline for comparing variants (e.g., ByT5, different phoneme symbol sets, or child-speech-enhanced models).

---

### Step 1: Install Requirements

Installs all training dependencies, including SentencePiece, Hugging Face Transformers, evaluation libraries, and sacrebleu, and prints the active transformers version for reproducibility.

In [None]:
# ---- REQUIREMENTS ----
!pip install -q sentencepiece pandas tqdm
!pip install -q jiwer
!pip install -q "transformers>=4.38.0" "datasets>=2.16.0" "evaluate>=0.4.1" "accelerate>=0.25.0" "packaging>=23.2"
!pip install -q sacrebleu
import transformers, packaging
print("transformers:", transformers.__version__)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25htransformers: 4.57.1


### Step 2: Load Evaluation Metrics

Initializes CER, BLEU, and chrF metrics from `evaluate`, ensuring they are available for both training-time evaluation and post-training validation.

In [None]:
import evaluate
cer  = evaluate.load("cer")
bleu = evaluate.load("bleu")
chrf = evaluate.load("chrf")

### Step 3: Mount Google Drive

In [None]:
# ---- Mount Google Drive ----
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Step 4: Imports

Loads dataset utilities, tokenizer/model classes, data collators, Trainer components, PyTorch, and tqdm to support preprocessing, batching, training, and evaluation.

In [None]:
# ---- IMPORTS ----
import os, pandas as pd
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
)
import evaluate
from tqdm import tqdm
import torch

In [None]:
# sanity: columns exist
assert {"phonemes","text"}.issubset(set(dataset.column_names)), dataset.column_names

# sanity: pad token
assert tokenizer.pad_token_id is not None, "Tokenizer needs a pad_token_id"

### Step 5: Data Loading, Splitting, Tokenization, Metric Function, and Trainer Setup

Loads up to 250k IPA/text pairs, splits them into training and validation sets, defines the preprocessing/tokenization function, configures CER/BLEU/chrF metric callbacks, sets up the data collator, and initializes a `Seq2SeqTrainer` instance with appropriate hyperparameters and stability settings.

In [None]:
# ---- PATHS ----
csv_path = "/content/drive/MyDrive/Capstone/Corpus/bookcorpus_ipa_final.csv"
output_dir = "/content/drive/MyDrive/Capstone/Models/t5_small_ipa_to_text"

# ---- LOAD DATA ----
df = pd.read_csv(csv_path)
# df = df.sample(n=5000, random_state=42)  # Small Test
df = df.sample(n=min(len(df), 250_000), random_state=42)
dataset = Dataset.from_pandas(df)

# ---- TRAIN / VAL SPLIT ----
ds = dataset.train_test_split(test_size=0.05, seed=42)
train_ds, val_ds = ds["train"], ds["test"]

# ---- LOAD MODEL ----
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# ---- TOKENIZATION FUNCTION ----
MAX_LENGTH = 256

def preprocess(batch):
    inputs = tokenizer(
        batch["phonemes"],
        max_length=MAX_LENGTH,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        batch["text"],
        max_length=MAX_LENGTH,
        truncation=True,
        padding="max_length"
    )["input_ids"]

    # mask out padding tokens for loss computation
    labels = [
        [(lid if lid != tokenizer.pad_token_id else -100) for lid in l]
        for l in labels
    ]
    inputs["labels"] = labels
    return inputs

tokenized_train = train_ds.map(preprocess, batched=True, remove_columns=dataset.column_names)
tokenized_val   = val_ds.map(preprocess, batched=True, remove_columns=dataset.column_names)

# ---- METRICS ----
import evaluate
cer  = evaluate.load("cer")
bleu = evaluate.load("bleu")
chrf = evaluate.load("chrf")

IGNORE_ID = -100

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # replace -100 with pad_token_id
    labels = [[(tok if tok != IGNORE_ID else tokenizer.pad_token_id) for tok in seq] for seq in labels]
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    return {
        "cer": cer.compute(predictions=preds, references=labels),
        "bleu": bleu.compute(predictions=preds, references=labels)["bleu"],
        "chrf": chrf.compute(predictions=preds, references=labels)["score"],
    }

# ---- DATA COLLATOR ----
from transformers import DataCollatorForSeq2Seq
collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# ---- TRAINING ARGS ----
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    do_train=True,
    do_eval=True,

    # short, fast run
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    num_train_epochs=3,            # only 2 epochs
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,

    # stability
    weight_decay=0.01,
    max_grad_norm=1.0,
    warmup_ratio=0.1,

    # generation / misc
    predict_with_generate=True,
    fp16=True,
    lr_scheduler_type="linear",
    push_to_hub=False,
)

# ---- TRAINER ----
from transformers import Seq2SeqTrainer, EarlyStoppingCallback

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/237500 [00:00<?, ? examples/s]

Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  trainer = Seq2SeqTrainer(


### Step 6: Training and Saving the Model

Runs the supervised fine-tuning loop and saves both the model weights and tokenizer to the specified output directory.

In [None]:
# ---- TRAIN ----
trainer.train()

# ---- SAVE ----
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✅ Model and tokenizer saved to {output_dir}")

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mzanegraper[0m ([33mzanegraper-university-of-the-cumberlands[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,5.4249
200,4.483
300,4.2252
400,4.0824
500,3.9408
600,3.8347
700,3.6996
800,3.5225
900,3.3567
1000,3.2044


✅ Model and tokenizer saved to /content/drive/MyDrive/Capstone/Models/t5_small_ipa_to_text


### Step 7:  Manual Validation Evaluation and Sample Inspection

Iterates over the validation set using beam search decoding to compute CER, BLEU, and chrF outside the Trainer loop, printing both metric scores and sample predictions for qualitative inspection.

In [None]:
# ---- EVALUATION AND SAMPLE ANALYSIS ----
import torch
from tqdm import tqdm
import evaluate

# reload metrics to ensure clean state
cer_metric  = evaluate.load("cer")
bleu_metric = evaluate.load("bleu")
chrf_metric = evaluate.load("chrf")

# run evaluation on the validation set
preds = []
refs  = []

model.eval()
for i in tqdm(range(len(tokenized_val))):
    input_ids = torch.tensor(tokenized_val[i]["input_ids"]).unsqueeze(0).to(model.device)
    attention_mask = torch.tensor(tokenized_val[i]["attention_mask"]).unsqueeze(0).to(model.device)

    with torch.no_grad():
        output_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=128,
            num_beams=4,             # beam search improves quality
            early_stopping=True
        )
    pred_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    label_ids = [tok if tok != -100 else tokenizer.pad_token_id for tok in tokenized_val[i]["labels"]]
    ref_text  = tokenizer.decode(label_ids, skip_special_tokens=True)

    preds.append(pred_text)
    refs.append(ref_text)

# ---- METRICS ----
cer_score  = cer_metric.compute(predictions=preds, references=refs)
bleu_score = bleu_metric.compute(predictions=preds, references=refs)["bleu"]
chrf_score = chrf_metric.compute(predictions=preds, references=refs)["score"]

print("\n=== Validation Metrics ===")
print(f"CER  : {cer_score:.4f}")
print(f"BLEU : {bleu_score:.4f}")
print(f"CHRF : {chrf_score:.4f}")

# ---- SAMPLE OUTPUTS ----
print("\n=== Sample Predictions ===")
for i in range(5):
    phonemes = val_ds[i]["phonemes"]
    ref = refs[i]
    pred = preds[i]
    print(f"Phonemes: {phonemes[:120]}...")
    print(f"Reference: {ref}")
    print(f"Predicted: {pred}")
    print("-"*80)

100%|██████████| 12500/12500 [1:07:27<00:00,  3.09it/s]



=== Validation Metrics ===
CER  : 0.1505
BLEU : 0.6065
CHRF : 76.2944

=== Sample Predictions ===
Phonemes: ɪ t ɪ z n ɑ t h ɪ z s ɑ l ʌ d j ɛ t ʌ n ɛ k s ʌ p ʃ eɪ n j ʌ l g ɪ t ɑ r p l eɪ ɪ ŋ ð æ t t eɪ k s ʌ s b aɪ k ʌ m p l i ...
Reference: it is not his solid yet unexceptional guitar playing that takes us by complete surprise
Predicted: it is not his silver yet inexplicable get or playing that takes us by complete surprise ''
--------------------------------------------------------------------------------
Phonemes: ʃ ɔ n k ʌ m p ɛ r d ð ʌ ɪ n f ə m eɪ ʃ ʌ n r æ z k eɪ m ʌ p w ɪ ð f r ʌ m æ l ʌ k s ɪ z n oʊ t s ʤ ə n ʌ l z ʌ n d d eɪ ...
Reference: sean compared the information raz came up with from alexs notes journals and database with what british intelligence had on eleazar
Predicted: then compording information res came up with from alex is notes angels and date a base with what brought the information had in earlier
-------------------------------------------------------------

This notebook implements a complete fine-tuning pipeline for converting IPA phoneme sequences into English text using T5-small. By combining structured preprocessing, padding-masking for stable loss computation, beam-search generation, and multilayered metrics, the workflow ensures that both quantitative and qualitative performance can be assessed reliably. The architecture is modular, allowing adjustments such as larger datasets, alternative tokenizers, or more aggressive hyperparameters without redesigning the pipeline. The validation loop provides a trustworthy snapshot of real-world model behavior, complementing training-time metrics.

**Takeaway**: This script is the core training engine of the project, producing the fine-tuned IPA→Text model used in downstream evaluation and deployment.