## **T5 Fine-Tuning Pipeline**

Zane Graper

Capstone

This notebook performs full fine-tuning of the T5-small IPA→Text model on the CHILDES corpus prepared earlier.
It trains on multiple IPA “views” (raw, boundary-augmented, rule-based), allowing the model to learn robust mappings from child-speech IPA to text.

The goal of this training run is to produce a final fine-tuned model that can decode child IPA more accurately than the baseline pretrained model.

### **Setup and Mount Drive**

This section:

* Mounts Google Drive for dataset and model access.

* Installs all required libraries:

   * `transformers` (model + trainer)

   * `datasets` (Hugging Face dataset objects)

   * `sentencepiece` (T5 tokenizer)

   * `accelerate` (GPU acceleration)

   * `evaluate` + `jiwer` (WER computation)

This prepares the entire training environment.

In [None]:
# ============================================================
# 1. Setup and Mount Drive
# ============================================================
from google.colab import drive
drive.mount('/content/drive')

!pip install transformers datasets sentencepiece accelerate evaluate jiwer --quiet

Mounted at /content/drive
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[?25h

### **Configuration**

This block defines all the key training hyperparameters:

| Variable                    | Description                                          |
| --------------------------- | ---------------------------------------------------- |
| `MODEL_NAME`                | The starting pretrained model (baseline IPA→Text T5) |
| `MAX_LEN`                   | Max token length for encoder/decoder inputs          |
| `LR`                        | Learning rate                                        |
| `EPOCHS`                    | How long to train                                    |
| `BATCH_SIZE`                | Batch size per device                                |
| `IPA_COLUMN`                | Which IPA representation to use (raw/boundary/rule)  |
| `train_path` / `valid_path` | Locations of the 3-view CHILDES corpus               |
| `output_dir`                | Where to save the final fine-tuned model             |

This allows controlled experiments using different IPA variants simply by changing `IPA_COLUMN`.

In [None]:
# ============================================================
# 2. Configuration
# ============================================================
MODEL_NAME = "zanegraper/t5-small-ipa-phoneme-to-text"
MAX_LEN = 128
LR = 3e-4
EPOCHS = 5
BATCH_SIZE = 16

# Choose which IPA view you want to train on:
IPA_COLUMN = "IPA_boundary"   # options: IPA_raw | IPA_boundary | IPA_rule

input_dir = "/content/drive/MyDrive/Capstone/Corpus/ipa_childes"
train_path = f"{input_dir}/train_3view.tsv"
valid_path = f"{input_dir}/valid_3view.tsv"

output_dir = "/content/drive/MyDrive/Capstone/Models/t5_childes_finetuned"

### **Load and Prepare Dataset**

This section loads the 3-view CHILDES dataset:
```
train_3view.tsv
valid_3view.tsv
```

It extracts only the chosen IPA column and the text column:
```
IPA_COLUMN → "IPA"
"Text"      → "Text"
```

This ensures T5 is trained on a consistent input format, while easily allowing comparisons across IPA views.

To keep training fast and computable:

* The training set is capped (`TRAIN_CAP` = 100000)

* The validation set is capped (`VALID_CAP` = 10000)

If the dataset exceeds these values:
```
train_df = train_df.sample(TRAIN_CAP)
valid_df = valid_df.sample(VALID_CAP)
```

This ensures experiments can run even with limited GPU/CPU resources.

In [None]:
# ============================================================
# 3. Load and Prepare Dataset
# ============================================================
import pandas as pd
from datasets import Dataset

df_train = pd.read_csv(train_path, sep="\t")
df_valid = pd.read_csv(valid_path, sep="\t")

# Select only needed columns
train_df = df_train[[IPA_COLUMN, "Text"]].rename(columns={IPA_COLUMN: "IPA"})
valid_df = df_valid[[IPA_COLUMN, "Text"]].rename(columns={IPA_COLUMN: "IPA"})

# ============================================================
# CAP THE DATASET SIZES
# ============================================================
TRAIN_CAP = 100000     # adjust if needed
VALID_CAP = 10000

if len(train_df) > TRAIN_CAP:
    train_df = train_df.sample(n=TRAIN_CAP, random_state=42).reset_index(drop=True)

if len(valid_df) > VALID_CAP:
    valid_df = valid_df.sample(n=VALID_CAP, random_state=42).reset_index(drop=True)

print(f"Final capped sizes -> Train: {len(train_df)}, Valid: {len(valid_df)}")

# Convert to HF Dataset AFTER sampling
train_ds = Dataset.from_pandas(train_df)
valid_ds = Dataset.from_pandas(valid_df)

print("Example row:", train_ds[0])
print(f"HF sizes -> Train: {len(train_ds)}, Valid: {len(valid_ds)}")

Final capped sizes -> Train: 100000, Valid: 10000
Example row: {'IPA': 'ɡoʊɪŋɹaʊndsɪɡəɹɹɛt', 'Text': 'going round cigarette.'}
HF sizes -> Train: 100000, Valid: 10000


### **Tokenization***

This step loads the tokenizer and defines the preprocessing function.

What tokenization does

* Encodes the IPA sequence into integer IDs.

* Encodes the target text into labels.

* Pads/truncates sequences to MAX_LEN.

* Produces a dictionary usable by the Trainer.

Both the training and validation sets are processed with:
```
Dataset.map(preprocess)
```

This ensures efficient, batched tokenization.


In [None]:
# ============================================================
# 4. Tokenization
# ============================================================
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def preprocess(batch):
    inputs = tokenizer(
        batch["IPA"],
        max_length=MAX_LEN,
        truncation=True,
        padding="max_length"
    )
    labels = tokenizer(
        batch["Text"],
        max_length=MAX_LEN,
        truncation=True,
        padding="max_length"
    )["input_ids"]

    inputs["labels"] = labels
    return inputs

train_tokenized = train_ds.map(preprocess, batched=True, remove_columns=["IPA", "Text"])
valid_tokenized = valid_ds.map(preprocess, batched=True, remove_columns=["IPA", "Text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

### **Initialize Model and Trainer**

This section loads:

* The pretrained T5 model

* A DataCollatorForSeq2Seq for correct padding behavior

* The WER metric via `evaluate`

* A metric function to compute WER during validation

This sets up all components the Hugging Face Trainer needs.

In [None]:
# ============================================================
# 5. Initialize Model and Trainer
# ============================================================
from transformers import (
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer
)

import evaluate
import numpy as np

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

metric = evaluate.load("wer")

def compute_metrics(eval_pred):
    preds, labels = eval_pred

    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    return {"wer": metric.compute(predictions=decoded_preds, references=decoded_labels)}


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

### **Training Configuration**

This block defines:

* Where outputs should go

* Training schedule (epochs, learning rate, batch size)

* Logging frequency


In [None]:
# ============================================================
# 6. Training Configuration (NO CHECKPOINT SAVING)
# ============================================================
training_args = TrainingArguments(
    output_dir=output_dir,
    # evaluation_strategy="epoch",
    save_strategy="no",               # <-- IMPORTANT: no checkpoints
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    logging_steps=100,
    # predict_with_generate=False,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=valid_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


### **Train the Model**

This section:

* Runs `trainer.train()`

* Displays step-level loss progress

* Saves the final fine-tuned model to:
```
/content/drive/MyDrive/Capstone/Models/t5_childes_finetuned
```

The notebook also logs the training to Weights & Biases, enabling visualization of:

* Loss curves

* Learning rate

* Token throughput

The final saved model includes:

* `pytorch_model.bin`

* `config.json`

* `tokenizer.json`

* `special_tokens_map.json`

* `tokenizer_config.json`

In [None]:
# ============================================================
# 7. Train the Model
# ============================================================
trainer.train()

# Save final model only
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"✅ Final fine-tuned model saved to: {output_dir}")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mzanegraper[0m ([33mzanegraper-university-of-the-cumberlands[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,1.1037
200,0.1521
300,0.1413
400,0.1313
500,0.1229
600,0.1225
700,0.1157
800,0.1113
900,0.1081
1000,0.1083


### **Result**

At the end of this notebook, you obtain a fully fine-tuned T5 model trained specifically on:

* CHILDES child-speech IPA

* Three IPA views (raw, boundary, rule)

* 100k training examples

* 10k validation examples

This model is significantly more robust to child IPA variation and provides the large gains observed in your experiments.

It is now ready for:

* Inference testing

* Rule-based IPA evaluation

* Hugging Face Hub upload

* Integration into the complete correction pipeline