## **CHILDES IPA Corpus Construction**

Zane Graper

Capstone

This notebook builds a high-quality, multi-view IPA corpus for fine-tuning the T5-small IPA-to-Text model.
It uses the phonemetransformers/IPA-CHILDES dataset and generates three aligned versions of each utterance:

* Raw IPA (canonical)

* Boundary-augmented IPA

* Rule-augmented IPA (phonological correction layer)

Each IPA representation is paired with the same canonical text transcript, enabling robust training across different phoneme conditions.

### **1. Environment Setup**

Mounts Google Drive for storage.

Installs required libraries:

* `datasets` (Hugging Face Datasets)

* `Pandas`

* `regex`, `re`

* `transformers`

Creates a dedicated output directory under:
MyDrive/Capstone/Corpus/ipa_childes_finetune/.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install datasets pandas --quiet

import os
from datasets import load_dataset
import pandas as pd

# Define output directory in Drive
output_dir = "/content/drive/MyDrive/Capstone/Corpus/ipa_childes"
os.makedirs(output_dir, exist_ok=True)

Mounted at /content/drive


### **2. Load the IPA-CHILDES Dataset**

The notebook loads:

```
dataset = load_dataset("phonemetransformers/IPA-CHILDES", "EnglishNA", split="train")
```
This dataset contains:

* IPA transcriptions

* Raw gloss text

* Child / adult metadata

* Word and morpheme counts

The notebook optionally includes both child and adult lines to help the model generalize better.

In [None]:
# ---------------------------------
# Load CHILDES dataset
# ---------------------------------
dataset = load_dataset("phonemetransformers/IPA-CHILDES", "EnglishNA", split="train")
df = dataset.to_pandas()

# Keep BOTH adult + child speech (better generalization)
# If you want child only, uncomment below:
# df = df[df["is_child"] == True]

# Minimum utterance length
min_words = 3
df["word_count"] = df["processed_gloss"].str.split().apply(len)
df = df[df["word_count"] >= min_words]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Eng-NA/processed.csv:   0%|          | 0.00/856M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2564614 [00:00<?, ? examples/s]

### **3. Filtering and Cleaning**

The notebook applies several preprocessing steps:

1. Min utterance length filter - Removes short utterances (configurable: min_words).

2. IPA cleaning - Removes:

   * `WORD_BOUNDARY`

   * extraneous punctuation

   * excessive whitespace

3. Text cleaning - Converts to lowercase and strips trailing whitespace.

4. Duplicate removal - Ensures one IPA ↔ Text mapping per unique utterance.

The result is a clean paired dataset with columns:

* `IPA_raw`

* `ext`

In [None]:
# ---------------------------------
# Cleaning functions
# ---------------------------------
import re

def clean_ipa(text):
    if not isinstance(text, str): return ""
    text = re.sub(r"WORD_BOUNDARY", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def clean_text(text):
    if not isinstance(text, str): return ""
    return text.lower().strip()

df["IPA_raw"] = df["ipa_transcription"].apply(clean_ipa)
df["Text"] = df["processed_gloss"].apply(clean_text)

df = df[(df["IPA_raw"] != "") & (df["Text"] != "")]
df = df.drop_duplicates(subset=["IPA_raw", "Text"]).reset_index(drop=True)


### **4. Boundary Augmentation**

Generates IPA_boundary, a version of IPA with inserted boundary markers | based on phonotactic heuristics:

* Inserts a boundary between:

   * vowel/nasal → voiceless stop

This helps the model identify word boundaries that may be missing in child IPA.

The rule used is conservative, designed to avoid over-segmentation.

In [None]:
# ---------------------------------
# Boundary Augmentation
# ---------------------------------
def approximate_boundaries(ipa_seq):
    tokens = ipa_seq.split()
    out = []
    for i, tok in enumerate(tokens[:-1]):
        nxt = tokens[i+1]
        out.append(tok)
        if tok in ["a","e","i","o","u","æ","ʌ","ɪ","ʊ","ə","n","m"] and nxt in ["p","t","k"]:
            out.append("|")
    out.append(tokens[-1])
    return " ".join(out)

df["IPA_boundary"] = df["IPA_raw"].apply(approximate_boundaries)

### **5. Rule-Based IPA Correction**

A refined phonological “correction layer” is applied to produce IPA_rule using:

* Final consonant restoration

* Liquid restoration (ɹ, l)

* Nasal restoration (n, m)

* s-cluster restoration

* Weak vowel (ə) insertion

All rules are lexicon-gated, meaning they only apply when the resulting IPA sequence corresponds to an actual word in the CHILDES-derived lexicon.

This version is intended to help the model:

* recover missing segments

* learn from corrected pronunciations

* normalize child speech IPA

In [None]:
# ---------------------------------
# Load lexicon (you already generated this)
# ---------------------------------
lexicon_path = "/content/drive/MyDrive/Capstone/Corpus/lexicon/lexicon.txt"
LEXICON = set()
with open(lexicon_path) as f:
    for line in f:
        LEXICON.add(line.strip().lower())

def ipa_to_word(ipa_seq):
    """Temporary mapping: collapse phonemes to a 'word key'."""
    return ipa_seq.replace(" ", "").replace("|", "")


### **6. Collapse IPA Forms**

Before training, all IPA forms are collapsed to remove spaces:
```
IPA_raw:       "k æ t" → "kæt"
IPA_boundary:  "k æ | t" → "kæ|t"
IPA_rule:      "k æ t" (corrected) → "kæt"
```

This ensures compatibility with the tokenizer, which was trained on unsegmented phoneme strings.

In [None]:
# ---------------------------------
# Rule-Based Correction (v1.1 rules)
# ---------------------------------
def correct_phonemes(ipa_seq):
    seq = ipa_seq.split()
    out = seq[:]

    def lex_ok(tokens):
        return ipa_to_word("".join(tokens)) in LEXICON

    # Rule A: Final consonant restoration
    vowels = ["æ","ɪ","ʌ","ə","ɑ","oʊ","u"]
    codas = ["t","d","k","n","s","m"]
    if out and out[-1] in vowels:
        for c in codas:
            cand = out + [c]
            if lex_ok(cand):
                out = cand
                break

    # Rule B: Liquid restoration
    if out and out[-1] in ["ɑ", "ɔ", "oʊ"]:
        for liq in ["ɹ", "l"]:
            cand = out + [liq]
            if lex_ok(cand):
                out = cand
                break

    # Rule C: Nasal restoration
    if out and out[-1] in ["oʊ", "ɑ", "u"]:
        for n in ["n", "m"]:
            cand = out + [n]
            if lex_ok(cand):
                out = cand
                break

    # Rule D: s-cluster restoration
    if out and out[0] in ["p", "t", "k"]:
        cand = ["s"] + out
        if lex_ok(cand):
            out = cand

    # Rule E: Weak vowel restoration
    bad_clusters = [("b","n"), ("n","n"), ("d","m"), ("t","n")]
    for (c1,c2) in bad_clusters:
        for i in range(len(out)-1):
            if out[i] == c1 and out[i+1] == c2:
                cand = out[:i+1] + ["ə"] + out[i+1:]
                if lex_ok(cand):
                    out = cand
                    break

    return " ".join(out)

df["IPA_rule"] = df["IPA_boundary"].apply(correct_phonemes)

7. Train/Validation Split

A randomized 90/10 split is applied to the combined dataset.

Output files saved:
```
train_3view.tsv
valid_3view.tsv
```

Each row contains:

* IPA_raw
* IPA_boundary
* IPA_rule
* Text

In [None]:
# ---------------------------------
# Collapse for model input
# ---------------------------------
df["IPA_raw"] = df["IPA_raw"].str.replace(" ", "")
df["IPA_boundary"] = df["IPA_boundary"].str.replace(" ", "")
df["IPA_rule"] = df["IPA_rule"].str.replace(" ", "").str.replace("|","")

### **8. Summary + Preview**

The notebook prints:

* dataset sizes

* random sample rows

* the first 20 corpus entries for manual inspection.

Be sure to verify that:

* IPA strings are valid

* boundaries and corrections are reasonable

* text lines look clean and aligned

In [None]:
# ---------------------------------
# Train/Valid split
# ---------------------------------
df = df.sample(frac=1, random_state=42)
cut = int(0.9 * len(df))
train_df, valid_df = df.iloc[:cut], df.iloc[cut:]

# Save ALL three views for training
train_df.to_csv(os.path.join(output_dir, "train_3view.tsv"), sep="\t", index=False)
valid_df.to_csv(os.path.join(output_dir, "valid_3view.tsv"), sep="\t", index=False)

print("Saved 3-view dataset:")
print("Train:", len(train_df), "rows")
print("Valid:", len(valid_df), "rows")

train_df.head(10)

Saved 3-view dataset:
Train: 1033399 rows
Valid: 114823 rows


Unnamed: 0,processed_gloss,ipa_transcription,character_split_utterance,is_child,id,gloss,stem,type,language,num_morphemes,...,collection_id,corpus_id,speaker_id,target_child_id,transcript_id,word_count,IPA_raw,Text,IPA_boundary,IPA_rule
190240,daddy needs two flashes.,d æ d i WORD_BOUNDARY n iː d z WORD_BOUNDARY t...,d a d d y WORD_BOUNDARY n e e d s WORD_BOUNDAR...,False,743590,Daddy needs two flashes,Daddy need two flash,declarative,eng haw,6,...,2,43,2344,2341,4205,4,dædiniːdztuːflæʃɪz,daddy needs two flashes.,dædiniːdztuːflæʃɪz,dædiniːdztuːflæʃɪz
342890,should we show her that song?,ʃ ʊ d WORD_BOUNDARY w iː WORD_BOUNDARY ʃ oʊ WO...,s h o u l d WORD_BOUNDARY w e WORD_BOUNDARY s ...,False,331820,should we show her that song,should we show her that song,question,eng,6,...,2,35,1704,1702,3686,6,ʃʊdwiːʃoʊhɜːðætsɔŋ,should we show her that song?,ʃʊdwiːʃoʊhɜːðæ|tsɔŋ,ʃʊdwiːʃoʊhɜːðætsɔŋ
293751,lots of presents here.,l ɑ t s WORD_BOUNDARY ʌ v WORD_BOUNDARY p ɹ ɛ ...,l o t s WORD_BOUNDARY o f WORD_BOUNDARY p r e ...,False,835985,lots_of presents here,lots_of present here,declarative,eng,4,...,2,45,2515,2514,4304,4,lɑtsʌvpɹɛzəntshɪɹ,lots of presents here.,lɑtsʌvpɹɛzən|tshɪɹ,lɑtsʌvpɹɛzəntshɪɹ
817748,you don't know anything about broccoli?,j uː WORD_BOUNDARY d oʊ n t WORD_BOUNDARY n oʊ...,y o u WORD_BOUNDARY d o n ' t WORD_BOUNDARY k ...,False,1546021,you don't know anything about broccoli,,question,eng,-2147483648,...,2,52,3048,3047,6828,6,juːdoʊntnoʊɛnɪθɪŋʌbaʊtbɹɑkəli,you don't know anything about broccoli?,juːdoʊn|tnoʊɛnɪθɪŋʌbaʊtbɹɑkəli,juːdoʊntnoʊɛnɪθɪŋʌbaʊtbɹɑkəli
25046,in my ska school.,ɪ n WORD_BOUNDARY m aɪ WORD_BOUNDARY s k ɑ WOR...,i n WORD_BOUNDARY m y WORD_BOUNDARY s k a WORD...,False,2409963,in my ska school,in my school,declarative,eng,3,...,2,71,3927,3918,9459,4,ɪnmaɪskɑskuːl,in my ska school.,ɪnmaɪskɑskuːl,ɪnmaɪskɑskuːl
1089846,just pretend there's a door.,d̠ʒ ʌ s t WORD_BOUNDARY p ɹ ɪ t ɛ n d WORD_BOU...,j u s t WORD_BOUNDARY p r e t e n d WORD_BOUND...,False,1607475,just pretend there's a door,just pretend there a door,missing CA terminator,eng,6,...,2,53,3126,-2147483648,6977,5,d̠ʒʌstpɹɪtɛndðɛɹzʌdɔɹ,just pretend there's a door.,d̠ʒʌstpɹɪ|tɛndðɛɹzʌdɔɹ,d̠ʒʌstpɹɪtɛndðɛɹzʌdɔɹ
809100,yeah what other kind of candy?,j ɛ h WORD_BOUNDARY w ʌ t WORD_BOUNDARY ʌ ð ə ...,y e a h WORD_BOUNDARY w h a t WORD_BOUNDARY o ...,False,1185772,yeah what other kind of candy,yeah what other kind of candy,question,eng,6,...,2,51,2724,2722,5509,6,jɛhwʌtʌðəɹkaɪndʌvkændi,yeah what other kind of candy?,jɛhwʌ|tʌðəɹkaɪndʌvkændi,jɛhwʌtʌðəɹkaɪndʌvkændi
544467,and then we have to make the checkers.,æ n d WORD_BOUNDARY ð ɛ n WORD_BOUNDARY w iː W...,a n d WORD_BOUNDARY t h e n WORD_BOUNDARY w e ...,False,17402128,and then we have to make the checkers,and then we have to make the checker,declarative,eng,9,...,21,336,23488,23487,43309,8,ændðɛnwiːhævtəmeɪkðət̠ʃɛkəɹz,and then we have to make the checkers.,ændðɛnwiːhævtəmeɪkðət̠ʃɛkəɹz,ændðɛnwiːhævtəmeɪkðət̠ʃɛkəɹz
751784,but i guess if you're not throwing up or anyth...,b ʌ t WORD_BOUNDARY aɪ WORD_BOUNDARY ɡ ɛ s WOR...,b u t WORD_BOUNDARY i WORD_BOUNDARY g u e s s ...,False,1110285,but I guess if you're not throwing up or anything,but I guess if you not throw up or anything,declarative,eng,12,...,2,51,2754,2722,5240,10,bʌtaɪɡɛsɪfjʊɹnɑtθɹoʊɪŋʌpɔɹɛnɪθɪŋ,but i guess if you're not throwing up or anyth...,bʌ|taɪɡɛsɪfjʊɹnɑtθɹoʊɪŋʌ|pɔɹɛnɪθɪŋ,bʌtaɪɡɛsɪfjʊɹnɑtθɹoʊɪŋʌpɔɹɛnɪθɪŋ
735153,maybe you should just go see what's wrong.,m eɪ b iː WORD_BOUNDARY j uː WORD_BOUNDARY ʃ ʊ...,m a y b e WORD_BOUNDARY y o u WORD_BOUNDARY s ...,False,1158095,maybe you should just go see what's wrong,maybe you should just go see what wrong,declarative,eng,9,...,2,51,2767,2722,5355,8,meɪbiːjuːʃʊdd̠ʒʌstɡoʊsiːwʌtsɹɔŋ,maybe you should just go see what's wrong.,meɪbiːjuːʃʊdd̠ʒʌstɡoʊsiːwʌ|tsɹɔŋ,meɪbiːjuːʃʊdd̠ʒʌstɡoʊsiːwʌtsɹɔŋ


In [None]:
import re

min_words = 5  # Minimum length for utterances

# ============================================================
# Load and Filter Dataset
# ============================================================
dataset = load_dataset("phonemetransformers/IPA-CHILDES", "EnglishNA", split="train")
df = dataset.to_pandas()

# Keep only target child utterances
if "speaker_role" in df.columns:
    df = df[df["speaker_role"].str.lower().eq("target_child")]
elif "is_child" in df.columns:
    df = df[df["is_child"] == True]

# Keep utterances with at least N words
df["word_count"] = df["processed_gloss"].str.split().apply(len)
df = df[df["word_count"] >= min_words]

# ============================================================
# Text Cleaning and Normalization
# ============================================================
def clean_ipa(text):
    if not isinstance(text, str):
        return ""
    # Remove WORD_BOUNDARY and extra whitespace
    text = re.sub(r"\bWORD_BOUNDARY\b", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower().strip()
    return text

df["IPA"] = df["ipa_transcription"].apply(clean_ipa)
df["Text"] = df["processed_gloss"].apply(clean_text)

# Drop missing or empty values
df = df[(df["IPA"] != "") & (df["Text"] != "")]
df = df.drop_duplicates(subset=["IPA", "Text"]).reset_index(drop=True)

# ============================================================
# Train / Validation Split
# ============================================================
df = df.sample(frac=1, random_state=42)
cut = int(0.9 * len(df))
train_df, valid_df = df.iloc[:cut], df.iloc[cut:]

# ============================================================
# Save for Fine-tuning
# ============================================================
train_path = os.path.join(output_dir, "child_train.tsv")
valid_path = os.path.join(output_dir, "child_valid.tsv")

train_df[["IPA", "Text"]].to_csv(train_path, sep="\t", index=False, header=False)
valid_df[["IPA", "Text"]].to_csv(valid_path, sep="\t", index=False, header=False)

print(f"✅ Cleaned and saved to {output_dir}")
print(f"Train size: {len(train_df)} | Valid size: {len(valid_df)}")
print(train_df.head(3))

✅ Cleaned and saved to /content/drive/MyDrive/Capstone/Corpus/ipa_childes
Train size: 165753 | Valid size: 18418
                                          processed_gloss  \
72558   i said i saw a little caterpillar a little cat...   
69206                              where is she gonna go?   
123163            mommy make some make some butter on my.   

                                        ipa_transcription  \
72558   aɪ WORD_BOUNDARY s ɛ d WORD_BOUNDARY aɪ WORD_B...   
69206   w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ʃ iː WOR...   
123163  m ɑ m i WORD_BOUNDARY m eɪ k WORD_BOUNDARY s ʌ...   

                                character_split_utterance  is_child       id  \
72558   i WORD_BOUNDARY s a i d WORD_BOUNDARY i WORD_B...      True   969197   
69206   w h e r e WORD_BOUNDARY i s WORD_BOUNDARY s h ...      True  2095265   
123163  m o m m y WORD_BOUNDARY m a k e WORD_BOUNDARY ...      True   656497   

                                                    gloss  \
72558   I said 

In [None]:
# ============================================================
# Preview 100 Random IPA/Text Pairs
# ============================================================
preview_df = df.sample(n=100, random_state=123).reset_index(drop=True)

# Show in Colab (first 20 rows displayed automatically)
pd.set_option('display.max_colwidth', None)
display(preview_df[["IPA", "Text"]].head(20))

# Optionally, save the preview for manual review
preview_path = os.path.join(output_dir, "child_preview.tsv")
preview_df[["IPA", "Text"]].to_csv(preview_path, sep="\t", index=False)

print(f"✅ Preview of 100 samples saved to: {preview_path}")

Unnamed: 0,IPA,Text
0,w aɪ j uː s ə p oʊ z d t ə p ʊ t ɪ t ɪ n,why you supposed to put it in?
1,ɑ ɹ j uː t ɹ aɪ ɪ ŋ t ə h ɪ ɹ w ʌ t aɪ s ɛ d,are you trying to hear what i said?
2,aɪ d ɪ d n t w ɛ ɹ ɪ t ɪ n s k uː l d æ d i,i didn't wear it in school daddy.
3,j ɛ h m ɑ m i d uː w iː ɡ ɑ t ɛ n i k æ n d i,yeah mommy do we got any candy?
4,aɪ ɡ ɑ t t ə k ʌ v ə ɹ m aɪ m aɪ b ɪ l d ɪ ŋ,i got to cover my my building.
5,aɪ ɡ oʊ ɪ ŋ ɡ ɹ oʊ s ə ɹ ɹ i s t ɔ ɹ b aɪ s ʌ m m ɔ ɹ f uː d,i going grocery store buy some more food.
6,æ n d æ n d m aɪ f aɪ ə ɹ ɹ ɛ n d̠ʒ ɪ n ɪ z s t ʌ k ɔ n d ɛ ɹ ə,and and my fire engine is stuck on dere.
7,h ɪ ɹ z d ə ʌ ð ə ɹ k ɑ ɹ z k ʌ m ɪ ŋ ɪ n d ə k ɑ ɹ ɹ æ l i,here's de other cars coming in de car rally.
8,ʌ n oʊ j uː d ɪ d n t iː v ə n t ɛ l m iː,uh no you didn't even tell me.
9,ð ɪ s aɪ w ɔ n ə aɪ d oʊ n t w ɔ n ə s iː m ɪ s t ə ɹ w oʊ d̠ʒ ə ɹ z,this i wanna i don't wanna see mister wogers.


✅ Preview of 100 samples saved to: /content/drive/MyDrive/Capstone/Corpus/ipa_childes/child_preview.tsv
