## **CHILDES Lexicon Builder**

This notebook constructs a lexical dictionary from the CHILDES IPA corpus.
The lexicon is used to gate phonological rules, ensuring corrective modifications to IPA strings are only applied when the resulting form corresponds to a real English word.

This prevents false-positive corrections—particularly important for child-speech IPA where pronunciations often deviate from canonical adult forms.

### **1. Environment Setup**

The notebook begins by:

* Loading required libraries (`pandas`, `json`, `re`, `Counter`)

* Setting up file paths to the cleaned CHILDES corpus:
```
child_train.tsv  
child_valid.tsv
```

* Ensuring a dedicated output directory exists for storing lexicon files.

This creates a controlled workspace for lexicon generation.

In [None]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **2. Load CHILDES TSV Files (No headers)**

Because CHILDES TSVs generated in earlier preprocessing do not include headers, the notebook loads them using:
```
header=None
sep="\t"
```

Then renames the two columns:
* col0 → ipa_transcription	cleaned IPA sequence
* col1 → gloss	cleaned text transcript

This produces a clean DataFrame containing IPA → Text pairs.

In [None]:
# ============================================================
# LOAD DATASETS WITH NO HEADERS
# ============================================================
train_path = "/content/drive/MyDrive/Capstone/Corpus/ipa_childes/child_train.tsv"
valid_path = "/content/drive/MyDrive/Capstone/Corpus/ipa_childes/child_valid.tsv"

train_df = pd.read_csv(train_path, sep="\t", header=None)
valid_df = pd.read_csv(valid_path, sep="\t", header=None)

df = pd.concat([train_df, valid_df], ignore_index=True)

# Based on your file preview:
df = df.rename(columns={0: "ipa_transcription", 1: "gloss"})

print("Loaded utterances:", len(df))
print("Columns:", df.columns.tolist())

Loaded utterances: 184171
Columns: ['ipa_transcription', 'gloss']


### **3. Gloss Preprocessing & Token Extraction**

To build a lexicon, the notebook extracts words from gloss strings using:

* Lowercasing

* Removing punctuation ([^a-z'])

* Splitting on whitespace

* Removing empty tokens

This yields a large list of raw tokens representing the vocabulary used in CHILDES transcripts.

These tokens reflect:

* child speech

* adult interaction speech

* frequent, simple English words

* morphologically simplified child forms

This makes the lexicon appropriate for phonological rule gating.

In [None]:
# ============================================================
# CLEAN & TOKENIZE WORDS
# ============================================================
def clean_word(w):
    w = w.lower().strip()
    w = re.sub(r"[^a-z']", "", w)  # keep apostrophes
    return w

all_words = []

for g in df["gloss"].astype(str):
    tokens = re.split(r"\s+", g)
    for t in tokens:
        w = clean_word(t)
        if len(w) > 0:
            all_words.append(w)

print("Total raw tokens extracted:", len(all_words))

Total raw tokens extracted: 1334221


### **4. Word Frequency Counting**

Using `Counter`, the notebook counts total occurrences of each word across both train and validation CHILDES sets.

Then it applies a minimum frequency threshold:
```
MIN_FREQ = 2
```

This removes rare or noisy words and keeps only those with stable presence in the dataset.

The result is a vocabulary of high-confidence English words produced in CHILDES interactions.

In [None]:
# ============================================================
# COUNT WORD FREQUENCIES
# ============================================================
freq = Counter(all_words)

# Keep words with at least 2 occurrences
MIN_FREQ = 2500
lexicon_words = [w for w, c in freq.items() if c >= MIN_FREQ]

print("Words kept in lexicon:", len(lexicon_words))


Words kept in lexicon: 105


### **5. Final Lexicon Construction**

The notebook prepares the final lexicon in three different formats:

A. lexicon.txt

* One word per line

* Easy to inspect manually

* Good for debugging lexical matches

B. lexicon.json

* List of strings

* Ideal for programmatic loading

* Can be used in other scripts or HF Spaces

C. lexicon.py

Defines:
```
LEXICON = {
   'cat',
   'book',
   'car',
   ...
}
```

This makes the lexicon importable directly into the rule-based IPA correction script.

Example usage:
```
from lexicon import LEXICON

if ipa_to_word(candidate) in LEXICON:
    apply_rule()
    ```

This ensures corrections are linguistically plausible.

In [None]:
# ============================================================
# SAVE LEXICON
# ============================================================
out_dir = "/content/drive/MyDrive/Capstone/Corpus/lexicon/"
os.makedirs(out_dir, exist_ok=True)

with open(out_dir + "lexicon.txt", "w") as f:
    for w in sorted(lexicon_words):
        f.write(w + "\n")

with open(out_dir + "lexicon.json", "w") as f:
    json.dump(sorted(lexicon_words), f, indent=2)

with open(out_dir + "lexicon.py", "w") as f:
    f.write("LEXICON = {\n")
    for w in sorted(lexicon_words):
        f.write(f"    '{w}',\n")
    f.write("}\n")

print("Lexicon saved to:", out_dir)

Lexicon saved to: /content/drive/MyDrive/Capstone/Corpus/lexicon/
