# ============================================================
#  BOOKCORPUS  ‚Üí  PHONEME/TEXT  DATASET  (to Google Drive)
# ============================================================

Zane Graper

Capstone Project

This notebook constructs a large, high-quality phoneme/text paired dataset from BookCorpus to support training an IPA-to-text model for child-speech ASR. The workflow addresses three persistent challenges highlighted in child-speech literature: scarcity of labeled child data, high acoustic/phonological variability, and the limits of dictionary-based correction approaches. Prior studies show that children exhibit systematic phonological substitutions, omissions, and developmental error patterns (e.g., stopping, gliding, cluster reduction), but these account for only a minority of ASR errors, meaning acoustic mismatch remains a major obstacle . Because large child corpora such as MyST and PF-STAR are limited and expensive to annotate, many pipelines rely on transfer from adult speech and synthetic augmentation . Creating a large aligned corpus of phonemes and text from clean adult data‚Äîlike BookCorpus‚Äîprovides a stable foundation for supervised training, enabling downstream models to generalize to noisier phonological patterns found in children‚Äôs speech. This notebook implements that foundation.

---

### Step 1: Install Prerequisites

In [None]:
# ---- 1.  Install prerequisites ----
!pip install -q phonemizer==2.2.1
!apt-get -qq install espeak-ng
!pip install -q g2p-en pandas tqdm

### Step 2: Mount Google Drive

In [None]:
# ---- 2.  Mount Google Drive ----
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Step 3: Imports & Paths

Initializes project directories, file paths, and shared dependencies to manage the BookCorpus workflow.

In [None]:
# ---- 3. Imports & Paths ----
import os
import tarfile
import urllib.request
import pandas as pd
from tqdm import tqdm
from phonemizer import phonemize
base_dir = "/content/drive/MyDrive/Capstone/Corpus"
os.makedirs(base_dir, exist_ok=True)
archive_path = os.path.join(base_dir, "bookcorpus.tar.bz2")
extract_dir  = os.path.join(base_dir, "bookcorpus_raw")
output_csv   = os.path.join(base_dir, "bookcorpus_phoneme_text_pairs.csv")

### Step 4: Download Archive

Fetches the 1.1 GB BookCorpus dataset directly from Hugging Face storage.

In [None]:
# ---- 4.  Download archive (1.1 GB) ----
import urllib.request
url = "https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2"
print("Downloading BookCorpus‚Ä¶")
urllib.request.urlretrieve(url, archive_path)
print("‚úÖ Downloaded to:", archive_path)

Downloading BookCorpus‚Ä¶
‚úÖ Downloaded to: /content/drive/MyDrive/Capstone/Corpus/bookcorpus.tar.bz2


### Step 5: Extract

Unpacks the downloaded `.tar.bz2` archive into a structured directory of raw text files for processing.

In [None]:
# ---- 5.  Extract ----
import tarfile
print("Extracting‚Ä¶")
with tarfile.open(archive_path, "r:bz2") as tar:
    tar.extractall(path=extract_dir)
print("‚úÖ Extracted to:", extract_dir)

Extracting‚Ä¶


  tar.extractall(path=extract_dir)


‚úÖ Extracted to: /content/drive/MyDrive/Capstone/Corpus/bookcorpus_raw


### Step 6: Reservoir Sampling

Randomly selects ~1M sentences across all BookCorpus files using a memory-efficient reservoir-sampling algorithm while discarding extremely short lines.

In [None]:
#  STEP 6 Sampling

import os, random, pandas as pd
from tqdm import tqdm

# ---- Paths ----
base_dir = "/content/drive/MyDrive/Capstone/Corpus"
extract_dir = os.path.join(base_dir, "bookcorpus_raw")
output_csv  = os.path.join(base_dir, "bookcorpus_sample_1m.csv")

# ---- Parameters ----
SAMPLE_SIZE = 1_000_000     # number of sentences to keep
MIN_WORDS   = 2            # skip ultra-short fragments
random.seed(42)

# ---- Gather all text files ----
text_files = []
for root, _, files in os.walk(extract_dir):
    for f in files:
        if f.endswith(".txt"):
            text_files.append(os.path.join(root, f))
print(f"Found {len(text_files)} text files.\n")

# ---- Reservoir sampling (memory-safe random sampling) ----
sample = []
total_lines = 0
print(f"Sampling {SAMPLE_SIZE:,} sentences from BookCorpus...")

for path in text_files:
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            total_lines += 1
            text = line.strip()
            if not text or len(text.split()) < MIN_WORDS:
                continue

            if len(sample) < SAMPLE_SIZE:
                sample.append(text)
            else:
                j = random.randint(0, total_lines)
                if j < SAMPLE_SIZE:
                    sample[j] = text

print(f"‚úÖ Sampled {len(sample):,} lines from ‚âà{total_lines:,} total.\n")

# ---- Save sample to CSV ----
df = pd.DataFrame(sample, columns=["text"])
df.to_csv(output_csv, index=False, encoding="utf-8")
print(f"‚úÖ Saved sample to: {output_csv}")

# ---- Quick sanity check ----
print("\nExample lines:")
for t in sample[:5]:
    print("-", t)

Found 2 text files.

Sampling 1,000,000 sentences from BookCorpus...
‚úÖ Sampled 1,000,000 lines from ‚âà74,004,228 total.

‚úÖ Saved sample to: /content/drive/MyDrive/Capstone/Corpus/bookcorpus_sample_1m.csv

Example lines:
- `` her mother does n't know it , but anya has already tried shapeshifting .
- i shimmied my skirt down and made a note to find my underwear before we left .
- `` she 's not the neatest roommate in the world . ''
- they tilted their heads , taking in my candy-stripped uniform complete with white visor and shook their heads .
- what kind of a pervert does that ?


In [None]:
### TEST PHOMENIZER
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('cmudict')
nltk.download('punkt')

from g2p_en import G2p
g2p = G2p()
print(" ".join(g2p("The caterpillar with a shell around it is called a pea.")))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


DH AH0   K AE1 T AH0 P IH2 L ER0   W IH1 DH   AH0   SH EH1 L   ER0 AW1 N D   IH1 T   IH1 Z   K AO1 L D   AH0   P IY1   .


### Step 7: Clean text prior to G2P

Normalizes punctuation, spacing, and length constraints to remove noisy artifacts and ensure text is well-formed before phonemization.

In [None]:
#  STEP 7  ‚Äî  CLEAN TEXT PRIOR TO G2P

import os, re, pandas as pd
from tqdm import tqdm

# ---- Paths ----
base_dir  = "/content/drive/MyDrive/Capstone/Corpus"
input_csv = os.path.join(base_dir, "bookcorpus_sample_1m.csv")
clean_csv = os.path.join(base_dir, "bookcorpus_sample_1m_clean.csv")

# ---- Parameters ----
MIN_WORDS = 5          # drop 2-word or shorter lines
MAX_WORDS = 150        # drop abnormally long sentences

# ---- Helper: normalize punctuation etc. ----
def preclean_text(t: str) -> str:
    t = str(t).strip()
    t = re.sub(r"[‚Äú‚Äù]", '"', t)
    t = re.sub(r"[‚Äò‚Äô]", "'", t)
    t = re.sub(r"[‚Äì‚Äî]", "-", t)
    t = re.sub(r"[\r\n]+", " ", t)
    t = re.sub(r"\s+", " ", t)
    return t

# ---- Load and clean ----
print(f"Loading {input_csv}...")
df = pd.read_csv(input_csv)

print("Cleaning text and filtering...")
cleaned_rows = []
for line in tqdm(df["text"].astype(str), total=len(df)):
    text = preclean_text(line)
    word_count = len(text.split())
    if MIN_WORDS <= word_count <= MAX_WORDS:
        cleaned_rows.append(text)

# ---- Save cleaned file ----
clean_df = pd.DataFrame(cleaned_rows, columns=["text"])
clean_df.to_csv(clean_csv, index=False, encoding="utf-8")
print(f"‚úÖ Saved cleaned corpus with {len(clean_df):,} rows ‚Üí {clean_csv}")

# ---- Quick sanity check ----
print("\nExample cleaned lines:")
for t in clean_df.sample(5, random_state=42)["text"]:
    print("-", t)

Loading /content/drive/MyDrive/Capstone/Corpus/bookcorpus_sample_1m.csv...
Cleaning text and filtering...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000000/1000000 [00:13<00:00, 74255.44it/s]


‚úÖ Saved cleaned corpus with 892,719 rows ‚Üí /content/drive/MyDrive/Capstone/Corpus/bookcorpus_sample_1m_clean.csv

Example cleaned lines:
- `` which they ... love , '' sanya said .
- he had a long braid down his back , just as dragon did and wore his vest open like the other males .
- i follow behind the old woman in a state of half-consciousness .
- i 've got to be heading back . ''
- she felt the engorgement within her tremble ; felt the demon try , at least momentarily , to draw back and regroup .


### Step 8: Phonemize BookCorpus (G2P-EN)

Streams through the cleaned corpus, converts each line to ARPAbet phonemes using `g2p-en`, and writes phoneme/text pairs with resume-safe logic.

In [None]:
#  STEP 8 - PHONEMIZE BOOKCORPUS CSV  ‚Üí  PHONEME/TEXT PAIRS

import csv, os, pandas as pd, tqdm
from tqdm import tqdm
from g2p_en import G2p
g2p = G2p()

# ---- Paths ----
base_dir   = "/content/drive/MyDrive/Capstone/Corpus"
input_csv  = os.path.join(base_dir, "bookcorpus_sample_1m_clean.csv")
output_csv = os.path.join(base_dir, "bookcorpus_sample_1m_phonemes.csv")

# ---- Resume support ----
processed_lines = 0
if os.path.exists(output_csv):
    # count existing lines (minus header)
    with open(output_csv, "r", encoding="utf-8") as f:
        processed_lines = sum(1 for _ in f) - 1
    print(f"Resuming after {processed_lines:,} processed lines.")
else:
    # create new output with header
    with open(output_csv, "w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["phonemes", "text"])
    print("Starting new phonemization output file.")

# ---- Streaming loop ----
from g2p_en import G2p
g2p = G2p()

with open(input_csv, "r", encoding="utf-8") as infile, \
     open(output_csv, "a", encoding="utf-8", newline="") as outfile:

    reader = csv.DictReader(infile)
    writer = csv.writer(outfile)

    for i, row in enumerate(tqdm(reader, desc="Phonemizing (g2p-en)", unit=" lines")):
        if i < processed_lines:
            continue
        text = row["text"].strip()
        if not text:
            continue
        try:
            phon = " ".join(g2p(text))
            if phon.strip():
                writer.writerow([phon, text])
                outfile.flush()
        except Exception as e:
            print(f"[skip] {e}")
            continue

print("‚úÖ Streaming phonemization complete (g2p-en).")
print(f"Results saved at: {output_csv}")

Starting new phonemization output file.


Phonemizing (g2p-en): 892719 lines [29:58, 496.27 lines/s]

‚úÖ Streaming phonemization complete (g2p-en).
Results saved at: /content/drive/MyDrive/Capstone/Corpus/bookcorpus_sample_1m_phonemes.csv





### Step 9: Celan after G2P

Removes stress markers/digits, normalizes both phonemes and text, filters extreme lengths, and drops duplicates to improve model-readiness.

In [None]:
#  STEP 9 ‚Äî CLEAN AFTER G2P (PHONEME/TEXT NORMALIZATION)

import os, re, pandas as pd
from tqdm import tqdm

# ---- Paths ----
base_dir   = "/content/drive/MyDrive/Capstone/Corpus"
input_csv  = os.path.join(base_dir, "bookcorpus_sample_1m_phonemes.csv")
output_csv = os.path.join(base_dir, "bookcorpus_1m_final.csv")

# ---- Parameters ----
MIN_WORDS = 5
MAX_WORDS = 150

# ---- Helpers ----
def clean_phonemes(p: str) -> str:
    """Remove stress digits, punctuation, and normalize spacing."""
    p = str(p)
    p = re.sub(r"\d", "", p)          # drop stress markers
    p = re.sub(r"[^A-Z\s]", " ", p)   # keep uppercase letters and spaces only
    p = re.sub(r"\s+", " ", p).strip()
    return p

def clean_text(t: str) -> str:
    """Normalize text to lowercase and remove odd punctuation."""
    t = str(t).lower()
    t = re.sub(r"[^a-z0-9\s']", " ", t)   # keep letters, numbers, apostrophes
    t = re.sub(r"\s+", " ", t).strip()
    return t

# ---- Load ----
print(f"Loading {input_csv}...")
df = pd.read_csv(input_csv)

# ---- Drop rows missing phonemes or text ----
df = df.dropna(subset=["phonemes", "text"])

# ---- Clean ----
print("Cleaning phonemes and text...")
df["phonemes"] = [clean_phonemes(p) for p in tqdm(df["phonemes"], desc="phonemes")]
df["text"]     = [clean_text(t)     for t in tqdm(df["text"], desc="text")]

# ---- Filter extreme lengths ----
df["phon_len"] = df["phonemes"].apply(lambda x: len(x.split()))
df["text_len"] = df["text"].apply(lambda x: len(x.split()))
df = df[df["phon_len"].between(MIN_WORDS, MAX_WORDS)]
df = df[df["text_len"].between(MIN_WORDS, MAX_WORDS)]

# ---- Drop duplicates and reset index ----
df = df.drop_duplicates(subset=["phonemes", "text"]).reset_index(drop=True)

# ---- Save ----
df[["phonemes", "text"]].to_csv(output_csv, index=False, encoding="utf-8")
print(f"‚úÖ Cleaned dataset saved with {len(df):,} rows ‚Üí {output_csv}")

# ---- Quick sanity check ----
print("\nSample rows:")
for i in range(3):
    print(f"{df.iloc[i]['phonemes']}  ||  {df.iloc[i]['text']}")

Loading /content/drive/MyDrive/Capstone/Corpus/bookcorpus_sample_1m_phonemes.csv...
Cleaning phonemes and text...


phonemes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 892719/892719 [00:23<00:00, 38675.70it/s]
text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 892719/892719 [00:07<00:00, 116126.00it/s]


‚úÖ Cleaned dataset saved with 788,671 rows ‚Üí /content/drive/MyDrive/Capstone/Corpus/bookcorpus_1m_final.csv

Sample rows:
HH ER M AH DH ER D AH Z EH N T AY N OW IH T B AH T EH N Y AH HH AE Z AO L R EH D IY T R AY D SH EY P SH AH F IH SH T  ||  her mother does n't know it but anya has already tried shapeshifting
AY SH IH M IY D M AY S K ER T D AW N AH N D M EY D AH N OW T T UW F AY N D M AY AH N D ER W EH R B IH F AO R W IY L EH F T  ||  i shimmied my skirt down and made a note to find my underwear before we left
SH IY EH S N AA T DH AH N IY T AH S T R UW M EY T IH N DH AH W ER L D  ||  she 's not the neatest roommate in the world ''


### Step 10: Sample Check (50 pairs)

Randomly displays dozens of phoneme/text pairs to manually verify correctness and phonemic consistency.

In [None]:
#  10. SAMPLE CHECK: View 50 random phoneme/text pairs

import os, pandas as pd
from random import sample

# ---- Paths ----
base_dir   = "/content/drive/MyDrive/Capstone/Corpus"
phoneme_csv = os.path.join(base_dir, "bookcorpus_1m_final.csv")

# ---- Load dataset (only once) ----
df = pd.read_csv(phoneme_csv)

# Guard against small files
n = len(df)
if n == 0:
    raise ValueError("The phoneme CSV appears empty.")
num_samples = min(50, n)

# ---- Randomly sample 50 lines across the file ----
indices = sample(range(n), num_samples)
subset = df.iloc[indices]

# ---- Display neatly in Colab ----
print(f"Showing {num_samples} random phoneme-text pairs from {n:,} total rows:\n")
for i, row in enumerate(subset.itertuples(index=False), 1):
    print(f"üü¢ Sample {i}")
    print(f"Text:    {row.text}")
    print(f"Phoneme: {row.phonemes}\n")

Showing 50 random phoneme-text pairs from 788,671 total rows:

üü¢ Sample 1
Text:    on the refrigerator there was a small piece of paper with the number of her parents ' hotel in paris
Phoneme: AA N DH AH R AH F R IH JH ER EY T ER DH EH R W AA Z AH S M AO L P IY S AH V P EY P ER W IH DH DH AH N AH M B ER AH V HH ER P EH R AH N T S HH OW T EH L IH N P EH R IH S

üü¢ Sample 2
Text:    i breathed out relieved and yet not quite able to feel at ease
Phoneme: AY B R IY DH D AW T R IH L IY V D AH N D Y EH T N AA T K W AY T EY B AH L T UW F IY L AE T IY Z

üü¢ Sample 3
Text:    no identification on either man but one looks to be malaysian or indonesian and the other could be viet or laotian or possibly cambodian
Phoneme: N OW AY D EH N T AH F AH K EY SH AH N AA N IY DH ER M AE N B AH T W AH N L UH K S T UW B IY M AH L EY ZH AH N AO R IH N D OW N IY ZH AH N AH N D DH AH AH DH ER K UH D B IY V IY EH T AO R L EY OW SH AH N AO R P AA S AH B L IY K AE M B OW D IY AH N

üü¢ Sample 4
Text:    no

This notebook produces a large, clean, and phonemically aligned dataset suitable for training an IPA-to-text or phoneme-to-text model. By standardizing punctuation, normalizing phonemes, and applying strict length filtering, the workflow removes many sources of noise that can destabilize sequence-to-sequence models. The resulting dataset supports robust generalization‚Äîcritical given that child-speech ASR suffers from high acoustic and phonological variability and benefits strongly from abundant adult-speech pretraining before child-specific fine-tuning, as documented across Whisper, wav2vec2, and Conformer studies . This corpus therefore acts as the ‚Äúperfect‚Äù phoneme/text supervision necessary to stabilize downstream training before merging with child-speech data. The pipeline is modular, transparent, and reproducible, making it appropriate for inclusion in a research-grade repository.

**Takeaway**: This notebook builds the foundational dataset that enables your IPA-to-text model to learn stable phoneme-text mappings before adapting to the complexities of children‚Äôs speech.