# Chamorro-English "TOD" Dictionary Parser

**About This Notebook**<br>
This notebook contains the full processing pipeline for parsing and formatting the contents of Chamorro-English Dictionary text files, from the work of Donald M. Topping, Pedro M. Ogo and Bernadita C. Dungca. This dictionary is known colloquially as the "TOD" Dictionary. The text files used in this notebook were accessed through the UH Manoa ScholarSpace digital repository at https://scholarspace.manoa.hawaii.edu/items/1cb6119e-b893-4a30-adae-0d8ed4aaafa2

This notebook includes functions for:

- Loading `.TXT` files from the TOD Dictionary dataset
- Extracting and cleaning the raw text data
- Parsing dictionary entries based on labeled fields (e.g., `.hw`, `df`, `il`)
- Structuring the data into a nested Python dictionary
- Exporting the formatted data to a JSON file for future use

The goal of this notebook is to prepare the dictionary data for easier analysis, transformation, or integration into future linguistic, educational, or machine learning projects.

**Name:** Schyuler Lujan <br>
**Date Started:** 13-May-2025 <br>
**Date Completed:** In Progress

# Import Libraries

In [116]:
import os
import re
import json
import csv
import pandas as pd

# Open Dictionary Files

In this section, we open the 6 text files that contain the contents of the TOD Dictionary and compile them into a single string for later parsing.

In [119]:
# Set folder path
base_path = "C:/Users/schyu/Documents/Python-Projects/"# ABSOLUTE PATH GOES HERE
folder_path = base_path + "Chamorro-Dictionary-Scraper/inputs/Chamorro-English-Dictionary-Files-TOD"

In [120]:
def open_dictionary_files(path):
    """
    Reads and concatenates the contents of all .TXT dictionary files in the specified directory.

    This function scans the given directory for files ending with '.TXT', reads each file's 
    contents using UTF-8 encoding, and combines them into a single string. It also performs 
    basic text cleanup, such as removing formatting characters (/ < > % *) and replacing 
    '$na' with 'ña' to restore proper spelling.

    Parameters:
    ----------
        path (str): The path to the directory containing dictionary text files.

    Returns:
    ----------
        str: A single string containing the cleaned and combined contents of all .TXT files.
    """
    # Initialize string for storing contents
    text = ""

    # Open all the files
    for filename in os.listdir(path):
        if filename.endswith('.TXT'):
            text_path = os.path.join(path, filename)
            # Open text file, read contents and append contents to a single string
            with open(text_path, 'r', encoding="utf-8") as file:
                content = file.read()
                text += content

    # Cleanup text
    text = re.sub(r'[/<>%*]', '', text) # Remove formatting characters
    text = text.replace("$na", "ña") # Replace $na with ña

    return text

In [121]:
# Get dictionary contents into a single string for parsing
contents = open_dictionary_files(folder_path)

# Parse Dictionary Contents

In this section, we parse the string output from `open_dictionary_files()` and format the contents into a structured Python dictionary for easier analysis in future projects. The dictionary contents follows a predictable structure, based on dictionary labels followed by the contents for each label. The labels and their meanings in the dictionary are below:

* **.hw** - headword (term)
* **df** - definition
* **il** - dxample sentence, formatted as Chamorro sentence |English translation
* **wc** - word class; takes values of 1, 2, or 3. Some words have more than one value
* **cf** - cross-reference or related term
* **zb** - simple gloss

**Parsing Approach**<br>
The approach involves splitting the string into lines based on dictionary labels (e.g., `.hw`, `df`, `cf`, `il`, etc.), then iterating over those lines to extract and organize the data into key-value pairs.

**Splitting the String into Lines**<br>
The full dictionary text is split into lines using a regular expression. Regular expressions enable more flexible and powerful text processing—such as matching label prefixes and preserving multi-line entries—compared to basic string methods.

**Separating Labels from Contents** <br>
Each line is further divided into a label and its associated content. The label (e.g., `df`or `il`) is identified by taking the first token before the first space using `split()`. The content that follows is extracted using `partition()` to capture everything after the label, preserving its full value.

In [124]:
def parse_dictionary_contents(text):
    """
    Parses the raw dictionary text and organizes it into a structured Python dictionary.

    This function processes the cleaned dictionary text, extracting labeled lines 
    (e.g., .hw for headword, df for definition, il for example sentence, etc.). 
    It groups all related information under their corresponding headwords and stores 
    them as nested dictionaries.

    Labels with multiple entries (e.g., multiple 'il' lines) are concatenated with semicolons.

    Parameters:
    ----------
        text (str): The full text of the dictionary, typically returned by open_dictionary_files().

    Returns:
    ----------
        dict: A dictionary where each key is a headword (string) and the value is 
              another dictionary mapping labels (e.g., 'df', 'il') to their corresponding content.
    """
    # Initialize dictionary
    dict = {}

    # Initialize variable for storing the term
    current_word = None

    # Split the text based on labels; multiline entries for a label are contained in a single string
    lines = re.findall(r'^(?:\.hw|df|il|cf|wc|zb)\s+.*(?:\n(?!\.hw|df|il|cf|wc|zb).*)*', text, flags=re.MULTILINE)

    # Iterate through the lines and append the data to dict
    for line in lines:
        label = line.split()[0].strip() # Get label (.hw, df, il, etc.)
        content = line.partition(" ")[-1].strip() # Get the contents attached to label
        # Create a new dictionary entry for each word
        if label == ".hw":
            current_word = content
            dict[current_word] = {}
        elif label in dict[current_word]:
            dict[current_word][label] += f"; {content}"
        else:
            dict[current_word][label] = content

    return dict

In [125]:
# Parse dictionary contents
dictionary = parse_dictionary_contents(contents)

# Parse Example Sentences

In this section we parse the example sentences from the dictionary and format the contents to be used in the development of a Chamorro corpus and in future language analysis. We take the output from `parse_dictionary_contents`, extract the example sentencess under each word which is labeled with `il` and do text cleanup.

In [128]:
def get_example_sentences(dictionary):
    """
    Parses the formatted dictionary contents, extracts and cleans example sentences, and formats it into a list of tuples,
    where each tuple is structured as (Chamorro Sentence, English Translation).
    """
    # Initialize string for extracting examples
    examples = ""

    # Iterate through dictionary and extract sentences
    for term, contents in dictionary.items():
        if "il" in dictionary[term]:
            examples += dictionary[term]["il"] + "; " 

    # Remove newline tags
    examples = examples.replace("\n", '')

    # Split examples into their chamorro-english pairs
    examples = examples.split(";")

    # Separate the Chamorro and English sentences into tuples
    example_pairs = [tuple(example.split("|")) for example in examples]
        
    return example_pairs

In [129]:
# Get the example sentences from the dictionary
example_sentences = get_example_sentences(dictionary)

In [130]:
# Count the number of elements in each tuple
len(example_sentences)

2354

# Export Data

In [132]:
# Set the export folder path
exports_folder = "Chamorro-Dictionary-Scraper/exports/"

## Export Dictionary to JSON

In [134]:
# Convert to regular dictionary before export
regular_dict = {key: value for key, value in dictionary.items()}

# Set the folder_path and filename for export
folder_path = base_path + exports_folder + "json"
filename = "chamorro_english_dictionary_TOD.json"

# Set the file_path for the export
file_path = os.path.join(folder_path, filename)

# Export the data to JSON
with open(file_path, mode="w", encoding="utf-8") as file:
    json.dump(regular_dict, file, ensure_ascii=False, indent=2)

## Export Example Sentences to CSV

In [136]:
# Convert to dataframe before export
df_sentences = pd.DataFrame(example_sentences)

# TODO Set the folder_path and filename for export
folder_path = base_path + exports_folder + "csv"
filename = "chamorro_english_dictionary_TOD_sentences.csv"

# TODO Set the file_path for export
file_path = os.path.join(folder_path, filename)

# TODO Export the data to CSV
df_sentences.to_csv(file_path, index=False)