# Chamorro Lexicon Expander

**Chamorro Lexicon Expander** is a Python project designed to expand the Chamorro-English dictionary by generating all possible affixed variations of Chamorro root words. This tool automates the process of creating word forms using common Chamorro prefixes, suffixes, and infixes according to linguistic rules. This project enables a more comprehensive representation of Chamorro vocabulary for language learners, linguists, and dictionary development.

**Name:** Schyuler Lujan <br>
**Date Started:** 10-Nov-2024 <br>
**Date Complete:** In Progress <br>

In [20]:
# Import libraries
import re
import pandas as pd
import csv

# Import and Clean Data

**About this data:** For this project, we will be using the words and part-of-speech tags from the Revised Chamorro-English dictionary, which is available for free at https://natibunmarianas.org/chamorro-dictionary/. We will be using this data because it is currently the only freely available resource online with the most complete and reliably accurate part-of-speech tags on Chamorro words. Part-of-speech tags will determine which words can be transformed with the different affixes.

In [21]:
# Import files and convert to dataframes
tverbs_df = pd.read_csv("transitive-verbs.csv", encoding="utf-8")
iverbs_df = pd.read_csv("intransitive-verbs.csv", encoding="utf-8")

In [22]:
# Preview dataframe
#tverbs_df.head()

# Define linguistic rules

To properly apply the different affixes, we will need to define the following in order to properly apply linguistic rules: <p>
    * Chamorro vowels <br>
    * Chamorro vowel harmony rules <br>
    * Man- Prefix rules <p>

In [95]:
# Create a list of vowels to search for in the words - to be used for infixes and some prefixes
vowel_list = ['a', 'á', 'å', 'e', 'é', 'i', 'í', 'o', 'ó', 'u']

In [24]:
# Create a dictionary of vowel harmony rules - to be used for infixes and some prefixes
vowel_harmony_dict = {"å": "a", "o": "e", "u": "i"}

In [96]:
# FIXME Create a dictionary of man- prefix rules

In [97]:
# FIXME Create a list of possessive pronouns - to be used for applying pronoun suffixes

# Apply Infixes

**About Chamorro Infixes:** Infixes are affixes that occur within the word, rather than being attached in front of the word or at the end. In Chamorro, infixes are always applied before the first vowel of the word they are attached to. If the word it's being attached to starts with a vowel, the infix is still placed in front of that vowel. There are two infixes in Chamorro: -in- and -um-.

In [82]:
def um_infixes(df, vowels, vowel_harmony):
    """
    In this function, we apply the UM Infix to words by finding the first vowel in the word and inserting the string
    "um" in front of that first vowel. The function will return a list of tuples with the (new word, old word, affix name)
    """
    
    # Get the terms and convert dataframe to a list
    original_word_list = df["Term"].tolist()
    
    # Define the infix and infix name
    infix = "um"
    infix_name = "UM Infix"
    
    # Initialize list to store new words
    infixed_words = []
    
    # Affix words with -um- and append to list, with other metadata
    for word in original_word_list:
        for letter in word:
            if letter in vowels:
                i = word.index(letter) # Get vowel's index
                um_word = word[0:i]+infix+word[i:] # Add the infix before vowel
                infixed_words.append((um_word, word, infix_name))
                break
    
    return infixed_words, df

In [98]:
def in_infixes(df, vowels, vowel_harmony):
    """
    In this function, we apply the In Infix to words by finding the first vowel in the word, determining if vowel harmony
    transformation rules must apply, and then adding the affix.
    """
    
    # Get the terms and convert dataframe to a list
    word_list = df["Term"].tolist()
    
    # Define the infix and infix name
    infix = "in"
    infix_name = "In Infix"
    
    # Initialize list to store new words
    infixed_words = []
                
    # Affix words with -in- and append it to the list
    for word in word_list:
        for letter in word:
            i = word.index(letter)
            if letter in vowel_harmony_dict:
                harmonized_word = word.replace(letter, vowel_harmony_dict[letter], 1) # Apply vowel harmony
                in_word = harmonized_word[0:i]+infix+harmonized_word[i:] # Infix vowel harmonized word
                infixed_words.append((in_word, word, infix_name))
                break
            elif letter in vowels:
                in_word = word[0:i]+infix+word[i:] # Infix without vowel harmony
                infixed_words.append((in_word, word, infix_name))
                break
    
    return infixed_words, df

In [93]:
# Pass the dataframe, vowel list, and vowel harmony dictionary
um_infixed_words = um_infixes(tverbs_df, vowel_list, vowel_harmony_dict)
in_infixed_words = in_infixes(tverbs_df, vowel_list, vowel_harmony_dict)

# Apply Prefixes

**About Chamorro Prefixes:** 

# Apply Suffixes

**About Chamorro Suffixes:**

# Apply Circumfixes

**About Chamorro Circumfixes:**

# Export to CSV

We will take all the word lists from above and export them to CSV files.

In [92]:
def convert_to_dataframe(affixed_words):
    """
    In this function, we convert our newly affixed words to a dataframe, and then export it to a CSV file.
    We will also include metadata from the original word list to our exported CSV file.
    """
    # Get new words
    new_words = affixed_words[0]
    
    # Get old words
    old_words = affixed_words[1]
    
    # Get Affix Name
    affix_name = affixed_words[0][0][2]
    
    # Convert list to dataframe
    infixed_words_df = pd.DataFrame(new_words, columns=["Word", "Term", "Affix"])
    # Add the original Definition and Root Word to the infixed words df
    filtered_df = old_words[["Term", "Definition", "Root Word"]]
    infixed_words_df = pd.merge(infixed_words_df, filtered_df, on="Term", how="left")
    
    # Save dataframe as CSV
    infixed_words_df.to_csv(f"{affix_name}_affixed_words.csv", index=False, encoding="utf-8")

In [99]:
# TEST Pass thru the convert_to_dataframe function
infixed_words = [um_infixed_words, in_infixed_words]

for output in infixed_words:
    convert_to_dataframe(output)