# Revised and Updated Chamorro Dictionary Scraper

**About This Notebook**<br>
This notebook contains the full processing pipeline for scraping, parsing, formatting and exporting the contents of the Revised and Updated Chamorro-English Dictionary found at https://natibunmarianas.org/chamorro-dictionary/ which is a project to revise and update the Chamorro-English Dictionary by Topping, Ogo, and Dungca (known as the "TOD" Dictionary).

This notebook includes functions for:

* Scraping the contents from the online dictionary
* Parsing and formatting the contents based on html tags
* Structuring the data into a nested Python dictionary
* Exporting the formatted data to JSON for future use
* Exporting the example sentences to CSV for use in a Chamorro corpus

The goal of this notebook is to prepare the dictionary data for easier analysis, transformation, or integration into future linguistic, educational, or machine learning projects.

**Name:** Schyuler Lujan <br>
**Date Started:** 14-May-2025 <br>
**Date Completed:** In Progress <br>
**Date Updated:** 15-May-2025

# Import Libraries

In [557]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import time
import json
import csv
import os

# Scrape, Parse and Extract Dictionary Contents

In this section we will use `BeautifulSoup` to scrape, parse, extract and format the dictionary contents on the website. Each term is wrapped in a `<p>` paragraph tag with `class="EntryParagraph"`. The formatting of the HTML allows us to easily target the following elements for scraping and extraction:

* **Term:** class="Lexeme"
* **Part of Speech:** class="Partofspeech"
* **Definition:** class="DefinitionE"

**Other Contents (Example Sentences, Synonyms, etc.)**<br>
To scrape and extract the other contents associated with a word, such as Chamorro example sentences, their English translations, synonyms, scientific names, loanword designations, etc. we will be scraping all of the text content associated with a term, and then performing text cleanup.

**Handling Words With More Than One Part of Speech**<br>
The website also distinguishes terms that have more than one Part of Speech. Visually on the website they appear as follows:

    abråsa n. unit of measurement from fingertip of outstretched arm to opposite shoulder. Un abråsa ha' inanakko'‑ña i tali. The length of the rope is about a yard long.

    — vt. encompass, cover a certain area. Ha abråsa todu i manmåolik na chå'guan i kellat guaka. The fenced-in area for cattle encompassed all the good grass. I kellat ha abråsa todu i sitiun i gima'. The fence covered all the area around the house. From: Sp. abraza.

In these instances, the second entry does not include a term. It is also wrapped in a separate `<p>` paragraph tag with `class="BlockParagraph"`.

## Create URLs

First we create the URLs that we will navigate to for scraping and parsing the dictionary content. The website has a different webpage for each letter of the Chamoru alphabet, and we will construct the URLs according to this structure.

In [537]:
# Create the urls for the website
letters = [
    "a-2", "b", "ch", "d", "e", "f", "g", "h", "i", "k", "l", "m", 
    "n", "n-2", "ng", "o", "p", "r", "s", "t", "u", "y"
]

main_url = "https://natibunmarianas.org/"

# Create list of urls
page_urls = [main_url+letter+"/" for letter in letters]

## Get Dictionary Content From Website

In [543]:
def get_dictionary_content(urls):
    """
    Navigates to each dictionary page on the website, parses the HTML and extracts the contents based on the tags and returns results
    in a dictionary.
    """
    ### FIXME: Handle class_="BlockParagraph" ###
    
    # Initialize dictionary for storing contents
    dictionary = {}

    # Set headers to avoid 406 error
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/114.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    }

    # Go to the url and parse the HTML
    for url in urls:
        # Request webpage
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding

        soup = BeautifulSoup(response.text, "html.parser")
        
        # Iterate through tags, extract the text and store in a Python dictionary
        for entry in soup.find_all("p", class_="EntryParagraph"):
            word = None # Initialize variable to store the word
            lexeme_tag = entry.find(class_="Lexeme") # Find the tag for the word
    
            # Get the word and add to the dictionary
            if lexeme_tag:
                word = lexeme_tag.get_text(strip=True)
                dictionary[word] = {}
    
                ### PART OF SPEECH ###
                # Get part of speech text
                pos_tag = entry.find(class_="Partofspeech")
                dictionary[word]["PartOfSpeech"] = pos_tag.get_text(strip=True) if pos_tag else "n/a"

                ### DEFINITION ###
                # Get Definition tag and text
                def_tag = entry.find(class_="DefinitionE")
                def_text = def_tag.get_text() if def_tag else "n/a"
                
                # Text Cleanup
                def_text = def_text.replace("\r", " ").replace("\n"," ")
                def_text = def_text.replace("\xa0", " ").replace("\'", "")
                
                # Add definition to dictionary
                dictionary[word]["Definition"] = def_text

                ### OTHER CONTENTS ###
                # Get full text -- will make text cleanup easier
                full_text = entry.get_text(separator=" ")

                # Text cleanup
                full_text = full_text.replace("\r", " ").replace("\n"," ") # Remove tags
                full_text = full_text.replace("\xa0", " ").replace("\'", "'") # Remove tags
                full_text = full_text.replace(def_text, " ") # Remove definition
                full_text = full_text.split(".") # Split to prep for text cleanup on each sentence

                # Cleanup each string and append to a list
                cleaned_text = [string.lstrip().rstrip() for string in full_text]
                cleaned_text.pop(0) # Remove term and pos strings
                
                # Add cleaned text to dictionary
                dictionary[word]["Other"] = cleaned_text

        time.sleep(3)
            
    return dictionary

In [545]:
# Get dictionary contents
contents = get_dictionary_content(page_urls)

# Extract and Format Example Sentences

In [101]:
def get_example_sentences(dictionary_contents):
    """
    """
    # TODO Extract example sentences from formatted Python dictionary
    return None

# Export Data

In this section we will export the dictionary contents and example sentences for future analysis and use in other language projects.

In [555]:
# Set the absolute path for exporting data
base_path = # ABSOLUTE PATH GOES HERE

## Export Dictionary to JSON

In [565]:
# Convert to regular Python dictionary
regular_dict = {key: value for key, value in contents.items()}

# Set folder path and filename
folder_path = base_path + "Chamorro-Dictionary-Scraper/exports/json"
filename = "revised_and_updated_chamorro_dictionary.json"

# Set file path for export
file_path = os.path.join(folder_path, filename)

# Export to JSON
with open(file_path, "w", encoding="utf-8") as file:
    json.dump(regular_dict, file, ensure_ascii=False, indent=2)

## Export Example Sentences to CSV

In [42]:
# TODO convert to dataframe

# TODO Set folder path and filename

# TODO Set file path for export

# TODO Export to CSV