# Guam PDN Scraper

## About This Project

This notebook contains a script to scrape and extract the news articles written by Peter Onedera in the Chamorro language, which are found on the Guam Pacific Daily News website. The goal is to make the Chamorro text and its English translations accessible for future analysis, research, corpus development, lexicon expansion, and general language learning support. The text will be collected and processed in the following ways:

1. Exported to an HTML file for conversion to reader-friendly formats, such as PDF or EPUB
2. Split into sentences for corpus development
3. Split into words for lexicon expansion

**Name:** Schyuler Lujan <br>
**Date Started:** 12-May-2025 <br>
**Date Completed:** In Progress

# Import Libraries

In [61]:
# Import libraries for web scraping
from bs4 import BeautifulSoup
import requests
import time

# Import libraries for exporting data
import json
import csv
import pandas as pd

# Import libraries for text cleanup and tokenization
#TODO

# Get URLs to Individual News Articles

In this section, we will navigate to the search results on the `https://www.guampdn.com/` website for all articles that return the string `Peter R. Onedera`, scrape the search results page for links to individual news articles and return a list of the URLS for those articles. This list will be used in the following section to scrape the news article contents.

The search results return 95 total articles. Each search result is wrapped in a `<h3 class="tnt-headline ">` tag, and the actual URL is found in anchor tags `<a href=...>`. A full example of this is as follows:

`<a href="/news/guam-nikkei-association-to-host-panel-discussion-on-a-borrowed-land-on-saturday/article_b0be5d9a-7de1-4614-b5ca-927b8c0eb010.html" class="tnt-asset-link" aria-label="Guam Nikkei Association to host panel discussion on 'A Borrowed Land' on Saturday">Guam Nikkei Association to host panel discussion on 'A Borrowed Land' on Saturday</a></h3>`

Additionally, the URL is only a partial URL, so we will need to append the string `https://www.guampdn.com` to each scraped URL to get the full URL that allows us to navigate to each page individually.

In [17]:
# Set the URL for the search results we want to scrape
search_results_url = 'https://www.guampdn.com/search/?tncms_csrf_token=8aa59fd64255a8027984365cf4d5940225badd3ae071725448f722b34652fe4f.cfd1ce29e6d3196090bc&f=html&t=article&s=start_time&sd=desc&l=100&nsa=eedition&q=Peter+R.+Onedera'

## Detect Content Written in English

In [63]:
# TODO write a function for detecting English-language titles, which will be used to exclude English content
def detect_english():
    """
    """

    return None

## Scrape News Article URLs

In [103]:
def get_urls(search_url):
    """
    """
    # Initialize list to store article urls
    article_urls = []

    # Set the string for appending a complete url
    url_head = "https://www.guampdn.com"

    # Set headers to avoid 406 error
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/114.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    }

    # Navigate to the url and use BeautifulSoup to get the urls and titles for each search result
    response = requests.get(search_url, headers=headers, timeout=10)
    response.raise_for_status()
    response.encoding = response.apparent_encoding

    # Parse the html
    soup = BeautifulSoup(response.text, "html.parser")

    # Find urls for each article
    urls = soup.find_all("h3", class_="tnt-headline")

    # Extract the urls for each article
    for url in urls:
        u = url.find("a")
        article_urls.append(url_head+u["href"])
    
    return list(set(article_urls))

In [105]:
# Get individual news article URLS from the search results
news_article_urls = get_urls(search_results_url)

In [109]:
print(len(news_article_urls))

95


### Export News Article URLs to .CSV

In [1]:
# Convert list to dataframe
df_article_urls = pd.DataFrame(news_article_urls, columns=["url"])

# Safe dataframe to .csv
df_article_urls.to_csv('urls_guam_pdn.csv', index=False)

NameError: name 'pd' is not defined

# Scrape News Article Contents

In this section, we will take the list of URLS returned in the previous section, navigate to each URL, and scrape and process the news article contents. 

Each article follows a predictible structure: The Chamorro text is written first, the English translation. The English translation is always preceeded by the heading "English translation."

The structure of the article webpages was examined to determine the elements we will extract data from. These elements are as follows:

* **Chamorro Title:** `<h1 itemprop="headline" class="headline">`
* **Author:** `<span itemprop="author" class="tnt-byline">`
* **Date:** `<time datetime="2019-03-20T11:58:12+10:00" class="tnt-date asset-date text-muted">`
* **Chamorro Text:** `<div id="article-body" itemprop="articleBody" class="asset-content  subscriber-premium" false="">` (text is wrapped in `<p>` tags)
* **English Title:** `<h2 class="presto-h2"><span class="exclude-from-newsgate">`
* **English Text:** `<p><span class="exclude-from-newsgate">`

## Scrape News Article Contents

In [25]:
#TODO write a function for navigating to each article URL and scraping the content

### Export News Article Contents to .JSON

In [43]:
#TODO export news articles to JSON file

# Format and Export Scraped News Articles to HTML

To-Do: Write the about for this section

## Format Article Contents Into HTML String

In [29]:
#TODO write a function for wrapping the news articles into an HTML formatted string

### Export String to HTML

In [31]:
#TODO export HTML formatted string to HTML file

# Split Text Into Sentences

## Split Article Text Into Sentences

In [34]:
# TODO write a function for splitting the article text into sentences
# Format should be a list of tuples (Chamorro, English, Article Title, Article Author, Article Date, Article URL)

### Export Sentences to .CSV

In [39]:
# TODO export sentences to a .CSV file

# Split Text Into Words

## Split Article Text Into Words

In [37]:
# TODO write a function for splitting article text into words, and return a list of unique words

### Export Words to .CSV

In [41]:
# TODO export word list to a .CSV file