# North Star Sunday Readings Scraper

This is a script to scrape, process and export the Sunday Readings in Chamorro from the North Star website, which is a publication of the diocese of Chalan Kanoa. The text will be 1) exported to an HTML file to be converted to other reading-friendly formats for easier access for learners; 2) split into sentences; and 3) split into words. This will make the text accessible for future analysis, research, corpus development, lexicon expansion and language learning.

**Name:** Schyuler Lujan <br>
**Date Started:** 02-May-2025 <br>
**Date Completed** 12-May-2025

## Import Libraries

In [299]:
# Import libraries for web scraping
import requests
import time
from bs4 import BeautifulSoup

# Import libraries for exporting data
import json
import csv
import pandas as pd

# Import libraries for tokenization and text cleanup
from nltk import tokenize
import re
import string

## Get URLs for the Navigation Pages

In this section, we will navigate to the webpage on the North Star website that allows us to navigate to each individual Sunday Reading and use `BeautifulSoup` to get all of the URLs for the navigation pages.

In [5]:
# Set the URL for the main webpage for the Sunday Readings on the North Star website
initial_url = 'https://northstar.website/category/sunday-readings-in-chamorro/'

# Get the last page of the navigation pages by looking at the webpage
last_page = 46

In [335]:
def get_navigation_page_links(url, last):
    """
    Generate and return a list of URLs for navigating through all pages containing Sunday Readings in Chamorro.

    This function constructs a list of URLs, starting with a given URL and appending additional page links
    based on the provided total number of pages. The URLs in the list point to the pages containing links 
    to individual Sunday Readings.

    Parameters
    ----------
    url : str
        The URL of the first page from which navigation begins.
    last : int
        The total number of pages containing the Sunday Readings. The function will generate URLs for all pages 
        starting from the first up to the specified page number.

    Returns
    -------
    navigation_links: list of str
        A list of URLs, each pointing to a page containing links to the individual Sunday Readings.

    Notes
    -----
    - The function assumes the navigation follows a consistent URL pattern, with pages numbered sequentially (e.g., `page/1/`, `page/2/`, etc.).
    - The first URL is included in the list, followed by all subsequent pages up to the last specified page.
    """
    # Initialize list to store navigation links
    navigation_links = [url]

    # Create the other navigation urls
    for i in range(2,last+1):
        next_url = url + f"page/{i}/"
        navigation_links.append(next_url)

    return navigation_links

In [9]:
# Get the other navigation links
navigation_urls = get_navigation_page_links(initial_url, last_page)

## Get URLs for Sunday Readings

In this section, we will go to each navigation page for the Sunday Readings and use `BeautifulSoup` get the URL for each individual Sunday Reading.

In [337]:
def get_sunday_readings_links(navigation_pages):
    """
    Scrape and return a list of URLs to individual Sunday Readings in Chamorro.

    This function takes a list of navigation page URLs, scrapes each page for links to individual Sunday Readings,
    and returns a list of those URLs.

    Parameters
    ----------
    navigation_pages : list of str
        A list of URLs pointing to pages containing links to individual Sunday Readings.

    Returns
    -------
    reading_urls: list of str
        A list of URLs, each pointing to an individual Sunday Reading in Chamorro.

    Notes
    -----
    - The function uses `requests` and `BeautifulSoup` to fetch and parse the HTML content of each page.
    - Each navigation page is expected to contain links to individual readings within `<h2>` elements with the class `title`.
    - A delay of 1 second is added between requests to avoid overloading the server.
    - Errors encountered during HTTP requests or HTML parsing are caught and logged, but they do not interrupt the process.
    """
    # Initialize list for storing Sunday Readings links
    readings_urls = []

    # Set headers to avoid 406 error
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/114.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    }

    # Iterate through the navigation urls and use BeautifulSoup to get the urls for each Sunday reading
    for page in navigation_pages:
        try:
            response = requests.get(page, headers=headers, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding

            soup = BeautifulSoup(response.text, "html.parser")

            # Find the links to each individual Sunday reading
            links = soup.find_all("h2", class_="title")

            # Extract the individual links
            for link in links:
                url = link.find("a")
                readings_urls.append(url["href"])

        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")

        # Pause between pages
        time.sleep(1)

    return readings_urls

In [15]:
# Get links for each Sunday Reading
readings_urls = get_sunday_readings_links(navigation_urls)

In [17]:
# View url count
print(len(readings_urls))

460


### Export Links List to CSV

In [20]:
# Export the Sunday Readings urls to a CSV for future reference and use
with open('urls_sunday_readings.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    for url in readings_urls:
        writer.writerow([url])  # wrap each string in a list so it’s one column

## Get Sunday Readings Content

In this section, we will use `BeautifulSoup` to scrape the reading metadata and text for each Sunday reading and store it in a Python dictionary. After examining the structure of each Sunday Reading page, these are the elements we will extract data from:

* **Title:** `<h1 class="entry-title">V DAMENGGU GI UTDINÅRIU NA TIEMPU</h1>` <br>
* **Date:** `<span class="posted-on">Posted on <a href="https://northstar.website/v-damenggu-gi-utdinariu-na-tiempu-8/" rel="bookmark"><span class="entry-date published" datetime="2025-02-08T17:54:48+10:00">February 8, 2025</span>` <br>
* **Author:** `<span class="author vcard"><a class="url fn n" href="https://northstar.website/author/rita-guerrero/">Rita Guerrero</a></span>` <br>
* **Text Content:** The text content of the readings is found `<div class="page-content">` And all the text is contained in `<p>` paragraph tags

In [341]:
def get_content(content_urls):
    """
    Scrape and return a dictionary of text content and metadata for each Sunday Reading in Chamorro.

    This function visits each URL in the input list, parses the HTML to extract the article's title, author,
    date, and main text content, and stores the results in a structured dictionary.

    Parameters
    ----------
    content_urls : list of str
        A list of URLs pointing to individual articles or readings to be scraped.

    Returns
    -------
    sunday_readings: dict
        A dictionary where each key is a string in the format "article_N" (e.g., "article_1"),
        and each value is another dictionary with the following keys:

            - 'title' : str
                The title of the article.
            - 'author' : str
                The author's name.
            - 'date' : str
                The publication date.
            - 'url' : str
                The URL of the article.
            - 'text' : str
                The main body text of the article, with paragraphs separated by double newlines.

    Notes
    -----
    - The function uses `requests` and `BeautifulSoup` to fetch and parse HTML content.
    - If any metadata field is missing, "N/A" is used as a placeholder.
    - A delay of 1 second is introduced between requests to be respectful to the server.
    - Errors during fetching or parsing are caught and printed, but do not stop execution.
    
    """
    # Initialize dictionary for storing content
    sunday_readings = {}

    # Initialize counter for article naming convention
    counter = 0

    # Set headers to avoid 406 error
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/114.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    }

    # Iterate through urls for the Sunday readings, parse the HTML, get the metadata and text
    for url in content_urls:
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding

            soup = BeautifulSoup(response.text, "html.parser")

            # Get elements
            title = soup.find("h1", class_="entry-title")
            date = soup.find('span', class_='entry-date published')
            author = soup.find(class_="url fn n")
            text_div = soup.find("div", class_="page-content")

            # If the tite, date and author elements are found, extract the text and convert to strings, otherwise return N/A
            title = title.get_text(strip=True) if title else "N/A"
            date = date.get_text() if date else "N/A"
            author = author.get_text() if author else "N/A"  

            # Get the text content
            if text_div:
                paragraphs = text_div.find_all("p")
                if paragraphs:
                    content = "\n\n".join(p.get_text() for p in paragraphs)
                else:
                    content = text_div.get_text() if text_div else "N/A"

            # Add reading metadata and text content to sunday_readings
            counter += 1
            sunday_readings[f"article_{counter}"] = {
                "title":title, 
                "author":author, 
                "date":date, 
                "url":str(url), 
                "text":content
            }
        
        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")

        # Pause between urls
        time.sleep(1)
            
    return sunday_readings

In [28]:
# Get Sunday Readings content
contents = get_content(readings_urls)

Request failed for https://northstar.website/damenggun-pinadesi-damenggun-ramus-i-likao-patma-5/: 404 Client Error: Not Found for url: https://northstar.website/damenggun-pinadesi-damenggun-ramus-i-likao-patma-5/


### Export Data to JSON

The data will be exported as a JSON file for future analysis and data projects.

In [32]:
# Export contents to JSON file for future analysis and research
with open('sunday_readings_in_chamorro.json', 'w', encoding="utf-8") as f:
    json.dump(contents, f, ensure_ascii=False, indent=2)

## Format and Export Readings to HTML

In this section, we will format the scraped content into an HTML format and export it to an HTML file, which can then be converted to other reading formats (i.e.: .PDF, .EPUB, .DOCX) to make reading and studying this content more accessible.

In [36]:
def format_content_as_html(text_content, work_title="Sunday Readings in Chamorro"):
    """
    Convert a dictionary of article content into a single HTML-formatted string.

    Parameters
    ----------
        text_content: dict
             A dictionary where each key is an article ID and each value is another dictionary
             containing 'title', 'author', 'date', 'url', and 'text' fields representing the article.
        work_title: str, optional
            The title to use for the HTML document's <title> tag (default is "Sunday Readings in Chamorro")

    Returns
    ----------
        combined_html_content: str
            A string containing all article content wrapped in HTML tags, with each article
            formatted as a section including a heading, metadata, and paragraphs for the text body.
    """
    # Initialize the HTML structure for formatting
    combined_html_content = f"""
    <html>
    <head><meta charset = "UTF-8"><title>{work_title}</title></head>
    <body>
    """
    # Get the articles and append metadata and text contents to HTML structure
    for article in text_content:
        # Get the article metadata and text contents
        title = text_content[article]['title']
        author = text_content[article]['author']
        date = text_content[article]['date']
        url = text_content[article]['url']
        text = text_content[article]['text']
        
        # Convert text with \n\n into HTML <p> paragraphs
        paragraphs = ''.join(f"<p>{para.strip()}</p>\n" for para in text.split('\n\n') if para.strip())
        
        # Append the content to the HTML structure
        combined_html_content += f"""
        <section>
        <h1>{title}</h1>
        <p><strong>Date:</strong> {date}<p>
        <p><strong>Author:</strong> {author}<p>
        <p><strong>URL:</strong> {url}<p>
        {paragraphs}
        </section>
        <hr>
        """
    # Close the HTML structure
    combined_html_content += f"""
    </body>
    </html>
    """
    return combined_html_content

In [38]:
# Get HTML structure
html_structure = format_content_as_html(contents)

### Export Text to HTML File

In [41]:
# Export to HTML file
with open("sunday_readings_in_chamorro.html", "w", encoding="utf-8") as file:
    file.write(html_structure)

# Get Sentences for Corpus Development

In this section we will go through the text content of the Sunday Readings, split the content into individual sentences and export to a .csv file for the development of a Chamorro corpus.

In the .csv export file, each sentence will also include its associated metadata (title, source name, date, url).

In [343]:
def split_into_sentences(data, source_name="North Star Sunday Readings"):
    """
    Split the text content of Chamorro Sunday readings into individual sentences and return them along with their metadata.

    This function processes a JSON dictionary containing the text content of multiple Chamorro Sunday readings, 
    splitting each reading's text into sentences. It also collects the title, source name, publication date, 
    and URL for each reading, and returns them as a list of tuples.

    Parameters
    ----------
    data : dict
        A dictionary where each key is an article identifier and each value contains metadata (title, date, URL) 
        and text content of the article.
    source_name : str, optional
        The name of the source from which the text content originates. Default is "North Star Sunday Readings".

    Returns
    -------
    all_sentences: list of tuples
        A list of tuples, where each tuple contains the following elements:
            - Sentence (str): A cleaned sentence from the article text.
            - Title (str): The title of the article.
            - Source (str): The source of the article (default is "North Star Sunday Readings").
            - Date (str): The publication date of the article.
            - URL (str): The URL of the article.

    Notes
    -----
    - The text content is processed by removing non-breaking spaces (`\xa0`), newline characters (`\n`), 
      and extra whitespace. Sentences are tokenized using the `sent_tokenize` function from the `tokenize` module.
    - The function returns the sentences as a list of tuples, with each tuple containing the sentence and its associated metadata.
    """
    # Initialize list for storing tuples
    all_sentences = []

    # Iterate through data to get article text and metadata
    for article_id, article_data in data.items():
        # Get article text and metadata
        title = article_data.get("title", "")
        date = article_data.get("date", "")
        url = article_data.get("url", "")
        text = article_data.get("text", "")
        
        # Split text into sentences
        sentences = tokenize.sent_tokenize(text)
        
        # Clean each sentence; remove tags, newlines and extra spaces
        for s in sentences:
            s = s.replace('\xa0', ' ')
            s = s.replace('\n', ' ')
            s = re.sub(r'\s+', ' ', s)
            s = s.strip()

            # Append sentence and sentence metadata to all_sentences
            all_sentences.append((s, title, source_name, date, url))
    
    return all_sentences

In [324]:
# Get sentences
sentences = split_into_sentences(contents)
total_sentences = len(sentences)
print(f'{total_sentences:,} total sentences')

20,514 total sentences


## Export Sentences to .CSV File

In [309]:
# Convert to dataframe
df_sentences = pd.DataFrame(sentences, columns=["Sentence", "Title", "Author", "Date", "Source"])

# Save dataframe to CSV
df_sentences.to_csv('sentences_sunday_readings.csv', index=False)

# Get Words for Lexicon Expansion

In this section we will generate a unique word list from the text content of the Sunday Readings and export the list to a .CSV file for future analysis and Chamorro lexicon expansion efforts.

In [292]:
def split_into_words(data):
    """
    Split the text content of Chamorro Sunday readings into individual words and return them in a unique list.

    This function processes a JSON dictionary containing the text of multiple Chamorro Sunday readings and splits the text for each reading
    into words.

    Parameters
    ----------
    data : dict
        A dictionary where each key is an article identifier and each value contains metadata (title, date, URL) 
        and text content of the article.

    Returns
    -------
    word_list: list
        A list where each element is a string.

    Notes
    -----
    - The text content is processed by converting all characters to lowercase, removing punctuation and digits.
    - The function returns a unique word list across all Chamorro Sunday readings
    """
    # Initialize list for storing words
    word_list = []

    # Iterate through data, get text and append each word to words
    for article_id, article_data in data.items():
        text = article_data.get("text", "") # Get text

        # Do text cleanup
        text = text.lower() # Convert all characters to lowercase
        remove_chars = string.punctuation + string.digits # To remove common punctuation and numbers
        text = text.translate(str.maketrans('','', remove_chars)) # Clean text
        text = text.replace('”', '').replace('“','').replace('‘','') # To remove special quotes

        # Split into words and append to word_list
        words = text.split()
        word_list += [word for word in words]

    return list(set(word_list))

In [326]:
# Get unique word list
words = split_into_words(contents)
unique_word_count = len(words)
print(f'{unique_word_count:,} unique words')

8,890 unique words


## Export Word List to .CSV File

In [305]:
# Convert to dataframe
df_words = pd.DataFrame({'words_sunday_readings': words})

# Save df_words to csv
df_words.to_csv('words_sunday_readings.csv', index=False)