# 1908 Chamorro Bible Scraper

## About This Project

This notebook contains a script to scrape and extract the text from the 1908 Chamorro Bible, available at http://chamorrobible.org/download/YSantaBiblia-Chamorro-HTML.htm. The goal is to collect the full text in Chamorro for further analysis, lexicon expansion, or natural language processing tasks. The text will also be exported to an HTML file for conversion to other formats, such as .epub, .pdf, .docx etc.

**Name:** Schyuler Lujan <br>
**Date Started:** 01-May-2025 <br>
**Date Completed:** In Progress

## Import Libraries

In [5]:
# Import relevant libraries
import requests
import json
import time
from bs4 import BeautifulSoup
from collections import defaultdict

## Parsing the Chamorro Bible HTML

In this section, we use `BeautifulSoup` to parse the HTML content of the Chamorro Bible page. The goal is to extract the **book titles**, **chapter titles**, and **verses** and organize them into a structured Python dictionary. The resulting dictionary will make it easy to access, analyze or process the text later on.

Based on the structure of the webpage, the key elements are organized as follows:

* **Book Titles**: contained in `<h2>` heading tags
* **Chapter titles**: centered inside `<p>` paragraph tags
* **Verses**: left-aligned inside `<p>` paragraph tags

In [2]:
# Set the URL for the 1908 Chamorro Bible webpage
url = 'http://chamorrobible.org/download/YSantaBiblia-Chamorro-HTML.htm'

In [18]:
def get_chamorro_bible_text(url):
    """
    Parses the Chamorro Bible HTML content and extracts the book titles, chapter titles, and verses.

    Args:
        url (str): The URL of the Chamorro Bible page.

    Returns:
        dict: A nested dictionary where each book title maps to its chapters, 
              and each chapter maps to a list of verses.
              Example structure:
              {
                  'Genesis': {
                      'Chapter 1': ['In the beginning...', 'Verse 2 text', ...],
                      'Chapter 2': [...],
                      ...
                  },
                  ...
              }
    """
    # Go to the url and parse the html
    response = requests.get(url, timeout=10)
    response.raise_for_status() # Raise error for bad responses
    response.encoding = response.apparent_encoding
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Initialize dictionary for storing bible data
    bible_data = defaultdict(lambda: defaultdict(list))
    
    current_book = None
    current_chapter = None
    
    # Find the book titles, chapter titles, verses and add to the dictionary
    for tag in soup.body.descendants:
        if tag.name == 'h2' and tag.has_attr('id'):
            current_book = tag.get_text(strip=True)
        elif tag.name == 'p' and tag.has_attr('id') and tag.parent.name == 'center':
            current_chapter = tag.get_text(strip=True)
        elif tag.name == 'p' and tag.get('align') == 'LEFT':
            verse = tag.get_text()
            verse = verse.replace('\n',' ') # To remove newlines without using strip
            if current_book and current_chapter:
                bible_data[current_book][current_chapter].append(verse)
    
    return bible_data

In [19]:
# Get the bible data
bible_contents = get_chamorro_bible_text(url)

In [34]:
# View bible data to spot-check for formatting issues
#print(bible_contents)

### Export Bible Data as JSON

In [31]:
# Convert defaultdict to regular dict before saving
bible_dict = {book: dict(chapters) for book, chapters in bible_contents.items()}

In [33]:
# Export as JSON
with open('1908_chamorro_bible.json', 'w', encoding="utf-8") as f:
    json.dump(bible_dict, f, ensure_ascii=False, indent=2)

## Export Unique Word List

In [37]:
# FIXME: Add function to get a unique word list from the bible verses

## Export HTML File

In [None]:
# FIXME: Append the bible contents to HTML structure and export to HTML file