# Google Search Chamorro Scraper

**About:** This project is meant for educational purposes, to scrape and export articles written in the Chamorro language from the Saipan Tribune news website, where the search for those articles is narrowed down via a Google search by searching for a common Chamorro stopword in the text body. We will be using the Google Custom Search JSON API to find and scrape content. The text results will be processed in two ways: 

1. The full text exported into an HTML file for conversion to other reader-friendly formats, such as PDF, .DOCX, or .EPUB
2. A unique word list will be exported to .CSV for additional analysis and integration into other learning tools

**Name:** Schyuler Lujan <br>
**Date Started:** 23-April-2025 <br>
**Date Completed:** In Progress <br>

## Import Libraries

In [61]:
# Import relevant libraries
import requests
import json
import time
from bs4 import BeautifulSoup

## Call The API

### Set the API Key, CSE ID, and Query

In [62]:
# Set your API Key and Custom Search Engine ID
API_KEY = # API Key goes here
CSE_ID = #custom search engine ID goes here

In [63]:
# Set search query to a common Chamorro stopword on the target website
QUERY = 'allintext: na site:https://www.saipantribune.com/'

### Calculate the Values for the `start` Parameter

When we call the API, we need to specify which results we want it to return to us by setting the `start` parameter to the appropriate value. When setting the `start` parameter to the appropriate value, we will keep the following in mind:

* A Google search returns 10 results per page
* Setting the `start` parameter to 1 returns the first page of results
* To get the results on subsequent pages, we must increase the start parameter by 10 each time

To get all the results of our search, we will do the following:

1. Return the total number of pages that our search returns and store that value in the variable `total_pages`
2. Calculate the start values based on `total_pages`, and store those values in a list. We will calculate a range to do this: `list(range(1,total_pages,10))`

### Make the Request and Return Total Search Results

In [64]:
def google_search(query, api_key, cse_id, start=1):
    """
    Sends the search query to Google by calling the Google Custom Search JSON API.
    Returns the results in a dictionary (JSON format).
    """
    # Set the endpoint for Google's search API
    url = 'https://www.googleapis.com/customsearch/v1'
    
    # Set the query parameters to send to the API
    params = {
        'key': api_key,
        'cx': cse_id,
        'q': query,
        'start': start
    }
    
    # Make the HTTP GET request to Google
    response = requests.get(url, params=params)
    
    return response.json()

In [65]:
# Make request
search_results = google_search(QUERY, API_KEY, CSE_ID)

In [66]:
# Get the total number of results
total_results = int(search_results.get("searchInformation", {}).get("totalResults", 0))
print("Total Results: ", total_results)

Total Results:  1630


### Get Links to 100 Search Results

**Free Tier Limits:** We are also using the *free tier* of the Custom Search API, which allows up to 100 queries per day. To be mindful of this limit, we will also split our start ranges into lists that return a maximum of 100 results each, and run them on separate days.

In [67]:
def get_links(search_results):
    """
    Gets the first 100 links from the google search results and returns a list of links.
    """
    # Set list for storing links
    links = []
    
    # Limit to a maximum of 100 results
    max_results = min(search_results, 100)

    # Loop over pages in increments of 10, get results, and store hyperlink for each item in links
    for start in range(1, max_results + 1, 10):
        results = google_search(QUERY, API_KEY, CSE_ID, start=start)
        items = results.get("items", [])

        # Get the link for the content and store in links
        for item in items:
            links.append(str(item["link"]))

        # One-second pause between requests
        time.sleep(1)
    
    return links

In [68]:
# Get links from the google search results
all_links = get_links(total_results)
# DELETE ME TEST CODE Get a slice of the links for testing
test_urls = all_links[2:4]

## Scrape Text Content From URLs

We will use `BeautifulSoup` to scrape the text content from the hyperlinks. After examining the elements of some articles on the website, these are the following classes we will be targeting for scraping:

* **blog title:** `class_="blog-title"`
* **blog author:** `"div", class_="blog-author"`
* **blog date:** `"div", class_="blog-date"`
* **blog content:** `div", class_="blog-content"`

Note for blog content: This is the body text fo the article, and is sometimes nested under `div` in paragraph tags, and other times it is not.

In [71]:
def get_contents(urls):
    """
    Iterates through the list of hyperlinks from google_search, scrapes the text content and returns results in a dictionary
    """
    # Initialize dictionary to store scraped data
    contents = {}
    
    # Initialize counter for naming convention in contents
    counter = 0
    
    # Go to each webpage and parse it
    for url in urls:
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding #"utf-8"
            
            soup = BeautifulSoup(response.text, "html.parser")

            # Find the blog-title, blog-author, blog-date and blog-content
            title = soup.find(class_="blog-title")
            author = soup.find("div", class_="blog-author")
            date = soup.find("div", class_="blog-date")
            content_div = soup.find("div", class_="blog-content")
            
            # If elements are found, convert to strings, otherwise return N/A
            title = title.get_text(strip=True) if title else "N/A"
            author = author.get_text(strip=True) if author else "N/A"
            date = date.get_text(strip=True) if date else "N/A"
            
            # Extract the text content; some content is nested in <p> tags, some aren't
            if content_div:
                paragraphs = content_div.find_all("p")
                if paragraphs:
                    content = "\n\n".join(p.get_text(strip=True) for p in paragraphs)
                else:
                    content = content_div.get_text(strip=True) if content_div else "N/A"

            # Add data to contents
            counter += 1 # For naming in the dictionary
            contents[f"article_{counter}"] = {
                "title":title, 
                "author":author, 
                "date":date, 
                "url":str(url), 
                "text":content
            }
    
        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")
              
    return contents

In [72]:
### DELETE ME TEST CODE for testing HTML formatting and file export###
contents = get_contents(test_urls)

## Export Full Text To HTML

In [73]:
def format_content(blog_contents, blog_title="Saipan Tribune"):
    """
    Iterates through the scraped blog content, creates an HTML structure for formatting, appends blog content to this
    structure, and returns the combined HTML content.
    """
    # Initialize HTML structure for formatting
    combined_html_content = f"""
    <html>
    <head><meta charset = "UTF-8"><title>{blog_title}</title></head>
    <body>
    """
    for article in blog_contents:
        # Get the article metadata and text contents
        title = blog_contents[article]['title']
        author = blog_contents[article]['author']
        date = blog_contents[article]['date']
        url = blog_contents[article]['url']
        text = blog_contents[article]['text']
        
        # Convert text with \n\n into HTML <p> paragraphs
        paragraphs = ''.join(f"<p>{para.strip()}</p>\n" for para in text.split('\n\n') if para.strip())
        
        # Append the content to the HTML structure
        combined_html_content += f"""
        <section>
        <h1>{title}</h1>
        <p><strong>Date:</strong> {date}<p>
        <p><strong>Author:</strong> {author}<p>
        <p><strong>URL:</strong> {url}<p>
        {paragraphs}
        </section>
        <hr>
        """
    # Close the HTML structure
    combined_html_content += f"""
    </body>
    </html>
    """
    return combined_html_content

In [74]:
# Export scraped content to an HTML structure
html_content = format_content(contents)

### Export Contents to an HTML File

Export `html_content` to an HTML file, which can then be converted to different formats that are easier for reading, such as .epub, .PDF, .docx, and others.

In [76]:
# Export contents to an HTML File
with open("saipantribune_test.html", "w", encoding="utf-8") as file:
    file.write(html_content)

## Create and Export a Unique Word List

The data will also be used to increase the Chamorro Lexicon, by adding any new or missing words into the online dictionary maintained at https://www.lengguahita.com/. To do this, we will get a unique word count from the text contents of the articles and return .csv file that contains a list of those words.

In [None]:
### FIXME: Clean text, split text into words, remove duplicates and export to CSV ###

## Test Code

In [8]:
### TEST Calculate the start values for getting all the results ###
#start_values = list(range(1, total_results, 10))

In [None]:
# View search results
# for item in search_results.get("items", []):
#     print(item["link"])

In [None]:
### TEST CODE: Iterating through multiple stopwords when querying the website (not as feasible on free tier) ###
# # Set the address for the website that will be queried
# site = 'site:https://www.saipantribune.com/'

# # List common Chamorro stopwords to be used in the query
# common_stopwords = ['na', 'yan', 'ya', 'nu', 'ni', 'gi', 'ti']

# # Create a list of the queries
# QUERY = [word + " " + site for word in common_stopwords]