# Google Search Chamorro Scraper

**About:** This project is meant for educational purposes, to scrape and export articles written in the Chamorro language from the Saipan Tribune news website, where the search for those articles is narrowed down via a Google search by searching for a common Chamorro stopword in the text body. We will be using the Google Custom Search JSON API to find and scrape content. The text results will be processed in two ways: 

1. The full text exported into an HTML file for conversion to other reader-friendly formats, such as PDF, .DOCX, or .EPUB
2. A unique word list will be exported to .CSV for additional analysis and integration into other learning tools

**Name:** Schyuler Lujan <br>
**Date Started:** 23-April-2025 <br>
**Date Completed:** In Progress <br>

## Import Libraries

In [13]:
# Import relevant libraries
import requests
import json
import time
from bs4 import BeautifulSoup

## Call The API

### Set the API Key, CSE ID, and Query

In [2]:
# Set your API Key and Custom Search Engine ID
API_KEY = 'AIzaSyBv-v2ToEeizr_7l4xSRr5IzpbzWH8Jiz4' # API Key goes here
CSE_ID = '2157f0d44bb5d4139' #custom search engine ID goes here

In [3]:
# Set search query to a common Chamorro stopword on the target website
QUERY = 'allintext: na site:https://www.saipantribune.com/'

### Calculate the Values for the `start` Parameter

When we call the API, we need to specify which results we want it to return to us by setting the `start` parameter to the appropriate value. When setting the `start` parameter to the appropriate value, we will keep the following in mind:

* A Google search returns 10 results per page
* Setting the `start` parameter to 1 returns the first page of results
* To get the results on subsequent pages, we must increase the start parameter by 10 each time

To get all the results of our search, we will do the following:

1. Return the total number of pages that our search returns and store that value in the variable `total_pages`
2. Calculate the start values based on `total_pages`, and store those values in a list. We will calculate a range to do this: `list(range(1,total_pages,10))`

**Free Tier Limits:** We are also using the *free tier* of the Custom Search API, which allows up to 100 queries per day. To be mindful of this limit, we will also split our start ranges into lists that return a maximum of 100 results each, and run them on separate days.

### Make the Request and Return Total Search Results

In [5]:
def google_search(query, api_key, cse_id, start=1):
    """
    Sends the search query to Google by calling the Google Custom Search JSON API.
    Returns the results in a dictionary (JSON format).
    """
    # Set the endpoint for Google's search API
    url = 'https://www.googleapis.com/customsearch/v1'
    
    # Set the query parameters to send to the API
    params = {
        'key': api_key,
        'cx': cse_id,
        'q': query,
        'start': start
    }
    
    # Make the HTTP GET request to Google
    response = requests.get(url, params=params)
    
    return response.json()

In [6]:
# Make request
search_results = google_search(QUERY, API_KEY, CSE_ID)

In [7]:
# Get the total number of results
total_results = int(search_results.get("searchInformation", {}).get("totalResults", 0))
print("Total Results: ", total_results)

Total Results:  1630


### Get Links to 100 Search Results

In [9]:
# Limit to a maximum of 100 results
max_results = min(total_results, 100)

In [24]:
# Set list for storing links
links = []

# Loop over pages in increments of 10, get results, and store hyperlink for each item in links
for start in range(1, max_results + 1, 10):
    results = google_search(QUERY, API_KEY, CSE_ID, start=start)
    items = results.get("items", [])
    
    # Get the link for the content and store in links
    for item in items:
        links.append(str(item["link"]))
    
    # One-second pause between requests
    time.sleep(1)

In [28]:
print(len(links))
print(links[2])

100
https://www.saipantribune.com/index.php/bdaa937c-1dfb-11e4-aedf-250bc8c9958e/


## Scrape Text Content From URLs

We will use `BeautifulSoup` to scrape the text content from the hyperlinks. After examining the elements of some articles on the website, these are the following classes we will be targeting for scraping:

* **blog title: **`class_="blog-title"`
* **blog author: **`"div", class_="blog-author"`
* **blog date: **`"div", class_="blog-date"`
* **blog content: **`div", class_="blog-content"`

Blog content is sometimes nested under `div` in paragraph tags, and other times it is not.

In [68]:
def get_content(urls):
    """
    Iterates through the list of hyperlinks from google_search, scrapes the text content and returns results in a dictionary
    """
    # Initialize dictionary to store scraped data
    contents = {}
    
    # Initialize counter for naming convention in contents
    counter = 0
    
    # Go to each webpage and parse it
    for url in urls:
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding #"utf-8"
            
            soup = BeautifulSoup(response.text, "html.parser")

            # Find the blog-title, blog-author, blog-date and blog-content
            title = soup.find(class_="blog-title")
            author = soup.find("div", class_="blog-author")
            date = soup.find("div", class_="blog-date")
            content_div = soup.find("div", class_="blog-content")
            
            # If elements are found, convert to strings, otherwise return N/A
            title = title.get_text(strip=True) if title else "N/A"
            author = author.get_text(strip=True) if author else "N/A"
            date = date.get_text(strip=True) if date else "N/A"
            
            # Find text content; some articles have text content nested in <p> tags under <div>
            if content_div:
                paragraphs = content_div.find_all("p")
                if paragraphs:
                    content = "\n\n".join(p.get_text(strip=True) for p in paragraphs)
                else:
                    content = content_div.get_text(strip=True) if content_div else "N/A"

            # Add data to contents
            counter += 1 # For naming
            contents[f"article_{counter}"] = {
                "title":title, 
                "author":author, 
                "date":date, 
                "url":str(url), 
                "text":content
            }
    
        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")
              
    return contents

In [70]:
### DELETE ME TEST scraping content on two articles; one with nested content, one without nested content ###
test_url = links[2:4]
get_content(test_url)

{'article_1': {'title': 'Diskalentau Na Kinalamten',
  'author': 'By',
  'date': 'Posted onNov 27 2011',
  'url': 'https://www.saipantribune.com/index.php/bdaa937c-1dfb-11e4-aedf-250bc8c9958e/',
  'text': 'Estaba un’ palaoan gi entalu’ filan estante gi un’ tenda guine gi alacha. Sige ha akompara presiun fektus siha. Ha atan hulu’ ya ilegña, “Magpu’ i nobena, Iku. Mas kahulu’ i presiun nesessidat familia siha ya sige ha’ man-ma’aña hit ni surcharge yan asuterity”. Mamahlauyu’ sa’ sumasaunauyu’ umesalau “let it be”. Pues hu ñañgon i prohima na “tana fan salakumba’ i manpairen Hegsu’ I Deni” gi otru sakan. Konfotme.\n\n—\n\n\n\nMientras ha a’ayeg mas fektus, ilegña, “Ke lau hafa na mampos bahu kinalamten tano’ta Iku na biahe? Hafa adai bidan niniha i man-ma’gasta na no siakassu siha nu i baban linala’ i publiku lau siha timuneru?”\n\nIneppe as Magoo, “Man-yayas ya ni siha ti matuñgu’ hafa para umachogue”. Ginen fuetsau na pinadese, mas ke dos mit na natibu dumiñgu i tanu’. Taya’ alibiu gu

## Export Full Text To HTML

In [None]:
### FIXME: Export the text to a formatted HTML file ###

## Create and Export a Unique Word List

In [None]:
### FIXME: Clean text, split text into words, remove duplicates and export to CSV ###

## Test Code

In [8]:
### TEST Calculate the start values for getting all the results ###
#start_values = list(range(1, total_results, 10))

In [None]:
# View search results
# for item in search_results.get("items", []):
#     print(item["link"])

In [None]:
### TEST CODE: Iterating through multiple stopwords when querying the website (not as feasible on free tier) ###
# # Set the address for the website that will be queried
# site = 'site:https://www.saipantribune.com/'

# # List common Chamorro stopwords to be used in the query
# common_stopwords = ['na', 'yan', 'ya', 'nu', 'ni', 'gi', 'ti']

# # Create a list of the queries
# QUERY = [word + " " + site for word in common_stopwords]