# Google Search Chamorro Scraper

**About:** This project is meant for educational purposes, to scrape and export articles written in the Chamorro language from the Saipan Tribune news website, where the search for those articles is narrowed down via a Google search by searching for a common Chamorro stopword in the text body. We will be using the Google Custom Search JSON API to find and scrape content. The text results will be processed in two ways: 

1. The full text exported into an HTML file for conversion to other reader-friendly formats, such as PDF, .DOCX, or .EPUB
2. A unique word list will be exported to .CSV for additional analysis and integration into other learning tools

**Name:** Schyuler Lujan <br>
**Date Started:** 23-April-2025 <br>
**Date Completed:** In Progress <br>

## Import Libraries

In [28]:
# Import relevant libraries
import requests
import json
from bs4 import BeautifulSoup

## Call The API

### Set the API Key, CSE ID, and Query

In [29]:
# Set your API Key and Custom Search Engine ID
API_KEY = # API Key goes here
CSE_ID = #custom search engine ID goes here

In [54]:
# Set search query to a common Chamorro stopword on the target website
QUERY = 'allintext: na site:https://www.saipantribune.com/'

### Calculate the Values for the `start` Paramter

When we call the API, we need to specify which results we want it to return to us by setting the `start` parameter to the appropriate value. When setting the `start` parameter to the appropriate value, we will keep the following in mind:

* A Google search returns 10 results per page
* Setting the `start` parameter to 1 returns the first page of results
* To get the results on subsequent pages, we must increase the start parameter by 10 each time

To get all the results of our search, we will do the following:

1. Return the total number of pages that our search returns and store that value in the variable `total_pages`
2. Calculate the start values based on `total_pages`, and store those values in a list. We will calculate a range to do this: `list(range(1,total_pages,10))`

**Free Tier Limits:** We are also using the *free tier* of the Custom Search API, which allows up to 100 queries per day. To be mindful of this limit, we will also split our start ranges into lists that return a maximum of 100 results each, and run them on separate days.

In [61]:
### FIXME: Get total page results and calculate the start values ###

### Make the Request and Return the Search Results

In [51]:
def google_search(query, api_key, cse_id, start=1):
    """
    Sends the search query to Google by calling the Google Custom Search JSON API.
    Returns the results in a dictionary (JSON format).
    """
    # Set the endpoint for Google's search API
    url = 'https://www.googleapis.com/customsearch/v1'
    
    # Set the query parameters to send to the API
    params = {
        'key': api_key,
        'cx': cse_id,
        'q': query,
        'start': start
    }
    
    # Make the HTTP GET request to Google
    response = requests.get(url, params=params)
    
    return response.json()

In [52]:
# Get the search results
search_results = google_search(QUERY, API_KEY, CSE_ID)

In [53]:
# View search results
for item in search_results.get("items", []):
    print(item["link"])

https://www.saipantribune.com/index.php/taiguini-books-launch-malingu-na-patgon/
https://www.saipantribune.com/index.php/tag/fiet-na-gacho/
https://www.saipantribune.com/news/local/gi-nuebo-na-sakan/article_a0636a6f-d11e-542b-8111-c287f7509be0.html
https://www.saipantribune.com/index.php/mesklau-na-asuntu/
https://www.saipantribune.com/index.php/pipinu-na-sagu-3/
https://www.saipantribune.com/index.php/93a49224-1dfb-11e4-aedf-250bc8c9958e/
https://www.saipantribune.com/index.php/40-anos-na-libetta/
https://www.saipantribune.com/index.php/grasiosu-na-konbetsasion/
https://www.saipantribune.com/index.php/fitme-na-pisun-tinituhun/
https://www.saipantribune.com/index.php/bd8fe9b6-1dfb-11e4-aedf-250bc8c9958e/


In [60]:
### DELETE ME: TEST ###
#print(json.dumps(search_results, indent=2)) # To see all search results

## Scrape Text Content From URLs

In [62]:
### FIXME: Iterate through scraped URLs and scrape the written content ###

## Export Full Text To HTML

In [63]:
### FIXME: Export the text to a formatted HTML file ###

## Create and Export a Unique Word List

In [65]:
### FIXME: Clean text, split text into words, remove duplicates and export to CSV ###

## Test Code

In [59]:
### TEST CODE: Iterating through multiple stopwords when querying the website (not as feasible on free tier) ###
# # Set the address for the website that will be queried
# site = 'site:https://www.saipantribune.com/'

# # List common Chamorro stopwords to be used in the query
# common_stopwords = ['na', 'yan', 'ya', 'nu', 'ni', 'gi', 'ti']

# # Create a list of the queries
# QUERY = [word + " " + site for word in common_stopwords]