# Web Scraper for Saipan Tribune

## About This Project

**Name:** Schyuler Lujan <br>
**Date Started:** 23-April-2025 <br>
**Date Completed:** In Progress

## Install and Import Libraries

In [7]:
# Install libraries (if not already installed)
#!pip install langdetect

In [8]:
# Import relevant libraries
import requests
import json
import time
from bs4 import BeautifulSoup
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

## Get page navigation links for search results

In [10]:
# Set url for initial search on website
initial_search_url = 'https://www.saipantribune.com/index.php/author/johnsdelrosariojr/'

From looking on the website, we know that searching on the author John S Del Rosario Jr produces 216 pages of results. We will create a list of links that includes all of these page results.

Future Edits: This function will be modified to dynamically capture the total page results.

In [12]:
def create_all_search_urls(initial_search):
    """
    Returns a list of all the URLS we will use to search on the Saipan Tribune website.

    Args:
        initial_search (str): The first url we will start our search on

    Returns:
        all_search_urls (list): A list of all the urls we will search on
    """
    # Initialize list
    search_urls = [initial_search]

    for i in range(1,217):
        next_page = initial_search + f"page/{i}/"
        search_urls.append(next_page)

    return search_urls

In [13]:
# Generate all search URLs
search_urls = create_all_search_urls(initial_search_url)

## Get Links to Each Article

In this section, we will navigate to each seach page and use `BeautifulSoup` to get the links to each article that appears in the search. After examining the structure of the search results webpages, the links to the blogs are contained in `<a href>` tags under `<h2>` tags where `class="blog-title"`.

In [16]:
def get_article_urls(all_search_urls):
    """
    Navigates to all the search urls and uses BeautifulSoup to parse each page and get the individual article links.
    
    Args:
        all_search_urls: A list of the urls, which link to the webpages that hold the individual article links

    Returns:
        article_urls: A list of the urls to each individual news article
    """
    # Initialize list to store article URLs
    article_urls = []

    # Navigate to each search result page and get the URLs for each article on the page
    for url in all_search_urls:
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding

            soup = BeautifulSoup(response.text, "html.parser")

            # Find the individual article links
            articles = soup.find_all("h2", class_="blog-title")

            # Extract the article links
            for article in articles:
                article_link = article.find("a")
                article_urls.append(article_link["href"])
        
        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")

        # Pause between searches
        time.sleep(1)
            
    return article_urls

In [17]:
# Get article links
article_links = get_article_urls(search_urls)

In [18]:
### FIXME: Remove duplicates in the function
article_links = list(set(article_links))

In [19]:
# View link count
print(len(article_links))

862


## Detect English Content

In this section, we will write a function that allows us to detect if content is written in English. This will allow us to exclude articles that we do not want to scrape.

In [22]:
### TODO: Implement a function that returns True if the text is in English

## Scrape Article Metadata and Content

We will use `BeautifulSoup` to scrape the text content from the hyperlinks. After examining the elements of some articles on the website, these are the following classes we will be targeting for scraping:

* **blog title:** `class_="blog-title"`
* **blog author:** `"div", class_="blog-author"`
* **blog date:** `"div", class_="blog-date"`
* **blog content:** `div", class_="blog-content"`

Note for blog content: This is the body text fo the article, and is sometimes nested under `div` in paragraph tags, and other times it is not.

In [25]:
def get_contents(urls):
    """
    Navigates to each article url, scrapes the text content and returns results in a dictionary.

    Args:
        urls: A unique list of URLs for each article

    Returns:
        contents: A dci
    """
    # Initialize dictionary to store scraped data
    contents = {}
    
    # Initialize counter for naming convention in contents
    counter = 0
    
    # Go to each webpage and parse it
    for url in urls:
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise error for bad responses
            response.encoding = response.apparent_encoding
            
            soup = BeautifulSoup(response.text, "html.parser")

            # Find the blog-title, blog-author, blog-date and blog-content
            title = soup.find(class_="blog-title")
            author = soup.find("div", class_="blog-author")
            date = soup.find("div", class_="blog-date")
            content_div = soup.find("div", class_="blog-content")
            
            # If elements are found for metadata, convert to strings, otherwise return N/A
            title = title.get_text(strip=True) if title else "N/A"
            author = author.get_text() if author else "N/A"
            date = date.get_text() if date else "N/A"
            
            # Extract the text content; some content is nested in <p> tags, some aren't
            if content_div:
                paragraphs = content_div.find_all("p")
                if paragraphs:
                    content = "\n\n".join(p.get_text(strip=True) for p in paragraphs)
                else:
                    content = content_div.get_text(strip=True) if content_div else "N/A"

            # Add article metadata and text to contents
            counter += 1
            contents[f"article_{counter}"] = {
                "title":title, 
                "author":author, 
                "date":date, 
                "url":str(url), 
                "text":content
            }
    
        except requests.RequestException as e:
            print(f"Request failed for {url}: {e}")
        except Exception as e:
            print(f"Error parsing {url}: {e}")

        time.sleep(1)
              
    return contents

In [38]:
# Scrape 
contents = get_contents(article_links)

Request failed for https://www.saipantribune.com/index.php/%ef%bb%bfmafabrikan-kuttura/: 404 Client Error: Not Found for url: https://www.saipantribune.com/index.php/%EF%BB%BFmafabrikan-kuttura/


### Export Data to JSON

In [40]:
with open('saipan_tribune_delrosario_john.json', 'w', encoding="utf-8") as f:
    json.dump(contents, f, ensure_ascii=False, indent=2)

## Export Articles to HTML

TO-DO: Add an "about" for this section.

In [158]:
def format_content(blog_contents, blog_title="Saipan Tribune"):
    """
    Iterates through the scraped blog content, creates an HTML structure for formatting, appends blog content to this
    structure, and returns the combined HTML content.
    """
    # Initialize HTML structure for formatting
    combined_html_content = f"""
    <html>
    <head><meta charset = "UTF-8"><title>{blog_title}</title></head>
    <body>
    """
    for article in blog_contents:
        # Get the article metadata and text contents
        title = blog_contents[article]['title']
        author = blog_contents[article]['author']
        date = blog_contents[article]['date']
        url = blog_contents[article]['url']
        text = blog_contents[article]['text']
        
        # Convert text with \n\n into HTML <p> paragraphs
        paragraphs = ''.join(f"<p>{para.strip()}</p>\n" for para in text.split('\n\n') if para.strip())
        
        # Append the content to the HTML structure
        combined_html_content += f"""
        <section>
        <h1>{title}</h1>
        <p><strong>Date:</strong> {date}<p>
        <p><strong>Author:</strong> {author}<p>
        <p><strong>URL:</strong> {url}<p>
        {paragraphs}
        </section>
        <hr>
        """
    # Close the HTML structure
    combined_html_content += f"""
    </body>
    </html>
    """
    return combined_html_content

### Export Contents to an HTML File

In [165]:
# Export contents to an HTML File
# with open("saipantribune_test.html", "w", encoding="utf-8") as file:
#     file.write(html_content)