# Achievement 1.4: Web Scraping Project

This notebook scrapes data from two Wikipedia pages and saves the results to text files. The process involves using Selenium to automate a browser and BeautifulSoup to parse the page content.

## Table of Contents
1. [Imports and Setup](#1.-Imports-and-Setup)
2. [Helper Functions](#2.-Helper-Functions)
3. [Scraping Logic](#3.-Scraping-Logic)
4. [Main Execution](#4.-Main-Execution)

## 1. Imports and Setup
This first code block imports all the necessary libraries for the project.

In [10]:
# Import necessary libraries.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

## 2. Helper Functions
This section contains the utility functions for setting up the Selenium WebDriver and for saving the scraped data to local files.

In [11]:
def setup_driver(driver_path):
    """
    Initializes and returns a Selenium WebDriver instance.

    Args:
        driver_path (str): The absolute file path to the chromedriver executable.

    Returns:
        A Selenium WebDriver object.
    """
    options = webdriver.ChromeOptions()
    # The script can be run headlessly by uncommenting the following line.
    # options.add_argument("--headless")
    service = Service(driver_path)
    driver = webdriver.Chrome(service=service, options=options)
    return driver

def save_timeline_file(events, filename="20th_century_events.txt"):
    """
    Saves the timeline events to a text file with Markdown formatting.

    Args:
        events (list): A list of (decade, event) tuples.
        filename (str): The name of the file to save.
    """
    if not events:
        return

    with open(filename, "w", encoding="utf-8") as f:
        f.write("# Key Events of the 20th Century\n")
        last_decade_written = ""
        for decade, event in events:
            # Add a new Markdown header when the decade changes.
            if decade != last_decade_written:
                f.write(f"\n## {decade}\n\n")
                last_decade_written = decade
            # Write each event as a Markdown bullet point.
            f.write(f"- {event}\n")

def save_countries_file(countries, filename="country_list.txt"):
    """
    Saves the list of countries to a simple text file.

    Args:
        countries (list): A list of country names.
        filename (str): The name of the file to save.
    """
    if not countries:
        return

    with open(filename, "w", encoding="utf-8") as f:
        for country in countries:
            f.write(f"{country}\n")

## 3. Scraping Logic
The functions below contain the core logic for scraping each of the two target webpages.

In [12]:
def scrape_timeline_events(driver):
    """
    Navigates to the 20th-century timeline page and scrapes event data.

    Args:
        driver: An active Selenium WebDriver instance.

    Returns:
        A list of tuples, where each tuple contains a decade and an event description.
    """
    # Navigate to the target URL and allow time for dynamic content to load.
    url = "https://en.wikipedia.org/wiki/Timeline_of_the_20th_century"
    driver.get(url)
    time.sleep(5)

    # Parse the page's HTML to find the main content container.
    soup = BeautifulSoup(driver.page_source, "html.parser")
    content_div = soup.find("div", class_="mw-parser-output")
    
    events = []
    if not content_div:
        return events

    # Isolate section headers (H2 tags) that have a decade-formatted ID.
    all_h2s = content_div.find_all("h2", id=True)
    decade_headers = []
    for h2 in all_h2s:
        # A valid decade ID is numeric and ends with 's' (e.g., "1900s").
        if h2['id'].endswith('s') and h2['id'][:-1].isdigit():
            decade_headers.append(h2)

    # Process each decade section to extract associated events.
    for header in decade_headers:
        current_decade = header['id']
        
        # Start searching from the header's parent div to find sibling elements.
        parent_div = header.find_parent('div', class_='mw-heading')

        for sibling in parent_div.find_next_siblings():
            # Stop processing when the next decade's header is reached.
            if sibling.name == 'div' and sibling.find('h2'):
                break
            
            # Extract text from all list items within any unordered list (ul).
            if sibling.name == 'ul':
                for li in sibling.find_all('li'):
                    event_text = li.get_text(strip=True)
                    if event_text:
                        events.append((current_decade, event_text))
    return events

def scrape_country_list(driver):
    """
    Navigates to the list of countries page and scrapes the country names.

    Args:
        driver: An active Selenium WebDriver instance.

    Returns:
        A sorted list of unique country names.
    """
    # Navigate to the target URL for the bonus task.
    country_url = "https://en.wikipedia.org/wiki/List_of_countries_by_continent"
    driver.get(country_url)
    time.sleep(3)

    # Parse the new page's HTML.
    country_soup = BeautifulSoup(driver.page_source, "html.parser")
    country_content_div = country_soup.find("div", class_="mw-parser-output")
    
    countries = []
    if not country_content_div:
        return countries

    # Find all list items, as countries are contained within them.
    list_items = country_content_div.find_all('li')
    
    for item in list_items:
        # A valid country link has a 'title' attribute and is not a link to a file.
        link = item.find('a')
        if link and link.has_attr('title') and not link['href'].startswith('/wiki/File:'):
            country_name = link.get_text(strip=True)
            # This check helps exclude non-country list items (e.g., single letters).
            if ' ' in country_name or len(country_name) > 3:
                 countries.append(country_name)

    # Return a sorted list of unique country names.
    return sorted(list(set(countries)))

## 4. Main Execution
This final block of code runs the main logic of the script by calling the functions defined in the cells above. It sets up the driver, executes the scraping tasks, saves the files, and prints a final summary.

In [13]:
def main():
    """
    Main function to execute the scraping tasks.
    """
    chromedriver_path = r'C:\Users\rewha\Ryan_Wick_Data Vis w-Python_Ach-01.00_CODE\chromedriver-win64\chromedriver.exe'
    
    driver = setup_driver(chromedriver_path)

    # --- Execute Main Task ---
    timeline_events = scrape_timeline_events(driver)
    save_timeline_file(timeline_events)

    # --- Execute Bonus Task ---
    country_list = scrape_country_list(driver)
    save_countries_file(country_list)

    # --- Final Cleanup and Summary ---
    driver.quit()
    
    print("--- Scraping Complete ---")
    print(f"Saved {len(timeline_events)} events to 20th_century_events.txt")
    print(f"Saved {len(country_list)} countries to country_list.txt")

# This ensures the main function runs only when the script is executed directly.
if __name__ == "__main__":
    main()

--- Scraping Complete ---
Saved 1176 events to 20th_century_events.txt
Saved 189 countries to country_list.txt
