# Task 7.4: Web Scraping (Revised)

This notebook scrapes the Wikipedia page **“Key events of the 20th century”** and saves:

1. A **clean text** version of the page content (focused on the main article text rather than site navigation).
2. A **structured dataset** of events with a best-effort **year** extraction, saved as CSV for easier downstream analysis.

The goal is not only to make the code run, but to produce output that is cleaner and more analysis-ready.


In [1]:
import sys, os
print("PYTHON:", sys.executable)
print("WORKING DIRECTORY:", os.getcwd())

PYTHON: /opt/anaconda3/envs/20th_century/bin/python
WORKING DIRECTORY: /Users/stephenhelvig/Documents/Python Projects/20th-century


## Install dependencies


In [5]:
import sys
!"{sys.executable}" -m pip install -r requirements.txt

Collecting selenium>=4.6.0 (from -r requirements.txt (line 4))
  Using cached selenium-4.40.0-py3-none-any.whl.metadata (7.7 kB)
INFO: pip is looking at multiple versions of selenium to determine which version is compatible with other requirements. This could take a while.
  Downloading selenium-4.39.0-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.38.0-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.37.0-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.36.0-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.35.0-py3-none-any.whl.metadata (7.4 kB)
  Downloading selenium-4.34.2-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.34.1-py3-none-any.whl.metadata (7.5 kB)
INFO: pip is still looking at multiple versions of selenium to determine which version is compatible with other requirements. This could take a while.
  Downloading selenium-4.34.0-py3-none-any.whl.metadata (7.5 kB)
  Downloading selenium-4.33.0-py3-none-any.whl.metadata

## Imports

In [1]:
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Optional (bonus): Selenium for DOM inspection on rendered pages
try:
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.chrome.service import Service
    SELENIUM_AVAILABLE = True
except Exception as e:
    SELENIUM_AVAILABLE = False
    print("Selenium not available in this environment:", e)

print("All core imports OK")

All core imports OK


In [2]:
import sys, selenium
print("Python:", sys.executable)
print("Selenium:", selenium.__version__)


Python: /opt/anaconda3/envs/20th_century/bin/python
Selenium: 4.32.0


## 1) Scrape the page with Requests + BeautifulSoup (cleaned)

Instead of `soup.get_text()` for the entire document (which includes menus, headers, footers, and UI labels),
this targets the **main article container** and extracts only relevant text elements.

Approach:
- Scope to the Wikipedia article content: `div#mw-content-text`
- Remove common noisy elements (tables, navboxes, references, scripts/styles)
- Extract paragraph and list-item text


In [3]:
# 1) Define the URL to scrape
url = "https://en.wikipedia.org/wiki/Key_events_of_the_20th_century"

# 2) Download the page
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
print("Status code:", response.status_code)

# 3) Parse the page HTML
soup = BeautifulSoup(response.text, "html.parser")

# 4) Target the main content area)
content = soup.select_one("#mw-content-text > div.mw-content-ltr.mw-parser-output")
if content is None:
    # Fallback: if selector changes, use full soup
    content = soup

# 5 Stop at non-content sections (cuts off "See also", "References", etc.)
stop_headings = {"See also", "References", "External links", "Further reading", "Sources"}

for heading in content.find_all(["h2", "h3"]):
    span = heading.find("span", class_="mw-headline")
    if span and span.get_text(strip=True) in stop_headings:
        # Remove this heading and everything after it inside the content area
        for sib in list(heading.next_siblings):
            if hasattr(sib, "decompose"):
                sib.decompose()
        heading.decompose()
        break

# 6) Remove common noisy elements inside the content area
for tag in content.select("script, style, noscript, table, .navbox, .vertical-navbox, .infobox, .metadata, .mw-editsection, .reference, sup, .reflist, ol.references, li[id^='cite_note'], .citation, .hatnote"):
    tag.decompose()

# 7) Extract clean text from paragraphs and list items
paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
list_items = [li.get_text(" ", strip=True) for li in content.find_all("li")]

# Keep only non-empty lines, and avoid extremely short UI-like fragments
def _is_reasonable_line(s: str) -> bool:
    s = s.strip()
    if not s:
        return False
    if len(s) < 20:
        return False
    # filter out common Wikipedia boilerplate strings if they appear
    boilerplate_patterns = [
        r"^Jump to navigation", r"^Jump to search", r"^Contents$",
        r"^This article", r"^From Wikipedia"
    ]
    return not any(re.search(pat, s, flags=re.IGNORECASE) for pat in boilerplate_patterns)

clean_lines = [t for t in (paragraphs + list_items) if _is_reasonable_line(t)]
clean_text = "\n".join(clean_lines)

print("Clean text preview (first 25 lines):")
print("\n".join(clean_text.splitlines()[:25]))

Status code: 200
Clean text preview (first 25 lines):
The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs , the Cold War led to the space race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today.
The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal .
The scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world.
From 1914 to 1918 the First World War, and its aftermath, caused major changes in the power balance of the world, destroying or transforming some of the most powerful 

## 2) Create a structured dataset of events (year + description)

In [4]:
# Build a structured dataset: one row per (year, sentence that mentions that year)

# Match years likely relevant to the 20th century timeline
year_re = re.compile(r"\b(18\d{2}|19\d{2}|20\d{2})\b")  # 1800–2099

def split_sentences(text: str) -> list[str]:
    # Simple sentence splitter
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]

records = []

for line in clean_lines:
    # Break each paragraph/list line into sentences
    for sent in split_sentences(line):
        if len(sent) < 30:
            continue

        years = year_re.findall(sent)
        if not years:
            continue

        # Create one record per year found in the sentence
        for y in years:
            records.append({"year": int(y), "event": sent})

events_df = pd.DataFrame(records)

# 8) Clean up and dedupe
if not events_df.empty:
    events_df["event"] = (
        events_df["event"]
        .str.replace(r"\s+", " ", regex=True)
        .str.strip()
    )
    events_df = (
        events_df.drop_duplicates()
        .sort_values("year")
        .reset_index(drop=True)
    )

print("Structured rows:", len(events_df))
display(events_df.head(20))


Structured rows: 106


Unnamed: 0,year,event
0,1910,The Korean Peninsula was a Japanese colony bet...
1,1914,1914 saw the completion of the Panama Canal .
2,1914,"From 1914 to 1918 the First World War, and its..."
3,1914,"The First World War (or simply WWI), termed ""T..."
4,1914,After a period of diplomatic and military esca...
5,1917,In 1917 Russia ended hostile actions against t...
6,1917,The Russian Revolution of 1917 (ending in the ...
7,1918,"From 1914 to 1918 the First World War, and its..."
8,1918,"The First World War (or simply WWI), termed ""T..."
9,1918,Although Germany shifted huge forces from the ...


## 3) Save outputs

- `20th_century_key_events_clean.txt`: cleaned text lines
- `20th_century_key_events.csv`: structured event table (year + event)

These files are intended to be analysis-ready inputs for later steps.


In [5]:
from pathlib import Path

txt_path = Path("20th_century_key_events_clean.txt")
csv_path = Path("20th_century_key_events.csv")

txt_path.write_text(clean_text, encoding="utf-8")
events_df.to_csv(csv_path, index=False, encoding="utf-8")

print("Saved text to:", txt_path.resolve())
print("Saved CSV to:", csv_path.resolve())

Saved text to: /Users/stephenhelvig/Documents/Python Projects/20th-century/20th_century_key_events_clean.txt
Saved CSV to: /Users/stephenhelvig/Documents/Python Projects/20th-century/20th_century_key_events.csv


In [6]:
events_df = pd.read_csv("20th_century_key_events.csv")

# Keep only 20th century years (choose your definition)
events_df = events_df[(events_df["year"] >= 1901) & (events_df["year"] <= 2000)].copy()

# Optional extra filter: drop obvious “archive/link” junk if it ever sneaks in
junk_pat = re.compile(r"\barchived\b|^time archives\b|research project", flags=re.IGNORECASE)
events_df = events_df[~events_df["event"].str.contains(junk_pat, na=False)]

events_df.to_csv("20th_century_key_events.csv", index=False)

## Bonus: Selenium DOM inspection

In [7]:
import sys, selenium
print("Python:", sys.executable)
print("Selenium:", selenium.__version__)


Python: /opt/anaconda3/envs/20th_century/bin/python
Selenium: 4.32.0


In [17]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd

CHROMEDRIVER_PATH = r"/Users/stephenhelvig/Documents/Python Projects/20th-century/chromedriver-mac-arm64/chromedriver"

driver = webdriver.Chrome(service=Service(CHROMEDRIVER_PATH), options=Options())

try:
    driver.get("https://simple.wikipedia.org/wiki/List_of_countries")

    # Main content container
    content = driver.find_element(By.CSS_SELECTOR, "#mw-content-text > div.mw-parser-output")

    # Countries section
    countries_section = content.find_element(By.CSS_SELECTOR, 'section[aria-labelledby="Countries"]')

    # Letter subsections inside Countries (A, B, C...)
    letter_sections = countries_section.find_elements(By.CSS_SELECTOR, "section[aria-labelledby]")

    rows = []

    for sec in letter_sections:
        label = sec.get_attribute("aria-labelledby")  # often "A", "B", etc.
        if not label or len(label) != 1 or not label.isalpha():
            continue

        letter = label.upper()

        # Countries are links inside paragraphs; skip flag image links (<a class="image">)
        links = sec.find_elements(By.CSS_SELECTOR, "p a:not(.image)")

        for a in links:
            name = a.text.strip()
            href = a.get_attribute("href") or ""
            title = (a.get_attribute("title") or "").strip()

            # Filter: keep actual wiki article links with visible country text
            if not name:
                continue
            if "/wiki/" not in href:
                continue
            if title.startswith("File:") or title.startswith("Category:") or title.startswith("Special:"):
                continue

            rows.append({"letter": letter, "country": name, "url": href})

    countries_df = (
        pd.DataFrame(rows)
        .drop_duplicates(subset=["country"])
        .sort_values(["letter", "country"])
        .reset_index(drop=True)
    )

    print("Countries captured:", len(countries_df))
    display(countries_df.head(25))

    countries_df.to_csv("simplewiki_countries.csv", index=False)
    print("Saved: simplewiki_countries.csv")

finally:
    driver.quit()

Countries captured: 195


Unnamed: 0,letter,country,url
0,A,Afghanistan,https://simple.wikipedia.org/wiki/Afghanistan
1,A,Albania,https://simple.wikipedia.org/wiki/Albania
2,A,Algeria,https://simple.wikipedia.org/wiki/Algeria
3,A,Andorra,https://simple.wikipedia.org/wiki/Andorra
4,A,Angola,https://simple.wikipedia.org/wiki/Angola
5,A,Antigua and Barbuda,https://simple.wikipedia.org/wiki/Antigua_and_...
6,A,Argentina,https://simple.wikipedia.org/wiki/Argentina
7,A,Armenia,https://simple.wikipedia.org/wiki/Armenia
8,A,Australia,https://simple.wikipedia.org/wiki/Australia
9,A,Austria,https://simple.wikipedia.org/wiki/Austria


Saved: simplewiki_countries.csv
