<a href="https://colab.research.google.com/github/swsewon3-ship-it/python-for-public-policy_2025-Fall/blob/main/Session1_APIs_WebScraping_Workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 1 Workbook — Data Collection with APIs & Web Scraping
### Text Analysis in Python for Public Policy / International Affairs


**Structure (2.5h live coding → 30m break → 60m student work)**
1. Token‑less API warm‑up (JSON & endpoints)
2. NewsAPI.org — query 'New York' → JSON → pandas DataFrame → quick clean
3. Web scraping with BeautifulSoup using the URL column
4. Ethics & troubleshooting notes

> Tip: Run cells top‑to‑bottom. If you hit network/rate‑limit issues, use the *Offline Fallback* cells provided.


## Learning Objectives
By the end of this session, you will be able to:
- Explain what an API is (base URL, endpoints, query parameters) and read JSON.
- Make a request to a public API and parse JSON into Python objects.
- Convert nested JSON into a tidy pandas `DataFrame` and do light cleaning.
- Use article URLs from API results to scrape page text with BeautifulSoup.
- Follow basic ethical guidelines for scraping (robots.txt, rate limiting, attribution).


## 0) Environment Setup (Colab‑friendly)

In [None]:
# If you're in Colab, uncomment the next lines to install packages:
# !pip install requests pandas beautifulsoup4 lxml python-dotenv

# Imports used throughout the workbook
import requests            # for HTTP requests to APIs / web pages
import json                # for pretty-printing JSON
import pandas as pd        # for DataFrame work
from bs4 import BeautifulSoup  # for HTML parsing
import time                # for polite sleeping / rate-limiting
import os                  # for reading environment variables


## 1) Warm‑up: Token‑less API (Cat Facts API)
**Concepts:** base URL, endpoint, JSON, key–value pairs.

We’ll hit a simple, no‑auth endpoint to focus on the response shape.
API docs: `https://catfact.ninja/fact` (returns one random cat fact)


In [None]:
# Define the base URL (the main address of the API).
base_url = "https://catfact.ninja"
# Define the endpoint (the specific resource we want under the base URL).
endpoint = "/fact"
# Combine base URL and endpoint into a full URL.
url = f"{base_url}{endpoint}"

# Send a GET request to the server to retrieve data.
response = requests.get(url)

# Check the HTTP status code (200 means OK/success).
print("Status code:", response.status_code)

# Convert the response body from JSON text into a Python dict.
data = response.json()

# Pretty-print the JSON so we can see its structure (keys and values).
print("Raw JSON:")
print(json.dumps(data, indent=2))

# Access a value by key from the JSON (dictionary).
print("Just the fact value:")
print(data["fact"])


### Offline Fallback (if the API is down or blocked)

In [None]:
# This cell simulates the same JSON structure returned by the API.
# Use this if you're offline or the API rate-limits during class.
offline_json = {
    "fact": "Cats can rotate their ears 180 degrees.",
    "length": 41
}
print(json.dumps(offline_json, indent=2))
print("Access 'fact':", offline_json["fact"])


## 2) Real‑world API: NewsAPI.org — `everything` endpoint
We’ll search for the term **“New York”**, retrieve results in JSON, then convert to a pandas `DataFrame`.

> **Setup:** You need a free API key from https://newsapi.org/  
> **Security tip:** Store your key as an environment variable (e.g., `NEWSAPI_KEY`) or use `python-dotenv`.


In [None]:
# Store your API KEY in a variab;e
NEWSAPI_KEY = '123456789'

# Define the endpoint and query parameters.
news_url = "https://newsapi.org/v2/everything"
params = {
    "q": "New York",   # search query
    "language": "en",  # restrict to English
    "pageSize": 25,    # number of results per page (max 100)
    "sortBy": "relevancy",  # or 'publishedAt' for recency
    "apiKey": NEWSAPI_KEY   # your API key
}

# Make the request with parameters to the NewsAPI endpoint.
news_resp = requests.get(news_url, params=params)

# Inspect status code to ensure the request worked.
print("Status code:", news_resp.status_code)

# Convert to Python objects (dict) from JSON.
news_data = news_resp.json()

# Sanity check: print the top-level keys.
print("Top-level keys:", list(news_data.keys()))

# Inspect the first article's keys to understand the structure.
if news_data.get("articles"):
    print("Article keys:", list(news_data["articles"][0].keys()))
else:
    print("No articles returned. Check your API key or params.")


### Convert JSON → pandas DataFrame & Quick Cleaning

In [None]:
# Convert the list of articles (list of dicts) into a DataFrame.
articles = news_data.get("articles", [])
df = pd.DataFrame(articles)

# Show the first few rows to confirm shape and columns.
print("DataFrame shape:", df.shape)
df.head()


In [None]:
# Select a subset of useful columns for our analysis.
keep_cols = ["source", "author", "title", "description", "url", "publishedAt"]
df = df[keep_cols]

# 'source' is a nested dict; flatten it to a simple string (source name).
# We create a new column 'source_name' with the inner 'name' field.
df["source_name"] = df["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)

# Drop the original nested 'source' column now that we've extracted the name.
df = df.drop(columns=["source"])

# Convert 'publishedAt' to a proper datetime type for easier filtering/sorting.
df["publishedAt"] = pd.to_datetime(df["publishedAt"], errors="coerce")

# Drop rows with missing URLs or titles (these are critical for scraping/analysis).
df = df.dropna(subset=["url", "title"]).reset_index(drop=True)

# Sort by recency to bring the newest items to the top.
df = df.sort_values("publishedAt", ascending=False).reset_index(drop=True)

# Display a tidy preview
df.head(10)


### Offline Fallback for NewsAPI (sample payload)

In [None]:
# Use a small, hard-coded sample if the API call fails/limits.
offline_news = {
  "status": "ok",
  "totalResults": 2,
  "articles": [
    {
      "source": {
        "id": null,
        "name": "Example Times"
      },
      "author": "Jane Doe",
      "title": "New York expands ferry service for commuters",
      "description": "City officials announce new routes and schedules.",
      "url": "https://www.example.com/ny-ferry",
      "publishedAt": "2025-10-20T10:00:00Z"
    },
    {
      "source": {
        "id": null,
        "name": "Policy Daily"
      },
      "author": "John Smith",
      "title": "Housing advocates push for zoning reform in New York",
      "description": "Debate intensifies over upzoning proposals.",
      "url": "https://www.example.com/ny-zoning",
      "publishedAt": "2025-10-19T09:30:00Z"
    }
  ]
}

offline_df = pd.DataFrame(offline_news["articles"])
offline_df["source_name"] = offline_df["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)
offline_df = offline_df.drop(columns=["source"])
offline_df["publishedAt"] = pd.to_datetime(offline_df["publishedAt"], errors="coerce")
offline_df = offline_df.dropna(subset=["url", "title"]).sort_values("publishedAt", ascending=False).reset_index(drop=True)
offline_df


## 3) Web Scraping with BeautifulSoup (from the URL column)
**Goal:** Given an article URL, fetch the web page and extract the textual content (paragraphs).

> **Important:** Real news sites often have paywalls or dynamic content loaded by JavaScript. For teaching, start with any URL that returns visible `<p>` text. Otherwise, use the **offline fallback** cell.


In [None]:
# Choose a URL to scrape: try the live df first, fallback to offline_df if needed.
candidate_df = df if not df.empty else offline_df
article_url = candidate_df.loc[0, "url"]
print("Scraping URL:", article_url)

# Send a GET request to retrieve the raw HTML of the page.
page_resp = requests.get(article_url, timeout=15)

# Create a BeautifulSoup object to parse the HTML document.
soup = BeautifulSoup(page_resp.text, "html.parser")

# Find all paragraph tags <p> and extract text from each.
paragraphs = soup.find_all("p")

# Use a list comprehension to strip whitespace and only keep non-empty paragraphs.
para_text = [p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)]

# Join paragraphs into a single string for quick inspection (limit output length).
full_text = " ".join(para_text)
print("First 1000 characters of extracted text:\n")
print(full_text[:1000])


### (Optional) More robust scraping: headers + polite delay

In [None]:
# Some sites block default Python requests; set a user-agent header to look like a browser.
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}

# Example loop to scrape first N article URLs with a polite delay.
N = min(3, len(candidate_df))  # limit to 3 for class demo
texts = []

for i in range(N):
    url_i = candidate_df.loc[i, "url"]
    print(f"Fetching ({i+1}/{N}):", url_i)
    try:
        r = requests.get(url_i, headers=headers, timeout=15)
        soup = BeautifulSoup(r.text, "html.parser")
        paras = [p.get_text(strip=True) for p in soup.find_all("p")]
        text_i = " ".join([t for t in paras if t])
        texts.append(text_i)
        # Be polite and avoid hammering servers.
        time.sleep(1.0)
    except Exception as e:
        print("Error fetching:", e)
        texts.append("")

# Add scraped text as a new column aligned to the first N rows.
candidate_df = candidate_df.copy()
candidate_df.loc[:N-1, "scraped_text"] = texts
candidate_df.head(N)


## 4) Save Scraped Text to Google Drive as `.txt` Files (Colab)
This section lets you export each row's `scraped_text` into a separate `.txt` file on Google Drive.

**Workflow**
1. Mount Drive (Colab)
2. Choose (or create) a destination folder in your Drive
3. Iterate through the DataFrame, sanitize filenames, and write `.txt` files

> If you're running **locally** (not in Colab), skip the mount cell and set `base_dir` to a local path (e.g., `./exports`).

In [None]:
# --- Colab-only: Mount Google Drive ---
# If running in Colab, uncomment the two lines below.
# from google.colab import drive
# drive.mount('/content/drive')

# Choose a Drive folder (adjust this path). If running locally, use a local path instead.
# Example for Colab:
base_dir = "/content/drive/MyDrive/text-analysis/session1_articles"
# Example local fallback:
# base_dir = "./exports"

import os
os.makedirs(base_dir, exist_ok=True)
print("Saving .txt files to:", base_dir)

In [None]:
# --- Export each row's scraped_text to a .txt file ---
import os
import re
import pandas as pd

def slugify(text, max_len=80):
    """Create a filesystem-safe slug from any text."""
    if not isinstance(text, str) or not text.strip():
        return "untitled"
    text = text.lower()
    text = re.sub(r"[^a-z0-9]+", "-", text)
    text = re.sub(r"-{2,}", "-", text).strip("-")
    return text[:max_len] if text else "untitled"

# Pick the first non-empty DataFrame among common variables created earlier.
df_source = None
for name in ["df_topic", "candidate_df", "df", "offline_df"]:
    if name in globals():
        _df = globals()[name]
        if isinstance(_df, pd.DataFrame) and not _df.empty:
            df_source = _df
            print(f"Using DataFrame: {name} with shape {_df.shape}")
            break

if df_source is None:
    raise ValueError("No DataFrame available. Run earlier cells to create df_topic/df/candidate_df/offline_df.")

if "scraped_text" not in df_source.columns:
    print("`scraped_text` column not found. You may need to run the scraping cells first.")
else:
    used_names = set()
    saved = 0
    for i, row in df_source.iterrows():
        text = row.get("scraped_text", "")
        if not isinstance(text, str) or not text.strip():
            continue  # skip empty text rows

        # Build a filename using date + title/url slug for disambiguation.
        title = row.get("title") if "title" in df_source.columns else None
        url = row.get("url") if "url" in df_source.columns else None

        # Try to extract a date for filename
        date_str = ""
        if "publishedAt" in df_source.columns and pd.notna(row.get("publishedAt")):
            try:
                date_str = pd.to_datetime(row["publishedAt"]).strftime("%Y%m%d")
            except Exception:
                date_str = ""

        base_name = slugify(title) if title else (slugify(url) if url else f"article-{i}")
        fname = f"{date_str + '_' if date_str else ''}{base_name}.txt"

        # Ensure uniqueness
        original_fname = fname
        k = 2
        while fname in used_names or os.path.exists(os.path.join(base_dir, fname)):
            fname = original_fname.replace(".txt", f"_{k}.txt")
            k += 1
        used_names.add(fname)

        # Write file (UTF-8)
        path = os.path.join(base_dir, fname)
        with open(path, "w", encoding="utf-8") as f:
            f.write(text)

        saved += 1
        if saved <= 5:  # print only the first few to keep output tidy
            print("Saved:", path)
    print(f"Done. Saved {saved} text files to {base_dir}.")

## 5) Ethics, Legality, and Troubleshooting
- **robots.txt**: Check site’s crawling policy, but note it’s advisory; always follow terms of service.
- **Rate limiting**: Sleep between requests; don’t parallelize aggressively.
- **Attribution**: Cite sources when using scraped content in reports.
- **Paywalls / JS‑rendered sites**: Some pages need tools like `selenium` or `requests_html`. Use sparingly and ethically.
- **Stability**: News sites change their HTML; write resilient, minimal selectors (e.g., `find_all("p")` as a start).
- **Alternatives**: Prefer official APIs when available (structured, stable, legally safer).


## 6) Mini‑Project (60 min post‑break)
**Choose a topic** (e.g., *housing policy*, *AI regulation*, *public transit*, *Ukraine*) and:

1. Modify the NewsAPI query to fetch ~25 English articles from the last week.
2. Convert to a tidy `DataFrame`, keep: `source_name`, `title`, `description`, `url`, `publishedAt`.
3. Scrape the first 2–3 article URLs and add a `scraped_text` column.
4. Save your work to CSV: `results_<topic>.csv`.

> **Stretch goal:** Use `.str.len()` on `scraped_text` to identify the fullest articles; compute basic stats.


In [None]:
# Starter scaffold for the mini-project (students edit this cell).
topic = "New York"  # ← change to your chosen topic
NEWSAPI_KEY = os.getenv("NEWSAPI_KEY", "YOUR_API_KEY_HERE")

news_url = "https://newsapi.org/v2/everything"
params = {
    "q": topic,
    "language": "en",
    "pageSize": 25,
    "sortBy": "publishedAt",
    "apiKey": NEWSAPI_KEY
}
resp = requests.get(news_url, params=params)
data = resp.json()
df_topic = pd.DataFrame(data.get("articles", []))
if not df_topic.empty:
    df_topic["source_name"] = df_topic["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)
    df_topic = df_topic.drop(columns=["source"])
    df_topic["publishedAt"] = pd.to_datetime(df_topic["publishedAt"], errors="coerce")
    df_topic = df_topic.dropna(subset=["url", "title"]).sort_values("publishedAt", ascending=False).reset_index(drop=True)

    # Scrape first 3 URLs
    headers = {"User-Agent": "Mozilla/5.0"}
    texts = []
    for i in range(min(3, len(df_topic))):
        u = df_topic.loc[i, "url"]
        try:
            r = requests.get(u, headers=headers, timeout=15)
            s = BeautifulSoup(r.text, "html.parser")
            paras = [p.get_text(strip=True) for p in s.find_all("p")]
            texts.append(" ".join([t for t in paras if t]))
            time.sleep(1.0)
        except Exception as e:
            print("Error:", e)
            texts.append("")
    df_topic.loc[:len(texts)-1, "scraped_text"] = texts

    # Save to CSV
    out_name = f"results_{topic.replace(' ', '_').lower()}.csv"
    df_topic.to_csv(out_name, index=False)
    print("Saved:", out_name)
    df_topic.head()
else:
    print("No results — check your API key or query.")


## Appendix: Common Errors & Fixes

In [None]:
# 1) If you get a 401 error from NewsAPI -> invalid/expired API key.
#    Fix: double-check your key, or set it explicitly:
# os.environ['NEWSAPI_KEY'] = 'PASTE_YOUR_KEY_HERE'

# 2) If scraping returns empty text:
#    - Try adding headers with a real user-agent.
#    - Try a different URL (some pages are behind paywalls or JS-rendered).
#    - Verify that <p> tags exist by printing a snippet of soup:
# print(soup.prettify()[:1500])

# 3) If you see Unicode errors when saving CSV:
# df.to_csv("file.csv", index=False, encoding="utf-8")

# 4) If you need only recent articles, filter by date:
# cutoff = pd.Timestamp.utcnow() - pd.Timedelta(days=7)
# df = df[df['publishedAt'] >= cutoff]
