<a href="https://colab.research.google.com/github/rosieb05/RocioBalderas_DTSC3020.020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [2]:
#Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


### 2) Common Imports & Polite Headers

In [3]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
       """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [6]:
#Question 2 Answer:

def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML."""
    tables = pd.read_html(html)
    if not tables:
        raise ValueError("No tables found.")

    for t in tables:
        if t.shape[1] >= 3:
            df = t.copy()
            break
    else:
        raise ValueError("No table with >= 3 columns found.")

    return flatten_headers(df)


def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    Returns at least Country, Alpha-2, Alpha-3, Numeric."""

    def _norm(s): return str(s).strip().lower().replace("-", " ").replace("_", " ")
    cols = {c: _norm(c) for c in df.columns}

    name_map = {}
    for c, n in cols.items():
        if n == "country":
            name_map[c] = "Country"
        elif n in ("alpha-2", "alpha2"):
            name_map[c] = "Alpha-2"
        elif n in ("alpha-3", "alpha3"):
            name_map[c] = "Alpha-3"
        elif n in ("numeric code","numeric"):
            name_map[c] = "Numeric"
        else:
            continue
    df = df.rename(columns=name_map)

    required = ["Country", "Alpha-2", "Alpha-3", "Numeric"]
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    out = df[required].copy()

    #trim
    for c in ["Country", "Alpha-2", "Alpha-3"]:
        out[c] = out[c].astype(str).str.strip()

    #Uppercase
    out["Numeric"] = pd.to_numeric(out["Numeric"], errors="coerce").astype("Int64")

    #Drop Rows
    out = out.dropna(subset=["Country", "Alpha-2", "Alpha-3"]).reset_index(drop=True)
    return out

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N. rows."""
    if "Numeric" not in df.columns:
        raise ValueError("Numeric column not found.")
    return df.sort_values("Numeric", ascending=False).head(top).reset_index(drop=True)




## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [8]:
# Q2 — Write your answer here

import re
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

_HN_BASE = "https://news.ycombinator.com/"

def _to_int_safe(text: str) -> int:
    """Extract the last integer in text; return 0 if none."""
    if text is None:
        return 0
    nums = re.findall(r"\d+", str(text))
    return int(nums[-1]) if nums else 0

def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional)."""

    soup = BeautifulSoup(html, "lxml")
    items = []

    for athing in soup.select("tr.athing"):
        rank_el = athing.select_one(".rank")
        rank = _to_int_safe(rank_el.get_text(strip=True) if rank_el else None)

        title_a = athing.select_one(".titleline a")
        title = title_a.get_text(strip=True) if title_a else ""
        link = urljoin(_HN_BASE, title_a["href"]) if title_a and title_a.has_attr("href") else ""

        # subtext row
        sub = athing.find_next_sibling("tr")
        points = 0
        comments = 0
        user = ""

        if sub:
            st = sub.select_one(".subtext")
            if st:
                score = st.select_one(".score")
                points = _to_int_safe(score.get_text(strip=True) if score else None)

                c_text = ""
                for a2 in st.find_all("a"):
                    t = a2.get_text(strip=True).lower()
                    if "comment" in t:
                        c_text = t
                comments = _to_int_safe(c_text)

                u = st.select_one(".hnuser")
                user = u.get_text(strip=True) if u else ""

        items.append({
            "rank": rank,
            "title": title or "",
            "link": link or "",
            "points": points,
            "comments": comments,
            "user": user or "",
        })

    return pd.DataFrame(items)

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values."""
    out = df.copy()

    #columns
    for c in ["points", "comments", "rank"]:
        if c not in out.columns:
            out[c] = 0
    for c in ["title", "link", "user"]:
        if c not in out.columns:
            out[c] = ""

    #numerics
    out["rank"] = pd.to_numeric(out["rank"], errors="coerce").fillna(0).astype(int)
    out["points"] = pd.to_numeric(out["points"], errors="coerce").fillna(0).astype(int)
    out["comments"] = pd.to_numeric(out["comments"], errors="coerce").fillna(0).astype(int)

    #fill/strip text fields
    for c in ["title", "link", "user"]:
        out[c] = out[c].fillna("").astype(str).str.strip()

    return out

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N."""
    if "points" not in df.columns:
        raise ValueError("Column 'points' not found in the DataFrame.")
    return df.sort_values("points", ascending=False).head(top).reset_index(drop=True)