<a href="https://colab.research.google.com/github/upen1530/Assignment1_ur0072.ipynb/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [None]:
1) #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


### 2) Common Imports & Polite Headers

In [None]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [14]:
# --- Q1 Skeleton (filled) ---
import pandas as pd
import numpy as np

def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML."""
    # Read all tables, pick the first one that looks like the IBAN countries table
    tables = pd.read_html(html, flavor="lxml")
    if not tables:
        raise ValueError("No tables found on page.")
    # Prefer a table that has a 'Country' column or at least 3 columns
    candidate = None
    for t in tables:
        t = flatten_headers(t.copy())
        cols = [c.strip().lower() for c in t.columns.astype(str)]
        if "country" in cols and len(cols) >= 3:
            candidate = t
            break
    if candidate is None:
        # Fallback: first table with >=3 cols
        for t in tables:
            t = flatten_headers(t.copy())
            if t.shape[1] >= 3:
                candidate = t
                break
    if candidate is None:
        raise ValueError("Could not find a suitable table (>=3 columns).")
    return flatten_headers(candidate)

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: trim, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids."""
    d = df.copy()

    # Normalize column names commonly used on iban.com
    rename_map = {}
    for c in d.columns:
        lc = c.strip().lower()
        if lc.startswith("country"):
            rename_map[c] = "Country"
        elif "alpha-2" in lc or lc == "alpha2" or lc == "alpha 2":
            rename_map[c] = "Alpha-2"
        elif "alpha-3" in lc or lc == "alpha3" or lc == "alpha 3":
            rename_map[c] = "Alpha-3"
        elif "numeric" in lc or "num" in lc:
            rename_map[c] = "Numeric"
    d = d.rename(columns=rename_map)

    # Keep at least the required columns if present
    keep = [c for c in ["Country", "Alpha-2", "Alpha-3", "Numeric"] if c in d.columns]
    # Add any extra columns to meet "≥4 cols" (the table typically has more already)
    if len(keep) < 4:
        # Try to keep up to 4 columns total
        extras = [c for c in d.columns if c not in keep]
        keep = keep + extras[: max(0, 4 - len(keep))]
    d = d[keep].copy()

    # Trim whitespace in all object columns
    for c in d.select_dtypes(include=["object"]).columns:
        d[c] = d[c].astype(str).str.strip()

    # Uppercase Alpha-2 / Alpha-3 if present
    if "Alpha-2" in d.columns:
        d["Alpha-2"] = d["Alpha-2"].str.upper()
    if "Alpha-3" in d.columns:
        d["Alpha-3"] = d["Alpha-3"].str.upper()

    # Numeric → nullable int
    if "Numeric" in d.columns:
        d["Numeric"] = (
            d["Numeric"]
            .astype(str)
            .str.extract(r"(\d+)", expand=False)
            .astype("Int64")
        )

    # Drop obvious junk rows (e.g., where Country is NaN or empty)
    if "Country" in d.columns:
        d = d[d["Country"].astype(str).str.len() > 0]
    d = d.reset_index(drop=True)
    return d

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N."""
    if "Numeric" not in df.columns:
        raise ValueError("Numeric column not found after cleaning.")
    return df.sort_values(by="Numeric", ascending=False, na_position="last").head(top).reset_index(drop=True)


In [16]:
# Q1 — Write your answer here
URL_Q1 = "https://www.iban.com/country-codes"

# Ensure fetch_html is defined by running the cell that defines it.
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


html_q1 = fetch_html(URL_Q1)  # uses your helper with polite headers
raw_q1 = q1_read_table(html_q1)
clean_q1 = q1_clean(raw_q1)

# Save and show Top-15 by Numeric desc
clean_q1.to_csv("data_q1.csv", index=False)
print("Saved: data_q1.csv\n")
print("Top-15 by Numeric (desc):")
display(q1_sort_top(clean_q1, top=15))

Common helpers loaded.
Saved: data_q1.csv

Top-15 by Numeric (desc):


  tables = pd.read_html(html, flavor="lxml")


Unnamed: 0,Country,Alpha-2,Alpha-3,Numeric
0,Zambia,ZM,ZMB,894
1,Yemen,YE,YEM,887
2,Samoa,WS,WSM,882
3,Wallis and Futuna,WF,WLF,876
4,Venezuela (Bolivarian Republic of),VE,VEN,862
5,Uzbekistan,UZ,UZB,860
6,Uruguay,UY,URY,858
7,Burkina Faso,BF,BFA,854
8,Virgin Islands (U.S.),VI,VIR,850
9,United States of America (the),US,USA,840


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [17]:
# --- Q2 Skeleton (filled) ---
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

def _extract_int(text, default=0):
    """Return first integer in text, else default."""
    if text is None:
        return default
    m = re.search(r"(\d+)", str(text))
    return int(m.group(1)) if m else default

def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional)."""
    soup = BeautifulSoup(html, "lxml")
    rows = soup.select("tr.athing")
    items = []
    for r in rows:
        # Rank & title/link (HN’s modern markup)
        rank_text = r.select_one("span.rank")
        title_a = r.select_one("span.titleline a")
        title = title_a.get_text(strip=True) if title_a else ""
        link = title_a["href"].strip() if title_a and title_a.has_attr("href") else ""
        rank = _extract_int(rank_text.get_text() if rank_text else "", default=0)

        # The subtext is in the *next* tr (sibling)
        sub = r.find_next_sibling("tr")
        points = 0
        comments = 0
        user = ""
        if sub:
            score = sub.select_one("span.score")
            points = _extract_int(score.get_text() if score else "", default=0)

            user_a = sub.select_one("a.hnuser")
            user = user_a.get_text(strip=True) if user_a else ""

            # Comments are usually the last <a> element containing 'comment'
            sublinks = sub.select("a")
            comment_text = ""
            if sublinks:
                for a in reversed(sublinks):
                    if "comment" in a.get_text(strip=True).lower():
                        comment_text = a.get_text(strip=True)
                        break
            comments = _extract_int(comment_text, default=0)

        items.append(
            {
                "rank": rank,
                "title": title,
                "link": link,
                "points": points,
                "comments": comments,
                "user": user,
            }
        )
    return pd.DataFrame(items)

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    d = df.copy()

    # Fill missing text fields
    for c in ["title", "link", "user"]:
        if c in d.columns:
            d[c] = d[c].astype(str).fillna("").str.strip()

    # Coerce numeric fields
    for c in ["rank", "points", "comments"]:
        if c in d.columns:
            d[c] = pd.to_numeric(d[c], errors="coerce").fillna(0).astype(int)

    # Remove rows that are definitely not stories (title/link empty and all zeros)
    d = d[~((d["title"] == "") & (d["link"] == "") & (d["points"] == 0) & (d["comments"] == 0))]
    d = d.reset_index(drop=True)
    return d


In [18]:
# Q2 — Write your answer here
URL_Q2 = "https://news.ycombinator.com/"

html_q2 = fetch_html(URL_Q2)  # polite headers from your helper
raw_q2 = q2_parse_items(html_q2)
clean_q2 = q2_clean(raw_q2)

# Save and show Top-15 by points desc
clean_q2.to_csv("data_q2.csv", index=False)
print("Saved: data_q2.csv\n")
print("Top-15 by points (desc):")
display(clean_q2.sort_values("points", ascending=False).head(15).reset_index(drop=True))


Saved: data_q2.csv

Top-15 by points (desc):


Unnamed: 0,rank,title,link,points,comments,user
0,30,Ratatui – App Showcase,https://ratatui.rs/showcase/apps/,704,201,AbuAssar
1,11,FBI tries to unmask owner of archive.is,https://www.heise.de/en/news/Archive-today-FBI...,658,346,Projectiboga
2,3,"Kimi K2 Thinking, a SOTA open-source trillion-...",https://moonshotai.github.io/Kimi-K2/thinking....,550,214,nekofneko
3,15,ICC ditches Microsoft 365 for openDesk,https://www.binnenlandsbestuur.nl/digitaal/int...,512,158,vincvinc
4,8,Open Source Implementation of Apple's Private ...,https://github.com/openpcc/openpcc,345,68,adam_gyroscope
5,2,Two billion email addresses were exposed,https://www.troyhunt.com/2-billion-email-addre...,303,216,esnard
6,23,I may have found a way to spot U.S. at-sea str...,https://old.reddit.com/r/OSINT/comments/1opjjy...,288,410,hentrep
7,29,IKEA launches new smart home range with 21 Mat...,https://www.ikea.com/global/en/newsroom/retail...,276,203,lemoine0461
8,18,Mathematical exploration and discovery at scale,https://terrytao.wordpress.com/2025/11/05/math...,219,105,nabla9
9,1,You should write an agent,https://fly.io/blog/everyone-write-an-agent/,218,105,tabletcorry
