
# Web Data Extraction & APIs

**Focus site:** `https://quotes.toscrape.com`  
**Goals:**
- Understand how web pages load: HTTP/HTTPS, requests, responses, HTML/CSS/JS, rendering.
- Use DevTools to inspect requests & responses.
- Reproduce a browser request via cURL and then with Python `requests`.
- Parse HTML with **BeautifulSoup** in multiple ways.
- Extract quotes (text), authors, and tags.
- Save results with **pandas** to a CSV (custom separators).

> You’ll run this notebook top-to-bottom during the lecture. Fill TODOs where asked.  
> Where you see “(paste image here)” you can drag & drop screenshots from DevTools into the notebook.



## Setup

We’ll rely on these packages:
- `requests`
- `beautifulsoup4`
- `pandas`

If you need to install them in your environment, run (uncomment if needed):


In [None]:
# !pip install requests beautifulsoup4 pandas


## 1) How Web Pages Work (HTTP/HTTPS → HTML → Render)

At a high level:
1. Your browser sends an **HTTP/HTTPS GET** request to a URL.  
2. The **server** responds with an **HTTP response**, typically HTML.  
3. The browser parses HTML and often issues **additional requests** for CSS, JS, fonts, and images.  
4. The browser **renders** the page incrementally; JavaScript may fetch more data (**XHR/fetch**) and modify the DOM.

```
You            Internet          Server
 |  GET /page -----> |  ...  | -----> Receives request
 | <-----  HTML      |  ...  | <----- Responds with HTML
 |  GET /style.css   |  ...  | -----> Additional requests (CSS/JS/images)
 |  GET /script.js   |  ...  |
 | <-----  CSS/JS    |  ...  |
[Render DOM + run JS; possibly fetch JSON via XHR/fetch]
```

**Static vs. dynamic** pages:  
- Static pages contain all key data directly in the HTML.  
- Dynamic pages may **render data via JavaScript** (e.g., JSON APIs). For those, you may:
  - Call the **same JSON endpoints** the page uses, or
  - Use a JS-capable tool (e.g., Playwright/Selenium) if needed.

Check right click → View Page Source/Inspect, and then the **Network** tab in DevTools to see what’s happening.

`https://goldenowl.asia/blog/difference-between-html-css-and-javascript`


## 2) DevTools: Inspect Requests/Responses

Open your browser’s **Developer Tools** → **Network** tab, then reload `https://quotes.toscrape.com/`.

What to look for:
- **Request URL**: `https://quotes.toscrape.com/`
- **Method**: GET
- **Status**: 200 OK
- **Response headers**: content type, encoding, etc.
- **Request headers**: `User-Agent`, `Accept`, `Accept-Language`, etc.
- **Other requests**: CSS, images, possibly additional pages when clicking pagination.



## 3) “Copy as cURL”

From DevTools, you can right-click the request and **Copy → Copy as cURL**.  
This reproduces the browser’s request on the command line.

### Example
```bash
curl 'https://quotes.toscrape.com/' \
  --compressed \
  -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
  -H 'Accept-Language: en-GB,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Connection: keep-alive' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-Site: none' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Priority: u=0, i' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache'
```

> Try running it in a terminal or a shell-enabled notebook with `!curl ...`.  
> You should receive HTML for the homepage.


In [12]:

# OPTIONAL: Try running the cURL command directly from a notebook cell (requires curl to be installed).
# Remove the leading '#' characters and run if your environment supports it.

!curl 'https://quotes.toscrape.com/' #   --compressed #   -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0' #   -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' #   -H 'Accept-Language: en-GB,en;q=0.5' #   -H 'Accept-Encoding: gzip, deflate, br, zstd' #   -H 'Connection: keep-alive' #   -H 'Upgrade-Insecure-Requests: 1' #   -H 'Sec-Fetch-Dest: document' #   -H 'Sec-Fetch-Mode: navigate' #   -H 'Sec-Fetch-Site: none' #   -H 'Sec-Fetch-User': '?1' #   -H 'Priority: u=0, i' #   -H 'Pragma: no-cache' #   -H 'Cache-Control: no-cache'


<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
  


## 4) Reproducing the Request in Python (`requests`)

Below we mirror the important headers and fetch the HTML.  
We’ll write a small helper that returns the **HTML as a string**.


In [13]:

from __future__ import annotations
from typing import Optional, Dict

import requests

DEFAULT_HEADERS: Dict[str, str] = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Priority": "u=0, i",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
}

def fetch_html(url: str, headers: Optional[Dict[str, str]] = None, timeout_s: float = 15.0) -> str:
    merged_headers: Dict[str, str] = {**DEFAULT_HEADERS, **(headers or {})}
    resp = requests.get(url, headers={}, timeout=timeout_s)
    resp.raise_for_status()
    return resp.text

try:
    html_preview: str = fetch_html("https://quotes.toscrape.com/")
    print(html_preview[:500])
except Exception as e:
    print(f"Fetch failed (likely due to offline/no-internet environment): {e}")


<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div cla



## 5) Parsing with BeautifulSoup — Quotes Only

We’ll show **three approaches** to extract the **quote text** from the homepage:
1. **CSS selectors** (`select` / `select_one`)
2. **Tag & class** via `find_all`
3. **Direct iteration** over containers + element lookup

Each function will return `list[str]` (just the quotes). Make sure you’ve fetched HTML first.


In [14]:

from __future__ import annotations
from typing import List
from bs4 import BeautifulSoup

def quotes_via_css(html: str) -> List[str]:
    soup = BeautifulSoup(html, "html.parser")
    return [el.get_text(strip=True) for el in soup.select("div.quote span.text")]

def quotes_via_find_all(html: str) -> List[str]:
    soup = BeautifulSoup(html, "html.parser")
    quotes: List[str] = []
    for qdiv in soup.find_all("div", class_="quote"):
        text_span = qdiv.find("span", class_="text")
        if text_span:
            quotes.append(text_span.get_text(strip=True))
    return quotes

def quotes_via_iteration(html: str) -> List[str]:
    soup = BeautifulSoup(html, "html.parser")
    quotes: List[str] = []
    for container in soup.select("div.quote"):
        span = container.select_one("span.text")
        if span:
            quotes.append(span.get_text(strip=True))
    return quotes

try:
    page_html: str = fetch_html("https://quotes.toscrape.com/")
    print("CSS:", quotes_via_css(page_html)[:3])
    print("find_all:", quotes_via_find_all(page_html)[:3])
    print("iteration:", quotes_via_iteration(page_html)[:3])
except Exception as e:
    print(f"Skipping demo due to environment: {e}")


CSS: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”']
find_all: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”']
iteration: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a mi


## 6) More Parsing — Quotes, Authors, Tags

Now we’ll extract a **tuple** for each quote:  
`(quote_text: str, author: str, tags: list[str])`

We’ll also show how to **follow pagination** (clicking “Next”) to get multiple pages.


In [15]:

from __future__ import annotations
from typing import List, Tuple, Optional
from bs4 import BeautifulSoup
from urllib.parse import urljoin

QuoteRow = Tuple[str, str, List[str]]  # (quote, author, tags)

def parse_quotes_page(html: str) -> List[QuoteRow]:
    soup = BeautifulSoup(html, "html.parser")
    rows: List[QuoteRow] = []
    for q in soup.select("div.quote"):
        text_el = q.select_one("span.text")
        author_el = q.select_one("small.author")
        tag_els = q.select("div.tags a.tag")
        if not text_el or not author_el:
            continue
        text: str = text_el.get_text(strip=True)
        author: str = author_el.get_text(strip=True)
        tags: List[str] = [t.get_text(strip=True) for t in tag_els]
        rows.append((text, author, tags))
    return rows

def find_next_page_url(html: str, base_url: str) -> Optional[str]:
    soup = BeautifulSoup(html, "html.parser")
    next_link = soup.select_one("li.next a")
    if next_link and next_link.get("href"):
        return urljoin(base_url, next_link["href"])
    return None

def scrape_all_quotes(start_url: str, max_pages: int = 10) -> List[QuoteRow]:
    all_rows: List[QuoteRow] = []
    url: Optional[str] = start_url
    pages_visited: int = 0

    while url and pages_visited < max_pages:
        html: str = fetch_html(url)
        page_rows: List[QuoteRow] = parse_quotes_page(html)
        all_rows.extend(page_rows)
        url = find_next_page_url(html, base_url=url)
        pages_visited += 1
    return all_rows

try:
    rows = scrape_all_quotes("https://quotes.toscrape.com/", max_pages=3)
    print(f"Collected {len(rows)} rows (first 2 shown):")
    for r in rows[:2]:
        print(r)
except Exception as e:
    print(f"Scrape demo skipped: {e}")


Collected 30 rows (first 2 shown):
('“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'Albert Einstein', ['change', 'deep-thoughts', 'thinking', 'world'])
('“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'J.K. Rowling', ['abilities', 'choices'])



## 7) Save Results with `pandas` (CSV)

We’ll convert the tuples to a DataFrame and then **save to CSV** with:
- CSV **separator**: `;`
- **Tags** field joined by commas: `","`

Resulting CSV columns: `quote;author;tags`


In [16]:

from __future__ import annotations
from typing import List, Tuple
import pandas as pd

def rows_to_dataframe(rows: List[Tuple[str, str, List[str]]]) -> pd.DataFrame:
    records = [{"quote": q, "author": a, "tags": ",".join(t)} for (q, a, t) in rows]
    return pd.DataFrame.from_records(records, columns=["quote", "author", "tags"])

def save_quotes_csv(rows: List[Tuple[str, str, List[str]]], path: str) -> None:
    df = rows_to_dataframe(rows)
    df.to_csv(path, sep=";", index=False)

try:
    rows = scrape_all_quotes("https://quotes.toscrape.com/", max_pages=3)
    save_path = "quotes_sample.csv"
    save_quotes_csv(rows, save_path)
    print(f"Saved {len(rows)} rows to {save_path}")
    display(rows_to_dataframe(rows).head(3))
except Exception as e:
    print(f"Save demo skipped: {e}")


Saved 30 rows to quotes_sample.csv


Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"



## 8) Responsible Scraping — Notes & Best Practices

- **Be gentle**: add small delays between requests; don’t hammer servers.
- **Identify yourself**: set a sensible `User-Agent`. Consider contact info for research use.
- **Limit scope**: only scrape what you need; set **max pages**.
- **Avoid PII** and restricted data.
- **Cache** responses where appropriate to reduce load.
- **Respect rate limits** if using APIs; use API keys securely.
- **Legal & ethical**: laws vary by jurisdiction; when in doubt, consult your organization’s policy.
