# Beginner Web Scraping & Parsing (Notebook)

A compact, runnable walkthrough that merges the theory and examples from the beginner edition. You can run each cell step by step to see how requests, BeautifulSoup, validation, and persistence fit together.


## Learning Path (you can jump into any section)
- Foundations: HTTP requests, HTML/CSS structure, respectful scraping
- BeautifulSoup essentials: selecting elements, extracting text/attributes
- Practical example 1: Quotes scraper with pagination + cleaning
- Practical example 2: Books scraper with rating/availability parsing
- Practical example 3: Full pipeline with validation and SQLite storage
- Next steps and exercises to extend the code


## Quick Start
1. Create and activate a virtualenv (Python 3.11+ recommended).
2. Install deps: `pip install requests beautifulsoup4` (optionally `pandas` for tabular views).
3. Run cells from top to bottom. The examples use the legal playground sites `quotes.toscrape.com` and `books.toscrape.com`.


## Responsible Scraping Checklist
- Identify yourself with a User-Agent header.
- Respect robots.txt and site terms.
- Add short delays between requests (rate limiting).
- Handle errors gracefully (timeouts, changed markup).
- Validate data before storing; avoid hammering a site with retries.


## Inspecting HTML Quickly
A small helper to preview HTML and keep responses on disk for debugging. This is useful when a selector suddenly stops matching because the site changed its structure.


In [1]:
from __future__ import annotations

import json
import logging
import sqlite3
import time
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import requests
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


In [2]:
def fetch_html(url: str, *, timeout: int = 10, pause: float = 0.5) -> Optional[str]:
    """Download HTML with a browser-like header and a small delay."""
    headers = {"User-Agent": "Mozilla/5.0 (Educational Bot for web scraping practice)"}
    time.sleep(pause)
    try:
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()
        return response.text
    except requests.RequestException as exc:
        logger.error("Error fetching %s: %s", url, exc)
        return None


def preview_html(html: str, *, keep: Optional[Path] = None, chars: int = 600) -> str:
    """Return a short preview of HTML and optionally persist it for offline inspection."""
    if keep:
        keep.parent.mkdir(parents=True, exist_ok=True)
        keep.write_text(html, encoding="utf-8")
        logger.info("Saved raw HTML to %s", keep)
    return html[:chars] + ("..." if len(html) > chars else "")


## BeautifulSoup Basics Recap
- `soup.find('tag', class_='name')` returns the first match.
- `soup.find_all('div', class_='quote')` returns a list.
- Access attributes with `element['href']` or safer `element.get('href')`.
- Use `.get_text(strip=True)` to trim whitespace.
- Prefer specific selectors; brittle selectors create noisy data.


## Example 1: Quotes Scraper (with validation and pagination)
We treat each quote as a structured record. The scraper follows the "Next" link automatically and cleans duplicates.


In [4]:
@dataclass
class Quote:
    text: str
    author: str
    tags: List[str]
    scraped_at: str

    def to_dict(self) -> Dict:
        return asdict(self)


def validate_quote(quote: Quote) -> Tuple[bool, Optional[str]]:
    if not quote.text or len(quote.text) < 5:
        return False, "Text is missing or too short"
    if not quote.author:
        return False, "Missing author"
    return True, None


class QuoteScraper:
    def __init__(self, base_url: str = "http://quotes.toscrape.com", delay: float = 1.0):
        self.base_url = base_url.rstrip("/")
        self.delay = delay
        self.quotes: List[Quote] = []
        self.errors: List[str] = []

    def _extract_quotes(self, soup: BeautifulSoup) -> List[Quote]:
        items: List[Quote] = []
        for block in soup.find_all("div", class_="quote"):
            try:
                text = block.find("span", class_="text").get_text(strip=True)
                author = block.find("small", class_="author").get_text(strip=True)
                tags = [tag.get_text(strip=True) for tag in block.find_all("a", class_="tag")]
                quote = Quote(text=text, author=author, tags=tags, scraped_at=time.strftime("%Y-%m-%dT%H:%M:%S"))
                is_valid, reason = validate_quote(quote)
                if is_valid:
                    items.append(quote)
                else:
                    self.errors.append(f"Invalid quote: {reason}")
            except AttributeError as exc:
                self.errors.append(f"Parsing error: {exc}")
        return items

    def scrape(self, max_pages: int = 3) -> List[Quote]:
        url = f"{self.base_url}/page/1/"
        seen = set()

        for page in range(1, max_pages + 1):
            logger.info("Scraping page %s", page)
            html = fetch_html(url, pause=self.delay)
            if not html:
                break

            soup = BeautifulSoup(html, "html.parser")
            for quote in self._extract_quotes(soup):
                key = (quote.text.lower(), quote.author.lower())
                if key not in seen:
                    seen.add(key)
                    self.quotes.append(quote)

            next_link = soup.find("li", class_="next")
            if next_link and next_link.find("a"):
                next_href = next_link.find("a").get("href")
                url = f"{self.base_url}{next_href}"
            else:
                break
        return self.quotes

    def to_json(self, path: Path) -> None:
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(json.dumps([q.to_dict() for q in self.quotes], indent=2, ensure_ascii=False), encoding="utf-8")
        logger.info("Saved %s quotes to %s", len(self.quotes), path)

    def preview(self, limit: int = 3) -> None:
        for quote in self.quotes[:limit]:
            print(f"— {quote.author}: {quote.text}")
            print(f"  tags: {', '.join(quote.tags)}")


### Run the Quotes Scraper
Uncomment the cell below to fetch a couple of pages. Data is deduplicated and saved to `data/quotes.json` by default.


In [5]:
quote_scraper = QuoteScraper()


In [6]:

quotes = quote_scraper.scrape(max_pages=2)


2025-12-20 13:16:11,537 - INFO - Scraping page 1
2025-12-20 13:16:13,073 - INFO - Scraping page 2


In [9]:
quote_scraper.quotes

[Quote(text='“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world'], scraped_at='2025-12-20T13:16:13'),
 Quote(text='“It is our choices, Harry, that show what we truly are, far more than our abilities.”', author='J.K. Rowling', tags=['abilities', 'choices'], scraped_at='2025-12-20T13:16:13'),
 Quote(text='“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles'], scraped_at='2025-12-20T13:16:13'),
 Quote(text='“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', author='Jane Austen', tags=['aliteracy', 'books', 'classic', 'humor'], scraped_at='2025-12-20T13:16:13'),
 Quote(text="“Imperfection is beauty, madness is genius and it's b

In [11]:

quote_scraper.to_json(Path("data/quotes.json"))


2025-12-20 13:18:02,246 - INFO - Saved 20 quotes to data/quotes.json


In [12]:

quote_scraper.preview(limit=5)


— Albert Einstein: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
  tags: change, deep-thoughts, thinking, world
— J.K. Rowling: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
  tags: abilities, choices
— Albert Einstein: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
  tags: inspirational, life, live, miracle, miracles
— Jane Austen: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
  tags: aliteracy, books, classic, humor
— Marilyn Monroe: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
  tags: be-yourself, inspirational


In [13]:

print(f"Collected {len(quotes)} quotes; errors: {len(quote_scraper.errors)}")


Collected 20 quotes; errors: 0


## Example 2: Books Scraper (price, availability, rating)
Books.toscrape.com exposes pagination and richer fields. We normalize prices and map star ratings to numbers for easier analysis.


In [17]:
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}


@dataclass
class Book:
    title: str
    price_gbp: float
    availability: str
    rating: int
    scraped_at: str

    def to_dict(self) -> Dict:
        return asdict(self)


def parse_price(raw: str) -> float:
    digits = raw.replace("£", "").strip()
    try:
        return float(digits)
    except ValueError:
        return 0.0


def parse_rating(tag: Optional[str]) -> int:
    if not tag:
        return 0
    return RATING_MAP.get(tag, 0)


class BookScraper:
    def __init__(self, base_url: str = "http://books.toscrape.com", delay: float = 1.0):
        self.base_url = base_url.rstrip("/")
        self.delay = delay
        self.books: List[Book] = []
        self.errors: List[str] = []

    def _extract_books(self, soup: BeautifulSoup) -> List[Book]:
        results: List[Book] = []
        for article in soup.find_all("article", class_="product_pod"):
            try:
                title = article.find("h2").find("a").get("title")
                price = parse_price(article.find("p", class_="price_color").get_text())
                availability = article.find("p", class_="instock availability").get_text(strip=True)
                rating_tag = article.find("p", class_="star-rating").get("class", [None, None])[1]
                book = Book(
                    title=title,
                    price_gbp=price,
                    availability=availability,
                    rating=parse_rating(rating_tag),
                    scraped_at=time.strftime("%Y-%m-%dT%H:%M:%S"),
                )
                results.append(book)
            except (AttributeError, IndexError) as exc:
                self.errors.append(f"Parsing error: {exc}")
        return results

    def scrape(self, max_pages: int = 2) -> List[Book]:
        url = f"{self.base_url}/catalogue/page-1.html"
        seen = set()

        for page in range(1, max_pages + 1):
            logger.info("Scraping books page %s", page)
            html = fetch_html(url, pause=self.delay)
            if not html:
                break
            soup = BeautifulSoup(html, "html.parser")
            for book in self._extract_books(soup):
                if book.title not in seen:
                    seen.add(book.title)
                    self.books.append(book)
            next_link = soup.find("li", class_="next")
            if next_link and next_link.find("a"):
                href = next_link.find("a").get("href")
                url = f"{self.base_url}/catalogue/{href}"
            else:
                break
        return self.books

    def to_json(self, path: Path) -> None:
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(json.dumps([b.to_dict() for b in self.books], indent=2, ensure_ascii=False), encoding="utf-8")
        logger.info("Saved %s books to %s", len(self.books), path)


### Run the Books Scraper
Uncomment to gather a few pages of data and store them in `data/books.json`.


In [18]:
book_scraper = BookScraper()
books = book_scraper.scrape(max_pages=2)
book_scraper.to_json(Path("data/books.json"))
print(f"Collected {len(books)} books; errors: {len(book_scraper.errors)}")


2025-12-20 13:20:33,675 - INFO - Scraping books page 1
2025-12-20 13:20:35,008 - INFO - Scraping books page 2
2025-12-20 13:20:36,341 - INFO - Saved 0 books to data/books.json


Collected 0 books; errors: 40


## Example 3: End-to-End Pipeline With SQLite
The pipeline mirrors a production workflow: fetch → parse → validate → deduplicate → persist. We use SQLite for simplicity and add basic duplicate prevention via a unique URL constraint.


In [19]:
@dataclass
class NewsArticle:
    title: str
    url: str
    content: Optional[str]
    category: Optional[str]
    published_date: Optional[str]
    scraped_at: str

    def to_dict(self) -> Dict:
        return asdict(self)


class NewsDataValidator:
    MIN_TITLE = 5
    MIN_CONTENT = 10

    @classmethod
    def validate(cls, article: NewsArticle) -> Tuple[bool, Optional[str]]:
        if not article.title or len(article.title) < cls.MIN_TITLE:
            return False, "Title is missing or too short"
        if not article.url:
            return False, "URL is required"
        if article.content and len(article.content) < cls.MIN_CONTENT:
            return False, "Content is too short"
        return True, None

    @classmethod
    def clean(cls, articles: List[NewsArticle]) -> Tuple[List[NewsArticle], List[str]]:
        valid: List[NewsArticle] = []
        errors: List[str] = []
        seen = set()
        for article in articles:
            ok, reason = cls.validate(article)
            if not ok:
                errors.append(f"Invalid: {reason}")
                continue
            key = article.title.lower().strip()
            if key in seen:
                errors.append(f"Duplicate title: {article.title}")
                continue
            seen.add(key)
            valid.append(article)
        return valid, errors


class NewsDatabase:
    def __init__(self, path: Path = Path("data/news_demo.db")):
        self.path = path
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self._init()

    def _init(self) -> None:
        conn = sqlite3.connect(self.path)
        cursor = conn.cursor()
        cursor.execute(
            """
            CREATE TABLE IF NOT EXISTS articles (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title TEXT NOT NULL,
                url TEXT NOT NULL UNIQUE,
                content TEXT,
                category TEXT,
                published_date TEXT,
                scraped_at TEXT NOT NULL,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
            """
        )
        conn.commit()
        conn.close()

    def save(self, articles: List[NewsArticle]) -> int:
        conn = sqlite3.connect(self.path)
        cursor = conn.cursor()
        saved = 0
        for item in articles:
            try:
                cursor.execute(
                    """
                    INSERT INTO articles (title, url, content, category, published_date, scraped_at)
                    VALUES (?, ?, ?, ?, ?, ?)
                    """,
                    (item.title, item.url, item.content, item.category, item.published_date, item.scraped_at),
                )
                saved += 1
            except sqlite3.IntegrityError:
                logger.warning("Duplicate URL skipped: %s", item.url)
        conn.commit()
        conn.close()
        return saved

    def latest(self, limit: int = 5) -> List[NewsArticle]:
        conn = sqlite3.connect(self.path)
        cursor = conn.cursor()
        cursor.execute(
            """
            SELECT title, url, content, category, published_date, scraped_at
            FROM articles
            ORDER BY created_at DESC
            LIMIT ?
            """,
            (limit,),
        )
        rows = cursor.fetchall()
        conn.close()
        return [
            NewsArticle(
                title=row[0], url=row[1], content=row[2], category=row[3], published_date=row[4], scraped_at=row[5]
            )
            for row in rows
        ]

    def stats(self) -> Dict:
        conn = sqlite3.connect(self.path)
        cursor = conn.cursor()
        cursor.execute("SELECT COUNT(*) FROM articles")
        total = cursor.fetchone()[0]
        cursor.execute("SELECT COUNT(DISTINCT category) FROM articles")
        categories = cursor.fetchone()[0]
        conn.close()
        return {"total": total, "categories": categories}


### Pipeline Runner (Quotes as pseudo-news)
For a quick demo we reuse `quotes.toscrape.com` as the source. Swap the selectors to adapt to a real news site.


In [21]:
class NewsPortalScraper:
    def __init__(self, base_url: str = "http://quotes.toscrape.com", delay: float = 1.0):
        self.base_url = base_url.rstrip("/")
        self.delay = delay

    def scrape(self, pages: int = 2) -> List[NewsArticle]:
        items: List[NewsArticle] = []
        url = f"{self.base_url}/page/1/"
        for page in range(1, pages + 1):
            logger.info("[news] page %s", page)
            html = fetch_html(url, pause=self.delay)
            if not html:
                break
            soup = BeautifulSoup(html, "html.parser")
            for block in soup.find_all("div", class_="quote"):
                text = block.find("span", class_="text").get_text(strip=True)
                author = block.find("small", class_="author").get_text(strip=True)
                article = NewsArticle(
                    title=author,
                    url=url,
                    content=text,
                    category="quotes",
                    published_date=None,
                    scraped_at=time.strftime("%Y-%m-%dT%H:%M:%S"),
                )
                items.append(article)
            next_link = soup.find("li", class_="next")
            if next_link and next_link.find("a"):
                url = f"{self.base_url}{next_link.find('a').get('href')}"
            else:
                break
        return items


def run_pipeline(pages: int = 2):
    scraper = NewsPortalScraper()
    raw_articles = scraper.scrape(pages=pages)
    valid_articles, errors = NewsDataValidator.clean(raw_articles)
    db = NewsDatabase()
    saved = db.save(valid_articles)
    latest = db.latest(limit=3)
    stats = db.stats()
    logger.info("Saved %s articles; %s validation errors", saved, len(errors))
    for article in latest:
        print(f"- {article.title}: {article.content[:80]}...")
    print("Stats:", stats)

# Uncomment to run the pipeline
run_pipeline(pages=2)


2025-12-20 13:21:28,266 - INFO - [news] page 1
2025-12-20 13:21:29,658 - INFO - [news] page 2
2025-12-20 13:21:30,883 - INFO - Saved 2 articles; 5 validation errors


- Albert Einstein: “The world as we have created it is a process of our thinking. It cannot be chan...
- Bob Marley: “You may not be her first, her last, or her only. She loved before she may love ...
Stats: {'total': 2, 'categories': 1}


## Next Steps & Exercises
- Extend the books scraper to export CSV alongside JSON.
- Add search and filtering helpers (e.g., by tag or price range).
- Swap in a real news site: update selectors, add `published_date` parsing, and improve duplicate checks (title + date).
- Introduce proxy rotation and exponential backoff for tougher targets.
- Pair this pipeline with a Django management command to populate a database and render the data in templates.
