Skip to content

ronaldgosso/topscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

topscrape

topscr

PyPI version Python CI License: MIT Downloads


📖 Landing Page  ·  📦 PyPI  ·  🐛 Issues


Declarative, resilient, and typed web scraping.
Define what you want — topscrape figures out how to get it, even when the site changes.


✨ Why topscrape?

Scrapers break when websites update their HTML. Fixing them means hunting down changed CSS selectors — tedious, repetitive, and always at the worst time.

❌ Standard approach — brittle

soup = BeautifulSoup(html, "html.parser")
price = soup.select_one(".price")
if not price:
    price = soup.select_one(".cost")

Manual fallback. Manual debugging. Constant maintenance.


✅ The topscrape Approach

from topscrape import ScraperModel, Field

class Product(ScraperModel):
    title: str = Field(selectors=["h1.title", "h1"])
    price: float = Field(
        selectors=[".product-price", "[data-price]", "//span[@itemprop='price']"],
        transform=lambda v: v.replace("$", "").replace(",", ""),
    )
    image: str = Field(selectors=["img.hero"], attr="src", default="")

product = Product.from_url("https://example.com/item/1")
print(product.price)

If .product-price disappears but [data-price] still works:

  1. topscrape returns the correct value
  2. Emits a Selector Drift Warning
  3. Keeps your scraper alive

That’s resilience by design.


🚀 Features

Feature Description
Declarative models Define fields with Field(selectors=[...])
Selector chains CSS → XPath → Regex fallback
Drift detection Warns before total breakage
Pydantic validation Strong typing enforced
Transforms Clean data before validation
Async ready from_url_async() supported
CLI included Quick one-off extraction

📦 Installation

pip install topscrape

Requires Python 3.9+.


⚡ Quick Start

Google Colab Example - Link

Basic Extraction

from topscrape import ScraperModel, Field

class Article(ScraperModel):
    title: str = Field(selectors=["h1", ".article-title"])
    author: str = Field(selectors=[".byline", "[rel='author']"], default="Unknown")
    content: str = Field(selectors=["article p", ".body-text"])

article = Article.from_html(html_string)
print(article.title)

Fetch From URL

product = Product.from_url("https://example.com/item/1")
print(product.title)

Async Usage

import asyncio

async def main():
    product = await Product.from_url_async("https://example.com/item/1")
    print(product.price)

asyncio.run(main())

Multiple Values

class Page(ScraperModel):
    tags: list[str] = Field(selectors=[".tag"], multiple=True)
    links: list[str] = Field(selectors=["nav a"], multiple=True, attr="href")

🛡 Drift Detection

If a fallback selector fires:

UserWarning: [Selector Drift] Field 'price':
primary selector '.product-price' failed;
used fallback '[data-price]'.

Catch programmatically:

import warnings
from topscrape import SelectorDriftWarning

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    product = Product.from_url(url)

drifted = [x for x in w if issubclass(x.category, SelectorDriftWarning)]

🖥 CLI Usage

topscrape https://example.com "title"
topscrape https://example.com ".price" "[data-price]"
topscrape https://example.com "a.buy-link" --attr href
topscrape https://example.com "li.feature" --all
topscrape https://example.com "h1" --json

🧩 API Reference

Field

Parameter Description
selectors Ordered CSS / XPath / Regex list
attr Attribute to extract
transform Pre-validation function
default Fallback value
multiple Return all matches

ScraperModel

Method Description
from_html Parse raw HTML
from_url Fetch & parse (sync)
from_url_async Fetch & parse (async)
from_selector Parse existing selector

👨‍💻 Developer Guide — Run & Contribute via GitHub

Want to run topscrape locally or contribute improvements? Follow this streamlined workflow.


🍴 1. Fork the Repository

  1. Go to: https://github.com/ronaldgosso/topscrape
  2. Click Fork
  3. Clone your fork

📥 2. Clone Your Fork

git clone https://github.com/<your-username>/topscrape.git
cd topscrape

Add upstream:

git remote add upstream https://github.com/ronaldgosso/topscrape.git

Sync later with:

git fetch upstream
git merge upstream/main

🐍 3. Create Virtual Environment

python -m venv .venv

Activate:

Mac/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

📦 4. Install in Editable Mode

pip install -e ".[dev]"

Editable mode ensures changes apply instantly.


🧪 5. Run Tests

pytest

No green tests, no merge.


🧹 6. Lint & Type Check

ruff check .
black .
mypy topscrape/

Clean, typed, consistent.


🌿 7. Create Feature Branch

git checkout -b feature/your-feature

Never commit directly to main.


💾 8. Commit Properly

git commit -m "feat: improve fallback logging"

Conventional commit prefixes:

  • feat:
  • fix:
  • docs:
  • refactor:
  • test:

🚀 9. Push & Open Pull Request

git push origin feature/your-feature

Then open a Pull Request against ronaldgosso/main.


🧠 Development Principles

topscrape prioritizes:

  • Resilience over cleverness
  • Declarative design
  • Type safety
  • Drift transparency

Every contribution should reduce brittleness.


🏆 Contribution Standards

Pull requests must:

  • Pass CI
  • Include tests (if applicable)
  • Maintain backward compatibility
  • Follow existing style

Quality > Speed.


📄 License

MIT © Ronald Isack Gosso