topscrape

📖 Landing Page · 📦 PyPI · 🐛 Issues

Declarative, resilient, and typed web scraping.
Define what you want — topscrape figures out how to get it, even when the site changes.

✨ Why topscrape?

Scrapers break when websites update their HTML. Fixing them means hunting down changed CSS selectors — tedious, repetitive, and always at the worst time.

❌ Standard approach — brittle

soup = BeautifulSoup(html, "html.parser")
price = soup.select_one(".price")
if not price:
    price = soup.select_one(".cost")

Manual fallback. Manual debugging. Constant maintenance.

✅ The topscrape Approach

from topscrape import ScraperModel, Field

class Product(ScraperModel):
    title: str = Field(selectors=["h1.title", "h1"])
    price: float = Field(
        selectors=[".product-price", "[data-price]", "//span[@itemprop='price']"],
        transform=lambda v: v.replace("$", "").replace(",", ""),
    )
    image: str = Field(selectors=["img.hero"], attr="src", default="")

product = Product.from_url("https://example.com/item/1")
print(product.price)

If .product-price disappears but [data-price] still works:

topscrape returns the correct value
Emits a Selector Drift Warning
Keeps your scraper alive

That’s resilience by design.

🚀 Features

Feature	Description
Declarative models	Define fields with `Field(selectors=[...])`
Selector chains	CSS → XPath → Regex fallback
Drift detection	Warns before total breakage
Pydantic validation	Strong typing enforced
Transforms	Clean data before validation
Async ready	`from_url_async()` supported
CLI included	Quick one-off extraction

📦 Installation

pip install topscrape

Requires Python 3.9+.

⚡ Quick Start

Google Colab Example - Link

Basic Extraction

from topscrape import ScraperModel, Field

class Article(ScraperModel):
    title: str = Field(selectors=["h1", ".article-title"])
    author: str = Field(selectors=[".byline", "[rel='author']"], default="Unknown")
    content: str = Field(selectors=["article p", ".body-text"])

article = Article.from_html(html_string)
print(article.title)

Fetch From URL

product = Product.from_url("https://example.com/item/1")
print(product.title)

Async Usage

import asyncio

async def main():
    product = await Product.from_url_async("https://example.com/item/1")
    print(product.price)

asyncio.run(main())

Multiple Values

class Page(ScraperModel):
    tags: list[str] = Field(selectors=[".tag"], multiple=True)
    links: list[str] = Field(selectors=["nav a"], multiple=True, attr="href")

🛡 Drift Detection

If a fallback selector fires:

UserWarning: [Selector Drift] Field 'price':
primary selector '.product-price' failed;
used fallback '[data-price]'.

Catch programmatically:

import warnings
from topscrape import SelectorDriftWarning

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    product = Product.from_url(url)

drifted = [x for x in w if issubclass(x.category, SelectorDriftWarning)]

🖥 CLI Usage

topscrape https://example.com "title"
topscrape https://example.com ".price" "[data-price]"
topscrape https://example.com "a.buy-link" --attr href
topscrape https://example.com "li.feature" --all
topscrape https://example.com "h1" --json

🧩 API Reference

`Field`

Parameter	Description
selectors	Ordered CSS / XPath / Regex list
attr	Attribute to extract
transform	Pre-validation function
default	Fallback value
multiple	Return all matches

`ScraperModel`

Method	Description
from_html	Parse raw HTML
from_url	Fetch & parse (sync)
from_url_async	Fetch & parse (async)
from_selector	Parse existing selector

👨‍💻 Developer Guide — Run & Contribute via GitHub

Want to run topscrape locally or contribute improvements? Follow this streamlined workflow.

🍴 1. Fork the Repository

Go to: https://github.com/ronaldgosso/topscrape
Click Fork
Clone your fork

📥 2. Clone Your Fork

git clone https://github.com/<your-username>/topscrape.git
cd topscrape

Add upstream:

git remote add upstream https://github.com/ronaldgosso/topscrape.git

Sync later with:

git fetch upstream
git merge upstream/main

🐍 3. Create Virtual Environment

python -m venv .venv

Activate:

Mac/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

📦 4. Install in Editable Mode

pip install -e ".[dev]"

Editable mode ensures changes apply instantly.

🧪 5. Run Tests

pytest

No green tests, no merge.

🧹 6. Lint & Type Check

ruff check .
black .
mypy topscrape/

Clean, typed, consistent.

🌿 7. Create Feature Branch

git checkout -b feature/your-feature

Never commit directly to main.

💾 8. Commit Properly

git commit -m "feat: improve fallback logging"

Conventional commit prefixes:

feat:
fix:
docs:
refactor:
test:

🚀 9. Push & Open Pull Request

git push origin feature/your-feature

Then open a Pull Request against ronaldgosso/main.

🧠 Development Principles

topscrape prioritizes:

Resilience over cleverness
Declarative design
Type safety
Drift transparency

Every contribution should reduce brittleness.

🏆 Contribution Standards

Pull requests must:

Pass CI
Include tests (if applicable)
Maintain backward compatibility
Follow existing style

Quality > Speed.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
docs		docs
tests		tests
topscrape		topscrape
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

topscrape

✨ Why topscrape?

❌ Standard approach — brittle

✅ The topscrape Approach

🚀 Features

📦 Installation

⚡ Quick Start

Google Colab Example - Link

Basic Extraction

Fetch From URL

Async Usage

Multiple Values

🛡 Drift Detection

🖥 CLI Usage

🧩 API Reference

Field

ScraperModel

👨‍💻 Developer Guide — Run & Contribute via GitHub

🍴 1. Fork the Repository

📥 2. Clone Your Fork

🐍 3. Create Virtual Environment

📦 4. Install in Editable Mode

🧪 5. Run Tests

🧹 6. Lint & Type Check

🌿 7. Create Feature Branch

💾 8. Commit Properly

🚀 9. Push & Open Pull Request

🧠 Development Principles

🏆 Contribution Standards

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`Field`

`ScraperModel`

Packages