Declarative, resilient, and typed web scraping.
Define what you want — topscrape figures out how to get it, even when the site changes.
Scrapers break when websites update their HTML. Fixing them means hunting down changed CSS selectors — tedious, repetitive, and always at the worst time.
soup = BeautifulSoup(html, "html.parser")
price = soup.select_one(".price")
if not price:
price = soup.select_one(".cost")Manual fallback. Manual debugging. Constant maintenance.
from topscrape import ScraperModel, Field
class Product(ScraperModel):
title: str = Field(selectors=["h1.title", "h1"])
price: float = Field(
selectors=[".product-price", "[data-price]", "//span[@itemprop='price']"],
transform=lambda v: v.replace("$", "").replace(",", ""),
)
image: str = Field(selectors=["img.hero"], attr="src", default="")
product = Product.from_url("https://example.com/item/1")
print(product.price)If .product-price disappears but [data-price] still works:
- topscrape returns the correct value
- Emits a Selector Drift Warning
- Keeps your scraper alive
That’s resilience by design.
| Feature | Description |
|---|---|
| Declarative models | Define fields with Field(selectors=[...]) |
| Selector chains | CSS → XPath → Regex fallback |
| Drift detection | Warns before total breakage |
| Pydantic validation | Strong typing enforced |
| Transforms | Clean data before validation |
| Async ready | from_url_async() supported |
| CLI included | Quick one-off extraction |
pip install topscrapeRequires Python 3.9+.
Google Colab Example - Link
from topscrape import ScraperModel, Field
class Article(ScraperModel):
title: str = Field(selectors=["h1", ".article-title"])
author: str = Field(selectors=[".byline", "[rel='author']"], default="Unknown")
content: str = Field(selectors=["article p", ".body-text"])
article = Article.from_html(html_string)
print(article.title)product = Product.from_url("https://example.com/item/1")
print(product.title)import asyncio
async def main():
product = await Product.from_url_async("https://example.com/item/1")
print(product.price)
asyncio.run(main())class Page(ScraperModel):
tags: list[str] = Field(selectors=[".tag"], multiple=True)
links: list[str] = Field(selectors=["nav a"], multiple=True, attr="href")If a fallback selector fires:
UserWarning: [Selector Drift] Field 'price':
primary selector '.product-price' failed;
used fallback '[data-price]'.
Catch programmatically:
import warnings
from topscrape import SelectorDriftWarning
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
product = Product.from_url(url)
drifted = [x for x in w if issubclass(x.category, SelectorDriftWarning)]topscrape https://example.com "title"
topscrape https://example.com ".price" "[data-price]"
topscrape https://example.com "a.buy-link" --attr href
topscrape https://example.com "li.feature" --all
topscrape https://example.com "h1" --json| Parameter | Description |
|---|---|
| selectors | Ordered CSS / XPath / Regex list |
| attr | Attribute to extract |
| transform | Pre-validation function |
| default | Fallback value |
| multiple | Return all matches |
| Method | Description |
|---|---|
| from_html | Parse raw HTML |
| from_url | Fetch & parse (sync) |
| from_url_async | Fetch & parse (async) |
| from_selector | Parse existing selector |
Want to run topscrape locally or contribute improvements? Follow this streamlined workflow.
- Go to: https://github.com/ronaldgosso/topscrape
- Click Fork
- Clone your fork
git clone https://github.com/<your-username>/topscrape.git
cd topscrapeAdd upstream:
git remote add upstream https://github.com/ronaldgosso/topscrape.gitSync later with:
git fetch upstream
git merge upstream/mainpython -m venv .venvActivate:
Mac/Linux
source .venv/bin/activateWindows
.venv\Scripts\activatepip install -e ".[dev]"Editable mode ensures changes apply instantly.
pytestNo green tests, no merge.
ruff check .
black .
mypy topscrape/Clean, typed, consistent.
git checkout -b feature/your-featureNever commit directly to main.
git commit -m "feat: improve fallback logging"Conventional commit prefixes:
- feat:
- fix:
- docs:
- refactor:
- test:
git push origin feature/your-featureThen open a Pull Request against ronaldgosso/main.
topscrape prioritizes:
- Resilience over cleverness
- Declarative design
- Type safety
- Drift transparency
Every contribution should reduce brittleness.
Pull requests must:
- Pass CI
- Include tests (if applicable)
- Maintain backward compatibility
- Follow existing style
Quality > Speed.
MIT © Ronald Isack Gosso
