An async Python crawling framework for discovering URLs, extracting links, and scraping structured content.
Onecrawler helps you build maintainable crawling and extraction workflows without turning every project into a custom scraping script. It provides a shared configuration model, async execution, sitemap discovery, browser-backed link extraction, heuristic content extraction, and optional GenAI extraction for typed outputs.
Recommended workflow:
- Use sitemaps first whenever possible.
- Fall back to browser link extraction when sitemap coverage is missing or dynamic.
- Scrape the final URL list with heuristic extraction by default.
- Use GenAI extraction when you need structured output in a Pydantic schema.
async with LinkExtractionEngine(settings) as link_engine:
links = await link_engine.run("https://example.com")
async with ScraperEngine(settings) as scraper_engine:
records = await scraper_engine.run(links)| Capability | Details |
|---|---|
| Sitemap discovery | Resolves robots.txt, common sitemap paths, nested indexes, .xml.gz, feeds, and HTML fallback |
| Browser link extraction | Shallow and deep Playwright-backed discovery for JavaScript-rendered or sitemap-poor sites |
| URL filtering | Wildcard path filters with include_link_patterns |
| Async performance | Tunable concurrency, retries, timeouts, and crawl limits |
| Content extraction | Heuristic extraction with trafilatura for fast article-like content |
| GenAI extraction | Optional model-assisted extraction for strongly typed Pydantic outputs |
| Output formats | markdown, json, csv, html, python, txt, xml, xmltei |
| Proxy support | Single proxy or rotating proxy pools for browser and sitemap workflows |
| Browser controls | Viewport, user agent, locale, timezone, storage state, and runtime settings |
| Need | Use | Why |
|---|---|---|
| Fast URL discovery from a public site | UniversalSiteMap |
Simplest, fastest, and least expensive way to collect URLs |
| Links from one listing page | Shallow LinkExtractionEngine |
Reads direct same-site links from the page |
| Recursive discovery through navigation | Deep LinkExtractionEngine |
Follows internal links until your configured limit |
| Bulk article or page text extraction | Heuristic ScraperEngine |
Deterministic and avoids model cost |
| Typed fields or semantic normalization | GenAI extraction | Produces schema-shaped output for downstream systems |
pip install onecrawlerInstall Playwright browser binaries when you use browser-backed crawling or scraping:
python -m playwright install chromiumInstall optional GenAI dependencies when you use model-assisted extraction:
pip install "onecrawler[genai]"Note
GenAI extraction requires an API key from your chosen provider (OpenAI, Google) or a running Ollama instance. See GenAI Extraction for details.
For local development:
git clone https://github.com/sayedshaun/onecrawler.git
cd onecrawler
python -m pip install -e ".[dev]"
python -m playwright install chromiumOneCrawler provides an optimized Docker image that includes all necessary browser dependencies. This is the recommended way to run the framework in production or CI/CD environments.
docker pull sayedshaun/onecrawler:latestdocker run -it --rm -v $(pwd):/app onecrawler python your_script.pyNote
The script must be located at the root of the mounted volume.
import json
from onecrawler import CrawlerSettings, LinkExtractionEngine, ScraperEngine
async def main():
settings = CrawlerSettings(
link_extraction_strategy="deep",
link_extraction_limit=10,
concurrency=7,
scraping_strategy="heuristic",
scraping_output_format="json",
enable_human_behaviors=True,
)
async with LinkExtractionEngine(settings) as link_engine:
links = await link_engine.run("https://www.example.com/")
async with ScraperEngine(settings) as scraper_engine:
results = await scraper_engine.run(links)
with open("output.json", "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=4)
if __name__ == "__main__":
import asyncio
asyncio.run(main())Tip
Always set link_extraction_limit when crawling broad sites. Without it, discovery can run indefinitely on large domains.
Use browser extraction when sitemaps are incomplete, unavailable, or unable to expose JavaScript-rendered links.
import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine
async def main():
settings = CrawlerSettings(
link_extraction_strategy="deep",
link_extraction_limit=250,
include_link_patterns=["/news/*"],
concurrency=5,
)
async with LinkExtractionEngine(settings) as engine:
links = await engine.run("https://example.com/news")
print(f"Collected {len(links)} links")
if __name__ == "__main__":
asyncio.run(main())Tip
Use include_link_patterns to keep discovery focused on relevant paths. For example, ["/blog/*", "/docs/*"] prevents the crawler from wandering into auth pages, admin routes, or unrelated sections.
Note
Deep extraction follows internal links recursively. Use shallow strategy when you only need links visible on a single listing page — it's significantly faster.
Use GenAI extraction when you need a strongly typed response shape instead of plain content.
pip install "onecrawler[genai]"import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine
class ArticleSummary(BaseModel):
title: str
author: Optional[str] = None
published_at: Optional[str] = None
summary: str
topics: list[str]
async def main():
settings = CrawlerSettings(
scraping_strategy="genai",
scraping_output_format="json",
genai=GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY",
output_schema=ArticleSummary,
),
concurrency=2,
request_timeout=30,
)
async with ScraperEngine(settings) as scraper:
result = await scraper.run("https://example.com/articles/story")
print(result.model_dump() if hasattr(result, "model_dump") else result)
if __name__ == "__main__":
asyncio.run(main())Tip
Keep concurrency low (2–4) for GenAI extraction. Each page triggers a model call; high concurrency can exhaust rate limits quickly and inflate costs.
Warning
Never hardcode your API key in source files. Use environment variables or a secrets manager instead:
import os
api_key=os.environ["OPENAI_API_KEY"]| Provider | Requires | Models |
|---|---|---|
| OpenAI | api_key |
GPT-4o, GPT-4o-mini, etc. |
api_key |
Gemini models | |
| Ollama | base_url (no key needed) |
Any locally hosted model |
settings = CrawlerSettings(
scraping_strategy="genai",
genai=GenerativeAISettings(
provider="ollama",
model_name="llama3:8b",
base_url="http://localhost:11434/",
output_schema=ArticleSummary,
),
)Note
Ollama requires a running local instance. Install it from ollama.com and pull your model (ollama pull llama3:8b) before running.
Attach one proxy or a rotating proxy pool directly to CrawlerSettings.
from onecrawler import CrawlerSettings, ProxySettings
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy-1.example:8080"),
ProxySettings(
server="http://proxy-2.example:8080",
username="user",
password="pass",
),
],
proxy_rotation="round_robin",
)Use proxy=ProxySettings(...) for a single proxy, or proxies=[...] with proxy_rotation for a pool.
Tip
round_robin rotation distributes requests evenly across your proxy pool. For rate-limited targets, pair this with a modest concurrency value and a request_delay to avoid triggering bans.
Important
Split URL discovery and scraping into separate pipeline steps. Collecting all URLs first gives you a checkpoint to resume from if scraping fails partway through — without re-running discovery.
Tip
Start with UniversalSiteMap before reaching for browser extraction. Sitemap-based discovery is faster, cheaper, and more complete on well-maintained sites. Fall back to LinkExtractionEngine only when sitemaps are missing or stale.
Tip
Use heuristic scraping (scraping_strategy="heuristic") for bulk content extraction. Reserve GenAI extraction for cases where you genuinely need structured, schema-shaped output — it adds latency and cost at scale.
Caution
Respect robots.txt and a site's terms of service before crawling. Onecrawler does not enforce crawl policies automatically — you are responsible for staying within allowed access patterns.
Released under the MIT License. See LICENSE for full terms.