Prompt-driven web scraping powered by AI. No selectors. No CSS paths. Just describe what you want in plain English.
ScraperAI uses a 3-phase pipeline — Fetch, Understand, Extract — to scrape any website intelligently.
Fetch (ScraperAPI) → Understand (AI) → Extract (AI) → Structured JSON
- You write a prompt describing what to scrape, how to navigate, and what data you want
- ScraperAPI fetches the rendered HTML (handles JS, bot protection, geo-restrictions)
- Phase 2 AI reads the HTML and produces clean markdown — finding all content, links, and images regardless of where they're hidden
- Phase 3 AI extracts structured JSON data from the clean markdown based on your prompt
The AI handles pagination, detail pages, and multi-level crawling automatically based on your prompt.
- Prompt-driven — No code changes needed for different sites. Write a prompt, get data.
- Multi-level crawling — Listing pages → Detail pages → Sub-pages. BFS with automatic pagination.
- Dual-model architecture — Free Phase 2 model for page understanding + cloud LLM for precise extraction
- Single-model mode — Use one provider for everything (skip Phase 2)
- Image discovery — Finds images hidden in JavaScript galleries, CSS
background-image, carousels - Data merging — Detail page data automatically merged into parent listing items
- Retry + fallback — Per-chunk retry with exponential backoff; automatic failover to a backup provider
- Crawl cache — Resume interrupted crawls without re-fetching already-processed pages
- Request pacing — Configurable delay between fetches to respect rate limits
- 6 providers — Anthropic Claude, OpenAI GPT-4o, Google Gemini, Groq, Ollama (local), mix & match
pip install -e ".[all]"Copy .env.example to .env and add your API keys:
SCRAPER_API_KEY=your_scraper_api_key
# Cloud LLM for extraction (pick one or more)
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
# Free Phase 2 processors (pick one)
GEMINI_API_KEY=your_gemini_key # Recommended — free, fast, large context
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=qwen2.5:14b
# Defaults
DEFAULT_PROVIDER=anthropic
CLAUDE_MODEL=claude-haiku-4-5-20251001
GEMINI_MODEL=gemini-2.5-flash
# Resilience (optional)
EXTRACTION_RETRIES=2 # retry attempts per chunk (default: 2)
FALLBACK_PROVIDER=openai # try this provider if primary fails
FETCH_DELAY=1.0 # seconds between page fetchesCreate prompts/my_scrape.txt:
You are scraping a product catalog website.
Step 1 - Listing pages:
Go to the catalog page. You will find products displayed as cards.
Each card has a product name, price, and link to its detail page.
Follow all pagination links until the last page.
For each product, extract:
{
"name": "Example Product",
"price": "$29.99",
"detail_url": "https://example.com/product/123"
}
Put pagination links in next_urls.
Put product detail URLs in detail_urls.
Step 2 - Detail pages:
Visit each product's detail page. Extract full specs:
{
"name": "Example Product",
"price": "$29.99",
"detail_url": "https://example.com/product/123",
"images": ["https://example.com/img1.jpg"],
"description": "Full product description...",
"sku": "ABC-123",
"category": "Electronics"
}
# Single-model mode (cloud LLM handles everything)
scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
--provider anthropic -o data/output.json
# Dual-model mode (Gemini understands pages, Anthropic extracts data)
scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
--provider anthropic --processor gemini -o data/output.json| Provider | Use as | Key needed | Free tier |
|---|---|---|---|
| Anthropic (Claude Haiku) | Phase 3 extractor | ANTHROPIC_API_KEY |
No (pay per token) |
| OpenAI (GPT-4o) | Phase 3 extractor | OPENAI_API_KEY |
No (pay per token) |
| Gemini (Flash) | Phase 2 processor | GEMINI_API_KEY |
Yes — 250K TPM, 250 RPD |
| Groq (Llama) | Phase 2/3 | GROQ_API_KEY |
Yes — 6K TPM (limited) |
| Ollama (any model) | Phase 2 processor | None (local) | Yes (runs on your machine) |
Recommended setup: Gemini Flash for Phase 2 (free, fast, 1M token context) + Claude Haiku for Phase 3 (cheap, accurate).
scraper-ai <url> <prompt> [options]
Arguments:
url Starting URL to scrape
prompt Prompt text or path to a .txt/.md file
Options:
-p, --provider AI provider for extraction: anthropic, openai, gemini, groq, ollama
--processor AI provider for page understanding (dual-model mode)
--fallback Fallback provider if primary extraction fails (e.g. openai)
--max-pages N Safety limit on pages to crawl (default: 100)
--delay SECONDS Seconds between page fetches (default: 1.0)
--cache Enable URL result caching for resume on interrupted crawls
--clear-cache Clear the cache before starting
--auto-scroll Enable infinite scroll handling
--no-render Disable JavaScript rendering
-o, --output FILE Output file path (default: stdout)
-v, --verbose Enable debug logging
ScraperAI can be used as a library — no CLI needed.
pip install -e ".[anthropic,gemini]" # or .[all]from scraper_ai import scrape
result = scrape(
"https://example.com/catalog",
"Extract product name, price, and description",
provider="anthropic",
)
print(f"Scraped {result.pages_crawled} pages")
for item in result.data:
print(item["name"], item["price"])from scraper_ai import scrape
result = scrape(
"https://example.com/catalog",
"prompts/my_scrape.txt", # path to a .txt or .md file
provider="anthropic",
)result = scrape(
"https://example.com/catalog",
"prompts/my_scrape.txt",
provider="anthropic",
processor="gemini",
)Any Settings field can be passed as a keyword argument:
result = scrape(
"https://example.com/catalog",
"prompts/my_scrape.txt",
provider="anthropic",
processor="gemini",
max_pages=20,
fetch_delay=2.0,
cache_enabled=True,
fallback_provider="openai",
)from scraper_ai import scrape
import json
result = scrape("https://example.com/catalog", "Extract all products", provider="anthropic")
# As a list of dicts
print(result.data)
# Serialise to JSON
print(json.dumps(result.model_dump(), indent=2))
# Key metadata
print(result.pages_crawled) # int — number of pages visited
print(result.provider) # str — provider used
print(result.url) # str — starting URLfrom scraper_ai import (
scrape, # convenience function (recommended)
crawl, # lower-level crawl function
CrawlResult, # return type
PageResult, # per-page result type
Settings, # configuration dataclass
get_provider, # instantiate a provider by name
list_providers, # list available provider names
)scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
--provider anthropic --processor gemini -o data/products.jsonOutput:
{
"pages_crawled": 4,
"data": [
{
"name": "Wireless Headphones",
"price": "$79.99",
"detail_url": "https://example.com/product/42",
"images": ["https://...img1.jpg", "https://...img2.jpg"],
"description": "Noise-cancelling over-ear headphones with 30h battery...",
"sku": "WH-42",
"category": "Electronics"
}
]
}scraper-ai "https://ollama.com/search?q=scrapping" prompts/ollama_models.txt \
--provider anthropic -o data/ollama_models.jsonThe prompt is the brain of ScraperAI. Tips:
- Use few-shot JSON examples — Show the exact field names and formats you want
- Describe each level — Step 1 for listing pages, Step 2 for detail pages
- Be explicit about URLs — "Put pagination in next_urls, detail URLs in detail_urls"
- Describe the page structure — "Cards with basic info", "Gallery with multiple images"
- Specify what NOT to do — "Do NOT follow pagination" for test runs
See prompts/ for examples.
Single model (--provider only):
Fetch → Clean → LLM (extract) → Structured JSON
Dual model (--provider + --processor):
Fetch → Clean → Processor (understand) → LLM (extract) → Structured JSON
↓ ↓
Clean markdown Structured JSON
with ALL images, per user prompt
links, content
3-Phase Pipeline:
| Phase | What | Who | Input | Output |
|---|---|---|---|---|
| 1. Fetch | Get rendered HTML | ScraperAPI | URL | Raw HTML |
| 1.5 Clean | Strip boilerplate | Regex | Raw HTML | Cleaned HTML |
| 2. Understand | Read HTML → markdown | Gemini / Ollama | Cleaned HTML | Clean markdown |
| 3. Extract | Follow prompt → JSON | Claude / GPT-4o | Markdown + prompt | Structured data |
See ARCHITECTURE.md for the full technical details.
ScraperAI depends on ScraperAPI for fetching and AI models for extraction. Both can fail in predictable ways.
| Symptom | Cause | Fix |
|---|---|---|
FetchError / empty HTML |
Bot protection (Cloudflare, Akamai) blocking ScraperAPI | ScraperAPI handles most bot protection, but some sites block all proxies. Try adding --auto-scroll or check ScraperAPI dashboard for errors. |
| HTML returned but content missing | SPA loads data via XHR after initial render | JS rendering is on by default (--no-render disables it). If content still missing, the site may require authentication or specific cookies. |
| Different HTML than browser | Site serves different content to headless browsers | Some sites detect headless Chrome. ScraperAPI rotates user agents, but geo-restricted content may need a specific country proxy (not yet supported). |
| Timeout errors | Page takes too long to render | Increase timeout via SCRAPER_TIMEOUT in .env (default: 60s). Heavy SPAs with many API calls may need 90-120s. |
| Symptom | Cause | Fix |
|---|---|---|
Empty data array |
AI couldn't match your prompt to the page content | Run with -v to see the HTML/markdown being sent. Update your prompt to match the actual page structure. |
| Missing fields | Data exists on page but AI didn't extract it | Add explicit field descriptions and few-shot examples to your prompt. Dual-model mode (--processor gemini) often captures more content. |
| Hallucinated data | AI invented data not on the page | Lower temperature (already 0.0 by default). Use more specific prompts. Check that the fetched HTML actually contains the expected content. |
ExtractionError: Failed to parse |
AI returned malformed JSON | Automatic retry (2 attempts by default). Add --fallback openai to try a second provider. If persistent, simplify your prompt's JSON schema. |
| Symptom | Cause | Fix |
|---|---|---|
| Crawl never stops | AI keeps finding pagination links | Use --max-pages N to set a safety limit. Also add "Do NOT follow pagination" in your prompt for test runs. |
| Detail data not merging | Detail URL doesn't match detail_url field in listing data |
Ensure your prompt extracts detail_url with the exact URL format the site uses (trailing slashes, query params, etc.). |
| Duplicate items | Same item appears on multiple pagination pages | Deduplication by detail_url is automatic for Level 1. If items lack a detail_url, duplicates may appear. |
| 413 / rate limit errors | Provider's token or request limit exceeded | Use Gemini for Phase 2 (250K TPM). Groq free tier (6K TPM) is too small for most HTML pages. Check provider logs with -v. |
- Login-required pages — ScraperAPI doesn't support authenticated sessions. You'd need to pass cookies manually (not yet supported).
- CAPTCHAs — ScraperAPI solves some CAPTCHAs, but interactive ones (drag-to-verify, puzzle) will fail.
- Infinite scroll without pagination URLs — Use
--auto-scrollto trigger scroll-based loading. Works for 3 scroll cycles; deeply nested infinite scroll may need multiple runs. - Iframed content — Content inside
<iframe>tags is stripped by the cleaner. Cross-origin iframe content isn't accessible via the main page fetch. - PDF / non-HTML content — Only HTML pages are supported. PDFs, images, or API endpoints returning raw JSON are not processed.
ScraperAI is designed to be cheap:
| Component | Cost |
|---|---|
| Gemini Flash (Phase 2) | Free (250 requests/day) |
| Ollama (Phase 2) | Free (local) |
| Claude Haiku (Phase 3) | ~$0.005/page |
| ScraperAPI | Free tier: 1000 calls/month |
A full product catalog scrape with 24 detail pages (~27 pages total) costs approximately $0.10-0.15 with dual-model mode (Gemini + Claude).
pip install -e ".[dev,all]"Uses Ruff for linting and import sorting (replaces flake8, isort, pyupgrade).
# Check for issues
ruff check src/ tests/
# Auto-fix what it can
ruff check src/ tests/ --fixRules enabled: pycodestyle, pyflakes, isort, pep8-naming, pyupgrade, flake8-bugbear, flake8-simplify, ruff-specific. Configuration is in pyproject.toml under [tool.ruff].
Uses pytest with 129 tests across 9 modules.
# Run all tests
pytest
# Verbose output
pytest -v
# Run a specific test file
pytest tests/test_cleaner.py
# Run a specific test
pytest tests/test_providers.py::TestProviderInit::test_gemini_requires_api_keyTest coverage:
| Module | Tests |
|---|---|
test_cleaner.py |
HTML cleaning, tag stripping, chunking |
test_config.py |
Settings defaults, env loading, retry/cache/pacing config |
test_models.py |
PageResult, CrawlResult serialization |
test_providers.py |
Provider registry, base class, API key checks |
test_fetcher.py |
ScraperAPI headers, scroll, error handling |
test_crawler.py |
BFS crawl, pagination, detail merge, dual-model, retry/fallback |
test_cli.py |
Argument parsing, file output, prompt loading, resilience flags |
test_cache.py |
CrawlCache put/get, clear, corruption handling |
test_api.py |
scrape() function, exports, prompt file loading |
MIT