ScraperAI

Prompt-driven web scraping powered by AI. No selectors. No CSS paths. Just describe what you want in plain English.

ScraperAI uses a 3-phase pipeline — Fetch, Understand, Extract — to scrape any website intelligently.

Fetch (ScraperAPI)  →  Understand (AI)  →  Extract (AI)  →  Structured JSON

How It Works

You write a prompt describing what to scrape, how to navigate, and what data you want
ScraperAPI fetches the rendered HTML (handles JS, bot protection, geo-restrictions)
Phase 2 AI reads the HTML and produces clean markdown — finding all content, links, and images regardless of where they're hidden
Phase 3 AI extracts structured JSON data from the clean markdown based on your prompt

The AI handles pagination, detail pages, and multi-level crawling automatically based on your prompt.

Features

Prompt-driven — No code changes needed for different sites. Write a prompt, get data.
Multi-level crawling — Listing pages → Detail pages → Sub-pages. BFS with automatic pagination.
Dual-model architecture — Free Phase 2 model for page understanding + cloud LLM for precise extraction
Single-model mode — Use one provider for everything (skip Phase 2)
Image discovery — Finds images hidden in JavaScript galleries, CSS background-image, carousels
Data merging — Detail page data automatically merged into parent listing items
Retry + fallback — Per-chunk retry with exponential backoff; automatic failover to a backup provider
Crawl cache — Resume interrupted crawls without re-fetching already-processed pages
Request pacing — Configurable delay between fetches to respect rate limits
6 providers — Anthropic Claude, OpenAI GPT-4o, Google Gemini, Groq, Ollama (local), mix & match

Quick Start

1. Install

pip install -e ".[all]"

2. Configure

Copy .env.example to .env and add your API keys:

SCRAPER_API_KEY=your_scraper_api_key

# Cloud LLM for extraction (pick one or more)
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key

# Free Phase 2 processors (pick one)
GEMINI_API_KEY=your_gemini_key        # Recommended — free, fast, large context
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=qwen2.5:14b

# Defaults
DEFAULT_PROVIDER=anthropic
CLAUDE_MODEL=claude-haiku-4-5-20251001
GEMINI_MODEL=gemini-2.5-flash

# Resilience (optional)
EXTRACTION_RETRIES=2              # retry attempts per chunk (default: 2)
FALLBACK_PROVIDER=openai          # try this provider if primary fails
FETCH_DELAY=1.0                   # seconds between page fetches

3. Write a prompt

Create prompts/my_scrape.txt:

You are scraping a product catalog website.

Step 1 - Listing pages:
Go to the catalog page. You will find products displayed as cards.
Each card has a product name, price, and link to its detail page.
Follow all pagination links until the last page.

For each product, extract:
{
  "name": "Example Product",
  "price": "$29.99",
  "detail_url": "https://example.com/product/123"
}

Put pagination links in next_urls.
Put product detail URLs in detail_urls.

Step 2 - Detail pages:
Visit each product's detail page. Extract full specs:
{
  "name": "Example Product",
  "price": "$29.99",
  "detail_url": "https://example.com/product/123",
  "images": ["https://example.com/img1.jpg"],
  "description": "Full product description...",
  "sku": "ABC-123",
  "category": "Electronics"
}

4. Run

# Single-model mode (cloud LLM handles everything)
scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
  --provider anthropic -o data/output.json

# Dual-model mode (Gemini understands pages, Anthropic extracts data)
scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
  --provider anthropic --processor gemini -o data/output.json

Providers

Provider	Use as	Key needed	Free tier
Anthropic (Claude Haiku)	Phase 3 extractor	`ANTHROPIC_API_KEY`	No (pay per token)
OpenAI (GPT-4o)	Phase 3 extractor	`OPENAI_API_KEY`	No (pay per token)
Gemini (Flash)	Phase 2 processor	`GEMINI_API_KEY`	Yes — 250K TPM, 250 RPD
Groq (Llama)	Phase 2/3	`GROQ_API_KEY`	Yes — 6K TPM (limited)
Ollama (any model)	Phase 2 processor	None (local)	Yes (runs on your machine)

Recommended setup: Gemini Flash for Phase 2 (free, fast, 1M token context) + Claude Haiku for Phase 3 (cheap, accurate).

CLI Reference

scraper-ai <url> <prompt> [options]

Arguments:
  url                   Starting URL to scrape
  prompt                Prompt text or path to a .txt/.md file

Options:
  -p, --provider        AI provider for extraction: anthropic, openai, gemini, groq, ollama
  --processor           AI provider for page understanding (dual-model mode)
  --fallback            Fallback provider if primary extraction fails (e.g. openai)
  --max-pages N         Safety limit on pages to crawl (default: 100)
  --delay SECONDS       Seconds between page fetches (default: 1.0)
  --cache               Enable URL result caching for resume on interrupted crawls
  --clear-cache         Clear the cache before starting
  --auto-scroll         Enable infinite scroll handling
  --no-render           Disable JavaScript rendering
  -o, --output FILE     Output file path (default: stdout)
  -v, --verbose         Enable debug logging

Python API

ScraperAI can be used as a library — no CLI needed.

Install

pip install -e ".[anthropic,gemini]"   # or .[all]

Basic usage

from scraper_ai import scrape

result = scrape(
    "https://example.com/catalog",
    "Extract product name, price, and description",
    provider="anthropic",
)

print(f"Scraped {result.pages_crawled} pages")
for item in result.data:
    print(item["name"], item["price"])

Prompt from a file

from scraper_ai import scrape

result = scrape(
    "https://example.com/catalog",
    "prompts/my_scrape.txt",       # path to a .txt or .md file
    provider="anthropic",
)

Dual-model mode (Gemini understands pages, Anthropic extracts)

result = scrape(
    "https://example.com/catalog",
    "prompts/my_scrape.txt",
    provider="anthropic",
    processor="gemini",
)

With options

Any Settings field can be passed as a keyword argument:

result = scrape(
    "https://example.com/catalog",
    "prompts/my_scrape.txt",
    provider="anthropic",
    processor="gemini",
    max_pages=20,
    fetch_delay=2.0,
    cache_enabled=True,
    fallback_provider="openai",
)

Working with results

from scraper_ai import scrape
import json

result = scrape("https://example.com/catalog", "Extract all products", provider="anthropic")

# As a list of dicts
print(result.data)

# Serialise to JSON
print(json.dumps(result.model_dump(), indent=2))

# Key metadata
print(result.pages_crawled)   # int — number of pages visited
print(result.provider)        # str — provider used
print(result.url)             # str — starting URL

Available exports

from scraper_ai import (
    scrape,          # convenience function (recommended)
    crawl,           # lower-level crawl function
    CrawlResult,     # return type
    PageResult,      # per-page result type
    Settings,        # configuration dataclass
    get_provider,    # instantiate a provider by name
    list_providers,  # list available provider names
)

Examples

Product Catalog (multi-level with detail pages)

scraper-ai "https://example.com/catalog" prompts/my_scrape.txt \
  --provider anthropic --processor gemini -o data/products.json

Output:

{
  "pages_crawled": 4,
  "data": [
    {
      "name": "Wireless Headphones",
      "price": "$79.99",
      "detail_url": "https://example.com/product/42",
      "images": ["https://...img1.jpg", "https://...img2.jpg"],
      "description": "Noise-cancelling over-ear headphones with 30h battery...",
      "sku": "WH-42",
      "category": "Electronics"
    }
  ]
}

Ollama Model Search (single page, no detail pages)

scraper-ai "https://ollama.com/search?q=scrapping" prompts/ollama_models.txt \
  --provider anthropic -o data/ollama_models.json

Writing Good Prompts

The prompt is the brain of ScraperAI. Tips:

Use few-shot JSON examples — Show the exact field names and formats you want
Describe each level — Step 1 for listing pages, Step 2 for detail pages
Be explicit about URLs — "Put pagination in next_urls, detail URLs in detail_urls"
Describe the page structure — "Cards with basic info", "Gallery with multiple images"
Specify what NOT to do — "Do NOT follow pagination" for test runs

See prompts/ for examples.

Architecture

Single model (--provider only):
  Fetch → Clean → LLM (extract) → Structured JSON

Dual model (--provider + --processor):
  Fetch → Clean → Processor (understand) → LLM (extract) → Structured JSON
                       ↓                        ↓
                 Clean markdown           Structured JSON
                 with ALL images,         per user prompt
                 links, content

3-Phase Pipeline:

Phase	What	Who	Input	Output
1. Fetch	Get rendered HTML	ScraperAPI	URL	Raw HTML
1.5 Clean	Strip boilerplate	Regex	Raw HTML	Cleaned HTML
2. Understand	Read HTML → markdown	Gemini / Ollama	Cleaned HTML	Clean markdown
3. Extract	Follow prompt → JSON	Claude / GPT-4o	Markdown + prompt	Structured data

See ARCHITECTURE.md for the full technical details.

Limitations & Troubleshooting

ScraperAI depends on ScraperAPI for fetching and AI models for extraction. Both can fail in predictable ways.

Fetch failures

Symptom	Cause	Fix
`FetchError` / empty HTML	Bot protection (Cloudflare, Akamai) blocking ScraperAPI	ScraperAPI handles most bot protection, but some sites block all proxies. Try adding `--auto-scroll` or check ScraperAPI dashboard for errors.
HTML returned but content missing	SPA loads data via XHR after initial render	JS rendering is on by default (`--no-render` disables it). If content still missing, the site may require authentication or specific cookies.
Different HTML than browser	Site serves different content to headless browsers	Some sites detect headless Chrome. ScraperAPI rotates user agents, but geo-restricted content may need a specific country proxy (not yet supported).
Timeout errors	Page takes too long to render	Increase timeout via `SCRAPER_TIMEOUT` in `.env` (default: 60s). Heavy SPAs with many API calls may need 90-120s.

Extraction failures

Symptom	Cause	Fix
Empty `data` array	AI couldn't match your prompt to the page content	Run with `-v` to see the HTML/markdown being sent. Update your prompt to match the actual page structure.
Missing fields	Data exists on page but AI didn't extract it	Add explicit field descriptions and few-shot examples to your prompt. Dual-model mode (`--processor gemini`) often captures more content.
Hallucinated data	AI invented data not on the page	Lower temperature (already 0.0 by default). Use more specific prompts. Check that the fetched HTML actually contains the expected content.
`ExtractionError: Failed to parse`	AI returned malformed JSON	Automatic retry (2 attempts by default). Add `--fallback openai` to try a second provider. If persistent, simplify your prompt's JSON schema.

Crawl issues

Symptom	Cause	Fix
Crawl never stops	AI keeps finding pagination links	Use `--max-pages N` to set a safety limit. Also add "Do NOT follow pagination" in your prompt for test runs.
Detail data not merging	Detail URL doesn't match `detail_url` field in listing data	Ensure your prompt extracts `detail_url` with the exact URL format the site uses (trailing slashes, query params, etc.).
Duplicate items	Same item appears on multiple pagination pages	Deduplication by `detail_url` is automatic for Level 1. If items lack a `detail_url`, duplicates may appear.
413 / rate limit errors	Provider's token or request limit exceeded	Use Gemini for Phase 2 (250K TPM). Groq free tier (6K TPM) is too small for most HTML pages. Check provider logs with `-v`.

Sites that don't work well

Login-required pages — ScraperAPI doesn't support authenticated sessions. You'd need to pass cookies manually (not yet supported).
CAPTCHAs — ScraperAPI solves some CAPTCHAs, but interactive ones (drag-to-verify, puzzle) will fail.
Infinite scroll without pagination URLs — Use --auto-scroll to trigger scroll-based loading. Works for 3 scroll cycles; deeply nested infinite scroll may need multiple runs.
Iframed content — Content inside <iframe> tags is stripped by the cleaner. Cross-origin iframe content isn't accessible via the main page fetch.
PDF / non-HTML content — Only HTML pages are supported. PDFs, images, or API endpoints returning raw JSON are not processed.

Cost

ScraperAI is designed to be cheap:

Component	Cost
Gemini Flash (Phase 2)	Free (250 requests/day)
Ollama (Phase 2)	Free (local)
Claude Haiku (Phase 3)	~$0.005/page
ScraperAPI	Free tier: 1000 calls/month

A full product catalog scrape with 24 detail pages (~27 pages total) costs approximately $0.10-0.15 with dual-model mode (Gemini + Claude).

Development

Install dev dependencies

pip install -e ".[dev,all]"

Linting

Uses Ruff for linting and import sorting (replaces flake8, isort, pyupgrade).

# Check for issues
ruff check src/ tests/

# Auto-fix what it can
ruff check src/ tests/ --fix

Rules enabled: pycodestyle, pyflakes, isort, pep8-naming, pyupgrade, flake8-bugbear, flake8-simplify, ruff-specific. Configuration is in pyproject.toml under [tool.ruff].

Testing

Uses pytest with 129 tests across 9 modules.

# Run all tests
pytest

# Verbose output
pytest -v

# Run a specific test file
pytest tests/test_cleaner.py

# Run a specific test
pytest tests/test_providers.py::TestProviderInit::test_gemini_requires_api_key

Test coverage:

Module	Tests
`test_cleaner.py`	HTML cleaning, tag stripping, chunking
`test_config.py`	Settings defaults, env loading, retry/cache/pacing config
`test_models.py`	PageResult, CrawlResult serialization
`test_providers.py`	Provider registry, base class, API key checks
`test_fetcher.py`	ScraperAPI headers, scroll, error handling
`test_crawler.py`	BFS crawl, pagination, detail merge, dual-model, retry/fallback
`test_cli.py`	Argument parsing, file output, prompt loading, resilience flags
`test_cache.py`	CrawlCache put/get, clear, corruption handling
`test_api.py`	`scrape()` function, exports, prompt file loading

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts		prompts
src/scraper_ai		src/scraper_ai
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

ScraperAI

How It Works

Features

Quick Start

1. Install

2. Configure

3. Write a prompt

4. Run

Providers

CLI Reference

Python API

Install

Basic usage

Prompt from a file

Dual-model mode (Gemini understands pages, Anthropic extracts)

With options

Working with results

Available exports

Examples

Product Catalog (multi-level with detail pages)

Ollama Model Search (single page, no detail pages)

Writing Good Prompts

Architecture

Limitations & Troubleshooting

Fetch failures

Extraction failures

Crawl issues

Sites that don't work well

Cost

Development

Install dev dependencies

Linting

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages