Lightweight, Firecrawl-compatible web scraper and crawler for AI
Installation • API Reference • MCP Integration • JS Rendering • Configuration
English | 中文
CRW is a self-hosted web scraper and web crawler built in Rust — a fast, lightweight Firecrawl alternative designed for LLM extraction, RAG pipelines, and AI agents. It ships as a single binary with ~6 MB idle RAM, built-in MCP server support for Claude, and structured data extraction via Anthropic and OpenAI. Drop-in compatible with Firecrawl's API.
Single binary. No Redis. No Node.js. Drop-in Firecrawl API.
cargo install crw-server
crw-server- CSS selector & XPath — target specific DOM elements before Markdown conversion (
cssSelector,xpath) - Chunking strategies — split content into topic, sentence, or regex-delimited chunks for RAG pipelines (
chunkStrategy) - BM25 & cosine filtering — rank chunks by relevance to a query and return top-K results (
filterMode,topK) - Better Markdown — switched to
htmd(Turndown.js port): tables, code block languages, nested lists all render correctly - Stealth mode — rotate User-Agent from a built-in Chrome/Firefox/Safari pool and inject 12 browser-like headers (
stealth: true) - Per-request proxy — override the global proxy on a per-request basis (
proxy: "http://...") - Rate limit jitter — randomized delay between requests to avoid uniform traffic fingerprinting
crw-server setup— one-command JS rendering setup: downloads LightPanda, createsconfig.local.toml
- Firecrawl-compatible REST API —
/v1/scrape,/v1/crawl,/v1/mapwith identical request/response format - 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
- LLM structured extraction — JSON schema in, validated structured data out (Anthropic tool_use + OpenAI function calling)
- JS rendering — auto-detect SPAs via heuristics, render via LightPanda, Playwright, or Chrome (CDP)
- BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
- MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
- SSRF protection — private IPs, cloud metadata, IPv6, dangerous URI filtering
- Docker ready — multi-stage build with LightPanda sidecar
CRW gives you Firecrawl's API with a fraction of the resource usage. No runtime dependencies, no Redis, no Node.js — just a single binary you can deploy anywhere.
| CRW | Firecrawl | |
|---|---|---|
| Coverage (1K URLs) | 92.0% | 77.2% |
| Avg Latency | 833ms | 4,600ms |
| P50 Latency | 446ms | — |
| Noise Rejection | 88.4% | — |
| Idle RAM | 6.6 MB | ~500 MB+ |
| Cold start | 85 ms | seconds |
| HTTP scrape | ~30 ms | ~200 ms+ |
| Binary size | ~8 MB | Node.js runtime |
| Cost / 1K scrapes | $0 (self-hosted) | $0.83–5.33 |
| Dependencies | single binary | Node + Redis |
| License | AGPL-3.0 | AGPL |
Benchmark: Firecrawl scrape-content-dataset-v1 — 1,000 real-world URLs with JS rendering enabled.
- 🔌 Firecrawl-compatible API — same endpoints, same request/response format, drop-in replacement
- 📄 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
- 🤖 LLM structured extraction — send a JSON schema, get validated structured data back (Anthropic tool_use + OpenAI function calling)
- 🌐 JS rendering — auto-detect SPAs with shell heuristics, render via LightPanda, Playwright, or Chrome (CDP)
- 🕷️ BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
- 🔧 MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
- 🔒 Security — SSRF protection (private IPs, cloud metadata, IPv6), constant-time auth, dangerous URI filtering
- 🐳 Docker ready — multi-stage build with LightPanda sidecar
- 🎯 CSS selector & XPath — extract specific DOM elements before Markdown conversion
- ✂️ Chunking & filtering — split content into topic/sentence/regex chunks; rank by BM25 or cosine similarity
- 🕵️ Stealth mode — browser-like UA rotation and header injection to reduce bot detection
- 🌐 Per-request proxy — override the global proxy per scrape request
Install from crates.io:
cargo install crw-server
crw-serverEnable JS rendering (optional):
crw-server setupThis downloads LightPanda and creates a config.local.toml for JS rendering. See JS Rendering for details.
Docker (pre-built image):
docker run -p 3000:3000 ghcr.io/us/crw:latestDocker Compose (with JS rendering):
docker compose upBuild from source:
cargo build --release --bin crw-server
./target/release/crw-serverScrape a page:
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'{
"success": true,
"data": {
"markdown": "# Example Domain\nThis domain is for use in ...",
"metadata": {
"title": "Example Domain",
"sourceURL": "https://example.com",
"statusCode": 200,
"elapsedMs": 32
}
}
}- RAG pipelines — crawl websites and extract structured data for vector databases
- AI agents — give Claude Code or Claude Desktop web scraping tools via MCP
- Content monitoring — periodic crawl with LLM extraction to track changes
- Data extraction — combine CSS selectors + LLM to extract any schema from any page
- Web archiving — full-site BFS crawl to markdown
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/scrape |
Scrape a single URL, optionally with LLM extraction |
POST |
/v1/crawl |
Start async BFS crawl (returns job ID) |
GET |
/v1/crawl/:id |
Check crawl status and retrieve results |
POST |
/v1/map |
Discover all URLs on a site |
GET |
/health |
Health check (no auth required) |
POST |
/mcp |
Streamable HTTP MCP transport |
Send a JSON schema with your scrape request and CRW returns validated structured data using LLM function calling.
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product",
"formats": ["json"],
"jsonSchema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
},
"required": ["name", "price"]
}
}'- Anthropic — uses
tool_usewithinput_schemafor extraction - OpenAI — uses function calling with
parametersschema - Validation — LLM output is validated against your JSON schema before returning
Configure the LLM provider in your config:
[extraction.llm]
provider = "anthropic" # "anthropic" or "openai"
api_key = "sk-..." # or CRW_EXTRACTION__LLM__API_KEY env var
model = "claude-sonnet-4-20250514"CRW works as an MCP tool server for any AI assistant that supports MCP. It provides 4 tools: crw_scrape, crw_crawl, crw_check_crawl_status, crw_map.
claude mcp add --transport http crw http://localhost:3000/mcpEdit your config file:
| OS | Path |
|---|---|
| macOS | ~/Library/Application Support/Claude/claude_desktop_config.json |
| Windows | %APPDATA%\Claude\claude_desktop_config.json |
| Linux | ~/.config/Claude/claude_desktop_config.json |
{
"mcpServers": {
"crw": {
"command": "/absolute/path/to/crw-mcp",
"env": { "CRW_API_URL": "http://localhost:3000" }
}
}
}Edit ~/.cursor/mcp.json (global) or .cursor/mcp.json (project):
{
"mcpServers": {
"crw": {
"command": "/absolute/path/to/crw-mcp",
"env": { "CRW_API_URL": "http://localhost:3000" }
}
}
}Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"crw": {
"command": "/absolute/path/to/crw-mcp",
"env": { "CRW_API_URL": "http://localhost:3000" }
}
}
}{
"mcpServers": {
"crw": {
"command": "/absolute/path/to/crw-mcp",
"env": { "CRW_API_URL": "http://localhost:3000" },
"alwaysAllow": ["crw_scrape", "crw_map"],
"disabled": false
}
}
}Edit ~/.continue/config.yaml:
mcpServers:
- name: crw
command: /absolute/path/to/crw-mcp
env:
CRW_API_URL: http://localhost:3000Edit ~/.codex/config.toml:
[mcp_servers.crw]
command = "/absolute/path/to/crw-mcp"
[mcp_servers.crw.env]
CRW_API_URL = "http://localhost:3000"Any MCP-compatible client can connect to CRW using the standard JSON format:
{
"mcpServers": {
"crw": {
"command": "/absolute/path/to/crw-mcp",
"env": { "CRW_API_URL": "http://localhost:3000" }
}
}
}Tip: The stdio binary (
crw-mcp) works with any client. For clients that support HTTP transport, usehttp://localhost:3000/mcpdirectly — no binary needed.
See the full MCP setup guide for detailed instructions, auth configuration, and platform comparison.
CRW auto-detects SPAs by analyzing the initial HTML response for shell heuristics (empty body, framework markers). When a SPA is detected, it renders the page via a headless browser.
Quick setup (recommended):
crw-server setupThis automatically downloads the LightPanda binary to ~/.local/bin/ and creates a config.local.toml with the correct renderer settings. Then start LightPanda and CRW:
lightpanda serve --host 127.0.0.1 --port 9222 &
crw-serverSupported renderers:
| Renderer | Protocol | Best for |
|---|---|---|
| LightPanda | CDP over WebSocket | Low-resource environments (default) |
| Playwright | CDP over WebSocket | Full browser compatibility |
| Chrome | CDP over WebSocket | Existing Chrome infrastructure |
Renderer mode is configured via renderer.mode: auto (default), lightpanda, playwright, chrome, or none.
With Docker Compose, LightPanda runs as a sidecar — no extra setup needed:
docker compose up┌─────────────────────────────────────────────┐
│ crw-server │
│ Axum HTTP API + Auth + MCP │
├──────────┬──────────┬───────────────────────┤
│ crw-crawl│crw-extract│ crw-renderer │
│ BFS crawl│ HTML→MD │ HTTP + CDP(WS) │
│ robots │ LLM/JSON │ LightPanda/Chrome │
│ sitemap │ clean/read│ auto-detect SPA │
├──────────┴──────────┴───────────────────────┤
│ crw-core │
│ Types, Config, Errors │
└─────────────────────────────────────────────┘
CRW uses layered TOML configuration with environment variable overrides:
config.default.toml— built-in defaultsconfig.local.toml— local overrides (or setCRW_CONFIG=myconfig)- Environment variables —
CRW_prefix with__separator (e.g.CRW_SERVER__PORT=8080)
[server]
host = "0.0.0.0"
port = 3000
[renderer]
mode = "auto" # auto | lightpanda | playwright | chrome | none
[crawler]
max_concurrency = 10
requests_per_second = 10.0
respect_robots_txt = true
[auth]
# api_keys = ["fc-key-1234"]See full configuration reference for all options.
Python:
import requests
response = requests.post("http://localhost:3000/v1/scrape", json={
"url": "https://example.com",
"formats": ["markdown", "links"]
})
data = response.json()["data"]
print(data["markdown"])Node.js:
const response = await fetch("http://localhost:3000/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com",
formats: ["markdown", "links"]
})
});
const { data } = await response.json();
console.log(data.markdown);LangChain document loader pattern:
import requests
def load_documents(urls):
documents = []
for url in urls:
resp = requests.post("http://localhost:3000/v1/scrape", json={
"url": url,
"formats": ["markdown"]
})
data = resp.json()["data"]
documents.append({
"page_content": data["markdown"],
"metadata": data["metadata"]
})
return documentsPre-built image from GHCR:
docker pull ghcr.io/us/crw:latest
docker run -p 3000:3000 ghcr.io/us/crw:latestDocker Compose (with JS rendering sidecar):
docker compose upThis starts CRW on port 3000 with LightPanda as a JS rendering sidecar on port 9222. CRW auto-connects to LightPanda for SPA rendering.
Tested on Firecrawl's scrape-content-dataset-v1 (1,000 real-world URLs, JS rendering enabled):
| CRW | Firecrawl v2.5 | |
|---|---|---|
| Coverage | 92.0% | 77.2% |
| Avg Latency | 833ms | 4,600ms |
| P50 Latency | 446ms | — |
| Noise Rejection | 88.4% | — |
| Cost / 1,000 scrapes | $0 (self-hosted) | $0.83–5.33 |
| Idle RAM | 6.6 MB | ~500 MB+ |
Run the benchmark yourself:
pip install datasets aiohttp
python3 bench/run_bench.py| Crate | Description | |
|---|---|---|
crw-core |
Core types, config, and error handling | |
crw-renderer |
HTTP + CDP browser rendering engine | |
crw-extract |
HTML → markdown/plaintext extraction | |
crw-crawl |
Async BFS crawler with robots.txt & sitemap | |
crw-server |
Axum API server (Firecrawl-compatible) | |
crw-mcp |
MCP stdio proxy binary |
See docs/crates.md for usage examples and cargo add instructions.
Full documentation: docs/index.md
CRW includes built-in protections against common web scraping attack vectors:
- SSRF protection — all URL inputs (REST API + MCP) are validated against private/internal networks:
- Loopback (
127.0.0.0/8,::1,localhost) - Private IPs (
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16) - Link-local / cloud metadata (
169.254.0.0/16— blocks AWS/GCP metadata endpoints) - IPv6 mapped addresses (
::ffff:127.0.0.1), link-local (fe80::), ULA (fc00::/7) - Non-HTTP schemes (
file://,ftp://,gopher://,data:)
- Loopback (
- Auth — optional Bearer token with constant-time comparison (no length or key-index leakage)
- robots.txt — respects
Allow/Disallowwith wildcard patterns (*,$) and RFC 9309 specificity - Rate limiting — configurable per-second request cap
- Resource limits — max body size (1 MB), max crawl depth (10), max pages (1000), max discovered URLs (5000)
Contributions are welcome! Please open an issue or submit a pull request.
- Fork the repository
- Create your feature branch (
git checkout -b feat/my-feature) - Commit your changes (
git commit -m 'feat: add my feature') - Push to the branch (
git push origin feat/my-feature) - Open a Pull Request