GitHub - us/crw: ⚡Lightweight Firecrawl alternative in Rust — 91.5% coverage, 5x faster, 3MB RAM. Web scraper & crawler with MCP server for Claude, LLM extraction, JS rendering.

Lightweight, Firecrawl-compatible web scraper and crawler for AI

Installation • API Reference • MCP Integration • JS Rendering • Configuration

English | 中文

CRW is a self-hosted web scraper and web crawler built in Rust — a fast, lightweight Firecrawl alternative designed for LLM extraction, RAG pipelines, and AI agents. It ships as a single binary with ~6 MB idle RAM, built-in MCP server support for Claude, and structured data extraction via Anthropic and OpenAI. Drop-in compatible with Firecrawl's API.

Single binary. No Redis. No Node.js. Drop-in Firecrawl API.

cargo install crw-server
crw-server

What's New

v0.0.2

CSS selector & XPath — target specific DOM elements before Markdown conversion (cssSelector, xpath)
Chunking strategies — split content into topic, sentence, or regex-delimited chunks for RAG pipelines (chunkStrategy)
BM25 & cosine filtering — rank chunks by relevance to a query and return top-K results (filterMode, topK)
Better Markdown — switched to htmd (Turndown.js port): tables, code block languages, nested lists all render correctly
Stealth mode — rotate User-Agent from a built-in Chrome/Firefox/Safari pool and inject 12 browser-like headers (stealth: true)
Per-request proxy — override the global proxy on a per-request basis (proxy: "http://...")
Rate limit jitter — randomized delay between requests to avoid uniform traffic fingerprinting
crw-server setup — one-command JS rendering setup: downloads LightPanda, creates config.local.toml

v0.0.1

Firecrawl-compatible REST API — /v1/scrape, /v1/crawl, /v1/map with identical request/response format
6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
LLM structured extraction — JSON schema in, validated structured data out (Anthropic tool_use + OpenAI function calling)
JS rendering — auto-detect SPAs via heuristics, render via LightPanda, Playwright, or Chrome (CDP)
BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
SSRF protection — private IPs, cloud metadata, IPv6, dangerous URI filtering
Docker ready — multi-stage build with LightPanda sidecar

Why CRW?

CRW gives you Firecrawl's API with a fraction of the resource usage. No runtime dependencies, no Redis, no Node.js — just a single binary you can deploy anywhere.

	CRW	Firecrawl
Coverage (1K URLs)	92.0%	77.2%
Avg Latency	833ms	4,600ms
P50 Latency	446ms	—
Noise Rejection	88.4%	—
Idle RAM	6.6 MB	~500 MB+
Cold start	85 ms	seconds
HTTP scrape	~30 ms	~200 ms+
Binary size	~8 MB	Node.js runtime
Cost / 1K scrapes	$0 (self-hosted)	$0.83–5.33
Dependencies	single binary	Node + Redis
License	AGPL-3.0	AGPL

Benchmark: Firecrawl scrape-content-dataset-v1 — 1,000 real-world URLs with JS rendering enabled.

Features

🔌 Firecrawl-compatible API — same endpoints, same request/response format, drop-in replacement
📄 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
🤖 LLM structured extraction — send a JSON schema, get validated structured data back (Anthropic tool_use + OpenAI function calling)
🌐 JS rendering — auto-detect SPAs with shell heuristics, render via LightPanda, Playwright, or Chrome (CDP)
🕷️ BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
🔧 MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
🔒 Security — SSRF protection (private IPs, cloud metadata, IPv6), constant-time auth, dangerous URI filtering
🐳 Docker ready — multi-stage build with LightPanda sidecar
🎯 CSS selector & XPath — extract specific DOM elements before Markdown conversion
✂️ Chunking & filtering — split content into topic/sentence/regex chunks; rank by BM25 or cosine similarity
🕵️ Stealth mode — browser-like UA rotation and header injection to reduce bot detection
🌐 Per-request proxy — override the global proxy per scrape request

Quick Start

Install from crates.io:

cargo install crw-server
crw-server

Enable JS rendering (optional):

crw-server setup

This downloads LightPanda and creates a config.local.toml for JS rendering. See JS Rendering for details.

Docker (pre-built image):

docker run -p 3000:3000 ghcr.io/us/crw:latest

Docker Compose (with JS rendering):

docker compose up

Build from source:

cargo build --release --bin crw-server
./target/release/crw-server

Scrape a page:

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

{
  "success": true,
  "data": {
    "markdown": "# Example Domain\nThis domain is for use in ...",
    "metadata": {
      "title": "Example Domain",
      "sourceURL": "https://example.com",
      "statusCode": 200,
      "elapsedMs": 32
    }
  }
}

Use Cases

RAG pipelines — crawl websites and extract structured data for vector databases
AI agents — give Claude Code or Claude Desktop web scraping tools via MCP
Content monitoring — periodic crawl with LLM extraction to track changes
Data extraction — combine CSS selectors + LLM to extract any schema from any page
Web archiving — full-site BFS crawl to markdown

API Endpoints

Method	Endpoint	Description
`POST`	`/v1/scrape`	Scrape a single URL, optionally with LLM extraction
`POST`	`/v1/crawl`	Start async BFS crawl (returns job ID)
`GET`	`/v1/crawl/:id`	Check crawl status and retrieve results
`POST`	`/v1/map`	Discover all URLs on a site
`GET`	`/health`	Health check (no auth required)
`POST`	`/mcp`	Streamable HTTP MCP transport

LLM Structured Extraction

Send a JSON schema with your scrape request and CRW returns validated structured data using LLM function calling.

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" }
      },
      "required": ["name", "price"]
    }
  }'

Anthropic — uses tool_use with input_schema for extraction
OpenAI — uses function calling with parameters schema
Validation — LLM output is validated against your JSON schema before returning

Configure the LLM provider in your config:

[extraction.llm]
provider = "anthropic"        # "anthropic" or "openai"
api_key = "sk-..."            # or CRW_EXTRACTION__LLM__API_KEY env var
model = "claude-sonnet-4-20250514"

MCP Server

CRW works as an MCP tool server for any AI assistant that supports MCP. It provides 4 tools: crw_scrape, crw_crawl, crw_check_crawl_status, crw_map.

Claude Code

claude mcp add --transport http crw http://localhost:3000/mcp

Claude Desktop

Edit your config file:

OS	Path
macOS	`~/Library/Application Support/Claude/claude_desktop_config.json`
Windows	`%APPDATA%\Claude\claude_desktop_config.json`
Linux	`~/.config/Claude/claude_desktop_config.json`

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Cursor

Edit ~/.cursor/mcp.json (global) or .cursor/mcp.json (project):

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Cline (VS Code)

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" },
      "alwaysAllow": ["crw_scrape", "crw_map"],
      "disabled": false
    }
  }
}

Continue.dev (VS Code / JetBrains)

Edit ~/.continue/config.yaml:

mcpServers:
  - name: crw
    command: /absolute/path/to/crw-mcp
    env:
      CRW_API_URL: http://localhost:3000

OpenAI Codex CLI

Edit ~/.codex/config.toml:

[mcp_servers.crw]
command = "/absolute/path/to/crw-mcp"

[mcp_servers.crw.env]
CRW_API_URL = "http://localhost:3000"

Other MCP Clients

Any MCP-compatible client can connect to CRW using the standard JSON format:

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Tip: The stdio binary (crw-mcp) works with any client. For clients that support HTTP transport, use http://localhost:3000/mcp directly — no binary needed.

See the full MCP setup guide for detailed instructions, auth configuration, and platform comparison.

JS Rendering

CRW auto-detects SPAs by analyzing the initial HTML response for shell heuristics (empty body, framework markers). When a SPA is detected, it renders the page via a headless browser.

Quick setup (recommended):

crw-server setup

This automatically downloads the LightPanda binary to ~/.local/bin/ and creates a config.local.toml with the correct renderer settings. Then start LightPanda and CRW:

lightpanda serve --host 127.0.0.1 --port 9222 &
crw-server

Supported renderers:

Renderer	Protocol	Best for
LightPanda	CDP over WebSocket	Low-resource environments (default)
Playwright	CDP over WebSocket	Full browser compatibility
Chrome	CDP over WebSocket	Existing Chrome infrastructure

Renderer mode is configured via renderer.mode: auto (default), lightpanda, playwright, chrome, or none.

With Docker Compose, LightPanda runs as a sidecar — no extra setup needed:

docker compose up

Architecture

┌─────────────────────────────────────────────┐
│                 crw-server                  │
│         Axum HTTP API + Auth + MCP          │
├──────────┬──────────┬───────────────────────┤
│ crw-crawl│crw-extract│    crw-renderer      │
│ BFS crawl│ HTML→MD   │  HTTP + CDP(WS)      │
│ robots   │ LLM/JSON  │  LightPanda/Chrome   │
│ sitemap  │ clean/read│  auto-detect SPA     │
├──────────┴──────────┴───────────────────────┤
│                 crw-core                    │
│        Types, Config, Errors                │
└─────────────────────────────────────────────┘

Configuration

CRW uses layered TOML configuration with environment variable overrides:

config.default.toml — built-in defaults
config.local.toml — local overrides (or set CRW_CONFIG=myconfig)
Environment variables — CRW_ prefix with __ separator (e.g. CRW_SERVER__PORT=8080)

[server]
host = "0.0.0.0"
port = 3000

[renderer]
mode = "auto"  # auto | lightpanda | playwright | chrome | none

[crawler]
max_concurrency = 10
requests_per_second = 10.0
respect_robots_txt = true

[auth]
# api_keys = ["fc-key-1234"]

See full configuration reference for all options.

Integration Examples

Python:

import requests

response = requests.post("http://localhost:3000/v1/scrape", json={
    "url": "https://example.com",
    "formats": ["markdown", "links"]
})
data = response.json()["data"]
print(data["markdown"])

Node.js:

const response = await fetch("http://localhost:3000/v1/scrape", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown", "links"]
  })
});
const { data } = await response.json();
console.log(data.markdown);

LangChain document loader pattern:

import requests

def load_documents(urls):
    documents = []
    for url in urls:
        resp = requests.post("http://localhost:3000/v1/scrape", json={
            "url": url,
            "formats": ["markdown"]
        })
        data = resp.json()["data"]
        documents.append({
            "page_content": data["markdown"],
            "metadata": data["metadata"]
        })
    return documents

Docker

Pre-built image from GHCR:

docker pull ghcr.io/us/crw:latest
docker run -p 3000:3000 ghcr.io/us/crw:latest

Docker Compose (with JS rendering sidecar):

docker compose up

This starts CRW on port 3000 with LightPanda as a JS rendering sidecar on port 9222. CRW auto-connects to LightPanda for SPA rendering.

Benchmark

Tested on Firecrawl's scrape-content-dataset-v1 (1,000 real-world URLs, JS rendering enabled):

	CRW	Firecrawl v2.5
Coverage	92.0%	77.2%
Avg Latency	833ms	4,600ms
P50 Latency	446ms	—
Noise Rejection	88.4%	—
Cost / 1,000 scrapes	$0 (self-hosted)	$0.83–5.33
Idle RAM	6.6 MB	~500 MB+

Run the benchmark yourself:

pip install datasets aiohttp
python3 bench/run_bench.py

Crates

Crate	Description
`crw-core`	Core types, config, and error handling
`crw-renderer`	HTTP + CDP browser rendering engine
`crw-extract`	HTML → markdown/plaintext extraction
`crw-crawl`	Async BFS crawler with robots.txt & sitemap
`crw-server`	Axum API server (Firecrawl-compatible)
`crw-mcp`	MCP stdio proxy binary

See docs/crates.md for usage examples and cargo add instructions.

Documentation

Full documentation: docs/index.md

Security

CRW includes built-in protections against common web scraping attack vectors:

SSRF protection — all URL inputs (REST API + MCP) are validated against private/internal networks:
- Loopback (127.0.0.0/8, ::1, localhost)
- Private IPs (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
- Link-local / cloud metadata (169.254.0.0/16 — blocks AWS/GCP metadata endpoints)
- IPv6 mapped addresses (::ffff:127.0.0.1), link-local (fe80::), ULA (fc00::/7)
- Non-HTTP schemes (file://, ftp://, gopher://, data:)
Auth — optional Bearer token with constant-time comparison (no length or key-index leakage)
robots.txt — respects Allow/Disallow with wildcard patterns (*, $) and RFC 9309 specificity
Rate limiting — configurable per-second request cap
Resource limits — max body size (1 MB), max crawl depth (10), max pages (1000), max discovered URLs (5000)

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Fork the repository
Create your feature branch (git checkout -b feat/my-feature)
Commit your changes (git commit -m 'feat: add my feature')
Push to the branch (git push origin feat/my-feature)
Open a Pull Request

License

AGPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
bench		bench
crates		crates
docs		docs
tests/fixtures		tests/fixtures
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
PROGRESS.md		PROGRESS.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md
config.default.toml		config.default.toml
config.docker.toml		config.docker.toml
docker-compose.yml		docker-compose.yml
test_api_pipeline.py		test_api_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

What's New

v0.0.2

v0.0.1

Why CRW?

Features

Quick Start

Use Cases

API Endpoints

LLM Structured Extraction

MCP Server

Claude Code

Claude Desktop

Cursor

Windsurf

Cline (VS Code)

Continue.dev (VS Code / JetBrains)

OpenAI Codex CLI

Other MCP Clients

JS Rendering

Architecture

Configuration

Integration Examples

Docker

Benchmark

Crates

Documentation

Security

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages