+
+
+
+
diff --git a/claudedocs/research_crw_security_assessment_20260525.md b/claudedocs/research_crw_security_assessment_20260525.md
new file mode 100644
index 0000000..c78ed64
--- /dev/null
+++ b/claudedocs/research_crw_security_assessment_20260525.md
@@ -0,0 +1,179 @@
+# CRW / fastCRW Security Assessment
+
+**Date:** 2026-05-25
+**Scope:** Full codebase at `/home/pantinor/data/repo/apps/crw` (v0.10.0)
+**Perspective:** Self-hosted operator concerned about local machine exposure
+**Deployment context:** Backends in rootless Podman; interest in client-side security
+
+---
+
+## Executive Summary
+
+**Overall verdict: Good security posture with a few notable gaps.**
+
+CRW demonstrates deliberate, above-average security engineering for an open-source web scraping tool. The Rust language choice eliminates entire vulnerability classes (buffer overflows, use-after-free). SSRF protections are comprehensive at the API level. There is zero telemetry or phone-home behavior. The dependency surface is clean (rustls-only TLS, no openssl, all crates.io sources). Docker container hardening for SearXNG is exemplary.
+
+The gaps that matter most for your deployment are:
+
+1. **The browse/MCP mode has weaker SSRF protection** than the server API -- it can navigate to internal IPs
+2. **Default configuration is too open** for local-only use (binds 0.0.0.0, no auth, permissive CORS)
+3. **Two Docker images use `:latest` tags** creating supply chain drift risk
+4. **No binary checksum verification** in install.sh or Python SDK
+
+None of these are showstoppers. All are configurable or patchable. Your choice to run backends in rootless Podman significantly reduces container-escape risk.
+
+---
+
+## Findings by Severity
+
+### CRITICAL: None
+
+No critical vulnerabilities were found. The SSRF fundamentals are covered, there's no RCE path from the API, no command injection, and no data exfiltration from the host.
+
+---
+
+### HIGH (3 findings)
+
+#### H1. Browse mode missing SSRF protection on `goto`
+
+**Files:** `crates/crw-browse/src/tools/goto.rs`
+**What:** The browse MCP server (used when AI agents need interactive browsing) validates URL schemes (blocks `file://`, `javascript:`, etc.) but does NOT call `validate_safe_url()` to block private/loopback IPs. An AI agent could navigate the browser to `http://169.254.169.254/latest/meta-data/` (cloud metadata), `http://localhost:9222` (other services), or any internal network address.
+**Combined with:** The `evaluate` tool allows arbitrary JS execution on the navigated page, and the `storage` tool can read cookies/localStorage including HttpOnly cookies via CDP.
+**Your risk:** If you run browse mode, a prompt-injected AI agent could scan your local network through the browser. If you don't use browse mode (`crw browse`), this doesn't apply.
+**Mitigation:** You can avoid using the `crw browse` command. The core scraping API (`/v1/scrape`, `/v1/crawl`, `/v1/map`) has proper SSRF protection.
+
+#### H2. Browser-mediated SSRF via JS redirects
+
+**Files:** `crates/crw-renderer/src/cdp.rs`
+**What:** The server-side API validates URLs before fetching, but pages rendered in Chrome/LightPanda can execute JavaScript that redirects to internal addresses (`window.location = 'http://169.254.169.254/...'`). The resulting HTML is returned in the API response. Network capture can also leak responses from XHR requests the page makes to internal endpoints.
+**Your risk:** Moderate for self-hosted. An attacker would need to control a page you scrape and have that page contain a JS redirect to an internal service. Cloud metadata (AWS/GCP) is the main concern; in a Podman deployment without cloud metadata, the risk is lower.
+**Mitigation:** Your rootless Podman containers limit what internal endpoints are reachable from the browser container. Network namespace isolation helps here.
+
+#### H3. GitHub Actions pinned by version tag, not SHA
+
+**Files:** `.github/workflows/release.yml` (14 distinct actions)
+**What:** All CI/CD actions use mutable version tags (`@v4`, `@v6`). The release workflow has access to Cargo, NPM, and PyPI registry tokens.
+**Your risk:** This is a supply chain concern for the project maintainers, not directly for you as a user. However, a compromised release action could publish malicious binaries that you would download.
+
+---
+
+### MEDIUM (11 findings)
+
+| # | Finding | Your Risk | Mitigation |
+|---|---------|-----------|------------|
+| M1 | **Default bind 0.0.0.0** -- server listens on all interfaces | Any device on your LAN can use the scraper | Set `host = "127.0.0.1"` in your config |
+| M2 | **No auth by default** -- API keys commented out in default config | Combined with M1, anyone on your LAN has full access | Set `api_keys = ["your-key"]` in `[auth]` |
+| M3 | **Permissive CORS** -- `Access-Control-Allow-Origin: *` | Any website you visit could make cross-origin requests to your CRW instance | Only matters if M1+M2 are not addressed |
+| M4 | **Unauthenticated admin endpoints** -- `/admin/breakers/reset`, `/metrics` bypass auth | Operational state can be manipulated without auth | Behind localhost bind + auth, this is low risk |
+| M5 | **DNS rebinding gap** in SSRF protection | Theoretical bypass of URL validation via DNS TOCTOU | Very hard to exploit; rootless Podman helps |
+| M6 | **`lightpanda/browser:latest`** unpinned Docker image | Silent supply chain drift | Pin to a specific version/digest |
+| M7 | **`chromedp/headless-shell:latest`** unpinned Docker image | Same as M6 | Pin to a specific Chromium version tag |
+| M8 | **Dockerfile runs as root** -- no USER directive | Container processes run as root inside the container | Rootless Podman maps this to your UID -- **already mitigated** |
+| M9 | **Browser containers lack hardening** -- no `cap_drop`, `security_opt` | Browser processes have unnecessary capabilities | Add `cap_drop: [ALL]`, `security_opt: [no-new-privileges:true]` |
+| M10 | **Chrome `--ignore-certificate-errors`** | Chrome accepts MITM'd certificates for scraped sites | Only affects scraping targets, not your machine |
+| M11 | **Prompt injection via scraped content** to LLM | Attacker-controlled pages could influence extraction/summary output | Only if you use LLM features; validate LLM outputs |
+
+---
+
+### LOW (10 findings)
+
+| # | Finding | Notes |
+|---|---------|-------|
+| L1 | Test-only SSRF bypass env var (`CRW_ALLOW_LOOPBACK_FOR_TESTS`) is runtime, not compile-time | Don't set this in production |
+| L2 | `--llm-key` CLI flag exposes key in process list and shell history | Use env var `CRW_EXTRACTION__LLM__API_KEY` instead |
+| L3 | Proxy credentials logged on parse failure | Only triggers on malformed proxy URLs |
+| L4 | Default secrets in docker-compose (`SEARXNG_SECRET_KEY`, `BROWSERLESS_TOKEN`) | Set proper values in `.env` for non-local deployments |
+| L5 | Regex DoS in chunking -- user-supplied regex not size-limited | Rust's `regex` crate guarantees linear time, so CPU DoS only |
+| L6 | `Debug` derive on structs containing API keys | Only a risk if debug logging is added in the future |
+| L7 | WebSocket connections to renderers are plaintext (`ws://`) | Contained within Docker bridge network |
+| L8 | No `cargo audit` in CI | 496 transitive deps not scanned for advisories |
+| L9 | `cross` installed from git HEAD in release CI | Unpinned build tool |
+| L10 | Screenshot tool in browse mode writes to arbitrary file paths | Only relevant if using browse mode |
+
+---
+
+### Positive Findings (INFO)
+
+These are things the project does **well**:
+
+| Area | Detail |
+|------|--------|
+| **Memory safety** | Rust; near-zero `unsafe` (only 2 justified `killpg` calls and edition-2024 `set_var`) |
+| **TLS** | Exclusive `rustls` -- no openssl/native-tls in entire dependency tree |
+| **No telemetry** | Zero phone-home behavior. Comprehensive grep confirmed no analytics, tracking, or update-check calls |
+| **SSRF at API level** | `validate_safe_url()` applied on all 5 route handlers + every redirect hop |
+| **Constant-time auth** | API key comparison resists timing attacks; all configured keys checked without short-circuit |
+| **Body size limits** | 1 MB max request body across all endpoints |
+| **Input clamping** | URL length (2048), search query (2000 chars), map params (64 max), network capture (30 bodies, 2MB) |
+| **Stateless** | No database, no persistent storage, no disk-based caches |
+| **Config file security** | Written with `0o600` permissions, symlink-safe, atomic writes |
+| **Container hardening (SearXNG)** | `read_only`, `cap_drop: ALL`, `no-new-privileges`, `tmpfs`, resource limits, healthcheck |
+| **Security headers** | `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY` |
+| **Security test suite** | Dedicated `tests/security.rs` covering SSRF, auth, information disclosure |
+| **Dependency hygiene** | All 496 deps from crates.io (zero git deps), minimal Python/NPM surfaces |
+| **No build scripts** | No `build.rs` or proc-macro crates in the workspace |
+
+---
+
+## Your Specific Deployment: Rootless Podman
+
+Your choice of rootless Podman is a strong mitigation for several findings:
+
+| Finding | Podman Mitigation |
+|---------|------------------|
+| M8 (runs as root) | **Fully mitigated.** Rootless Podman maps container root to your UID via user namespaces. Even PID 1 inside the container runs as your unprivileged user on the host. |
+| H2 (browser SSRF) | **Partially mitigated.** Network namespace isolation limits what internal endpoints the browser can reach. Podman's `slirp4netns` or `pasta` networking adds a layer of separation. |
+| M9 (browser capabilities) | **Partially mitigated.** Rootless Podman drops many capabilities by default. `CAP_SYS_ADMIN` (needed for Chrome sandbox) is typically not available, which is why `--no-sandbox` is used. |
+| M10 (cert errors) | **Not mitigated.** This is a Chrome flag that affects TLS validation for scraped targets regardless of container runtime. |
+
+---
+
+## Recommended Hardening for Your Setup
+
+### Immediate (config changes only, no code changes):
+
+1. **Bind to localhost**: Set `host = "127.0.0.1"` in your config TOML
+2. **Set API keys**: Uncomment and set `api_keys = ["your-key"]` in `[auth]`
+3. **Pin Docker images**: Change `:latest` to specific versions for `lightpanda/browser` and `chromedp/headless-shell`
+4. **Harden browser containers**: Add to your docker-compose override:
+ ```yaml
+ lightpanda:
+ cap_drop: [ALL]
+ security_opt: [no-new-privileges:true]
+ pids_limit: 128
+ chrome:
+ cap_drop: [ALL]
+ security_opt: [no-new-privileges:true]
+ pids_limit: 256
+ ```
+
+### If using browse mode:
+
+5. **Be aware** that browse mode has weaker SSRF protection than the server API. AI agents connected via MCP can navigate the browser to internal addresses. If this concerns you, avoid using `crw browse` or restrict which AI agents can access it.
+
+### If using LLM features:
+
+6. **Use env vars for API keys**: `CRW_EXTRACTION__LLM__API_KEY=sk-...` instead of config file
+7. **Validate LLM outputs**: Scraped content may contain prompt injection payloads
+
+---
+
+## Fair Overall Assessment
+
+**For a self-hosted web scraper, CRW is significantly more secure than most alternatives in its class.** The Rust foundation, memory-safe TLS, comprehensive SSRF protections, zero telemetry, and stateless architecture are all strong positives. The security test suite shows intentional security thinking, not afterthought.
+
+**The main concerns for a local deployment are configuration defaults** (0.0.0.0 bind, no auth, permissive CORS), which are easily fixed. The browse mode SSRF gap is the most substantive code-level issue, but it only applies if you use that specific feature.
+
+**Your rootless Podman deployment is a good defense-in-depth choice** that mitigates several container-level concerns. Combined with the localhost bind and API key config changes, your exposure surface becomes quite small: you'd be running a localhost-only, authenticated, stateless Rust binary that fetches web pages with SSRF protection and returns markdown.
+
+**Confidence level for daily use:** HIGH, provided you apply the config hardening above.
+
+---
+
+## Methodology
+
+This assessment was conducted through static analysis of the full codebase (10 crates, ~15,000 lines of Rust), configuration files, Docker infrastructure, CI/CD pipelines, install scripts, and NPM/Python distribution packages. Four parallel analysis streams covered: server-side security, client/MCP security, supply chain/dependencies, and network exposure. No dynamic testing or fuzzing was performed.
+
+---
+
+*Report generated by security analysis of CRW v0.10.0 codebase.*
diff --git a/crates/crw-crawl/Cargo.toml b/crates/crw-crawl/Cargo.toml
index 9d95696..59790ce 100644
--- a/crates/crw-crawl/Cargo.toml
+++ b/crates/crw-crawl/Cargo.toml
@@ -11,7 +11,7 @@ description = "Async BFS web crawler with rate limiting and robots.txt support f
[dependencies]
crw-core = { path = "../crw-core", version = "0.12.0" }
-crw-diff = { path = "../crw-diff", version = "0.11.0" }
+crw-diff = { path = "../crw-diff", version = "0.12.0" }
crw-renderer = { path = "../crw-renderer", version = "0.12.0" }
crw-extract = { path = "../crw-extract", version = "0.12.0" }
reqwest = { workspace = true }
diff --git a/crates/crw-diff/Cargo.toml b/crates/crw-diff/Cargo.toml
index 6d1fe02..f2decd3 100644
--- a/crates/crw-diff/Cargo.toml
+++ b/crates/crw-diff/Cargo.toml
@@ -13,7 +13,7 @@ description = "Stateless change-tracking diff engine for the CRW web scraper"
# Shared types only (ChangeTrackingOptions/Result, DiffAst, etc.). This crate
# MUST NOT depend on crw-extract — judging is injected upstream so the diff
# engine stays pure (no LLM, no HTTP, no I/O).
-crw-core = { path = "../crw-core", version = "0.11.0" }
+crw-core = { path = "../crw-core", version = "0.12.0" }
serde = { workspace = true }
serde_json = { workspace = true }
similar = { workspace = true }
diff --git a/crates/crw-monitor/Cargo.toml b/crates/crw-monitor/Cargo.toml
index 57c9acc..66f2c46 100644
--- a/crates/crw-monitor/Cargo.toml
+++ b/crates/crw-monitor/Cargo.toml
@@ -14,11 +14,11 @@ description = "Optional self-host monitor mode for the CRW web scraper (SQLite-b
# `crw-crawl` provides the scrape/crawl primitives; `crw-extract` provides the
# LLM judge. None of these pull a DB dependency — the SQLite/cron/hmac stack is
# local to this crate and feature-gated, never compiled into the default server.
-crw-core = { path = "../crw-core", version = "0.11.0" }
-crw-diff = { path = "../crw-diff", version = "0.11.0" }
-crw-crawl = { path = "../crw-crawl", version = "0.11.0" }
-crw-extract = { path = "../crw-extract", version = "0.11.0" }
-crw-renderer = { path = "../crw-renderer", version = "0.11.0" }
+crw-core = { path = "../crw-core", version = "0.12.0" }
+crw-diff = { path = "../crw-diff", version = "0.12.0" }
+crw-crawl = { path = "../crw-crawl", version = "0.12.0" }
+crw-extract = { path = "../crw-extract", version = "0.12.0" }
+crw-renderer = { path = "../crw-renderer", version = "0.12.0" }
serde = { workspace = true }
serde_json = { workspace = true }
diff --git a/crates/crw-renderer/Cargo.toml b/crates/crw-renderer/Cargo.toml
index 260eeca..0ba835a 100644
--- a/crates/crw-renderer/Cargo.toml
+++ b/crates/crw-renderer/Cargo.toml
@@ -16,7 +16,7 @@ auto-browser = ["dep:dirs"]
[dependencies]
crw-core = { path = "../crw-core", version = "0.12.0" }
-crw-extract = { path = "../crw-extract", version = "0.11.0" }
+crw-extract = { path = "../crw-extract", version = "0.12.0" }
reqwest = { workspace = true }
tokio = { workspace = true }
tokio-tungstenite = { version = "0.28", features = ["rustls-tls-native-roots"], optional = true }
diff --git a/crates/crw-server/Cargo.toml b/crates/crw-server/Cargo.toml
index 5f16080..9478117 100644
--- a/crates/crw-server/Cargo.toml
+++ b/crates/crw-server/Cargo.toml
@@ -20,13 +20,13 @@ monitor = ["dep:crw-monitor"]
[dependencies]
crw-core = { path = "../crw-core", version = "0.12.0" }
-crw-diff = { path = "../crw-diff", version = "0.11.0" }
+crw-diff = { path = "../crw-diff", version = "0.12.0" }
crw-renderer = { path = "../crw-renderer", version = "0.12.0" }
crw-extract = { path = "../crw-extract", version = "0.12.0" }
crw-crawl = { path = "../crw-crawl", version = "0.12.0" }
crw-search = { path = "../crw-search", version = "0.12.0" }
# Optional self-host monitor mode (default OFF — see the `monitor` feature).
-crw-monitor = { path = "../crw-monitor", version = "0.11.0", optional = true }
+crw-monitor = { path = "../crw-monitor", version = "0.12.0", optional = true }
axum = { workspace = true }
tower = { workspace = true }
tower-http = { workspace = true }