First stable release. kreuzcrawl ships a Rust core with active bindings for
Python, TypeScript/Node, Ruby, PHP, Go, Java/JNI, C#, Elixir, WebAssembly,
Dart, Kotlin/Android, Swift, Zig, and C FFI, plus a CLI, an HTTP API, and an
MCP server.
Added
- Tiered dispatch engine. The crawl engine chains HTTP → Bypass → Browser
tiers driven by per-attempt signals rather than a single bypass
short-circuit. Publickreuzcrawl::types::dispatchsurface:Tier,
EscalationStrategy,EscalationReason,AttemptOutcome,RetryDirective,
RetryPolicy,WafSignal,WafClassifier,DomainStatePort,
DomainRecommendation,EscalationBudget, andDispatchProfile(dispatch
enums are#[non_exhaustive]).CrawlConfig::builder()and
DispatchProfile::builder()provide fluent construction. - WAF detection. A TOML fingerprint corpus (
rules/waf_fingerprints.toml,
34 fingerprints) with an Aho-Corasick matcher,TomlClassifier::watch()
hot-reload (debounced, atomicArcSwap, Kubernetes ConfigMap-safe), and
EwmaDomainStatefor per-domain block-rate tracking that promotes/demotes
the starting tier. - SSRF defense. New
kreuzcrawl::net::ssrfmodule —SsrfPolicy,
HostMatcher(Exact/Suffix/Cidr),SsrfError, and async
validate_url.CrawlConfig::ssrfplus builder methods
allow_private_networks(bool)andssrf_allowlist_host(HostMatcher);
CrawlError::SsrfPolicyViolation. Exposed as a settable DTO (deny_private,
max_redirects) across every binding. - Browser pool injection.
BrowserPool/BrowserPoolConfigand
NativeBrowserExecutor/NativeBrowserExecutorConfigare public;
CrawlEngineBuilder::with_browser_pool/with_native_executorand
CrawlEngineHandle::from_enginelet consumers construct andwarm()a pool
once and reuse it across all crawl jobs. - Public substrate parsers.
kreuzcrawl::robotsandkreuzcrawl::sitemap
are public (parse_robots_txt,is_path_allowed,RobotsRules,
parse_sitemap_xml,parse_sitemap_index,is_sitemap_index) — usable
without spinning up the engine. - Pluggable proxy rotation.
ProxyProvidertrait +StaticProxyProvider
baseline, wired into the reqwest fetch path via
CrawlEngineBuilder::with_proxy_provider; called per request and taking
precedence over the staticCrawlConfig::proxyvalue. - CLI.
batch-scrape,batch-crawl,download,citations, and
versionsubcommands, bringing the CLI to 1:1 with the core and MCP
surfaces. - MCP server. Tools are 1:1 with the CLI (
batch_crawl,
generate_citations, …), each declaringread_only/destructive/
open_worldsafety annotations, and are served over both stdio and rmcp
Streamable HTTP at/mcpwhen the binary is built with theapi+mcp
features. - Observability. OpenTelemetry counters
kreuzcrawl_waf_fingerprint_matches_totaland
kreuzcrawl_escalations_total, plus property tests, cargo-fuzz targets, and
Criterion benchmarks covering the WAF subsystem.
Changed
- Memory-bounded streaming crawl.
crawl_stream/batch_crawl_stream
move each page into itsCrawlEvent::Pageand drop it instead of
accumulating every page, bounding peak memory on large crawls (≈2.5 GB →
≈20 MB working set).crawl()'s batch result is unchanged. - Dispatch model.
CrawlError::WafBlockedis now a struct variant
({ vendor, message });DomainStatePortmoved to an observation model
(recommend/observe);SimpleRetryPolicy's off-by-one is fixed
(max_retries=3yields 3 retries);#[non_exhaustive]added to
CrawlError,NetworkErrorKind, and the dispatch enums so future variants
are non-breaking. - Asset downloads route through
http_fetch, so every file fetch is
subject to the SSRF policy.
Fixed
- Crawl loop materializes downloaded documents. The
download_documents
flag was previously honored only by single-pagescrape(); the crawl loop
now buildsCrawlPageResult.downloaded_documentfor linked PDFs/DOCX via a
shared helper instead of fetching, flagging, and discarding the bytes. - SSRF rollout hardening. Follow-up fixes to the SSRF refactor: redirect
final_urlis tracked again (per-hop re-validation moved into
follow_redirects), within-batch URL dedup no longer races, crawl
child-depth is incremented (restoringmax_depthandinclude_paths
semantics), andCrawlConfigJSON deserialization honors
KREUZCRAWL_ALLOW_PRIVATE_NETWORKthrough aSsrfPolicy::from_envserde
default. Each is covered by a regression test. - MCP server exposed zero tools. The handler was missing rmcp's
#[tool_handler], sotools/list/tools/callreturned an empty list over
both stdio and HTTP; it now delegates to the generated tool router.
Security
- SSRF defense, enabled by default.
scrape(),crawl(),
batch_crawl(), sitemap fetch, robots.txt fetch, and asset download refuse
URLs resolving to loopback (127.0.0.0/8), RFC1918 private networks,
link-local (169.254.0.0/16), cloud metadata (0.0.0.0/8), multicast
(224.0.0.0/4), IPv6 ULA (fc00::/7), IPv6 link-local (fe80::/10), IPv6
multicast (ff00::/8), or any non-http(s) scheme. Includes DNS-rebinding
mitigation (every resolved IP must pass the policy), redirect-chain
re-validation (bounded byssrf.max_redirects, default 5), and
link-enqueue validation with bounded concurrency. Opt out via
KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1or
CrawlConfig::allow_private_networks(true).
Build
- Bindings, facades, READMEs, docs, stubs, and e2e suites are generated by
alef (pinned at 0.26.6) across all 14 language targets. - Publish-pipeline hardening: a native per-arch Docker matrix that drops QEMU
emulation, Flutter-free Dart native builds for pub.dev, Swift artifactbundle
checksum injection and Apple system-framework linking, and
lockfile-preserving source publishes for the Elixir NIF, PHP extension, and
Ruby gem.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0/kreuzcrawl-zig-v0.3.0.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-l-oqNoO5eCDhkMWBHtd-su6btmha_K-hk0D2x5Ooq7B9\",\n },\n},\n```\n