Skip to content

Releases: xberg-io/crawlberg

v0.3.0

23 Jun 16:09

Choose a tag to compare

First stable release. kreuzcrawl ships a Rust core with active bindings for
Python, TypeScript/Node, Ruby, PHP, Go, Java/JNI, C#, Elixir, WebAssembly,
Dart, Kotlin/Android, Swift, Zig, and C FFI, plus a CLI, an HTTP API, and an
MCP server.

Added

  • Tiered dispatch engine. The crawl engine chains HTTP → Bypass → Browser
    tiers driven by per-attempt signals rather than a single bypass
    short-circuit. Public kreuzcrawl::types::dispatch surface: Tier,
    EscalationStrategy, EscalationReason, AttemptOutcome, RetryDirective,
    RetryPolicy, WafSignal, WafClassifier, DomainStatePort,
    DomainRecommendation, EscalationBudget, and DispatchProfile (dispatch
    enums are #[non_exhaustive]). CrawlConfig::builder() and
    DispatchProfile::builder() provide fluent construction.
  • WAF detection. A TOML fingerprint corpus (rules/waf_fingerprints.toml,
    34 fingerprints) with an Aho-Corasick matcher, TomlClassifier::watch()
    hot-reload (debounced, atomic ArcSwap, Kubernetes ConfigMap-safe), and
    EwmaDomainState for per-domain block-rate tracking that promotes/demotes
    the starting tier.
  • SSRF defense. New kreuzcrawl::net::ssrf module — SsrfPolicy,
    HostMatcher (Exact/Suffix/Cidr), SsrfError, and async
    validate_url. CrawlConfig::ssrf plus builder methods
    allow_private_networks(bool) and ssrf_allowlist_host(HostMatcher);
    CrawlError::SsrfPolicyViolation. Exposed as a settable DTO (deny_private,
    max_redirects) across every binding.
  • Browser pool injection. BrowserPool/BrowserPoolConfig and
    NativeBrowserExecutor/NativeBrowserExecutorConfig are public;
    CrawlEngineBuilder::with_browser_pool / with_native_executor and
    CrawlEngineHandle::from_engine let consumers construct and warm() a pool
    once and reuse it across all crawl jobs.
  • Public substrate parsers. kreuzcrawl::robots and kreuzcrawl::sitemap
    are public (parse_robots_txt, is_path_allowed, RobotsRules,
    parse_sitemap_xml, parse_sitemap_index, is_sitemap_index) — usable
    without spinning up the engine.
  • Pluggable proxy rotation. ProxyProvider trait + StaticProxyProvider
    baseline, wired into the reqwest fetch path via
    CrawlEngineBuilder::with_proxy_provider; called per request and taking
    precedence over the static CrawlConfig::proxy value.
  • CLI. batch-scrape, batch-crawl, download, citations, and
    version subcommands, bringing the CLI to 1:1 with the core and MCP
    surfaces.
  • MCP server. Tools are 1:1 with the CLI (batch_crawl,
    generate_citations, …), each declaring read_only/destructive/
    open_world safety annotations, and are served over both stdio and rmcp
    Streamable HTTP at /mcp when the binary is built with the api + mcp
    features.
  • Observability. OpenTelemetry counters
    kreuzcrawl_waf_fingerprint_matches_total and
    kreuzcrawl_escalations_total, plus property tests, cargo-fuzz targets, and
    Criterion benchmarks covering the WAF subsystem.

Changed

  • Memory-bounded streaming crawl. crawl_stream / batch_crawl_stream
    move each page into its CrawlEvent::Page and drop it instead of
    accumulating every page, bounding peak memory on large crawls (≈2.5 GB →
    ≈20 MB working set). crawl()'s batch result is unchanged.
  • Dispatch model. CrawlError::WafBlocked is now a struct variant
    ({ vendor, message }); DomainStatePort moved to an observation model
    (recommend/observe); SimpleRetryPolicy's off-by-one is fixed
    (max_retries=3 yields 3 retries); #[non_exhaustive] added to
    CrawlError, NetworkErrorKind, and the dispatch enums so future variants
    are non-breaking.
  • Asset downloads route through http_fetch, so every file fetch is
    subject to the SSRF policy.

Fixed

  • Crawl loop materializes downloaded documents. The download_documents
    flag was previously honored only by single-page scrape(); the crawl loop
    now builds CrawlPageResult.downloaded_document for linked PDFs/DOCX via a
    shared helper instead of fetching, flagging, and discarding the bytes.
  • SSRF rollout hardening. Follow-up fixes to the SSRF refactor: redirect
    final_url is tracked again (per-hop re-validation moved into
    follow_redirects), within-batch URL dedup no longer races, crawl
    child-depth is incremented (restoring max_depth and include_paths
    semantics), and CrawlConfig JSON deserialization honors
    KREUZCRAWL_ALLOW_PRIVATE_NETWORK through a SsrfPolicy::from_env serde
    default. Each is covered by a regression test.
  • MCP server exposed zero tools. The handler was missing rmcp's
    #[tool_handler], so tools/list/tools/call returned an empty list over
    both stdio and HTTP; it now delegates to the generated tool router.

Security

  • SSRF defense, enabled by default. scrape(), crawl(),
    batch_crawl(), sitemap fetch, robots.txt fetch, and asset download refuse
    URLs resolving to loopback (127.0.0.0/8), RFC1918 private networks,
    link-local (169.254.0.0/16), cloud metadata (0.0.0.0/8), multicast
    (224.0.0.0/4), IPv6 ULA (fc00::/7), IPv6 link-local (fe80::/10), IPv6
    multicast (ff00::/8), or any non-http(s) scheme. Includes DNS-rebinding
    mitigation (every resolved IP must pass the policy), redirect-chain
    re-validation (bounded by ssrf.max_redirects, default 5), and
    link-enqueue validation with bounded concurrency. Opt out via
    KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1 or
    CrawlConfig::allow_private_networks(true).

Build

  • Bindings, facades, READMEs, docs, stubs, and e2e suites are generated by
    alef (pinned at 0.26.6) across all 14 language targets.
  • Publish-pipeline hardening: a native per-arch Docker matrix that drops QEMU
    emulation, Flutter-free Dart native builds for pub.dev, Swift artifactbundle
    checksum injection and Apple system-framework linking, and
    lockfile-preserving source publishes for the Elixir NIF, PHP extension, and
    Ruby gem.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0/kreuzcrawl-zig-v0.3.0.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-l-oqNoO5eCDhkMWBHtd-su6btmha_K-hk0D2x5Ooq7B9\",\n    },\n},\n```\n

v0.3.0-rc.88

23 Jun 09:55

Choose a tag to compare

v0.3.0-rc.88 Pre-release
Pre-release

Fixed

  • Docker: multi-arch publish no longer times out on the arm64 leg. Publish Docker Images built linux/amd64,linux/arm64 in a single job with the arm64 image compiled under QEMU emulation, which routinely ran the full Rust build right up against the 120-minute job timeout (rc.86 squeaked through at 103 min; rc.87 tipped over and was cancelled, leaving no 0.3.0-rc.87 image on GHCR). The job now builds each architecture natively in a matrix (amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm), pushes each by digest, and merges them into a single manifest list with docker buildx imagetools create — matching the canonical infra pattern and removing QEMU entirely. Per-arch GHA cache scopes also fix a cache-from/cache-to scope mismatch that previously defeated cache reuse. (.github/workflows/publish-docker.yaml)

Build

  • Regenerated all bindings against alef 0.26.6 (pin bumped from 0.26.3). Folds in the accumulated 0.26.4→0.26.6 codegen changes (including the pyo3 Python trait-callback reliability fix and the Dart mirror/opaque/from_json declaration fix).

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.88/kreuzcrawl-zig-v0.3.0-rc.88.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.88-l-oqNv3EeCARU_oSw7q0Ey1p26mbnSkJ7NBo160AuQcR\",\n    },\n},\n```\n

v0.3.0-rc.87

22 Jun 20:17

Choose a tag to compare

v0.3.0-rc.87 Pre-release
Pre-release

Added

  • MCP: Streamable HTTP transport at /mcp. When the binary is built with both the api and mcp features (the CLI is), the REST API server now also exposes the MCP server over rmcp's Streamable HTTP transport, mounted outside the REST middleware stack so request-timeout/compression layers don't break MCP SSE. Each tool also declares its safety annotations (read_only/destructive/open_world hints). (crates/kreuzcrawl/src/mcp, crates/kreuzcrawl/src/api/router.rs)

Fixed

  • MCP: the server exposed zero tools. The impl ServerHandler was missing rmcp's #[tool_handler], so tools/list/tools/call silently returned an empty list over both stdio and HTTP — every MCP client through rc.86 saw a tool-less server. The handler now delegates to the generated tool router. (crates/kreuzcrawl/src/mcp/server.rs)
  • Swift: RustBridgeC.h no longer regresses to the placeholder. rc.86 shipped the typedef-only placeholder header (reverting the populated header from rc.85), so every SwiftPM consumer of the source package failed to compile with thousands of cannot find '__swift_bridge__$…' in scope errors. The alef 0.26.3 bump below preserves an already-populated header across alef all --clean, and the umbrella header (994 swift-bridge C declarations) is repopulated here.

Build

  • Regenerated all bindings against alef 0.26.3 (pin bumped from 0.25.60). Beyond the Swift header preservation fix above, this folds in the accumulated 0.25.60→0.26.3 codegen changes.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.87/kreuzcrawl-zig-v0.3.0-rc.87.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.87-l-oqNlnHeSAlSjd6Z9WxFtrTMIzeT11cfH85hL9n9W5q\",\n    },\n},\n```\n

v0.3.0-rc.86

22 Jun 08:19

Choose a tag to compare

v0.3.0-rc.86 Pre-release
Pre-release

Build

  • Regenerated all bindings against alef 0.25.60 (pin bumped from 0.25.59). Notable codegen fixes: the generated Rust e2e/test-app common.rs now resolves the mock-server binary via env!("CARGO_BIN_EXE_mock-server") instead of a hardcoded target/release/mock-server path (so cargo test debug builds spawn it correctly), and the Kotlin Android MockServerListener now parses the MOCK_SERVERS env map on the preset path so per-fixture mockServer.<id> lookups resolve under the registry-mode test runner.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.86/kreuzcrawl-zig-v0.3.0-rc.86.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.86-l-oqNnHJeSBVbHIEt2GQxgeXjxKLiAHHXSOKdNd5taQj\",\n    },\n},\n```\n

v0.3.0-rc.85

21 Jun 20:37

Choose a tag to compare

v0.3.0-rc.85 Pre-release
Pre-release

Fixed

  • Dart pub.dev publish no longer requires Flutter to build natives. The per-platform native dylib build used build-dart-package, which installs Flutter via subosito/flutter-action; Flutter ships no Linux ARM64 stable SDK, so the linux-arm64 leg failed ("Unable to determine Flutter version … architecture: arm64") and blocked the entire pub.dev publish. The native build needs only Rust (frb_generated.rs is committed), so the matrix now builds each native with cargo build --locked -p kreuzcrawl-dart --release directly — matching the canonical liter-llm / kreuzberg pattern. (.github/workflows/publish-pubdev.yaml)

Build

  • Regenerated all bindings against alef 0.25.59 (pin bumped from 0.25.55). Notable codegen changes: the Swift RustBridgeC.h is now the full concatenated swift-bridge C header instead of a placeholder, and the JNI NativeLib throws a descriptive ExceptionInInitializerError naming the missing native symbol instead of a bare orElseThrow.
  • Supersedes rc.84, whose Homebrew bottle-merge and release-finalize steps were stranded by a transient crates.io fetch flake in the x86_64_linux bottle build (leaving the tap formula's bottle do block pinned to the rc.83 root URL).

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.85/kreuzcrawl-zig-v0.3.0-rc.85.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.85-l-oqNr_keSCO9RQ7ifk4XGg6WZ4fMfoVqRf8Ocej5zng\",\n    },\n},\n```\n

v0.3.0-rc.84

20 Jun 19:24

Choose a tag to compare

Added

  • CLI: batch-scrape, batch-crawl, download, citations, and version subcommands. Wire the existing core entry points into the CLI so the CLI, MCP server, and core surfaces are 1:1. download mirrors the MCP download tool (scrape with document download enabled, emitting the downloaded document's metadata); citations converts markdown links to numbered citations; version prints the crate version as JSON. (crates/kreuzcrawl-cli)
  • MCP: batch_crawl and generate_citations tools. batch_crawl crawls multiple seed URLs concurrently (mirroring batch_scrape); generate_citations converts markdown links into numbered citations. (crates/kreuzcrawl/src/mcp)

Changed

  • chore(precommit): drop the conflicting kotlin-android ktlint hook; ktfmt is the sole formatter. ktlint's always-format mode fought ktfmt (blank-line-after-brace) and rewrote alef's /// doc comments to // /, breaking alef verify. detekt remains for static analysis. Also excluded the vendored Gradle wrapper from shellcheck. (.pre-commit-config.yaml)

Removed

  • MCP: dropped the unimplemented screenshot, research, and crawl_status tools. They had no backing core capability and only ever returned "not yet implemented", advertising tools that always failed. Every remaining MCP tool is now 1:1 with a CLI subcommand and a real core function. (crates/kreuzcrawl/src/mcp)

Fixed

  • Release: CLI binaries are now reliably attached to a published release. The GitHub release is created as a draft and was only un-drafted by release-finalize, which is skipped whenever any unrelated publish leg fails — stranding the whole release (CLI binaries included) as an invisible draft. upload-cli-release now publishes the release directly via publish-github-release@v1 with draft: false (un-drafting the existing draft) and runs even on partial CLI-build failures, mirroring the sibling repos. (.github/workflows/publish.yaml)
  • Memory-bounded streaming crawl. crawl_stream / batch_crawl_stream now move each page into its CrawlEvent::Page and drop it instead of accumulating every page, bounding peak memory on large crawls (≈2.5 GB → ≈20 MB working set). crawl()'s batch result is unchanged; Complete.pages_crawled reports the exact post-filter count, and a terminal Complete is emitted on the seed-error path. (crates/kreuzcrawl/src/engine)
  • Dart package ships native libraries. The pub.dev package previously bundled no native libs, so the flutter_rust_bridge loader fell back to a relative framework path that macOS hardened-runtime rejects. The publish pipeline now builds the native on a 5-platform matrix and stages each into lib/src/native/<rid>/ before publishing. (.github/workflows/publish-pubdev.yaml)
  • Swift package links Apple system frameworks. The published root Package.swift now links Security, CoreFoundation, and SystemConfiguration on the RustBridge target so the pre-built static library's SC* symbols (reqwest proxy detection) resolve for remote SwiftPM consumers. (regenerated via alef 0.25.55)

Build

  • Regenerated all bindings against alef 0.25.55 — generic Go FFI provisioning (test-app runner delegates to the binding's own download_ffi; no project-specific download in the generic generator), swift framework linking, zig stale-hash strip, and test-app [crates.e2e.env] export.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.84/kreuzcrawl-zig-v0.3.0-rc.84.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.84-l-oqNnXKeSCF_EYtxYkhF8AiaMo2OHBXJdTmZGqYDwDl\",\n    },\n},\n```\n

v0.3.0-rc.83

20 Jun 12:41

Choose a tag to compare

v0.3.0-rc.83 Pre-release
Pre-release

Full Changelog: v0.3.0-rc.74...v0.3.0-rc.83

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.83/kreuzcrawl-zig-v0.3.0-rc.83.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.83-l-oqNmD2dSCgZ0poTHayw9oY3xhCSBNRHS5zmEo0Drd6\",\n    },\n},\n```\n

v0.3.0-rc.82

20 Jun 08:03

Choose a tag to compare

v0.3.0-rc.82 Pre-release
Pre-release

Changed

  • Bump alef pin 0.25.49 → 0.25.50 and regenerate all bindings.

Fixed

  • Swift: cfg-gated struct fields broke the swift-bridge build. The Swift backend now filters constructor params, getter externs, and wrapper field initializers by #[cfg] against the configured feature set, and recurses inbound bridge-type conversion through Optional/Vec/Map so nested opaque types are JSON-bridged correctly. (alef 0.25.50)
  • Homebrew: formula source checksum mismatch. The release pipeline now computes the formula's source-tarball sha after the Swift-checksum job force-moves the release tag, so the bottle builds verify against the final archive instead of a pre-move hash.

Docs

  • Document the kreuzcrawl agent plugin in the README.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.82/kreuzcrawl-zig-v0.3.0-rc.82.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.82-l-oqNh4FdiBzbuEKrLMB5XhC0KEHcta0_XeOIIosbaT3\",\n    },\n},\n```\n

v0.3.0-rc.81

19 Jun 18:37

Choose a tag to compare

v0.3.0-rc.81 Pre-release
Pre-release

Changed

  • Bump alef pin 0.25.47 → 0.25.49 and regenerate all bindings.
  • Kotlin Android: modernized the build toolchain. Gradle 9.6, Android Gradle Plugin 9.2, Kotlin 2.4, compileSdk 36, minSdk 24 (the full-latest triple required under JDK 25 daemons); the Gradle wrapper is now generated from a shared module so the binding and e2e wrappers stay in lockstep. (alef 0.25.49)

Fixed

  • Java (Panama FFM) returned empty response bodies. The handler upcall allocated its response in a per-call arena that could be reclaimed before Rust read the returned pointer; the response is now copied into malloc'd memory that outlives the upcall, so full bodies survive the FFI boundary. (alef 0.25.48)
  • Java: SsrfPolicy.deny_private lost its default. Boxed-Boolean serde-default fields now restore their Rust default (deny_private = true) in the compact constructor when omitted, matching the core and the other bindings. (alef 0.25.49)
  • Swift: hard-coded bridge crate name in the post-build step. The Swift backend now builds the bridge crate via packages/swift/rust/Cargo.toml instead of a fixed package name, keeping generated bindings project-agnostic. (alef 0.25.49)
  • PHP: parse serde_json::Value params as JSON strings, convert BTreeMap params from PHP hash maps, serialize string-mapped data-enum returns, and avoid requiring Default for flat data-enum fallback arms. (alef 0.25.49)
  • NAPI: emit binding-to-core conversions for plain data enums in input DTOs and apply primitive casts on the DTO opaque let-binding call path. (alef 0.25.48–0.25.49)
  • FFI: convert Option<&[u8]> parameters as optional bytes rather than a nullable C string. (alef 0.25.48)

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.81/kreuzcrawl-zig-v0.3.0-rc.81.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.81-l-oqNug2diABJSOQ5k2jm6dlZWWfKsbmaVfK81fv2kZq\",\n    },\n},\n```\n

v0.3.0-rc.80

19 Jun 08:30

Choose a tag to compare

v0.3.0-rc.80 Pre-release
Pre-release

Changed

  • Bump alef pin 0.25.43 → 0.25.47 and regenerate all bindings.

Fixed

  • Java (Panama FFM) crashed on the first request. The register_route downcall descriptor omitted the C context parameter and the upcall adapter mis-read the request pointer; both are corrected, so route registration and request marshalling work end-to-end. (alef 0.25.47)
  • PHP ignored function-path serde defaults. CrawlConfig from JSON now honors field defaults such as SsrfPolicy::from_env() when ssrf is omitted, matching the core and the other bindings instead of falling back to deny. (alef 0.25.47)
  • Kotlin e2e: compare nullable Boolean? assertions explicitly so safe-call field paths compile. (alef 0.25.47)
  • Dart: cfg-gate generated Rust-bridge items that reference feature-gated core types, and fix slice/&Path/&BTreeMap/serde_json::Value parameter handling so the Flutter Rust bridge crate compiles. (alef 0.25.46–0.25.47)
  • JNI: marshal Option<&[u8]> and &BTreeMap parameters to the core's declared types. (alef 0.25.44)
  • SSRF: allow private networks on the wasm32 target and respect KREUZCRAWL_ALLOW_PRIVATE_NETWORK on native.
  • Elixir Hex publish is gated on a complete NIF upload so a flaky NIF leg cannot ship an immutable package with an incomplete checksum file.
  • Swift Package.swift receives the real artifactbundle SHA256 in the release tag instead of a placeholder, unblocking SwiftPM consumers.
  • Test-app harness: resolve the published Go module directory, rebuild the Dart native to the published content hash, and set the SSRF private-network override at process level for C#/Elixir.

Zig

Add to your build.zig.zon:

.dependencies = .{
    .kreuzcrawl-zig = .{\n        .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.80/kreuzcrawl-zig-v0.3.0-rc.80.tar.gz\",\n        .hash = \"kreuzcrawl-0.3.0-rc.80-l-oqNtwpdiCK9M_hosX4K5axm70pN8jMvZf59Dkzwq4L\",\n    },\n},\n```\n