Releases: xberg-io/crawlberg
Releases · xberg-io/crawlberg
v0.3.0
First stable release. kreuzcrawl ships a Rust core with active bindings for
Python, TypeScript/Node, Ruby, PHP, Go, Java/JNI, C#, Elixir, WebAssembly,
Dart, Kotlin/Android, Swift, Zig, and C FFI, plus a CLI, an HTTP API, and an
MCP server.
Added
- Tiered dispatch engine. The crawl engine chains HTTP → Bypass → Browser
tiers driven by per-attempt signals rather than a single bypass
short-circuit. Publickreuzcrawl::types::dispatchsurface:Tier,
EscalationStrategy,EscalationReason,AttemptOutcome,RetryDirective,
RetryPolicy,WafSignal,WafClassifier,DomainStatePort,
DomainRecommendation,EscalationBudget, andDispatchProfile(dispatch
enums are#[non_exhaustive]).CrawlConfig::builder()and
DispatchProfile::builder()provide fluent construction. - WAF detection. A TOML fingerprint corpus (
rules/waf_fingerprints.toml,
34 fingerprints) with an Aho-Corasick matcher,TomlClassifier::watch()
hot-reload (debounced, atomicArcSwap, Kubernetes ConfigMap-safe), and
EwmaDomainStatefor per-domain block-rate tracking that promotes/demotes
the starting tier. - SSRF defense. New
kreuzcrawl::net::ssrfmodule —SsrfPolicy,
HostMatcher(Exact/Suffix/Cidr),SsrfError, and async
validate_url.CrawlConfig::ssrfplus builder methods
allow_private_networks(bool)andssrf_allowlist_host(HostMatcher);
CrawlError::SsrfPolicyViolation. Exposed as a settable DTO (deny_private,
max_redirects) across every binding. - Browser pool injection.
BrowserPool/BrowserPoolConfigand
NativeBrowserExecutor/NativeBrowserExecutorConfigare public;
CrawlEngineBuilder::with_browser_pool/with_native_executorand
CrawlEngineHandle::from_enginelet consumers construct andwarm()a pool
once and reuse it across all crawl jobs. - Public substrate parsers.
kreuzcrawl::robotsandkreuzcrawl::sitemap
are public (parse_robots_txt,is_path_allowed,RobotsRules,
parse_sitemap_xml,parse_sitemap_index,is_sitemap_index) — usable
without spinning up the engine. - Pluggable proxy rotation.
ProxyProvidertrait +StaticProxyProvider
baseline, wired into the reqwest fetch path via
CrawlEngineBuilder::with_proxy_provider; called per request and taking
precedence over the staticCrawlConfig::proxyvalue. - CLI.
batch-scrape,batch-crawl,download,citations, and
versionsubcommands, bringing the CLI to 1:1 with the core and MCP
surfaces. - MCP server. Tools are 1:1 with the CLI (
batch_crawl,
generate_citations, …), each declaringread_only/destructive/
open_worldsafety annotations, and are served over both stdio and rmcp
Streamable HTTP at/mcpwhen the binary is built with theapi+mcp
features. - Observability. OpenTelemetry counters
kreuzcrawl_waf_fingerprint_matches_totaland
kreuzcrawl_escalations_total, plus property tests, cargo-fuzz targets, and
Criterion benchmarks covering the WAF subsystem.
Changed
- Memory-bounded streaming crawl.
crawl_stream/batch_crawl_stream
move each page into itsCrawlEvent::Pageand drop it instead of
accumulating every page, bounding peak memory on large crawls (≈2.5 GB →
≈20 MB working set).crawl()'s batch result is unchanged. - Dispatch model.
CrawlError::WafBlockedis now a struct variant
({ vendor, message });DomainStatePortmoved to an observation model
(recommend/observe);SimpleRetryPolicy's off-by-one is fixed
(max_retries=3yields 3 retries);#[non_exhaustive]added to
CrawlError,NetworkErrorKind, and the dispatch enums so future variants
are non-breaking. - Asset downloads route through
http_fetch, so every file fetch is
subject to the SSRF policy.
Fixed
- Crawl loop materializes downloaded documents. The
download_documents
flag was previously honored only by single-pagescrape(); the crawl loop
now buildsCrawlPageResult.downloaded_documentfor linked PDFs/DOCX via a
shared helper instead of fetching, flagging, and discarding the bytes. - SSRF rollout hardening. Follow-up fixes to the SSRF refactor: redirect
final_urlis tracked again (per-hop re-validation moved into
follow_redirects), within-batch URL dedup no longer races, crawl
child-depth is incremented (restoringmax_depthandinclude_paths
semantics), andCrawlConfigJSON deserialization honors
KREUZCRAWL_ALLOW_PRIVATE_NETWORKthrough aSsrfPolicy::from_envserde
default. Each is covered by a regression test. - MCP server exposed zero tools. The handler was missing rmcp's
#[tool_handler], sotools/list/tools/callreturned an empty list over
both stdio and HTTP; it now delegates to the generated tool router.
Security
- SSRF defense, enabled by default.
scrape(),crawl(),
batch_crawl(), sitemap fetch, robots.txt fetch, and asset download refuse
URLs resolving to loopback (127.0.0.0/8), RFC1918 private networks,
link-local (169.254.0.0/16), cloud metadata (0.0.0.0/8), multicast
(224.0.0.0/4), IPv6 ULA (fc00::/7), IPv6 link-local (fe80::/10), IPv6
multicast (ff00::/8), or any non-http(s) scheme. Includes DNS-rebinding
mitigation (every resolved IP must pass the policy), redirect-chain
re-validation (bounded byssrf.max_redirects, default 5), and
link-enqueue validation with bounded concurrency. Opt out via
KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1or
CrawlConfig::allow_private_networks(true).
Build
- Bindings, facades, READMEs, docs, stubs, and e2e suites are generated by
alef (pinned at 0.26.6) across all 14 language targets. - Publish-pipeline hardening: a native per-arch Docker matrix that drops QEMU
emulation, Flutter-free Dart native builds for pub.dev, Swift artifactbundle
checksum injection and Apple system-framework linking, and
lockfile-preserving source publishes for the Elixir NIF, PHP extension, and
Ruby gem.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0/kreuzcrawl-zig-v0.3.0.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-l-oqNoO5eCDhkMWBHtd-su6btmha_K-hk0D2x5Ooq7B9\",\n },\n},\n```\n
v0.3.0-rc.88
Fixed
- Docker: multi-arch publish no longer times out on the arm64 leg.
Publish Docker Imagesbuiltlinux/amd64,linux/arm64in a single job with the arm64 image compiled under QEMU emulation, which routinely ran the full Rust build right up against the 120-minute job timeout (rc.86 squeaked through at 103 min; rc.87 tipped over and was cancelled, leaving no0.3.0-rc.87image on GHCR). The job now builds each architecture natively in a matrix (amd64onubuntu-latest,arm64onubuntu-24.04-arm), pushes each by digest, and merges them into a single manifest list withdocker buildx imagetools create— matching the canonical infra pattern and removing QEMU entirely. Per-arch GHA cache scopes also fix acache-from/cache-toscope mismatch that previously defeated cache reuse. (.github/workflows/publish-docker.yaml)
Build
- Regenerated all bindings against alef 0.26.6 (pin bumped from 0.26.3). Folds in the accumulated 0.26.4→0.26.6 codegen changes (including the pyo3 Python trait-callback reliability fix and the Dart mirror/opaque/from_json declaration fix).
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.88/kreuzcrawl-zig-v0.3.0-rc.88.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.88-l-oqNv3EeCARU_oSw7q0Ey1p26mbnSkJ7NBo160AuQcR\",\n },\n},\n```\n
v0.3.0-rc.87
Added
- MCP: Streamable HTTP transport at
/mcp. When the binary is built with both theapiandmcpfeatures (the CLI is), the REST API server now also exposes the MCP server over rmcp's Streamable HTTP transport, mounted outside the REST middleware stack so request-timeout/compression layers don't break MCP SSE. Each tool also declares its safety annotations (read_only/destructive/open_worldhints). (crates/kreuzcrawl/src/mcp,crates/kreuzcrawl/src/api/router.rs)
Fixed
- MCP: the server exposed zero tools. The
impl ServerHandlerwas missing rmcp's#[tool_handler], sotools/list/tools/callsilently returned an empty list over both stdio and HTTP — every MCP client through rc.86 saw a tool-less server. The handler now delegates to the generated tool router. (crates/kreuzcrawl/src/mcp/server.rs) - Swift:
RustBridgeC.hno longer regresses to the placeholder. rc.86 shipped the typedef-only placeholder header (reverting the populated header from rc.85), so every SwiftPM consumer of the source package failed to compile with thousands ofcannot find '__swift_bridge__$…' in scopeerrors. The alef 0.26.3 bump below preserves an already-populated header acrossalef all --clean, and the umbrella header (994 swift-bridge C declarations) is repopulated here.
Build
- Regenerated all bindings against alef 0.26.3 (pin bumped from 0.25.60). Beyond the Swift header preservation fix above, this folds in the accumulated 0.25.60→0.26.3 codegen changes.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.87/kreuzcrawl-zig-v0.3.0-rc.87.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.87-l-oqNlnHeSAlSjd6Z9WxFtrTMIzeT11cfH85hL9n9W5q\",\n },\n},\n```\n
v0.3.0-rc.86
Build
- Regenerated all bindings against alef 0.25.60 (pin bumped from 0.25.59). Notable codegen fixes: the generated Rust e2e/test-app
common.rsnow resolves the mock-server binary viaenv!("CARGO_BIN_EXE_mock-server")instead of a hardcodedtarget/release/mock-serverpath (socargo testdebug builds spawn it correctly), and the Kotlin AndroidMockServerListenernow parses theMOCK_SERVERSenv map on the preset path so per-fixturemockServer.<id>lookups resolve under the registry-mode test runner.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.86/kreuzcrawl-zig-v0.3.0-rc.86.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.86-l-oqNnHJeSBVbHIEt2GQxgeXjxKLiAHHXSOKdNd5taQj\",\n },\n},\n```\n
v0.3.0-rc.85
Fixed
- Dart pub.dev publish no longer requires Flutter to build natives. The per-platform native dylib build used
build-dart-package, which installs Flutter viasubosito/flutter-action; Flutter ships no Linux ARM64 stable SDK, so thelinux-arm64leg failed ("Unable to determine Flutter version … architecture: arm64") and blocked the entire pub.dev publish. The native build needs only Rust (frb_generated.rsis committed), so the matrix now builds each native withcargo build --locked -p kreuzcrawl-dart --releasedirectly — matching the canonical liter-llm / kreuzberg pattern. (.github/workflows/publish-pubdev.yaml)
Build
- Regenerated all bindings against alef 0.25.59 (pin bumped from 0.25.55). Notable codegen changes: the Swift
RustBridgeC.his now the full concatenated swift-bridge C header instead of a placeholder, and the JNINativeLibthrows a descriptiveExceptionInInitializerErrornaming the missing native symbol instead of a bareorElseThrow. - Supersedes rc.84, whose Homebrew bottle-merge and release-finalize steps were stranded by a transient crates.io fetch flake in the
x86_64_linuxbottle build (leaving the tap formula'sbottle doblock pinned to the rc.83 root URL).
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.85/kreuzcrawl-zig-v0.3.0-rc.85.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.85-l-oqNr_keSCO9RQ7ifk4XGg6WZ4fMfoVqRf8Ocej5zng\",\n },\n},\n```\n
v0.3.0-rc.84
Added
- CLI:
batch-scrape,batch-crawl,download,citations, andversionsubcommands. Wire the existing core entry points into the CLI so the CLI, MCP server, and core surfaces are 1:1.downloadmirrors the MCP download tool (scrape with document download enabled, emitting the downloaded document's metadata);citationsconverts markdown links to numbered citations;versionprints the crate version as JSON. (crates/kreuzcrawl-cli) - MCP:
batch_crawlandgenerate_citationstools.batch_crawlcrawls multiple seed URLs concurrently (mirroringbatch_scrape);generate_citationsconverts markdown links into numbered citations. (crates/kreuzcrawl/src/mcp)
Changed
- chore(precommit): drop the conflicting kotlin-android ktlint hook; ktfmt is the sole formatter. ktlint's always-format mode fought ktfmt (blank-line-after-brace) and rewrote alef's
///doc comments to// /, breakingalef verify. detekt remains for static analysis. Also excluded the vendored Gradle wrapper from shellcheck. (.pre-commit-config.yaml)
Removed
- MCP: dropped the unimplemented
screenshot,research, andcrawl_statustools. They had no backing core capability and only ever returned "not yet implemented", advertising tools that always failed. Every remaining MCP tool is now 1:1 with a CLI subcommand and a real core function. (crates/kreuzcrawl/src/mcp)
Fixed
- Release: CLI binaries are now reliably attached to a published release. The GitHub release is created as a draft and was only un-drafted by
release-finalize, which is skipped whenever any unrelated publish leg fails — stranding the whole release (CLI binaries included) as an invisible draft.upload-cli-releasenow publishes the release directly viapublish-github-release@v1withdraft: false(un-drafting the existing draft) and runs even on partial CLI-build failures, mirroring the sibling repos. (.github/workflows/publish.yaml) - Memory-bounded streaming crawl.
crawl_stream/batch_crawl_streamnow move each page into itsCrawlEvent::Pageand drop it instead of accumulating every page, bounding peak memory on large crawls (≈2.5 GB → ≈20 MB working set).crawl()'s batch result is unchanged;Complete.pages_crawledreports the exact post-filter count, and a terminalCompleteis emitted on the seed-error path. (crates/kreuzcrawl/src/engine) - Dart package ships native libraries. The pub.dev package previously bundled no native libs, so the flutter_rust_bridge loader fell back to a relative framework path that macOS hardened-runtime rejects. The publish pipeline now builds the native on a 5-platform matrix and stages each into
lib/src/native/<rid>/before publishing. (.github/workflows/publish-pubdev.yaml) - Swift package links Apple system frameworks. The published root
Package.swiftnow linksSecurity,CoreFoundation, andSystemConfigurationon theRustBridgetarget so the pre-built static library'sSC*symbols (reqwest proxy detection) resolve for remote SwiftPM consumers. (regenerated via alef 0.25.55)
Build
- Regenerated all bindings against alef 0.25.55 — generic Go FFI provisioning (test-app runner delegates to the binding's own
download_ffi; no project-specific download in the generic generator), swift framework linking, zig stale-hash strip, and test-app[crates.e2e.env]export.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.84/kreuzcrawl-zig-v0.3.0-rc.84.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.84-l-oqNnXKeSCF_EYtxYkhF8AiaMo2OHBXJdTmZGqYDwDl\",\n },\n},\n```\n
v0.3.0-rc.83
Full Changelog: v0.3.0-rc.74...v0.3.0-rc.83
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.83/kreuzcrawl-zig-v0.3.0-rc.83.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.83-l-oqNmD2dSCgZ0poTHayw9oY3xhCSBNRHS5zmEo0Drd6\",\n },\n},\n```\n
v0.3.0-rc.82
Changed
- Bump alef pin 0.25.49 → 0.25.50 and regenerate all bindings.
Fixed
- Swift: cfg-gated struct fields broke the swift-bridge build. The Swift backend now filters constructor params, getter externs, and wrapper field initializers by
#[cfg]against the configured feature set, and recurses inbound bridge-type conversion throughOptional/Vec/Mapso nested opaque types are JSON-bridged correctly. (alef 0.25.50) - Homebrew: formula source checksum mismatch. The release pipeline now computes the formula's source-tarball sha after the Swift-checksum job force-moves the release tag, so the bottle builds verify against the final archive instead of a pre-move hash.
Docs
- Document the kreuzcrawl agent plugin in the README.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.82/kreuzcrawl-zig-v0.3.0-rc.82.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.82-l-oqNh4FdiBzbuEKrLMB5XhC0KEHcta0_XeOIIosbaT3\",\n },\n},\n```\n
v0.3.0-rc.81
Changed
- Bump alef pin 0.25.47 → 0.25.49 and regenerate all bindings.
- Kotlin Android: modernized the build toolchain. Gradle 9.6, Android Gradle Plugin 9.2, Kotlin 2.4, compileSdk 36, minSdk 24 (the full-latest triple required under JDK 25 daemons); the Gradle wrapper is now generated from a shared module so the binding and e2e wrappers stay in lockstep. (alef 0.25.49)
Fixed
- Java (Panama FFM) returned empty response bodies. The handler upcall allocated its response in a per-call arena that could be reclaimed before Rust read the returned pointer; the response is now copied into
malloc'd memory that outlives the upcall, so full bodies survive the FFI boundary. (alef 0.25.48) - Java:
SsrfPolicy.deny_privatelost its default. Boxed-Booleanserde-default fields now restore their Rust default (deny_private = true) in the compact constructor when omitted, matching the core and the other bindings. (alef 0.25.49) - Swift: hard-coded bridge crate name in the post-build step. The Swift backend now builds the bridge crate via
packages/swift/rust/Cargo.tomlinstead of a fixed package name, keeping generated bindings project-agnostic. (alef 0.25.49) - PHP: parse
serde_json::Valueparams as JSON strings, convertBTreeMapparams from PHP hash maps, serialize string-mapped data-enum returns, and avoid requiringDefaultfor flat data-enum fallback arms. (alef 0.25.49) - NAPI: emit binding-to-core conversions for plain data enums in input DTOs and apply primitive casts on the DTO opaque let-binding call path. (alef 0.25.48–0.25.49)
- FFI: convert
Option<&[u8]>parameters as optional bytes rather than a nullable C string. (alef 0.25.48)
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.81/kreuzcrawl-zig-v0.3.0-rc.81.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.81-l-oqNug2diABJSOQ5k2jm6dlZWWfKsbmaVfK81fv2kZq\",\n },\n},\n```\n
v0.3.0-rc.80
Changed
- Bump alef pin 0.25.43 → 0.25.47 and regenerate all bindings.
Fixed
- Java (Panama FFM) crashed on the first request. The
register_routedowncall descriptor omitted the C context parameter and the upcall adapter mis-read the request pointer; both are corrected, so route registration and request marshalling work end-to-end. (alef 0.25.47) - PHP ignored function-path serde defaults.
CrawlConfigfrom JSON now honors field defaults such asSsrfPolicy::from_env()whenssrfis omitted, matching the core and the other bindings instead of falling back to deny. (alef 0.25.47) - Kotlin e2e: compare nullable
Boolean?assertions explicitly so safe-call field paths compile. (alef 0.25.47) - Dart: cfg-gate generated Rust-bridge items that reference feature-gated core types, and fix slice/
&Path/&BTreeMap/serde_json::Valueparameter handling so the Flutter Rust bridge crate compiles. (alef 0.25.46–0.25.47) - JNI: marshal
Option<&[u8]>and&BTreeMapparameters to the core's declared types. (alef 0.25.44) - SSRF: allow private networks on the
wasm32target and respectKREUZCRAWL_ALLOW_PRIVATE_NETWORKon native. - Elixir Hex publish is gated on a complete NIF upload so a flaky NIF leg cannot ship an immutable package with an incomplete checksum file.
- Swift
Package.swiftreceives the real artifactbundle SHA256 in the release tag instead of a placeholder, unblocking SwiftPM consumers. - Test-app harness: resolve the published Go module directory, rebuild the Dart native to the published content hash, and set the SSRF private-network override at process level for C#/Elixir.
Zig
Add to your build.zig.zon:
.dependencies = .{
.kreuzcrawl-zig = .{\n .url = \"https://github.com/kreuzberg-dev/kreuzcrawl/releases/download/v0.3.0-rc.80/kreuzcrawl-zig-v0.3.0-rc.80.tar.gz\",\n .hash = \"kreuzcrawl-0.3.0-rc.80-l-oqNtwpdiCK9M_hosX4K5axm70pN8jMvZf59Dkzwq4L\",\n },\n},\n```\n