Release v0.2.0 · vysakh0/redhop

The binding-parity + non-English release. Three months of incremental
quality work plus a focused arc on cross-binding consistency: Python, Node,
and Rust all expose the same surface, return the same values for the same
inputs, and drift is now actively prevented in CI. The Rust crate also gains
a pluggable lexical analyzer, closing the structural bug class (BM25 ↔
grounding-scorer disagreement) that 0.1.3–0.1.4 fixed by hand four times.

Breaking changes

Two source-level breaks for Rust callers; ..Default::default() and the
pip/npm consumers are unaffected:

ContextConfig + DocumentConfig grew new required fields
(analyzer: Arc<dyn Analyzer>) for the pluggable lexical analyzer.
Callers constructing those structs via field literals from outside
the crate need to add analyzer: redhop::analyzer::default_english().
ContextConfig::default().token_budget changed from 2048 → 8192
to align with the Python binding's long-standing default (which was
shipping to PyPI users that whole time). Rust callers relying on the
old 2048 default will now get a 4× larger assembled context. Set
token_budget: 2048 explicitly to restore the old behavior. Python +
Node users see no change.

Added

Pluggable lexical analyzer

crate::analyzer::Analyzer trait + SnowballAnalyzer (18
Snowball Porter2 languages). First-class extension point: one analyzer
drives BOTH the BM25 retriever AND the grounding scorer, so the two
layers structurally cannot disagree on what "the same term" means.
Design rationale in docs/design/ANALYZER_PLUGIN.md; usage in
docs/LANGUAGE.md.
Document::with_analyzer(Arc<dyn Analyzer>) — mirrors
with_embedder. Swaps the analyzer for both layers in lockstep.
LoadOptions::language: Option<String> — string-routed access to
the 18 builtins (english, german, french, spanish, italian,
portuguese, dutch, russian, swedish, norwegian, danish,
finnish, romanian, hungarian, turkish, arabic, greek,
tamil). Unknown language names return an error (no silent fallback
to English).
Python language kwarg on every Document.from_* constructor.
Node language field on Options.

Binding parity (Node catches up to Python)

Document.analyze(query) — pure diagnostics, returns the same
Report shape as context().report without paying assembly cost.
Document.nFiles getter — number of source files indexed (1
for single-source ctors, the readable count for fromFolder).
Document.skippedFiles getter — SkippedFile[] ({source, reason} pairs) for files fromFolder couldn't parse. Was a silent
skip with no introspection before.
buildContext / filterContext / analyzeContext /
contextEconomics top-level functions — the low-level "I do my own
retrieval, just want RedHop for assembly" surface. Mirrors Python's
same-named functions; takes ChunkInput[] + ContextOptions.
groundingScore(query, text) + linkStrength(a, b) — the
observability primitives the strategies use internally, exposed so
external code reuses RedHop's exact relevance notion instead of
reimplementing.

Tests + infrastructure

crates/redhop/tests/quality_suite.rs — 45-test behavior-level
suite organized by what a user perceives, not by code structure.
Covers tokenization (T01-T07), multi-field reach (T08-T09), document
structure (T10-T13), context assembly (T14-T20), hybrid contract
(T21-T22), edge cases (T23-T26), Unicode/multilingual (T27-T30),
adversarial queries (T31-T34), nested markdown (T35), cross-format
mixed corpus (T36), non-English pinning (T37-T40), and the analyzer
plugin (T41-T45). Found two real bugs on its first runs (an
empty-query BM25 crash and an accent-folding gap), and a binding bug
via T41-T44 (from_chunks silently dropping language= in Python).
python/tests/test_parity_node.py + nodejs/test/parity_runner.cjs
— cross-binding parity harness. 6 tests hand identical inputs to
Python and Node and diff structured outputs (caught the
analyzeContext / contextEconomics token_budget divergence on
its first run).
crates/cli/tests/cli_smoke.rs — first-ever CLI integration
tests. Asserts --help works on each subcommand + a real
analyze-context - stdin pipe.
Node CI job — .github/workflows/ci.yml now builds the napi
addon and runs npm test on PRs. Previously PRs only exercised
Rust + Python.
ASCII folding (café ↔ cafe, Süßigkeit ↔ Sussigkeit,
naïve ↔ naive) in both BM25 and the grounding scorer (via NFKD).
New tests T27, T28, T39 pin this.

Documentation

docs/LANGUAGE.md — honest scope of non-English support, by
family + the Analyzer plugin's public API (Rust / Python / Node).
docs/design/ANALYZER_PLUGIN.md — rewritten to describe the
shipped surface (was originally a proposal with several deviations).
README "Language support" section + per-package READMEs
(python/README.md, nodejs/README.md) — language= examples.
docs/ARCHITECTURE.md — refreshed against the post-consolidation
workspace (the pre-0.2 split of redhop-{core,context,…} into
separate crates was rolled into one published redhop crate; diagram
and crate-name references updated).
docs/API_STABILITY.md — full Node section added; Python section
updated with language=, n_files, skipped_files; Rust section
updated with the consolidated module paths.

Changed

Python folder walker unified with Rust's read_folder_with —
−429 LOC in python/src/lib.rs (≈25% of the file). Removed the
parallel build_folder_persisted, collect_files, PersistedIndex,
CachedFile, fingerprint, etc. Both bindings now share Rust's
single implementation; on-disk index format is byte-compatible with
the previous Python writer, so existing caches reload cleanly.
strategy_from_str + retrieval_from_str consolidated to a
single source of truth in redhop::load. Python's wrappers now
forward to the Rust functions with map_err instead of duplicating
the match arms.
Document carries n_files() and skipped_files() accessors on
the Rust struct. Single-source constructors default to 1 / empty;
read_folder_with (both simple and persisted paths) now records
(source, reason) for each skipped file instead of silently dropping
them.
MSRV bumped 1.75 → 1.77 across all three workspace declarations
(workspace, python/Cargo.toml, nodejs/Cargo.toml) — the napi-rs
2.x in the Node binding sets the actual floor; the inconsistency
meant a 1.75 user hit a mysterious napi error instead of a clear MSRV
one.

Fixed

All-stopword query no longer crashes BM25. A query the analyzer
pipeline reduces to zero positive terms ("", " ", "the and is of in or") used to surface as a hard Tantivy error (Invalid query: Only excluding terms given). The retriever now traps that error
class (and the empty query class) and returns an empty result.
Caught by quality_suite::t25 on its first run.
Python Document.from_chunks silently dropped language= — the
pyo3 signature accepted the kwarg but the call into doc_config
passed None instead of the user's value. Caught by the new Python
analyzer test suite on its first run.
Node analyzeContext / contextEconomics were honoring the
user's token_budget option; Python's equivalents hardcode
usize::MAX because these are no-budget pure-analysis surfaces.
Caught by the cross-binding parity tests on their first run.
Node index.d.ts was stale — the language field and
minCandidates field were present on the Rust Options struct but
hadn't been regenerated. TypeScript users got "Object literal may
only specify known properties" on perfectly valid options.

Notes

unicode-normalization promoted from transitive (via tantivy) to a
direct dep of redhop. Used for the grounding scorer's NFKD fold.
Workspace test count: 320/320 (Rust) + 81/81 (Python, +1 BGE
fixture skip) + Node smoke + analyzer suites. Was 260 at the v0.1.4
tag.
CI gates: cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace, cargo doc --workspace --no-deps --features files,semantic (warning-free), the
cross-binding parity suite, and the Node smoke + analyzer suites.
All six CI jobs green.
[package.metadata.docs.rs] all-features = true added to the
redhop crate so the published doc page on docs.rs shows the
files + semantic items instead of just the lean lexical surface.
21 example files swept clean of hardcoded /Users/vysakh/... paths;
they resolve datasets/models/exports through
redhop_examples::{data_path, exports_path, model_path, bge_small_paths, ms_marco_paths} helpers that honor
REDHOP_{DATA,EXPORTS,MODELS}_DIR env vars.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Choose a tag to compare

Sorry, something went wrong.