v0.2.0
The binding-parity + non-English release. Three months of incremental
quality work plus a focused arc on cross-binding consistency: Python, Node,
and Rust all expose the same surface, return the same values for the same
inputs, and drift is now actively prevented in CI. The Rust crate also gains
a pluggable lexical analyzer, closing the structural bug class (BM25 ↔
grounding-scorer disagreement) that 0.1.3–0.1.4 fixed by hand four times.
Breaking changes
Two source-level breaks for Rust callers; ..Default::default() and the
pip/npm consumers are unaffected:
ContextConfig+DocumentConfiggrew new required fields
(analyzer: Arc<dyn Analyzer>) for the pluggable lexical analyzer.
Callers constructing those structs via field literals from outside
the crate need to addanalyzer: redhop::analyzer::default_english().ContextConfig::default().token_budgetchanged from 2048 → 8192
to align with the Python binding's long-standing default (which was
shipping to PyPI users that whole time). Rust callers relying on the
old 2048 default will now get a 4× larger assembled context. Set
token_budget: 2048explicitly to restore the old behavior. Python +
Node users see no change.
Added
Pluggable lexical analyzer
crate::analyzer::Analyzertrait +SnowballAnalyzer(18
Snowball Porter2 languages). First-class extension point: one analyzer
drives BOTH the BM25 retriever AND the grounding scorer, so the two
layers structurally cannot disagree on what "the same term" means.
Design rationale indocs/design/ANALYZER_PLUGIN.md; usage in
docs/LANGUAGE.md.Document::with_analyzer(Arc<dyn Analyzer>)— mirrors
with_embedder. Swaps the analyzer for both layers in lockstep.LoadOptions::language: Option<String>— string-routed access to
the 18 builtins (english,german,french,spanish,italian,
portuguese,dutch,russian,swedish,norwegian,danish,
finnish,romanian,hungarian,turkish,arabic,greek,
tamil). Unknown language names return an error (no silent fallback
to English).- Python
languagekwarg on everyDocument.from_*constructor. - Node
languagefield onOptions.
Binding parity (Node catches up to Python)
Document.analyze(query)— pure diagnostics, returns the same
Reportshape ascontext().reportwithout paying assembly cost.Document.nFilesgetter — number of source files indexed (1
for single-source ctors, the readable count forfromFolder).Document.skippedFilesgetter —SkippedFile[]({source, reason}pairs) for filesfromFoldercouldn't parse. Was a silent
skip with no introspection before.buildContext/filterContext/analyzeContext/
contextEconomicstop-level functions — the low-level "I do my own
retrieval, just want RedHop for assembly" surface. Mirrors Python's
same-named functions; takesChunkInput[]+ContextOptions.groundingScore(query, text)+linkStrength(a, b)— the
observability primitives the strategies use internally, exposed so
external code reuses RedHop's exact relevance notion instead of
reimplementing.
Tests + infrastructure
crates/redhop/tests/quality_suite.rs— 45-test behavior-level
suite organized by what a user perceives, not by code structure.
Covers tokenization (T01-T07), multi-field reach (T08-T09), document
structure (T10-T13), context assembly (T14-T20), hybrid contract
(T21-T22), edge cases (T23-T26), Unicode/multilingual (T27-T30),
adversarial queries (T31-T34), nested markdown (T35), cross-format
mixed corpus (T36), non-English pinning (T37-T40), and the analyzer
plugin (T41-T45). Found two real bugs on its first runs (an
empty-query BM25 crash and an accent-folding gap), and a binding bug
via T41-T44 (from_chunkssilently droppinglanguage=in Python).python/tests/test_parity_node.py+nodejs/test/parity_runner.cjs
— cross-binding parity harness. 6 tests hand identical inputs to
Python and Node and diff structured outputs (caught the
analyzeContext/contextEconomicstoken_budgetdivergence on
its first run).crates/cli/tests/cli_smoke.rs— first-ever CLI integration
tests. Asserts--helpworks on each subcommand + a real
analyze-context -stdin pipe.- Node CI job —
.github/workflows/ci.ymlnow builds the napi
addon and runsnpm teston PRs. Previously PRs only exercised
Rust + Python. - ASCII folding (
café↔cafe,Süßigkeit↔Sussigkeit,
naïve↔naive) in both BM25 and the grounding scorer (via NFKD).
New tests T27, T28, T39 pin this.
Documentation
docs/LANGUAGE.md— honest scope of non-English support, by
family + theAnalyzerplugin's public API (Rust / Python / Node).docs/design/ANALYZER_PLUGIN.md— rewritten to describe the
shipped surface (was originally a proposal with several deviations).- README "Language support" section + per-package READMEs
(python/README.md,nodejs/README.md) —language=examples. docs/ARCHITECTURE.md— refreshed against the post-consolidation
workspace (the pre-0.2 split ofredhop-{core,context,…}into
separate crates was rolled into one publishedredhopcrate; diagram
and crate-name references updated).docs/API_STABILITY.md— full Node section added; Python section
updated withlanguage=,n_files,skipped_files; Rust section
updated with the consolidated module paths.
Changed
- Python folder walker unified with Rust's
read_folder_with—
−429 LOC inpython/src/lib.rs(≈25% of the file). Removed the
parallelbuild_folder_persisted,collect_files,PersistedIndex,
CachedFile,fingerprint, etc. Both bindings now share Rust's
single implementation; on-disk index format is byte-compatible with
the previous Python writer, so existing caches reload cleanly. strategy_from_str+retrieval_from_strconsolidated to a
single source of truth inredhop::load. Python's wrappers now
forward to the Rust functions withmap_errinstead of duplicating
the match arms.Documentcarriesn_files()andskipped_files()accessors on
the Rust struct. Single-source constructors default to1/ empty;
read_folder_with(both simple and persisted paths) now records
(source, reason)for each skipped file instead of silently dropping
them.- MSRV bumped 1.75 → 1.77 across all three workspace declarations
(workspace,python/Cargo.toml,nodejs/Cargo.toml) — the napi-rs
2.x in the Node binding sets the actual floor; the inconsistency
meant a 1.75 user hit a mysterious napi error instead of a clear MSRV
one.
Fixed
- All-stopword query no longer crashes BM25. A query the analyzer
pipeline reduces to zero positive terms (""," ","the and is of in or") used to surface as a hard Tantivy error (Invalid query: Only excluding terms given). The retriever now traps that error
class (and theempty queryclass) and returns an empty result.
Caught byquality_suite::t25on its first run. - Python
Document.from_chunkssilently droppedlanguage=— the
pyo3 signature accepted the kwarg but the call intodoc_config
passedNoneinstead of the user's value. Caught by the new Python
analyzer test suite on its first run. - Node
analyzeContext/contextEconomicswere honoring the
user'stoken_budgetoption; Python's equivalents hardcode
usize::MAXbecause these are no-budget pure-analysis surfaces.
Caught by the cross-binding parity tests on their first run. - Node
index.d.tswas stale — thelanguagefield and
minCandidatesfield were present on the RustOptionsstruct but
hadn't been regenerated. TypeScript users got "Object literal may
only specify known properties" on perfectly valid options.
Notes
unicode-normalizationpromoted from transitive (via tantivy) to a
direct dep of redhop. Used for the grounding scorer's NFKD fold.- Workspace test count: 320/320 (Rust) + 81/81 (Python, +1 BGE
fixture skip) + Node smoke + analyzer suites. Was 260 at the v0.1.4
tag. - CI gates:
cargo fmt --all -- --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspace,cargo doc --workspace --no-deps --features files,semantic(warning-free), the
cross-binding parity suite, and the Node smoke + analyzer suites.
All six CI jobs green. [package.metadata.docs.rs] all-features = trueadded to the
redhop crate so the published doc page on docs.rs shows the
files+semanticitems instead of just the lean lexical surface.- 21 example files swept clean of hardcoded
/Users/vysakh/...paths;
they resolve datasets/models/exports through
redhop_examples::{data_path, exports_path, model_path, bge_small_paths, ms_marco_paths}helpers that honor
REDHOP_{DATA,EXPORTS,MODELS}_DIRenv vars.