v0.1.4
Citation ergonomics — for both code and prose. Follow-on to 0.1.3's
BM25-quality theme: retrieval is now precise, but the assembled context
returned to the LLM left the user staring at a def line without the
implementation, or at a deep section paragraph without its parent heading.
Both gaps closed by default; both have explicit opt-out knobs. Plus three
prose-side fixes that surfaced during the audit (setext headings, PDF
heading heuristic, plumbing).
Changed
Document::context(query)on a code chunk now attaches ±1 neighbor
chunks by default. Code is chunked as fixed-token windows so a 50-line
function often spans 2-3 chunks; a hit on the chunk containing thedef
line would previously cite only the signature, omitting the body in the
next chunk. WithDocumentConfig::code_neighbors_default = 1
(the new default), citations on code hits include the surrounding
implementation. Behavior change for code-shaped corpora — set
code_neighbors_default: 0to restore the old chunk-only behavior. No
effect on prose corpora (fires only on chunks tagged
metadata["kind"] == "code").Document::context(query)on a prose chunk with a section heading
now attaches the section's opener chunk by default. A query that
lands deep inside## Refunds → ### Eligibilitypreviously cited only
the matched chunk — the LLM lost the section title. With
DocumentConfig::prose_heading_default = true(the new default), the
section's first chunk is attached. Behavior change for hierarchical
prose — setprose_heading_default: falseto disable. Only fires on
chunks that carry non-emptymetadata["heading"](markdown, DOCX,
PPTX, XLSX, and — new in this release — PDF).- Markdown sections now recognize setext headings (
Title\n=====
for H1,Title\n-----for H2) in addition to ATX (#/##/…). YAML
frontmatter (---...---at file start) is detected and excluded
from setext scanning so its closing fence doesn't get treated as an
H2 underline. Pandoc output / older docs / man pages now produce the
same section structure as their ATX equivalents. - PDF chunks now carry best-effort heading metadata. A per-page
heuristic lifts the first short, non-paragraph-shaped line into
Section::heading(rejecting page-number footers, body lines ending
in sentence punctuation, and lines ending in a digit). Lets the BM25
heading-field search added in 0.1.3 actually reach PDF chunks by
topic; previouslymetadata["heading"]was alwaysNoneon PDFs.
Added
DocumentConfig::code_neighbors_default: usize(default1).DocumentConfig::prose_heading_default: bool(defaulttrue).
Both inherited via the Python / Node bindings' default config; no
binding-surface change for callers who don't override.
Fixed
- (No code-bug fixes — 0.1.3 already closed the BM25 quality gaps. See
the Notes section for the verified-not-broken embedding-persistence
story.)
Notes
- Embedding persistence verified. The 0.1.3 audit suspected that
read_folder_with(persist=true)re-embedded every chunk on reload
(paying ~30-60 sec of bge-small cost per cold start on a 1000-chunk
codebase). The machinery is already correct:embedded_chunks()
populates theChunk::embeddingfield from the retriever cache before
writingindex.json,EmbeddingisSerialize/Deserialize, and
LocalRerankRetriever::indexshort-circuits any chunk that comes back
with an embedding already set. Round-trip test
(crates/redhop/tests/embedding_persistence.rs) now pins this — a
reload triggers exactly 1 embed call (the query), not N+1 (the query +
every chunk). Locked in as a regression guard. - Eleven new tests across the citation-ergonomics theme: 3 for the code
neighbor default, 3 for the prose heading default, 3 for setext
headings + frontmatter handling, 2 for the PDF heading heuristic.
111/111 tests pass under
cargo test -p redhop --features files.