feat(search): normalize semantic query (NFKC + casefold)#2937
feat(search): normalize semantic query (NFKC + casefold)#2937lg320531124 wants to merge 1 commit into
Conversation
Semantic find/search queries are passed verbatim to the embedder and retriever, so colloquial or CJK full/half-width variants of the same intent do not match indexed content. Examples from real usage: - "OpenViking" (full-width) vs "OpenViking" - "OpenVAKING" (mis-spelled case) vs "openvaking" - "Harms agent" (double space) vs "hermes agent" Add _normalize_search_query (NFKC + casefold + whitespace collapse) at the VikingFS.find/search entry points. NFKC folds full/half-width forms to canonical; casefold handles Unicode case (stronger than str.lower()); whitespace collapse + strip removes accidental gaps. Idempotent and only widens recall — already-normalized ASCII is unchanged. Applied only to the semantic path. grep is intentionally left alone: its pattern is a regular expression, and NFKC/casefold would corrupt explicit character classes (e.g. [A-Z]); the existing case_insensitive flag covers case folding there. Verified at the viking_fs layer (not SearchService) because session/ memory/tools.py calls viking_fs.search directly, bypassing SearchService. Tests: tests/unit/test_search_query_normalization.py (7 cases, all pass) cover CJK full-width, casefold, whitespace, empty/None passthrough, and idempotency.
|
CI note: the The failure is identical across all currently-open PRs (#2874, #2936, #2937, #2938) and reproduces on #2874 which predates my other work:
This PR is pure-Python (touches only Evidence that maintainers are not blocking on this flake: #2934 was merged today (2026-07-02 01:07 UTC) with the same Local unit tests for this PR pass (see PR body). Happy to help root-cause the maturin/upstream flake separately if useful — just let me know. |
Problem
Semantic
find/searchqueries are passed verbatim to the embedder and retriever, so colloquial or CJK full/half-width variants of the same intent do not match indexed content. Real examples from daily usage:OpenViking(full-width)OpenVikingOpenVAKING(case variant of a typo)openvakingHarms agent(double space)hermes agentsearch_service.pyonly doesquery.strip(); the hierarchical retriever forwards the query verbatim. The embedder tokenizes"OpenViking"and"OpenViking"to different vectors, so recall silently drops on CJK-keyboard and casual input.Change
Add
_normalize_search_query(query)(NFKC + casefold + whitespace collapse + strip) and apply it at the two semantic entry points inviking_fs.py:VikingFS.find(after_ensure_non_empty_search_query)VikingFS.search(after_ensure_non_empty_search_query)str.lower(), e.g.ß→ss)Why this layer
Applied at the
viking_fslayer (notSearchService) becausesession/memory/tools.pycallsviking_fs.searchdirectly, bypassingSearchService. Fixing it at the lowest shared entry point covers all callers.Why grep is left alone
greptakes a regex pattern. NFKC + casefold would corrupt explicit character classes (e.g.[A-Z],[A-Z]collapse differently), andgrepalready exposes acase_insensitiveflag for case folding. Normalizing there would be a behavior change, not a recall fix.Tests
tests/unit/test_search_query_normalization.py— 7 cases, all pass:OpenViking→openviking)OpenViking,OpenVAKING)Harms agent→harms agent)\tquery\n)NonepassthroughPure-function unit tests (no vectordb/embedder fixture) — the helper is side-effect-free, so integration tests would pay full storage init cost for a string transform.
Scope
+22 lines in
viking_fs.py, 1 new test file. No public API change, no config knob, no behavior change for already-normalized queries.Related
Continues the search-recall robustness line of #2874 (tool-input compaction) and #2900 (grep empty-result fallback). Does not overlap either.