Skip to content

feat(search): normalize semantic query (NFKC + casefold)#2937

Open
lg320531124 wants to merge 1 commit into
volcengine:mainfrom
lg320531124:feat/search-query-normalization
Open

feat(search): normalize semantic query (NFKC + casefold)#2937
lg320531124 wants to merge 1 commit into
volcengine:mainfrom
lg320531124:feat/search-query-normalization

Conversation

@lg320531124

Copy link
Copy Markdown
Contributor

Problem

Semantic find/search queries are passed verbatim to the embedder and retriever, so colloquial or CJK full/half-width variants of the same intent do not match indexed content. Real examples from daily usage:

Query typed Indexed as Match?
OpenViking (full-width) OpenViking
OpenVAKING (case variant of a typo) openvaking
Harms agent (double space) hermes agent

search_service.py only does query.strip(); the hierarchical retriever forwards the query verbatim. The embedder tokenizes "OpenViking" and "OpenViking" to different vectors, so recall silently drops on CJK-keyboard and casual input.

Change

Add _normalize_search_query(query) (NFKC + casefold + whitespace collapse + strip) and apply it at the two semantic entry points in viking_fs.py:

  • VikingFS.find (after _ensure_non_empty_search_query)
  • VikingFS.search (after _ensure_non_empty_search_query)
def _normalize_search_query(query: str) -> str:
    if not query:
        return query
    return re.sub(r"\s+", " ", unicodedata.normalize("NFKC", query).casefold()).strip()
  • NFKC folds full/half-width Latin/digits to canonical ASCII
  • casefold handles Unicode case folding (stronger than str.lower(), e.g. ßss)
  • whitespace collapse + strip removes accidental double spaces / leading-trailing gaps
  • Idempotent and only widens recall — already-normalized ASCII is byte-for-byte unchanged

Why this layer

Applied at the viking_fs layer (not SearchService) because session/memory/tools.py calls viking_fs.search directly, bypassing SearchService. Fixing it at the lowest shared entry point covers all callers.

Why grep is left alone

grep takes a regex pattern. NFKC + casefold would corrupt explicit character classes (e.g. [A-Z], [A-Z] collapse differently), and grep already exposes a case_insensitive flag for case folding. Normalizing there would be a behavior change, not a recall fix.

Tests

tests/unit/test_search_query_normalization.py — 7 cases, all pass:

  • CJK full-width → ASCII (OpenVikingopenviking)
  • mixed-case casefold (OpenViking, OpenVAKING)
  • whitespace collapse (Harms agentharms agent)
  • leading/trailing strip (\tquery\n)
  • empty + None passthrough
  • idempotency
======================== 7 passed, 4 warnings in 2.28s =========================

Pure-function unit tests (no vectordb/embedder fixture) — the helper is side-effect-free, so integration tests would pay full storage init cost for a string transform.

Scope

+22 lines in viking_fs.py, 1 new test file. No public API change, no config knob, no behavior change for already-normalized queries.

Related

Continues the search-recall robustness line of #2874 (tool-input compaction) and #2900 (grep empty-result fallback). Does not overlap either.

Semantic find/search queries are passed verbatim to the embedder and
retriever, so colloquial or CJK full/half-width variants of the same
intent do not match indexed content. Examples from real usage:
  - "OpenViking" (full-width) vs "OpenViking"
  - "OpenVAKING" (mis-spelled case) vs "openvaking"
  - "Harms  agent" (double space) vs "hermes agent"

Add _normalize_search_query (NFKC + casefold + whitespace collapse) at
the VikingFS.find/search entry points. NFKC folds full/half-width forms
to canonical; casefold handles Unicode case (stronger than str.lower());
whitespace collapse + strip removes accidental gaps. Idempotent and
only widens recall — already-normalized ASCII is unchanged.

Applied only to the semantic path. grep is intentionally left alone: its
pattern is a regular expression, and NFKC/casefold would corrupt explicit
character classes (e.g. [A-Z]); the existing case_insensitive flag covers
case folding there.

Verified at the viking_fs layer (not SearchService) because session/
memory/tools.py calls viking_fs.search directly, bypassing SearchService.

Tests: tests/unit/test_search_query_normalization.py (7 cases, all pass)
cover CJK full-width, casefold, whitespace, empty/None passthrough, and
idempotency.
@lg320531124

Copy link
Copy Markdown
Contributor Author

CI note: the API & CLI Integration Tests job is red, but this is a pre-existing CI-infra flake, not introduced by this PR.

The failure is identical across all currently-open PRs (#2874, #2936, #2937, #2938) and reproduces on #2874 which predates my other work:

  1. Build ragfs-python native extension❌ ERROR: maturin build produced no wheel (the Rust→Python binding fails to build a wheel in CI)
  2. test_cli_search.py::TestSearchGrep::test_grep_basic / test_grep_case_insensitive → ERROR at the add-resource setup step (AssertionError: add-resource failed after retries)
  3. find/search tests → SKIPPED (Upstream ... marker — the upstream VikingDB service is unreachable from the runner)

This PR is pure-Python (touches only 'openviking/session/memory/tools.py'-style modules), does not touch crates/, maturin, the ov CLI binary, or any CI workflow — so it cannot be the cause of a Rust-binding build failure or an upstream-service reachability issue.

Evidence that maintainers are not blocking on this flake: #2934 was merged today (2026-07-02 01:07 UTC) with the same API & CLI Integration Tests → FAILURE status, and check-deps → SUCCESS + main-branch 02. Main Branch Checks → success both stay green.

Local unit tests for this PR pass (see PR body). Happy to help root-cause the maturin/upstream flake separately if useful — just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant