feat(search): normalize semantic query (NFKC + casefold) by lg320531124 · Pull Request #2937 · volcengine/OpenViking

lg320531124 · 2026-07-01T23:57:25Z

Problem

Semantic find/search queries are passed verbatim to the embedder and retriever, so colloquial or CJK full/half-width variants of the same intent do not match indexed content. Real examples from daily usage:

Query typed	Indexed as	Match?
`ＯｐｅｎＶｉｋｉｎｇ` (full-width)	`OpenViking`	❌
`OpenVAKING` (case variant of a typo)	`openvaking`	❌
`Harms agent` (double space)	`hermes agent`	❌

search_service.py only does query.strip(); the hierarchical retriever forwards the query verbatim. The embedder tokenizes "ＯｐｅｎＶｉｋｉｎｇ" and "OpenViking" to different vectors, so recall silently drops on CJK-keyboard and casual input.

Change

Add _normalize_search_query(query) (NFKC + casefold + whitespace collapse + strip) and apply it at the two semantic entry points in viking_fs.py:

VikingFS.find (after _ensure_non_empty_search_query)
VikingFS.search (after _ensure_non_empty_search_query)

def _normalize_search_query(query: str) -> str:
    if not query:
        return query
    return re.sub(r"\s+", " ", unicodedata.normalize("NFKC", query).casefold()).strip()

NFKC folds full/half-width Latin/digits to canonical ASCII
casefold handles Unicode case folding (stronger than str.lower(), e.g. ß → ss)
whitespace collapse + strip removes accidental double spaces / leading-trailing gaps
Idempotent and only widens recall — already-normalized ASCII is byte-for-byte unchanged

Why this layer

Applied at the viking_fs layer (not SearchService) because session/memory/tools.py calls viking_fs.search directly, bypassing SearchService. Fixing it at the lowest shared entry point covers all callers.

Why grep is left alone

grep takes a regex pattern. NFKC + casefold would corrupt explicit character classes (e.g. [A-Z], [Ａ-Ｚ] collapse differently), and grep already exposes a case_insensitive flag for case folding. Normalizing there would be a behavior change, not a recall fix.

Tests

tests/unit/test_search_query_normalization.py — 7 cases, all pass:

CJK full-width → ASCII (ＯｐｅｎＶｉｋｉｎｇ → openviking)
mixed-case casefold (OpenViking, OpenVAKING)
whitespace collapse (Harms agent → harms agent)
leading/trailing strip (\tquery\n)
empty + None passthrough
idempotency

======================== 7 passed, 4 warnings in 2.28s =========================

Pure-function unit tests (no vectordb/embedder fixture) — the helper is side-effect-free, so integration tests would pay full storage init cost for a string transform.

Scope

+22 lines in viking_fs.py, 1 new test file. No public API change, no config knob, no behavior change for already-normalized queries.

Semantic find/search queries are passed verbatim to the embedder and retriever, so colloquial or CJK full/half-width variants of the same intent do not match indexed content. Examples from real usage: - "ＯｐｅｎＶｉｋｉｎｇ" (full-width) vs "OpenViking" - "OpenVAKING" (mis-spelled case) vs "openvaking" - "Harms agent" (double space) vs "hermes agent" Add _normalize_search_query (NFKC + casefold + whitespace collapse) at the VikingFS.find/search entry points. NFKC folds full/half-width forms to canonical; casefold handles Unicode case (stronger than str.lower()); whitespace collapse + strip removes accidental gaps. Idempotent and only widens recall — already-normalized ASCII is unchanged. Applied only to the semantic path. grep is intentionally left alone: its pattern is a regular expression, and NFKC/casefold would corrupt explicit character classes (e.g. [A-Z]); the existing case_insensitive flag covers case folding there. Verified at the viking_fs layer (not SearchService) because session/ memory/tools.py calls viking_fs.search directly, bypassing SearchService. Tests: tests/unit/test_search_query_normalization.py (7 cases, all pass) cover CJK full-width, casefold, whitespace, empty/None passthrough, and idempotency.

lg320531124 · 2026-07-02T01:35:28Z

CI note: the API & CLI Integration Tests job is red, but this is a pre-existing CI-infra flake, not introduced by this PR.

The failure is identical across all currently-open PRs (#2874, #2936, #2937, #2938) and reproduces on #2874 which predates my other work:

Build ragfs-python native extension → ❌ ERROR: maturin build produced no wheel (the Rust→Python binding fails to build a wheel in CI)
test_cli_search.py::TestSearchGrep::test_grep_basic / test_grep_case_insensitive → ERROR at the add-resource setup step (AssertionError: add-resource failed after retries)
find/search tests → SKIPPED (Upstream ... marker — the upstream VikingDB service is unreachable from the runner)

This PR is pure-Python (touches only 'openviking/session/memory/tools.py'-style modules), does not touch crates/, maturin, the ov CLI binary, or any CI workflow — so it cannot be the cause of a Rust-binding build failure or an upstream-service reachability issue.

Evidence that maintainers are not blocking on this flake: #2934 was merged today (2026-07-02 01:07 UTC) with the same API & CLI Integration Tests → FAILURE status, and check-deps → SUCCESS + main-branch 02. Main Branch Checks → success both stay green.

Local unit tests for this PR pass (see PR body). Happy to help root-cause the maturin/upstream flake separately if useful — just let me know.

github-project-automation Bot added this to OpenViking project Jul 1, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jul 1, 2026

lg320531124 mentioned this pull request Jul 2, 2026

feat(search): grep fallback when semantic recall is empty #2938

Open

This was referenced Jul 2, 2026

fix(session): wire auto-commit threshold into add_messages #2936

Open

feat(memory): opt-in secret-scrub gate before persisting curated memories (#2899) #2941

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(search): normalize semantic query (NFKC + casefold)#2937

feat(search): normalize semantic query (NFKC + casefold)#2937
lg320531124 wants to merge 1 commit into
volcengine:mainfrom
lg320531124:feat/search-query-normalization

lg320531124 commented Jul 1, 2026

Uh oh!

lg320531124 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lg320531124 commented Jul 1, 2026

Problem

Change

Why this layer

Why grep is left alone

Tests

Scope

Related

Uh oh!

lg320531124 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant