fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900
fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900lg320531124 wants to merge 2 commits into
Conversation
ae5d592 to
2560a6f
Compare
…volcengine#2850) VikingFS._grep_vikingdb_then_fs treated an empty BM25 candidate list as a definitive 'no matching content', and an indefinitely hanging remote query could stall grep. Both are symptoms of VikingDB unreliability on large/stale corpora (volcengine#2850), not proof of absence. - Wrap search_by_keywords in asyncio.wait_for (default 10s, configurable via OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC); on TimeoutError fall back to _grep_fs. - On empty candidate recall, retry via _grep_fs instead of returning an empty result, so a VikingDB gap no longer masquerades as 'no matches'. Tests: two new cases (empty-recall fallback, timeout fallback) + updated exclude_uri test to stub _grep_fs. 17/17 pass; ruff clean. Signed-off-by: lg320531124 <lg320531124@users.noreply.github.com>
2560a6f to
b00e336
Compare
|
Following up on the What's failingThe CLI Compatibility Tests finish with 11 Why I don't think it's this PR
Root cause (source-level)
return await OwnedLockLease.acquire_tree(lock_manager, path, timeout=timeout)
...
raise ResourceBusyError(f"Resource is busy: {uri or path}", ..., retryable=True)The fixture's The workflow itself acknowledges the conflict surface — the test step echoes Cross-check: same failure on an unrelated PRMy sibling PR #2874 ( AskI don't have admin rights to rerun failed jobs from a fork. Could someone with access rerun the (No code changes from my side — happy to act on anything if the failure does point back at the branch.) |
|
Opened #2916 to track the |
What & Why
Re-fixes #2850, replacing the rejected #2854 (which sat at the MCP layer). Per @qin-ctx's review there, the fix belongs inside
VikingFS._grep_vikingdb_then_fs()— not wrapped around the MCP endpoint — and must cover the empty-result case, not just exceptions/timeouts.#2850 has two failure modes, both rooted in VikingDB unreliability on large/stale corpora:
[]even though matching content exists (index lag, silent timeout). The old code treated an empty candidate list as a definitive "no matching content" and short-circuited with an empty result. The user sees "no matches" when matches exist.search_by_keywordscan hang indefinitely, stalling grep with no fallback.Changes (storage layer only —
openviking/storage/viking_fs.py)_grep_vikingdb_then_fs()now:candidate_urisis empty, retry via_grep_fs(...)instead of returning an empty dict. This is the case fix(mcp): add grep timeout, error logging, and fs fallback on VikingDB failure #2854 missed and @qin-ctx explicitly called out. A warning is logged so the gap is observable, not silent.search_by_keywords(...)is wrapped inasyncio.wait_for(..., timeout=vikingdb_timeout). Onasyncio.TimeoutError, fall back to_grep_fs(...)— same treatment as a raised exception. Default 10 s, configurable viaOPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC(noGrepConfigschema change needed, sinceGrepConfigisextra=forbid).except Exceptionbranch is untouched; timeout is a siblingexcept, not a replacement.This matches the three points in @qin-ctx's review: converge on
_grep_vikingdb_then_fs(), keep the existing exception fallback, add empty-result fallback at thecandidate_urisempty site, and wrap onlysearch_by_keywords(not the whole grep) if a timeout is wanted.Why not the MCP layer
The MCP endpoint already delegates to
service.fs.grep(...); wrapping it again there (as #2854 did) just re-calls the same path without forcing fs, and can't see VikingDB's empty recall at all. The storage layer is where both failure modes are visible.Tests (
tests/storage/test_viking_fs_grep.py)test_grep_vikingdb_empty_recall_falls_back_to_fs—_DummyVectorStorereturns[]; asserts_grep_fsis called and its result surfaces (not the empty VikingDB hit).test_grep_vikingdb_timeout_falls_back_to_fs—_SlowVectorStore(30 s sleep) +OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC=0.1; asserts fs fallback result surfaces.test_grep_vikingdb_pushes_exclude_uri_to_filter— updated to stub_grep_fsreturning empty, since empty recall now falls through to fs (keeps the exclude_uri filter assertion focused on the remote query path).Scope
openviking/storage/viking_fs.py(+59/-7)tests/storage/test_viking_fs_grep.py(+92)No
uv.lockchurn, no MCP changes, noGrepConfigschema change. Default behavior is unchanged when VikingDB returns non-empty results promptly.Closes #2850. Supersedes #2854.