fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850) by lg320531124 · Pull Request #2900 · volcengine/OpenViking

lg320531124 · 2026-06-30T03:45:31Z

What & Why

Re-fixes #2850, replacing the rejected #2854 (which sat at the MCP layer). Per @qin-ctx's review there, the fix belongs inside VikingFS._grep_vikingdb_then_fs() — not wrapped around the MCP endpoint — and must cover the empty-result case, not just exceptions/timeouts.

#2850 has two failure modes, both rooted in VikingDB unreliability on large/stale corpora:

Silent empty recall. BM25 returns [] even though matching content exists (index lag, silent timeout). The old code treated an empty candidate list as a definitive "no matching content" and short-circuited with an empty result. The user sees "no matches" when matches exist.
Hung query. search_by_keywords can hang indefinitely, stalling grep with no fallback.

Changes (storage layer only — `openviking/storage/viking_fs.py`)

_grep_vikingdb_then_fs() now:

Empty recall → fs fallback. When candidate_uris is empty, retry via _grep_fs(...) instead of returning an empty dict. This is the case fix(mcp): add grep timeout, error logging, and fs fallback on VikingDB failure #2854 missed and @qin-ctx explicitly called out. A warning is logged so the gap is observable, not silent.
Bounded VikingDB query. search_by_keywords(...) is wrapped in asyncio.wait_for(..., timeout=vikingdb_timeout). On asyncio.TimeoutError, fall back to _grep_fs(...) — same treatment as a raised exception. Default 10 s, configurable via OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC (no GrepConfig schema change needed, since GrepConfig is extra=forbid).
Existing exception fallback preserved. The except Exception branch is untouched; timeout is a sibling except, not a replacement.

This matches the three points in @qin-ctx's review: converge on _grep_vikingdb_then_fs(), keep the existing exception fallback, add empty-result fallback at the candidate_uris empty site, and wrap only search_by_keywords (not the whole grep) if a timeout is wanted.

Why not the MCP layer

The MCP endpoint already delegates to service.fs.grep(...); wrapping it again there (as #2854 did) just re-calls the same path without forcing fs, and can't see VikingDB's empty recall at all. The storage layer is where both failure modes are visible.

Tests (`tests/storage/test_viking_fs_grep.py`)

test_grep_vikingdb_empty_recall_falls_back_to_fs — _DummyVectorStore returns []; asserts _grep_fs is called and its result surfaces (not the empty VikingDB hit).
test_grep_vikingdb_timeout_falls_back_to_fs — _SlowVectorStore (30 s sleep) + OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC=0.1; asserts fs fallback result surfaces.
test_grep_vikingdb_pushes_exclude_uri_to_filter — updated to stub _grep_fs returning empty, since empty recall now falls through to fs (keeps the exclude_uri filter assertion focused on the remote query path).
Existing exception-fallback and DFS/concurrency tests unchanged.

$ uv run --no-sync python -m pytest tests/storage/test_viking_fs_grep.py -q --no-cov
17 passed in 2.18s

$ ruff check openviking/storage/viking_fs.py tests/storage/test_viking_fs_grep.py
All checks passed!

Scope

openviking/storage/viking_fs.py (+59/-7)
tests/storage/test_viking_fs_grep.py (+92)

No uv.lock churn, no MCP changes, no GrepConfig schema change. Default behavior is unchanged when VikingDB returns non-empty results promptly.

Closes #2850. Supersedes #2854.

…volcengine#2850) VikingFS._grep_vikingdb_then_fs treated an empty BM25 candidate list as a definitive 'no matching content', and an indefinitely hanging remote query could stall grep. Both are symptoms of VikingDB unreliability on large/stale corpora (volcengine#2850), not proof of absence. - Wrap search_by_keywords in asyncio.wait_for (default 10s, configurable via OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC); on TimeoutError fall back to _grep_fs. - On empty candidate recall, retry via _grep_fs instead of returning an empty result, so a VikingDB gap no longer masquerades as 'no matches'. Tests: two new cases (empty-recall fallback, timeout fallback) + updated exclude_uri test to stub _grep_fs. 17/17 pass; ruff clean. Signed-off-by: lg320531124 <lg320531124@users.noreply.github.com>

lg320531124 · 2026-07-01T00:57:28Z

Following up on the 06. API & CLI Integration Tests failure. I rebased onto latest main (head b00e336) to rule out base-skew, and the job fails the same way — so I dug into the fixture path, since the failing tests touch code this PR doesn't modify. I believe this is a test-infrastructure TreeLock issue, not a regression from this change.

What's failing

The CLI Compatibility Tests finish with 3 failed, 25 passed, 7 skipped, 11 errors (in ~19min), all from one shared setup step:

ERROR test_cli_content.py::TestContentRead::test_read - AssertionError: add-resource failed after retries:
  ov add-resource /tmp/tmpXXX.txt --to viking://resources/cli_test_51cef59c/test_pack --wait --timeout 300 -o json
  │ OpenViking returned an error: CONFLICT: Resource is busy:
  │ viking://resources/cli_test_51cef59c/test_pack.

11 ERROR at setup + 3 FAILED — they all depend on the session-scoped test_pack_uri fixture (tests/cli/conftest.py:647), whose ov add-resource returns CONFLICT: Resource is busy after 15×@10s retries. The tests fail before any of my grep/search assertions run.

Why I don't think it's this PR

The diff only touches openviking/storage/viking_fs.py (grep fs fallback) + tests/storage/test_viking_fs_grep.py. It does not touch add-resource, conftest.py, or any lock code.
The failing tests (test_add_resource_*, test_reindex, TestSearchGrep::test_grep_*) all fail at the shared test_pack_uri setup step, not at grep assertions.

Root cause (source-level)

Resource is busy comes from openviking/utils/resource_processor.py:441:

return await OwnedLockLease.acquire_tree(lock_manager, path, timeout=timeout)
...
raise ResourceBusyError(f"Resource is busy: {uri or path}", ..., retryable=True)

The fixture's ov_add_resource retries 15×@10s (conftest.py:488, "retries for CONFLICT/busy/network errors"), yet the lock stays held across all 15 attempts → retryable=True but the retry never recovers. test_dir_uri uses a random uuid per session (conftest.py:639), so this isn't a fixed-name collision — it looks like the per-resource TreeLock isn't being released within the session-scoped fixture's lifetime, and the dependent tests cascade-fail at setup.

The workflow itself acknowledges the conflict surface — the test step echoes Running CLI integration tests (serial to avoid resource conflicts)... — but serializing didn't prevent it here.

Cross-check: same failure on an unrelated PR

My sibling PR #2874 (feat/tool-input-compaction, touches a completely different module) fails with the identical tally on the same run: 3 failed, 25 passed, 7 skipped, 11 errors, same Resource is busy on its own random cli_test_4f5946d2 uuid. Two PRs touching disjoint code producing byte-identical failure counts is strong evidence of a shared infra flake rather than a code defect in either branch.

Ask

I don't have admin rights to rerun failed jobs from a fork. Could someone with access rerun the 06. API & CLI Integration Tests job on this branch? If it passes on retry, that confirms the flake. If it fails the same way, the TreeLock-release path in acquire_resource_lock may be worth a look for the CI environment.

(No code changes from my side — happy to act on anything if the failure does point back at the branch.)

lg320531124 · 2026-07-01T01:44:33Z

Opened #2916 to track the Resource is busy TreeLock flake at the root-cause level (CI-infra, not this PR). Summary of the evidence chain is there: main schedule runs green in the same window the fork PRs go red, two disjoint-module PRs hit the identical tally, and retryable=True never recovers across 15 retries within the session-scoped fixture. Any maintainer with admin access who can rerun the 06. API & CLI Integration Tests job on this branch, that would confirm the flake.

github-project-automation Bot added this to OpenViking project Jun 30, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 30, 2026

lg320531124 mentioned this pull request Jun 30, 2026

fix(mcp): add grep timeout, error logging, and fs fallback on VikingDB failure #2854

Closed

lg320531124 force-pushed the fix/grep-empty-result-fallback branch from ae5d592 to 2560a6f Compare June 30, 2026 11:10

lg320531124 force-pushed the fix/grep-empty-result-fallback branch from 2560a6f to b00e336 Compare July 1, 2026 00:07

lg320531124 mentioned this pull request Jul 1, 2026

feat: rule-based tool input compaction for auto-capture (~10x storage reduction) #2874

Open

6 tasks

chore: trigger CI retry (TreeLock flake — see prior diagnosis comment)

573ca1a

lg320531124 mentioned this pull request Jul 1, 2026

CI flake: API & CLI Integration Tests fail with constant tally on fork PRs (TreeLock not released) #2916

Closed

This was referenced Jul 1, 2026

feat(search): normalize semantic query (NFKC + casefold) #2937

Open

feat(search): grep fallback when semantic recall is empty #2938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900

fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900
lg320531124 wants to merge 2 commits into
volcengine:mainfrom
lg320531124:fix/grep-empty-result-fallback

lg320531124 commented Jun 30, 2026

Uh oh!

lg320531124 commented Jul 1, 2026

Uh oh!

lg320531124 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lg320531124 commented Jun 30, 2026

What & Why

Changes (storage layer only — openviking/storage/viking_fs.py)

Why not the MCP layer

Tests (tests/storage/test_viking_fs_grep.py)

Scope

Uh oh!

lg320531124 commented Jul 1, 2026

What's failing

Why I don't think it's this PR

Root cause (source-level)

Cross-check: same failure on an unrelated PR

Ask

Uh oh!

lg320531124 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes (storage layer only — `openviking/storage/viking_fs.py`)

Tests (`tests/storage/test_viking_fs_grep.py`)