Skip to content

fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900

Open
lg320531124 wants to merge 2 commits into
volcengine:mainfrom
lg320531124:fix/grep-empty-result-fallback
Open

fix(grep): fall back to fs on empty VikingDB recall and query timeout (#2850)#2900
lg320531124 wants to merge 2 commits into
volcengine:mainfrom
lg320531124:fix/grep-empty-result-fallback

Conversation

@lg320531124

Copy link
Copy Markdown
Contributor

What & Why

Re-fixes #2850, replacing the rejected #2854 (which sat at the MCP layer). Per @qin-ctx's review there, the fix belongs inside VikingFS._grep_vikingdb_then_fs() — not wrapped around the MCP endpoint — and must cover the empty-result case, not just exceptions/timeouts.

#2850 has two failure modes, both rooted in VikingDB unreliability on large/stale corpora:

  1. Silent empty recall. BM25 returns [] even though matching content exists (index lag, silent timeout). The old code treated an empty candidate list as a definitive "no matching content" and short-circuited with an empty result. The user sees "no matches" when matches exist.
  2. Hung query. search_by_keywords can hang indefinitely, stalling grep with no fallback.

Changes (storage layer only — openviking/storage/viking_fs.py)

_grep_vikingdb_then_fs() now:

  • Empty recall → fs fallback. When candidate_uris is empty, retry via _grep_fs(...) instead of returning an empty dict. This is the case fix(mcp): add grep timeout, error logging, and fs fallback on VikingDB failure #2854 missed and @qin-ctx explicitly called out. A warning is logged so the gap is observable, not silent.
  • Bounded VikingDB query. search_by_keywords(...) is wrapped in asyncio.wait_for(..., timeout=vikingdb_timeout). On asyncio.TimeoutError, fall back to _grep_fs(...) — same treatment as a raised exception. Default 10 s, configurable via OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC (no GrepConfig schema change needed, since GrepConfig is extra=forbid).
  • Existing exception fallback preserved. The except Exception branch is untouched; timeout is a sibling except, not a replacement.

This matches the three points in @qin-ctx's review: converge on _grep_vikingdb_then_fs(), keep the existing exception fallback, add empty-result fallback at the candidate_uris empty site, and wrap only search_by_keywords (not the whole grep) if a timeout is wanted.

Why not the MCP layer

The MCP endpoint already delegates to service.fs.grep(...); wrapping it again there (as #2854 did) just re-calls the same path without forcing fs, and can't see VikingDB's empty recall at all. The storage layer is where both failure modes are visible.

Tests (tests/storage/test_viking_fs_grep.py)

  • test_grep_vikingdb_empty_recall_falls_back_to_fs_DummyVectorStore returns []; asserts _grep_fs is called and its result surfaces (not the empty VikingDB hit).
  • test_grep_vikingdb_timeout_falls_back_to_fs_SlowVectorStore (30 s sleep) + OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC=0.1; asserts fs fallback result surfaces.
  • test_grep_vikingdb_pushes_exclude_uri_to_filter — updated to stub _grep_fs returning empty, since empty recall now falls through to fs (keeps the exclude_uri filter assertion focused on the remote query path).
  • Existing exception-fallback and DFS/concurrency tests unchanged.
$ uv run --no-sync python -m pytest tests/storage/test_viking_fs_grep.py -q --no-cov
17 passed in 2.18s

$ ruff check openviking/storage/viking_fs.py tests/storage/test_viking_fs_grep.py
All checks passed!

Scope

  • openviking/storage/viking_fs.py (+59/-7)
  • tests/storage/test_viking_fs_grep.py (+92)

No uv.lock churn, no MCP changes, no GrepConfig schema change. Default behavior is unchanged when VikingDB returns non-empty results promptly.

Closes #2850. Supersedes #2854.

…volcengine#2850)

VikingFS._grep_vikingdb_then_fs treated an empty BM25 candidate list as a
definitive 'no matching content', and an indefinitely hanging remote query
could stall grep. Both are symptoms of VikingDB unreliability on large/stale
corpora (volcengine#2850), not proof of absence.

- Wrap search_by_keywords in asyncio.wait_for (default 10s, configurable via
  OPENVIKING_GREP_VIKINGDB_TIMEOUT_SEC); on TimeoutError fall back to _grep_fs.
- On empty candidate recall, retry via _grep_fs instead of returning an empty
  result, so a VikingDB gap no longer masquerades as 'no matches'.

Tests: two new cases (empty-recall fallback, timeout fallback) + updated
exclude_uri test to stub _grep_fs. 17/17 pass; ruff clean.

Signed-off-by: lg320531124 <lg320531124@users.noreply.github.com>
@lg320531124 lg320531124 force-pushed the fix/grep-empty-result-fallback branch from 2560a6f to b00e336 Compare July 1, 2026 00:07
@lg320531124

Copy link
Copy Markdown
Contributor Author

Following up on the 06. API & CLI Integration Tests failure. I rebased onto latest main (head b00e336) to rule out base-skew, and the job fails the same way — so I dug into the fixture path, since the failing tests touch code this PR doesn't modify. I believe this is a test-infrastructure TreeLock issue, not a regression from this change.

What's failing

The CLI Compatibility Tests finish with 3 failed, 25 passed, 7 skipped, 11 errors (in ~19min), all from one shared setup step:

ERROR test_cli_content.py::TestContentRead::test_read - AssertionError: add-resource failed after retries:
  ov add-resource /tmp/tmpXXX.txt --to viking://resources/cli_test_51cef59c/test_pack --wait --timeout 300 -o json
  │ OpenViking returned an error: CONFLICT: Resource is busy:
  │ viking://resources/cli_test_51cef59c/test_pack.

11 ERROR at setup + 3 FAILED — they all depend on the session-scoped test_pack_uri fixture (tests/cli/conftest.py:647), whose ov add-resource returns CONFLICT: Resource is busy after 15×@10s retries. The tests fail before any of my grep/search assertions run.

Why I don't think it's this PR

  • The diff only touches openviking/storage/viking_fs.py (grep fs fallback) + tests/storage/test_viking_fs_grep.py. It does not touch add-resource, conftest.py, or any lock code.
  • The failing tests (test_add_resource_*, test_reindex, TestSearchGrep::test_grep_*) all fail at the shared test_pack_uri setup step, not at grep assertions.

Root cause (source-level)

Resource is busy comes from openviking/utils/resource_processor.py:441:

return await OwnedLockLease.acquire_tree(lock_manager, path, timeout=timeout)
...
raise ResourceBusyError(f"Resource is busy: {uri or path}", ..., retryable=True)

The fixture's ov_add_resource retries 15×@10s (conftest.py:488, "retries for CONFLICT/busy/network errors"), yet the lock stays held across all 15 attempts → retryable=True but the retry never recovers. test_dir_uri uses a random uuid per session (conftest.py:639), so this isn't a fixed-name collision — it looks like the per-resource TreeLock isn't being released within the session-scoped fixture's lifetime, and the dependent tests cascade-fail at setup.

The workflow itself acknowledges the conflict surface — the test step echoes Running CLI integration tests (serial to avoid resource conflicts)... — but serializing didn't prevent it here.

Cross-check: same failure on an unrelated PR

My sibling PR #2874 (feat/tool-input-compaction, touches a completely different module) fails with the identical tally on the same run: 3 failed, 25 passed, 7 skipped, 11 errors, same Resource is busy on its own random cli_test_4f5946d2 uuid. Two PRs touching disjoint code producing byte-identical failure counts is strong evidence of a shared infra flake rather than a code defect in either branch.

Ask

I don't have admin rights to rerun failed jobs from a fork. Could someone with access rerun the 06. API & CLI Integration Tests job on this branch? If it passes on retry, that confirms the flake. If it fails the same way, the TreeLock-release path in acquire_resource_lock may be worth a look for the CI environment.

(No code changes from my side — happy to act on anything if the failure does point back at the branch.)

@lg320531124

Copy link
Copy Markdown
Contributor Author

Opened #2916 to track the Resource is busy TreeLock flake at the root-cause level (CI-infra, not this PR). Summary of the evidence chain is there: main schedule runs green in the same window the fork PRs go red, two disjoint-module PRs hit the identical tally, and retryable=True never recovers across 15 retries within the session-scoped fixture. Any maintainer with admin access who can rerun the 06. API & CLI Integration Tests job on this branch, that would confirm the flake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug] VikingDB BM25 grep returns empty results on large corpus — fallback to fs grep not triggered reliably

1 participant