fix(semantic): ensure memory processing always reports completion status by deepakdevp · Pull Request #951 · volcengine/OpenViking

deepakdevp · 2026-03-25T02:35:54Z

Summary

Fix memory semantic queue stalls where pending backlog grows while processed stays at 0
Root cause: _process_memory_directory() had error paths that caught exceptions and returned silently, but these errors need to propagate to on_dequeue()'s error handling which calls report_error() and handles circuit breaker logic
Changed 2 silent early returns (ls failure, write failure) to re-raise as RuntimeError, properly caught by on_dequeue()'s existing exception handler

Changes

semantic_processor.py: Error paths in _process_memory_directory() now re-raise instead of silently returning, so on_dequeue() can call report_error() for permanent errors or re-enqueue for transient ones
2 new tests verifying empty dir reports success, ls error reports error

Fixes #864.

Test plan

2 new tests pass (pytest tests/storage/test_memory_semantic_stall.py)
Ruff check and format clean
Empty directory path still correctly reports success (no regression)

_process_memory_directory() had early return paths that could bypass report_success()/report_error() in on_dequeue(), leaving the queue's in_progress counter permanently stuck. This caused the semantic queue to appear stalled with pending items never being processed. All code paths now properly propagate to the completion callbacks. Fixes volcengine#864.

github-actions · 2026-03-25T02:36:39Z

Failed to generate code suggestions for PR

qin-ctx

Thanks for chasing this down. The direction is correct: the silent early returns in _process_memory_directory() really can bypass the queue completion callbacks and leave in_progress stuck.

I found one blocking issue and one follow-up test gap below.

qin-ctx · 2026-03-25T04:26:25Z

        except Exception as e:
-            logger.warning(f"Failed to list memory directory {dir_uri}: {e}")
-            return
+            raise RuntimeError(f"Failed to list memory directory {dir_uri}: {e}") from e


[Bug] (blocking) Re-raising here fixes the silent early return, but it does not actually make this failure path report an error in production. on_dequeue() still routes non-permanent exceptions through the re-enqueue branch, and classify_api_error() only recognizes 401/403/5xx/timeout patterns. That means common filesystem failures here, such as FileNotFoundError, Permission denied, or local I/O errors, are classified as unknown, re-enqueued, and ultimately counted as success instead of report_error(). So this PR removes the stuck in_progress symptom, but it does not guarantee the intended issue behavior from the PR description, and it can turn invalid memory URIs into infinite retries. Please either classify these directory read/write failures as permanent at the source, or extend the error-classification path so local filesystem failures are reported as queue errors rather than retried forever.

Addressed in b8b504b. Added _PERMANENT_IO_ERRORS = (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) with an isinstance check at the top of classify_api_error(), so filesystem errors are classified as "permanent" and hit report_error() instead of being re-enqueued. This prevents both the infinite retry loop and the false success counting.

qin-ctx · 2026-03-25T04:26:25Z

+            return_value=None,
+        ),
+        patch(
+            "openviking.storage.queuefs.semantic_processor.classify_api_error",


[Suggestion] (non-blocking) This test currently proves the desired behavior only because classify_api_error() is mocked to return "permanent". In the real code path, OSError("disk read failed") is classified as unknown, so on_dequeue() re-enqueues it instead of calling report_error(). Please add at least one test that exercises the real classifier behavior, and ideally a second one for the new write_file() failure path as well, so the tests match production semantics.

Fixed in b8b504b. Removed the classify_api_error mock — tests now use real FileNotFoundError and PermissionError which the updated classifier handles as permanent. Also added a write-failure test path and 2 new tests in test_circuit_breaker.py verifying all 4 filesystem error types + chained cause detection.

…inite retry Address review feedback: filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) are now classified as permanent by classify_api_error(), so they hit report_error() instead of being infinitely re-enqueued. Tests updated to exercise real classifier behavior without mocking.

deepakdevp · 2026-04-06T00:59:42Z

Hi @qin-ctx — I've addressed both points in b8b504b (permanent IO error classification + real-classifier tests). Ready for re-review when you get a chance.

deepakdevp · 2026-04-11T02:24:27Z

Hi @qin-ctx — just a gentle ping on this one. Both points from your review were addressed in b8b504b (permanent IO error classification + real-classifier tests). Ready for re-review whenever you have a moment.

… fix Bundles three in-flight contributor PRs (#533, #549, #951) with reviewer feedback addressed, consolidated into a single set of focused edits. memory_extraction.yaml (#549): - Add length targets to the Three-Level Structure section: abstract ~50-80 chars, overview 3-5 bullets, content 2-4 sentences. - Kept the concise guidance Zayn approved; dropped the BAD/GOOD content example blocks he flagged as redundant with the few-shot examples below, and kept all text in English per yangxinxin-7's language-mixing concern. memory_merge_bundle.yaml (#533): - Add facet coherence check: same category is not sufficient to merge; memories covering different facets (e.g. Python code style + food preference) must output {"decision": "skip"}. - Add hard length limits: abstract ≤ 80, overview ≤ 200, content ≤ 300. - Switch merge strategy from accumulate-all to condensed snapshot: on conflict keep newer value; do not retain superseded details. - Bump template version 1.0.0 → 2.0.0. memory_extractor.py (#549): - Vectorize on `abstract or content` instead of `content`. Shorter text yields more discriminative embeddings and reduces score clustering. semantic_processor.py + model_retry.py (#951): - Fix memory semantic queue stall: _process_memory_directory() had two silent early-return paths (ls failure, write_file failure) that let on_dequeue() hit report_success() while the work actually failed — telemetry got marked_failed, but the queue's in_progress counter and processed count treated the message as done. Re-raise as RuntimeError so on_dequeue routes to report_error(). - Classify local filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError — including chained __cause__) as "permanent" in classify_api_error, so a bad path fails the queue entry instead of being re-enqueued forever. Tests: - tests/utils/test_circuit_breaker.py: cover the four filesystem error types and a chained FileNotFoundError. - tests/storage/test_memory_semantic_stall.py: exercise on_dequeue through the real classifier — ls failure must hit on_error, empty dir must still hit on_success (no regression).

ZaynJarvis · 2026-04-17T05:52:54Z

Cherry-picked into #1522 for batch merge along with #533 and #549. This PR had drifted on main — classify_api_error has since moved from openviking/utils/circuit_breaker.py to openviking/utils/model_retry.py, and _process_memory_directory gained a _mark_failed(...) telemetry call — so I re-applied your fix against the current tree:

_process_memory_directory: both error paths re-raise as RuntimeError(...) from e so on_dequeue reaches report_error. Kept the _mark_failed(str(e)) call before the raise.
classify_api_error (now in model_retry.py): added the _PERMANENT_IO_ERRORS tuple + isinstance check at the top, including the chained-__cause__ walk.
Tests: classifier assertions (4 filesystem types + chained) added to the existing tests/utils/test_circuit_breaker.py; a new tests/storage/test_memory_semantic_stall.py exercises on_dequeue end-to-end through the real classifier (no classify_api_error mock) to pin the production semantics @qin-ctx asked for.

Will merge via #1522 rather than this branch. Thanks @deepakdevp — the root-cause write-up and qin-ctx's classifier follow-up made the cherry-pick straightforward.

DequeueHandlerBase.set_callbacks now takes (on_success, on_requeue, on_error); the original PR #951 test harness called it with only (on_success, on_error). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ZaynJarvis · 2026-04-17T06:03:27Z

Following up on the earlier cherry-pick attempt — closed #1522 and re-did the cherry-pick properly so your original commits are preserved via git cherry-pick (deepakdevp authorship intact on both commits). Now split out as standalone #1531. Conflict resolution: since classify_api_error moved from circuit_breaker.py to model_retry.py on main, the _PERMANENT_IO_ERRORS isinstance check was applied to model_retry.py instead. Error paths in _process_memory_directory() now release lifecycle_lock_handle_id before re-raising to match the new lock-ownership model. 21/21 tests pass.

github-project-automation bot added this to OpenViking project Mar 25, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 25, 2026

qin-ctx requested changes Mar 25, 2026

View reviewed changes

qin-ctx self-assigned this Mar 25, 2026

This was referenced Apr 17, 2026

feat(memory): concise extraction/merge prompts + semantic queue stall fix #1522

Closed

feat(prompts): add facet guard and length limits to memory_merge_bundle #533

Open

feat: optimize memory extraction for concise output and precise retrieval #549

Open

ZaynJarvis mentioned this pull request Apr 17, 2026

feat(memory): bundle #533 + #549 — facet guard, length limits, concise extraction #1530

Open

2 tasks

ZaynJarvis mentioned this pull request Apr 17, 2026

fix(semantic): memory queue stall + permanent fs-error classification #1531

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(semantic): ensure memory processing always reports completion status#951

fix(semantic): ensure memory processing always reports completion status#951
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall

deepakdevp commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Mar 25, 2026

Uh oh!

deepakdevp Mar 25, 2026

Uh oh!

qin-ctx Mar 25, 2026

Uh oh!

deepakdevp Mar 25, 2026

Uh oh!

deepakdevp commented Apr 6, 2026

Uh oh!

deepakdevp commented Apr 11, 2026

Uh oh!

ZaynJarvis commented Apr 17, 2026

Uh oh!

ZaynJarvis commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deepakdevp commented Mar 25, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp commented Apr 6, 2026

Uh oh!

deepakdevp commented Apr 11, 2026

Uh oh!

ZaynJarvis commented Apr 17, 2026

Uh oh!

ZaynJarvis commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants