fix(semantic): ensure memory processing always reports completion status#951
fix(semantic): ensure memory processing always reports completion status#951deepakdevp wants to merge 2 commits intovolcengine:mainfrom
Conversation
_process_memory_directory() had early return paths that could bypass report_success()/report_error() in on_dequeue(), leaving the queue's in_progress counter permanently stuck. This caused the semantic queue to appear stalled with pending items never being processed. All code paths now properly propagate to the completion callbacks. Fixes volcengine#864.
|
Failed to generate code suggestions for PR |
qin-ctx
left a comment
There was a problem hiding this comment.
Thanks for chasing this down. The direction is correct: the silent early returns in _process_memory_directory() really can bypass the queue completion callbacks and leave in_progress stuck.
I found one blocking issue and one follow-up test gap below.
| except Exception as e: | ||
| logger.warning(f"Failed to list memory directory {dir_uri}: {e}") | ||
| return | ||
| raise RuntimeError(f"Failed to list memory directory {dir_uri}: {e}") from e |
There was a problem hiding this comment.
[Bug] (blocking) Re-raising here fixes the silent early return, but it does not actually make this failure path report an error in production. on_dequeue() still routes non-permanent exceptions through the re-enqueue branch, and classify_api_error() only recognizes 401/403/5xx/timeout patterns. That means common filesystem failures here, such as FileNotFoundError, Permission denied, or local I/O errors, are classified as unknown, re-enqueued, and ultimately counted as success instead of report_error(). So this PR removes the stuck in_progress symptom, but it does not guarantee the intended issue behavior from the PR description, and it can turn invalid memory URIs into infinite retries. Please either classify these directory read/write failures as permanent at the source, or extend the error-classification path so local filesystem failures are reported as queue errors rather than retried forever.
There was a problem hiding this comment.
Addressed in b8b504b. Added _PERMANENT_IO_ERRORS = (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) with an isinstance check at the top of classify_api_error(), so filesystem errors are classified as "permanent" and hit report_error() instead of being re-enqueued. This prevents both the infinite retry loop and the false success counting.
| return_value=None, | ||
| ), | ||
| patch( | ||
| "openviking.storage.queuefs.semantic_processor.classify_api_error", |
There was a problem hiding this comment.
[Suggestion] (non-blocking) This test currently proves the desired behavior only because classify_api_error() is mocked to return "permanent". In the real code path, OSError("disk read failed") is classified as unknown, so on_dequeue() re-enqueues it instead of calling report_error(). Please add at least one test that exercises the real classifier behavior, and ideally a second one for the new write_file() failure path as well, so the tests match production semantics.
There was a problem hiding this comment.
Fixed in b8b504b. Removed the classify_api_error mock — tests now use real FileNotFoundError and PermissionError which the updated classifier handles as permanent. Also added a write-failure test path and 2 new tests in test_circuit_breaker.py verifying all 4 filesystem error types + chained cause detection.
…inite retry Address review feedback: filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) are now classified as permanent by classify_api_error(), so they hit report_error() instead of being infinitely re-enqueued. Tests updated to exercise real classifier behavior without mocking.
… fix Bundles three in-flight contributor PRs (#533, #549, #951) with reviewer feedback addressed, consolidated into a single set of focused edits. memory_extraction.yaml (#549): - Add length targets to the Three-Level Structure section: abstract ~50-80 chars, overview 3-5 bullets, content 2-4 sentences. - Kept the concise guidance Zayn approved; dropped the BAD/GOOD content example blocks he flagged as redundant with the few-shot examples below, and kept all text in English per yangxinxin-7's language-mixing concern. memory_merge_bundle.yaml (#533): - Add facet coherence check: same category is not sufficient to merge; memories covering different facets (e.g. Python code style + food preference) must output {"decision": "skip"}. - Add hard length limits: abstract ≤ 80, overview ≤ 200, content ≤ 300. - Switch merge strategy from accumulate-all to condensed snapshot: on conflict keep newer value; do not retain superseded details. - Bump template version 1.0.0 → 2.0.0. memory_extractor.py (#549): - Vectorize on `abstract or content` instead of `content`. Shorter text yields more discriminative embeddings and reduces score clustering. semantic_processor.py + model_retry.py (#951): - Fix memory semantic queue stall: _process_memory_directory() had two silent early-return paths (ls failure, write_file failure) that let on_dequeue() hit report_success() while the work actually failed — telemetry got marked_failed, but the queue's in_progress counter and processed count treated the message as done. Re-raise as RuntimeError so on_dequeue routes to report_error(). - Classify local filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError — including chained __cause__) as "permanent" in classify_api_error, so a bad path fails the queue entry instead of being re-enqueued forever. Tests: - tests/utils/test_circuit_breaker.py: cover the four filesystem error types and a chained FileNotFoundError. - tests/storage/test_memory_semantic_stall.py: exercise on_dequeue through the real classifier — ls failure must hit on_error, empty dir must still hit on_success (no regression).
|
Cherry-picked into #1522 for batch merge along with #533 and #549. This PR had drifted on
Will merge via #1522 rather than this branch. Thanks @deepakdevp — the root-cause write-up and qin-ctx's classifier follow-up made the cherry-pick straightforward. |
DequeueHandlerBase.set_callbacks now takes (on_success, on_requeue, on_error); the original PR #951 test harness called it with only (on_success, on_error). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Following up on the earlier cherry-pick attempt — closed #1522 and re-did the cherry-pick properly so your original commits are preserved via |
Summary
pendingbacklog grows whileprocessedstays at 0_process_memory_directory()had error paths that caught exceptions and returned silently, but these errors need to propagate toon_dequeue()'s error handling which callsreport_error()and handles circuit breaker logicRuntimeError, properly caught byon_dequeue()'s existing exception handlerChanges
semantic_processor.py: Error paths in_process_memory_directory()now re-raise instead of silently returning, soon_dequeue()can callreport_error()for permanent errors or re-enqueue for transient onesFixes #864.
Test plan
pytest tests/storage/test_memory_semantic_stall.py)