Skip to content

fix(semantic): ensure memory processing always reports completion status#951

Open
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall
Open

fix(semantic): ensure memory processing always reports completion status#951
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall

Conversation

@deepakdevp
Copy link
Copy Markdown
Contributor

Summary

  • Fix memory semantic queue stalls where pending backlog grows while processed stays at 0
  • Root cause: _process_memory_directory() had error paths that caught exceptions and returned silently, but these errors need to propagate to on_dequeue()'s error handling which calls report_error() and handles circuit breaker logic
  • Changed 2 silent early returns (ls failure, write failure) to re-raise as RuntimeError, properly caught by on_dequeue()'s existing exception handler

Changes

  • semantic_processor.py: Error paths in _process_memory_directory() now re-raise instead of silently returning, so on_dequeue() can call report_error() for permanent errors or re-enqueue for transient ones
  • 2 new tests verifying empty dir reports success, ls error reports error

Fixes #864.

Test plan

  • 2 new tests pass (pytest tests/storage/test_memory_semantic_stall.py)
  • Ruff check and format clean
  • Empty directory path still correctly reports success (no regression)

_process_memory_directory() had early return paths that could bypass
report_success()/report_error() in on_dequeue(), leaving the queue's
in_progress counter permanently stuck. This caused the semantic queue
to appear stalled with pending items never being processed.

All code paths now properly propagate to the completion callbacks.

Fixes volcengine#864.
@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this down. The direction is correct: the silent early returns in _process_memory_directory() really can bypass the queue completion callbacks and leave in_progress stuck.

I found one blocking issue and one follow-up test gap below.

except Exception as e:
logger.warning(f"Failed to list memory directory {dir_uri}: {e}")
return
raise RuntimeError(f"Failed to list memory directory {dir_uri}: {e}") from e
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) Re-raising here fixes the silent early return, but it does not actually make this failure path report an error in production. on_dequeue() still routes non-permanent exceptions through the re-enqueue branch, and classify_api_error() only recognizes 401/403/5xx/timeout patterns. That means common filesystem failures here, such as FileNotFoundError, Permission denied, or local I/O errors, are classified as unknown, re-enqueued, and ultimately counted as success instead of report_error(). So this PR removes the stuck in_progress symptom, but it does not guarantee the intended issue behavior from the PR description, and it can turn invalid memory URIs into infinite retries. Please either classify these directory read/write failures as permanent at the source, or extend the error-classification path so local filesystem failures are reported as queue errors rather than retried forever.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b8b504b. Added _PERMANENT_IO_ERRORS = (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) with an isinstance check at the top of classify_api_error(), so filesystem errors are classified as "permanent" and hit report_error() instead of being re-enqueued. This prevents both the infinite retry loop and the false success counting.

return_value=None,
),
patch(
"openviking.storage.queuefs.semantic_processor.classify_api_error",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) This test currently proves the desired behavior only because classify_api_error() is mocked to return "permanent". In the real code path, OSError("disk read failed") is classified as unknown, so on_dequeue() re-enqueues it instead of calling report_error(). Please add at least one test that exercises the real classifier behavior, and ideally a second one for the new write_file() failure path as well, so the tests match production semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8b504b. Removed the classify_api_error mock — tests now use real FileNotFoundError and PermissionError which the updated classifier handles as permanent. Also added a write-failure test path and 2 new tests in test_circuit_breaker.py verifying all 4 filesystem error types + chained cause detection.

@qin-ctx qin-ctx self-assigned this Mar 25, 2026
…inite retry

Address review feedback: filesystem errors (FileNotFoundError,
PermissionError, IsADirectoryError, NotADirectoryError) are now
classified as permanent by classify_api_error(), so they hit
report_error() instead of being infinitely re-enqueued.

Tests updated to exercise real classifier behavior without mocking.
@deepakdevp
Copy link
Copy Markdown
Contributor Author

Hi @qin-ctx — I've addressed both points in b8b504b (permanent IO error classification + real-classifier tests). Ready for re-review when you get a chance.

@deepakdevp
Copy link
Copy Markdown
Contributor Author

Hi @qin-ctx — just a gentle ping on this one. Both points from your review were addressed in b8b504b (permanent IO error classification + real-classifier tests). Ready for re-review whenever you have a moment.

ZaynJarvis added a commit that referenced this pull request Apr 17, 2026
… fix

Bundles three in-flight contributor PRs (#533, #549, #951) with reviewer
feedback addressed, consolidated into a single set of focused edits.

memory_extraction.yaml (#549):
- Add length targets to the Three-Level Structure section: abstract
  ~50-80 chars, overview 3-5 bullets, content 2-4 sentences.
- Kept the concise guidance Zayn approved; dropped the BAD/GOOD content
  example blocks he flagged as redundant with the few-shot examples
  below, and kept all text in English per yangxinxin-7's language-mixing
  concern.

memory_merge_bundle.yaml (#533):
- Add facet coherence check: same category is not sufficient to merge;
  memories covering different facets (e.g. Python code style + food
  preference) must output {"decision": "skip"}.
- Add hard length limits: abstract ≤ 80, overview ≤ 200, content ≤ 300.
- Switch merge strategy from accumulate-all to condensed snapshot: on
  conflict keep newer value; do not retain superseded details.
- Bump template version 1.0.0 → 2.0.0.

memory_extractor.py (#549):
- Vectorize on `abstract or content` instead of `content`. Shorter text
  yields more discriminative embeddings and reduces score clustering.

semantic_processor.py + model_retry.py (#951):
- Fix memory semantic queue stall: _process_memory_directory() had two
  silent early-return paths (ls failure, write_file failure) that let
  on_dequeue() hit report_success() while the work actually failed —
  telemetry got marked_failed, but the queue's in_progress counter and
  processed count treated the message as done. Re-raise as RuntimeError
  so on_dequeue routes to report_error().
- Classify local filesystem errors (FileNotFoundError, PermissionError,
  IsADirectoryError, NotADirectoryError — including chained __cause__)
  as "permanent" in classify_api_error, so a bad path fails the queue
  entry instead of being re-enqueued forever.

Tests:
- tests/utils/test_circuit_breaker.py: cover the four filesystem error
  types and a chained FileNotFoundError.
- tests/storage/test_memory_semantic_stall.py: exercise on_dequeue
  through the real classifier — ls failure must hit on_error, empty dir
  must still hit on_success (no regression).
@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Cherry-picked into #1522 for batch merge along with #533 and #549. This PR had drifted on mainclassify_api_error has since moved from openviking/utils/circuit_breaker.py to openviking/utils/model_retry.py, and _process_memory_directory gained a _mark_failed(...) telemetry call — so I re-applied your fix against the current tree:

  • _process_memory_directory: both error paths re-raise as RuntimeError(...) from e so on_dequeue reaches report_error. Kept the _mark_failed(str(e)) call before the raise.
  • classify_api_error (now in model_retry.py): added the _PERMANENT_IO_ERRORS tuple + isinstance check at the top, including the chained-__cause__ walk.
  • Tests: classifier assertions (4 filesystem types + chained) added to the existing tests/utils/test_circuit_breaker.py; a new tests/storage/test_memory_semantic_stall.py exercises on_dequeue end-to-end through the real classifier (no classify_api_error mock) to pin the production semantics @qin-ctx asked for.

Will merge via #1522 rather than this branch. Thanks @deepakdevp — the root-cause write-up and qin-ctx's classifier follow-up made the cherry-pick straightforward.

ZaynJarvis added a commit that referenced this pull request Apr 17, 2026
DequeueHandlerBase.set_callbacks now takes (on_success, on_requeue, on_error);
the original PR #951 test harness called it with only (on_success, on_error).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Following up on the earlier cherry-pick attempt — closed #1522 and re-did the cherry-pick properly so your original commits are preserved via git cherry-pick (deepakdevp authorship intact on both commits). Now split out as standalone #1531. Conflict resolution: since classify_api_error moved from circuit_breaker.py to model_retry.py on main, the _PERMANENT_IO_ERRORS isinstance check was applied to model_retry.py instead. Error paths in _process_memory_directory() now release lifecycle_lock_handle_id before re-raising to match the new lock-ownership model. 21/21 tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Memory semantic queue stalls on context_type=memory jobs; pending backlog grows while processed stays at 0

3 participants