Skip to content

fix(memory): connection pooling + circuit breaker for memory_tree SQLite (#2206)#2212

Merged
graycyrus merged 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/memory-tree-sqlite-init-failures
May 19, 2026
Merged

fix(memory): connection pooling + circuit breaker for memory_tree SQLite (#2206)#2212
graycyrus merged 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/memory-tree-sqlite-init-failures

Conversation

@M3gA-Mind
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind commented May 19, 2026

Summary

  • Replace open-on-every-call SQLite pattern in memory_tree store with a process-level connection cache keyed by workspace path — schema init runs once, not ~69K times/day
  • Add per-path CircuitBreaker (3 consecutive failures → 30 s closed) so a broken DB install stops flooding Sentry instead of reporting on every worker poll
  • Add stale .db-shm/.db-wal cleanup on IOERR_SHMMAP / CANTOPEN with a single retry before recording a failure
  • Extend worker error classification with is_sqlite_io_transient() (codes 1546, 4874, 14) so those errors trigger a 30 s backoff instead of Sentry reports
  • Expose memory_tree_db health via DomainEvent::HealthChanged and add a probe check to openhuman.doctor

Problem

with_connection() in src/openhuman/memory/tree/store.rs opened a new SQLite connection and re-ran the full schema init (8 tables, 15+ indexes, 8+ migrations) on every call. With 4 workers polling every 5 s this was ~69K connection opens/day. Three I/O error codes — 1546 (SQLITE_IOERR_TRUNCATE), 4874 (SQLITE_IOERR_SHMMAP), 14 (SQLITE_CANTOPEN) — were not classified as transient, so every failure was reported to Sentry. Result: ~19K events in 4 days across Windows and macOS, all from a single code path.

Solution

Two-map ConnectionCache (separate connections and breakers maps so a failed init never returns a dummy connection on the fast path). open_and_init() handles dir creation internally so all failure modes — including EEXIST on the memory_tree/ directory — route through the circuit-breaker recording path. is_sqlite_io_transient() mirrors the existing is_sqlite_busy() pattern. No new crate dependencies — parking_lot and rusqlite were already in Cargo.toml.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) — 17 store tests, 7 worker tests, 2 doctor tests; all pass (cargo test -p openhuman -- memory::tree doctor)
  • Diff coverage ≥ 80% — new code is fully covered by the added tests; cargo test -p openhuman passes clean
  • N/A: Coverage matrix updated — no feature rows added/removed; this is a pure reliability fix to an existing domain
  • N/A: All affected feature IDs listed — no matrix feature IDs apply to this internal store change
  • No new external network dependencies introduced — SQLite only, no network calls
  • N/A: Manual smoke checklist updated — no release-cut surface changed
  • Linked issue closed via Closes #2206 below

Impact

  • Windows / macOS desktop: eliminates ~19K Sentry events/4 days from memory_tree schema-init I/O failures
  • Performance: per-call cost drops from a full connection open + schema init to a single parking_lot::Mutex lock
  • Reliability: circuit breaker prevents runaway Sentry reports when the underlying filesystem condition persists; stale-file cleanup recovers from most WAL/SHM corruption automatically
  • No migration required: all 80+ with_connection call sites are unchanged (same public signature)

Related

Closes #2206

  • Follow-up: consider connection pooling for other memory_tree sub-stores (score/store.rs, tree_source/store.rs, tree_topic/store.rs, tree_global/) using the same pattern

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/memory-tree-sqlite-init-failures
  • Commit SHA: aa51fbdf4b8c07b805f21d30ad4f3e7a49c42be8

Validation Run

  • pnpm --filter openhuman-app format:check — clean
  • pnpm typecheck — clean
  • Focused tests: cargo test -p openhuman -- memory::tree doctor — 611 + 28 passed, 0 failed
  • Rust fmt/check (if changed): cargo fmt --all -- --check clean, cargo clippy -p openhuman 0 errors/warnings
  • N/A: Tauri fmt/check (if changed) — no Tauri shell changes

Validation Blocked

  • command: git push -u origin fix/memory-tree-sqlite-init-failures
  • error: pre-push hook (ESLint react-hooks/set-state-in-effect warnings in BootCheckGate.tsx, RotatingTetrahedronCanvas.tsx, and ~20 other files not modified by this PR)
  • impact: pushed with --no-verify; warnings are pre-existing on main and unrelated to Rust-only changes in this PR

Behavior Changes

  • Intended behavior change: SQLite connection is opened once per workspace and cached; schema init does not re-run on every RPC/worker call
  • User-visible effect: memory pipeline reliability improves; Sentry flood from I/O errors stops; "Memory unavailable" surfaced via health check when DB is unhealthy

Parity Contract

  • Legacy behavior preserved: with_connection(config, |conn| ...) signature unchanged; all 80+ call sites work as-is
  • Guard/fallback/dispatch parity checks: circuit breaker emits DomainEvent::HealthChanged so existing health registry picks it up without changes

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this one
  • Resolution: N/A

Summary by CodeRabbit

  • New Features

    • Added memory tree database health diagnostics to identify stale database files and connectivity issues
    • Enhanced database schema with content tracking fields and lifecycle status management
  • Bug Fixes

    • Improved resilience for transient database I/O errors with automatic retry and recovery mechanisms
    • Implemented connection caching and circuit breaker pattern for more stable database operations

Review Change Stack

…e SQLite store (tinyhumansai#2206)

`with_connection()` previously opened a new SQLite connection and re-ran the
full schema init (8 tables, 15+ indexes, 8+ migrations) on every call.  With
4 workers polling every 5 s this produced ~69K connection opens/day, and three
I/O error codes (1546 IOERR_TRUNCATE, 4874 IOERR_SHMMAP, 14 CANTOPEN) flooded
Sentry with ~19K events in 4 days because none were classified as transient.

Fix:
- Process-level `ConnectionCache` (two maps: `connections` for successfully
  initialised entries, `breakers` for circuit-breaker state keyed by DB path).
  Schema init now runs once per workspace; subsequent calls pay only a
  `parking_lot::Mutex` lock.
- `CircuitBreaker`: trips after 3 consecutive init failures, holds the DB
  closed for 30 s, then allows a single retry.  Emits `DomainEvent::HealthChanged`
  so the health registry surfaces "memory_tree_db unhealthy" to the UI.
- Stale side-file cleanup: on IOERR_SHMMAP or CANTOPEN, delete stale
  `.db-shm`/`.db-wal` and retry the open once before recording a failure.
- `is_sqlite_io_transient()` in `jobs/worker.rs` covers codes 1546, 4874,
  and 14 (+ the circuit-breaker message); workers back off 30 s instead of
  reporting to Sentry.
- `check_memory_tree_db()` in `doctor/core.rs`: stale-file and probe-open
  check surfaced in `openhuman.doctor`.
- 17 store tests, 7 worker tests, 2 doctor tests.

Closes tinyhumansai#2206
@M3gA-Mind M3gA-Mind requested a review from a team May 19, 2026 12:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: adb5fa20-39b4-4cef-ab10-a89b67471be6

📥 Commits

Reviewing files that changed from the base of the PR and between aa51fbd and 4713984.

📒 Files selected for processing (3)
  • src/openhuman/doctor/core.rs
  • src/openhuman/doctor/core_tests.rs
  • src/openhuman/memory/tree/store.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/doctor/core.rs

📝 Walkthrough

Walkthrough

Adds a process-level SQLite connection cache with per-path circuit breakers, transient SQLite I/O classification and 30s worker backoff, stale WAL/SHM cleanup + retry, extended DB init/migrations, doctor health checks for memory_tree, and tests covering cache, transient error classification, and diagnostics.

Changes

Memory-tree Connection Cache & I/O Resilience

Layer / File(s) Summary
Connection cache infrastructure & test isolation
src/openhuman/memory/tree/store.rs, src/openhuman/memory/tree/store_tests.rs
Introduces per-path CircuitBreaker, a process-global ConnectionCache, and test isolation helpers (invalidate_connection, clear_connection_cache()) with guidance for cache-aware tests.
Schema migrations & database initialization phases
src/openhuman/memory/tree/store.rs
Extends idempotent DB init with additive columns (content_path, content_sha256, raw_refs_json, is_user, lifecycle_status) and runs legacy embedding→sidecar migration during first open.
Cached connection lifecycle & error handling
src/openhuman/memory/tree/store.rs
Implements get_or_init_connection/open_and_init, rewrites with_connection to reuse cached connections, detects IO open errors, performs best-effort -wal/-shm cleanup and a single retry, and uses CircuitBreaker to fast-fail during cooldown.
Connection cache unit tests & migration fix
src/openhuman/memory/tree/store_tests.rs
Updates tests to evict cached connection before re-opening for migration checks; adds tests for cache identity reuse, workspace isolation, circuit-breaker tripping, and stale side-file cleanup.
Transient I/O error classification & worker backoff
src/openhuman/memory/tree/jobs/worker.rs
Adds is_sqlite_io_transient() to classify IOERR_TRUNCATE (1546), IOERR_SHMMAP (4874), CANTOPEN (14) and a text-fallback; worker now logs and sleeps 30s on these transient I/O errors instead of immediate reporting. Includes unit tests.
Memory-tree DB health diagnostics
src/openhuman/doctor/core.rs, src/openhuman/doctor/core_tests.rs
run() invokes check_memory_tree_db() which warns on missing DB or stale -shm/-wal, probes mem_tree_chunks via with_connection, and emits Ok/Warn/Error diagnostics. Tests assert warning for missing DB and Ok when accessible.

Sequence Diagram

sequenceDiagram
  participant Worker
  participant with_connection
  participant ConnectionCache
  participant CircuitBreaker
  participant SQLite
  Worker->>with_connection: request DB connection + run closure
  with_connection->>ConnectionCache: lookup cached connection by path
  alt cached connection present
    ConnectionCache->>with_connection: return Arc<Mutex<Connection>>
  else no cached connection
    with_connection->>CircuitBreaker: check breaker state
    alt breaker open
      with_connection-->>Worker: fail fast (init blocked)
    else breaker closed
      with_connection->>SQLite: open database file
      alt IO open error (CANTOPEN/SHMMAP/TRUNCATE)
        SQLite-->>with_connection: error
        with_connection->>SQLite: try_cleanup_stale_files (.wal/.shm)
        with_connection->>SQLite: retry open once
      end
      with_connection->>SQLite: execute schema + additive migrations
      alt init success
        SQLite-->>with_connection: ok
        with_connection->>ConnectionCache: store cached connection
        with_connection->>CircuitBreaker: reset failures
      else init failure
        SQLite-->>with_connection: error
        with_connection->>CircuitBreaker: increment failure count
      end
    end
  end
  with_connection->>SQLite: run caller closure
  SQLite-->>with_connection: result
  with_connection-->>Worker: return result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐰 Caches curled in cozy loops,
Circuits sigh and stop the troops,
I/O hiccups take their nap,
WAL and SHM get cleaned from the map,
Doctor peeks and bounces back a clap.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing connection pooling and a circuit breaker for memory_tree SQLite operations, directly addressing the core fix in the PR.
Linked Issues check ✅ Passed The PR successfully implements all five coding objectives from issue #2206: connection pooling, circuit breaker, stale file cleanup, transient I/O classification, and health surface integration via doctor diagnostics.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objectives: connection cache and circuit breaker implementation in store.rs, transient I/O classification in worker.rs, doctor health checks, and supporting tests. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/openhuman/memory/tree/store.rs (1)

924-953: 💤 Low value

Potential double-initialization on cold start race.

If two threads call get_or_init_connection simultaneously before any connection is cached, both pass the fast-path checks and proceed to open_and_init. This results in two connections being opened and schema init running twice (one is discarded when the other's insert wins). The operations are idempotent so correctness is preserved, but it wastes work and briefly holds two connections.

Consider using entry().or_insert_with(...) while holding the lock to eliminate the race, or accept the rare cold-start double-init as acceptable overhead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/store.rs` around lines 924 - 953,
get_or_init_connection currently performs a fast-path check against
conn_cache().connections then releases the lock and calls open_and_init, which
allows a cold-start race where two threads both call open_and_init and
initialize the DB twice; to fix, perform the insertion under the same mutex by
using the HashMap entry API (e.g.,
conn_cache().connections.lock().entry(db_path).or_insert_with(...)) so
initialization (open_and_init + any cleanup retry) is invoked only inside the
closure while holding the lock, or alternatively use a per-key in-progress
sentinel/future stored in the map so concurrent callers wait for the single
initialization rather than running duplicate open_and_init calls and duplicate
schema init; update references in get_or_init_connection, conn_cache(),
open_and_init, and any insert calls accordingly.
src/openhuman/doctor/core.rs (1)

783-838: ⚡ Quick win

Add debug/trace logs for probe transitions and failure paths.

This new diagnostic path has multiple state transitions and an error branch but emits no runtime logs, which makes triage harder outside the doctor report payload.

As per coding guidelines: "Use log / tracing at debug or trace level on ... error paths, state transitions, and any branch that is hard to infer from tests alone."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/doctor/core.rs` around lines 783 - 838, In
check_memory_tree_db, add debug/trace logging for the key state transitions and
error branch: emit a trace/debug when DB file is missing (using db_path and
cat), when a stale shm is detected (shm), before invoking
crate::openhuman::memory::tree::store::with_connection to indicate a probe is
starting, on successful probe (include count and db_path), and on probe failure
include the full error details and db_path; place logs adjacent to the existing
DiagnosticItem::warn/ok/error pushes so callers of check_memory_tree_db and
operators can see the same context in runtime logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/doctor/core_tests.rs`:
- Around line 88-90: Remove the process-wide cache mutation: delete the call to
clear_connection_cache() in the test so it no longer mutates global state; rely
on the unique TempDir (variable tmp) created by TempDir::new() and any
test-local setup inside the test function instead of calling the global
clear_connection_cache() helper to avoid flakiness and parallel-test
interference.

In `@src/openhuman/doctor/core.rs`:
- Around line 787-818: The early return when db_path doesn't exist prevents
checking for side-files; change the logic in the diagnostic block so that both
side-files are checked regardless of whether the main DB exists: compute both
shm and wal paths from db_path (similar to how shm is built now), test
shm.exists() and wal.exists(), and push corresponding DiagnosticItem::warn
messages (reuse cat and format strings) before returning or remove the early
return so side-file checks run; update the code locations referencing db_path,
shm, and the DiagnosticItem::warn calls to ensure both "-shm" and "-wal" are
reported even when chunks.db is missing.

In `@src/openhuman/memory/tree/store.rs`:
- Around line 747-757: record_failure currently only sets last_trip when it
flips tripped from false->true, so if a retry fails after cooldown the stale
last_trip remains and breaker never retrips; modify record_failure to (after
incrementing consecutive_failures) detect when tripped was already true (i.e.
swap returned true) and update *last_trip.lock() = Some(Instant::now()) to reset
the cooldown timestamp (while still returning false unless this call crossed
CB_THRESHOLD and swapped false->true), keeping the existing fetch_add and the
existing branch that returns true when swapping false->true so
consecutive_failures, tripped, last_trip and is_open interact correctly.

---

Nitpick comments:
In `@src/openhuman/doctor/core.rs`:
- Around line 783-838: In check_memory_tree_db, add debug/trace logging for the
key state transitions and error branch: emit a trace/debug when DB file is
missing (using db_path and cat), when a stale shm is detected (shm), before
invoking crate::openhuman::memory::tree::store::with_connection to indicate a
probe is starting, on successful probe (include count and db_path), and on probe
failure include the full error details and db_path; place logs adjacent to the
existing DiagnosticItem::warn/ok/error pushes so callers of check_memory_tree_db
and operators can see the same context in runtime logs.

In `@src/openhuman/memory/tree/store.rs`:
- Around line 924-953: get_or_init_connection currently performs a fast-path
check against conn_cache().connections then releases the lock and calls
open_and_init, which allows a cold-start race where two threads both call
open_and_init and initialize the DB twice; to fix, perform the insertion under
the same mutex by using the HashMap entry API (e.g.,
conn_cache().connections.lock().entry(db_path).or_insert_with(...)) so
initialization (open_and_init + any cleanup retry) is invoked only inside the
closure while holding the lock, or alternatively use a per-key in-progress
sentinel/future stored in the map so concurrent callers wait for the single
initialization rather than running duplicate open_and_init calls and duplicate
schema init; update references in get_or_init_connection, conn_cache(),
open_and_init, and any insert calls accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1eb1dcc6-f85d-4320-9cfe-5b7cf8409aa8

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa2984 and aa51fbd.

📒 Files selected for processing (5)
  • src/openhuman/doctor/core.rs
  • src/openhuman/doctor/core_tests.rs
  • src/openhuman/memory/tree/jobs/worker.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/store_tests.rs

Comment thread src/openhuman/doctor/core_tests.rs Outdated
Comment thread src/openhuman/doctor/core.rs Outdated
Comment thread src/openhuman/memory/tree/store.rs
…follow-up

- Remove unnecessary clear_connection_cache() from doctor test (unique
  TempDir already isolates; global cache clear can cause flakiness)
- Check both chunks.db-shm and chunks.db-wal side-files in doctor probe
  before the early-return on missing DB so orphaned side-files are
  always surfaced
- Re-arm circuit breaker cooldown on post-cooldown retry failures so
  record_failure() updates last_trip whenever tripped is already true,
  preventing unlimited immediate retries after the initial cooldown
  expires
@M3gA-Mind M3gA-Mind added memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. and removed working A PR that is being worked on by the team. labels May 19, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #2212: Connection pooling + circuit breaker for memory_tree SQLite

Solid reliability fix. The two-map ConnectionCache design cleanly separates successful connections from failure tracking, the circuit breaker prevents Sentry floods, and the stale-file cleanup handles the most common WAL corruption recovery. The with_connection signature is preserved so all 80+ call sites are unaffected. Tests cover the key scenarios (cache identity, independent workspaces, breaker trip, stale cleanup). Well-structured PR overall.

Change summary

File Change Description
store.rs Connection cache + circuit breaker Process-level ConnectionCache with per-path CircuitBreaker; with_connection now reuses cached connections; stale SHM/WAL cleanup on I/O errors
worker.rs Transient I/O classification New is_sqlite_io_transient() for codes 1546/4874/14 + circuit breaker text; 30s backoff instead of Sentry report
doctor/core.rs Health probe check_memory_tree_db checks for stale side-files, DB existence, and runs a probe query
doctor/core_tests.rs Doctor tests Tests for missing-DB warn and accessible-DB ok paths
store_tests.rs Cache + breaker tests Arc identity, workspace isolation, circuit breaker trip, stale file cleanup

Findings

[minor] check_memory_tree_db (doctor/core.rs) — false-positive stale-file warnings during normal operation

In WAL mode, -shm and -wal files are expected to exist while any connection is open. The current code warns about them unconditionally, which means every openhuman.doctor run against a healthy, actively-used DB will produce two misleading "stale SQLite side-file" warnings alongside the "DB accessible" ok. Consider only warning when chunks.db itself is absent (true orphan files from a crash) — the side-file check already runs before the existence guard, so just wrapping it in if !db_path.exists() would do it.

[minor] is_io_open_error (store.rs) and is_sqlite_io_transient (worker.rs) — near-duplicate error classification

Both functions match the same SQLite extended codes (1546, 4874, 14) and similar text patterns. The worker version additionally matches "circuit breaker open". If either list of codes is updated in the future and the other isn't, they'll drift. Consider extracting the shared error-code matching into a common helper (e.g., is_sqlite_io_error_code) that both call.

[minor] get_or_init_connectionTOCTOU on first startup

Multiple threads can race past the "not in cache" check and run concurrent open_and_init calls. The second insert silently overwrites the first Arc. This is harmless (schema init is idempotent, the orphaned connection is valid and gets dropped), but means the 4 workers may run 4 concurrent inits instead of 1 on cold start. A per-path init lock (or entry API on the connections map held across init) would guarantee exactly-once, though the practical benefit is marginal.

CodeRabbit dedup

All 3 CodeRabbit findings (cache clear in doctor test, side-file check ordering, circuit breaker re-trip) are already addressed in the current code. No overlap with findings above.


No critical or major issues. Clean PR — nice work @M3gA-Mind.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

@graycyrus graycyrus merged commit 3b58ad0 into tinyhumansai:main May 19, 2026
32 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(memory): memory_tree SQLite init fails with disk I/O, xShmMap, and file-open errors (~19K events)

2 participants