fix(memory): connection pooling + circuit breaker for memory_tree SQLite (#2206)#2212
Conversation
…e SQLite store (tinyhumansai#2206) `with_connection()` previously opened a new SQLite connection and re-ran the full schema init (8 tables, 15+ indexes, 8+ migrations) on every call. With 4 workers polling every 5 s this produced ~69K connection opens/day, and three I/O error codes (1546 IOERR_TRUNCATE, 4874 IOERR_SHMMAP, 14 CANTOPEN) flooded Sentry with ~19K events in 4 days because none were classified as transient. Fix: - Process-level `ConnectionCache` (two maps: `connections` for successfully initialised entries, `breakers` for circuit-breaker state keyed by DB path). Schema init now runs once per workspace; subsequent calls pay only a `parking_lot::Mutex` lock. - `CircuitBreaker`: trips after 3 consecutive init failures, holds the DB closed for 30 s, then allows a single retry. Emits `DomainEvent::HealthChanged` so the health registry surfaces "memory_tree_db unhealthy" to the UI. - Stale side-file cleanup: on IOERR_SHMMAP or CANTOPEN, delete stale `.db-shm`/`.db-wal` and retry the open once before recording a failure. - `is_sqlite_io_transient()` in `jobs/worker.rs` covers codes 1546, 4874, and 14 (+ the circuit-breaker message); workers back off 30 s instead of reporting to Sentry. - `check_memory_tree_db()` in `doctor/core.rs`: stale-file and probe-open check surfaced in `openhuman.doctor`. - 17 store tests, 7 worker tests, 2 doctor tests. Closes tinyhumansai#2206
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a process-level SQLite connection cache with per-path circuit breakers, transient SQLite I/O classification and 30s worker backoff, stale WAL/SHM cleanup + retry, extended DB init/migrations, doctor health checks for memory_tree, and tests covering cache, transient error classification, and diagnostics. ChangesMemory-tree Connection Cache & I/O Resilience
Sequence DiagramsequenceDiagram
participant Worker
participant with_connection
participant ConnectionCache
participant CircuitBreaker
participant SQLite
Worker->>with_connection: request DB connection + run closure
with_connection->>ConnectionCache: lookup cached connection by path
alt cached connection present
ConnectionCache->>with_connection: return Arc<Mutex<Connection>>
else no cached connection
with_connection->>CircuitBreaker: check breaker state
alt breaker open
with_connection-->>Worker: fail fast (init blocked)
else breaker closed
with_connection->>SQLite: open database file
alt IO open error (CANTOPEN/SHMMAP/TRUNCATE)
SQLite-->>with_connection: error
with_connection->>SQLite: try_cleanup_stale_files (.wal/.shm)
with_connection->>SQLite: retry open once
end
with_connection->>SQLite: execute schema + additive migrations
alt init success
SQLite-->>with_connection: ok
with_connection->>ConnectionCache: store cached connection
with_connection->>CircuitBreaker: reset failures
else init failure
SQLite-->>with_connection: error
with_connection->>CircuitBreaker: increment failure count
end
end
end
with_connection->>SQLite: run caller closure
SQLite-->>with_connection: result
with_connection-->>Worker: return result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/openhuman/memory/tree/store.rs (1)
924-953: 💤 Low valuePotential double-initialization on cold start race.
If two threads call
get_or_init_connectionsimultaneously before any connection is cached, both pass the fast-path checks and proceed toopen_and_init. This results in two connections being opened and schema init running twice (one is discarded when the other'sinsertwins). The operations are idempotent so correctness is preserved, but it wastes work and briefly holds two connections.Consider using
entry().or_insert_with(...)while holding the lock to eliminate the race, or accept the rare cold-start double-init as acceptable overhead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/store.rs` around lines 924 - 953, get_or_init_connection currently performs a fast-path check against conn_cache().connections then releases the lock and calls open_and_init, which allows a cold-start race where two threads both call open_and_init and initialize the DB twice; to fix, perform the insertion under the same mutex by using the HashMap entry API (e.g., conn_cache().connections.lock().entry(db_path).or_insert_with(...)) so initialization (open_and_init + any cleanup retry) is invoked only inside the closure while holding the lock, or alternatively use a per-key in-progress sentinel/future stored in the map so concurrent callers wait for the single initialization rather than running duplicate open_and_init calls and duplicate schema init; update references in get_or_init_connection, conn_cache(), open_and_init, and any insert calls accordingly.src/openhuman/doctor/core.rs (1)
783-838: ⚡ Quick winAdd debug/trace logs for probe transitions and failure paths.
This new diagnostic path has multiple state transitions and an error branch but emits no runtime logs, which makes triage harder outside the doctor report payload.
As per coding guidelines: "Use
log/tracingatdebugortracelevel on ... error paths, state transitions, and any branch that is hard to infer from tests alone."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/doctor/core.rs` around lines 783 - 838, In check_memory_tree_db, add debug/trace logging for the key state transitions and error branch: emit a trace/debug when DB file is missing (using db_path and cat), when a stale shm is detected (shm), before invoking crate::openhuman::memory::tree::store::with_connection to indicate a probe is starting, on successful probe (include count and db_path), and on probe failure include the full error details and db_path; place logs adjacent to the existing DiagnosticItem::warn/ok/error pushes so callers of check_memory_tree_db and operators can see the same context in runtime logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/doctor/core_tests.rs`:
- Around line 88-90: Remove the process-wide cache mutation: delete the call to
clear_connection_cache() in the test so it no longer mutates global state; rely
on the unique TempDir (variable tmp) created by TempDir::new() and any
test-local setup inside the test function instead of calling the global
clear_connection_cache() helper to avoid flakiness and parallel-test
interference.
In `@src/openhuman/doctor/core.rs`:
- Around line 787-818: The early return when db_path doesn't exist prevents
checking for side-files; change the logic in the diagnostic block so that both
side-files are checked regardless of whether the main DB exists: compute both
shm and wal paths from db_path (similar to how shm is built now), test
shm.exists() and wal.exists(), and push corresponding DiagnosticItem::warn
messages (reuse cat and format strings) before returning or remove the early
return so side-file checks run; update the code locations referencing db_path,
shm, and the DiagnosticItem::warn calls to ensure both "-shm" and "-wal" are
reported even when chunks.db is missing.
In `@src/openhuman/memory/tree/store.rs`:
- Around line 747-757: record_failure currently only sets last_trip when it
flips tripped from false->true, so if a retry fails after cooldown the stale
last_trip remains and breaker never retrips; modify record_failure to (after
incrementing consecutive_failures) detect when tripped was already true (i.e.
swap returned true) and update *last_trip.lock() = Some(Instant::now()) to reset
the cooldown timestamp (while still returning false unless this call crossed
CB_THRESHOLD and swapped false->true), keeping the existing fetch_add and the
existing branch that returns true when swapping false->true so
consecutive_failures, tripped, last_trip and is_open interact correctly.
---
Nitpick comments:
In `@src/openhuman/doctor/core.rs`:
- Around line 783-838: In check_memory_tree_db, add debug/trace logging for the
key state transitions and error branch: emit a trace/debug when DB file is
missing (using db_path and cat), when a stale shm is detected (shm), before
invoking crate::openhuman::memory::tree::store::with_connection to indicate a
probe is starting, on successful probe (include count and db_path), and on probe
failure include the full error details and db_path; place logs adjacent to the
existing DiagnosticItem::warn/ok/error pushes so callers of check_memory_tree_db
and operators can see the same context in runtime logs.
In `@src/openhuman/memory/tree/store.rs`:
- Around line 924-953: get_or_init_connection currently performs a fast-path
check against conn_cache().connections then releases the lock and calls
open_and_init, which allows a cold-start race where two threads both call
open_and_init and initialize the DB twice; to fix, perform the insertion under
the same mutex by using the HashMap entry API (e.g.,
conn_cache().connections.lock().entry(db_path).or_insert_with(...)) so
initialization (open_and_init + any cleanup retry) is invoked only inside the
closure while holding the lock, or alternatively use a per-key in-progress
sentinel/future stored in the map so concurrent callers wait for the single
initialization rather than running duplicate open_and_init calls and duplicate
schema init; update references in get_or_init_connection, conn_cache(),
open_and_init, and any insert calls accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1eb1dcc6-f85d-4320-9cfe-5b7cf8409aa8
📒 Files selected for processing (5)
src/openhuman/doctor/core.rssrc/openhuman/doctor/core_tests.rssrc/openhuman/memory/tree/jobs/worker.rssrc/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/store_tests.rs
…follow-up - Remove unnecessary clear_connection_cache() from doctor test (unique TempDir already isolates; global cache clear can cause flakiness) - Check both chunks.db-shm and chunks.db-wal side-files in doctor probe before the early-return on missing DB so orphaned side-files are always surfaced - Re-arm circuit breaker cooldown on post-cooldown retry failures so record_failure() updates last_trip whenever tripped is already true, preventing unlimited immediate retries after the initial cooldown expires
graycyrus
left a comment
There was a problem hiding this comment.
Review — PR #2212: Connection pooling + circuit breaker for memory_tree SQLite
Solid reliability fix. The two-map ConnectionCache design cleanly separates successful connections from failure tracking, the circuit breaker prevents Sentry floods, and the stale-file cleanup handles the most common WAL corruption recovery. The with_connection signature is preserved so all 80+ call sites are unaffected. Tests cover the key scenarios (cache identity, independent workspaces, breaker trip, stale cleanup). Well-structured PR overall.
Change summary
| File | Change | Description |
|---|---|---|
store.rs |
Connection cache + circuit breaker | Process-level ConnectionCache with per-path CircuitBreaker; with_connection now reuses cached connections; stale SHM/WAL cleanup on I/O errors |
worker.rs |
Transient I/O classification | New is_sqlite_io_transient() for codes 1546/4874/14 + circuit breaker text; 30s backoff instead of Sentry report |
doctor/core.rs |
Health probe | check_memory_tree_db checks for stale side-files, DB existence, and runs a probe query |
doctor/core_tests.rs |
Doctor tests | Tests for missing-DB warn and accessible-DB ok paths |
store_tests.rs |
Cache + breaker tests | Arc identity, workspace isolation, circuit breaker trip, stale file cleanup |
Findings
[minor] check_memory_tree_db (doctor/core.rs) — false-positive stale-file warnings during normal operation
In WAL mode, -shm and -wal files are expected to exist while any connection is open. The current code warns about them unconditionally, which means every openhuman.doctor run against a healthy, actively-used DB will produce two misleading "stale SQLite side-file" warnings alongside the "DB accessible" ok. Consider only warning when chunks.db itself is absent (true orphan files from a crash) — the side-file check already runs before the existence guard, so just wrapping it in if !db_path.exists() would do it.
[minor] is_io_open_error (store.rs) and is_sqlite_io_transient (worker.rs) — near-duplicate error classification
Both functions match the same SQLite extended codes (1546, 4874, 14) and similar text patterns. The worker version additionally matches "circuit breaker open". If either list of codes is updated in the future and the other isn't, they'll drift. Consider extracting the shared error-code matching into a common helper (e.g., is_sqlite_io_error_code) that both call.
[minor] get_or_init_connection — TOCTOU on first startup
Multiple threads can race past the "not in cache" check and run concurrent open_and_init calls. The second insert silently overwrites the first Arc. This is harmless (schema init is idempotent, the orphaned connection is valid and gets dropped), but means the 4 workers may run 4 concurrent inits instead of 1 on cold start. A per-path init lock (or entry API on the connections map held across init) would guarantee exactly-once, though the practical benefit is marginal.
CodeRabbit dedup
All 3 CodeRabbit findings (cache clear in doctor test, side-file check ordering, circuit breaker re-trip) are already addressed in the current code. No overlap with findings above.
No critical or major issues. Clean PR — nice work @M3gA-Mind.
graycyrus
left a comment
There was a problem hiding this comment.
Looks good, nice work!
Summary
memory_treestore with a process-level connection cache keyed by workspace path — schema init runs once, not ~69K times/dayCircuitBreaker(3 consecutive failures → 30 s closed) so a broken DB install stops flooding Sentry instead of reporting on every worker poll.db-shm/.db-walcleanup onIOERR_SHMMAP/CANTOPENwith a single retry before recording a failureis_sqlite_io_transient()(codes 1546, 4874, 14) so those errors trigger a 30 s backoff instead of Sentry reportsmemory_tree_dbhealth viaDomainEvent::HealthChangedand add a probe check toopenhuman.doctorProblem
with_connection()insrc/openhuman/memory/tree/store.rsopened a new SQLite connection and re-ran the full schema init (8 tables, 15+ indexes, 8+ migrations) on every call. With 4 workers polling every 5 s this was ~69K connection opens/day. Three I/O error codes — 1546 (SQLITE_IOERR_TRUNCATE), 4874 (SQLITE_IOERR_SHMMAP), 14 (SQLITE_CANTOPEN) — were not classified as transient, so every failure was reported to Sentry. Result: ~19K events in 4 days across Windows and macOS, all from a single code path.Solution
Two-map
ConnectionCache(separateconnectionsandbreakersmaps so a failed init never returns a dummy connection on the fast path).open_and_init()handles dir creation internally so all failure modes — includingEEXISTon thememory_tree/directory — route through the circuit-breaker recording path.is_sqlite_io_transient()mirrors the existingis_sqlite_busy()pattern. No new crate dependencies —parking_lotandrusqlitewere already inCargo.toml.Submission Checklist
cargo test -p openhuman -- memory::tree doctor)cargo test -p openhumanpasses cleanCloses #2206belowImpact
memory_treeschema-init I/O failuresparking_lot::Mutexlockwith_connectioncall sites are unchanged (same public signature)Related
Closes #2206
memory_treesub-stores (score/store.rs,tree_source/store.rs,tree_topic/store.rs,tree_global/) using the same patternAI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/memory-tree-sqlite-init-failuresaa51fbdf4b8c07b805f21d30ad4f3e7a49c42be8Validation Run
pnpm --filter openhuman-app format:check— cleanpnpm typecheck— cleancargo test -p openhuman -- memory::tree doctor— 611 + 28 passed, 0 failedcargo fmt --all -- --checkclean,cargo clippy -p openhuman0 errors/warningsValidation Blocked
git push -u origin fix/memory-tree-sqlite-init-failuresreact-hooks/set-state-in-effectwarnings inBootCheckGate.tsx,RotatingTetrahedronCanvas.tsx, and ~20 other files not modified by this PR)--no-verify; warnings are pre-existing onmainand unrelated to Rust-only changes in this PRBehavior Changes
Parity Contract
with_connection(config, |conn| ...)signature unchanged; all 80+ call sites work as-isDomainEvent::HealthChangedso existing health registry picks it up without changesDuplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes