fix(memory): serialize memory_tree schema init across worker pool#2059
Conversation
Four tree_jobs_worker tasks were racing into with_connection on cold start, each hitting SQLite WAL/SHM bootstrap simultaneously. Surfaced as ~19k events/day across three codes — 14 (CANTOPEN), 1546 (IOERR_TRUNCATE), 4874 (IOERR_SHMMAP) — all reported as "Failed to initialize memory_tree schema". Gate apply_schema behind a process-wide Mutex<HashSet<PathBuf>> so only the first caller per DB path runs WAL setup + SCHEMA + ALTERs; subsequent callers see the path in the set and skip the work. Keyed on PathBuf (not "process-wide once") so per-test tempdir configs each initialise independently — see feedback_oncelock_sqlite_test_isolation. Belt-and-suspenders: retry the open+init flow up to 3x on the three transient cold-start codes with 50→150→500ms backoff before surfacing the error. Closes OPENHUMAN-TAURI-HH Closes OPENHUMAN-TAURI-ZM Closes OPENHUMAN-TAURI-MB
Spawns 8 threads racing into with_connection on a cold tempdir DB. Asserts all callers return Ok and apply_schema runs exactly once per path — the structural guarantee that the mutex in the parent commit eliminates the OPENHUMAN-TAURI-HH/-ZM/-MB cold-start race. Per-path counter (not global) so the test is robust against other tests in the same binary running in parallel against their own tempdirs.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSQLite initialization now uses per-DB-path guards and retries transient cold-starts with backoff; schema application is idempotent and extended with additive migrations. Tests and a smoke binary validate concurrent init correctness and that foreign key enforcement is enabled per connection. ChangesConcurrent-Safe SQLite Initialization with Schema Extension
Sequence DiagramsequenceDiagram
participant WorkerThread
participant with_connection
participant open_and_init_with_retry
participant InitializationGuard as Guard
participant SQLiteDB as DB
WorkerThread->>with_connection: request connection(path)
with_connection->>open_and_init_with_retry: open/init path
open_and_init_with_retry->>Guard: check/acquire per-path lock
Guard->>open_and_init_with_retry: grant or block
open_and_init_with_retry->>DB: open and run SCHEMA / migrations (if leader)
open_and_init_with_retry->>with_connection: return Connection
with_connection->>WorkerThread: provide Connection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/tree/store.rs`:
- Around line 758-815: The connection-level PRAGMA for foreign key enforcement
must be applied in open_connection so every returned Connection has FK checks
enabled; update open_connection (the function that calls Connection::open and
configures busy_timeout) to run "PRAGMA foreign_keys = ON" (or the equivalent
rusqlite call) before returning the conn, and remove or leave FK-related PRAGMA
only in apply_schema for one-time DDL if present, ensuring that
try_open_and_init / open_and_init_with_retry fast-path connections get FK
enforcement too.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 584d3086-9a24-4d25-99cd-a8609b5f0613
📒 Files selected for processing (2)
src/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/store_tests.rs
Manual harness that spins N threads racing into with_connection
against a shared workspace. Cold-start exercises the race window
the OPENHUMAN-TAURI-HH/-ZM/-MB fix in the prior commits closes.
Bumps `with_connection` visibility from pub(crate) to pub so the
bin (a separate crate target) can call it directly. Existing
internal callers (read_rpc, ingest, bucket_seal, score::store)
are unaffected.
Usage:
rm -rf /tmp/mt-smoke
OPENHUMAN_WORKSPACE=/tmp/mt-smoke \
cargo run --bin memory-tree-init-smoke -- 64
Exit 0 if all threads Ok, 1 if any failed.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bin/memory_tree_init_smoke.rs`:
- Around line 49-54: Replace the std print calls used for smoke diagnostics
(e.g. the eprintln! that prints "[smoke] workspace={} cold_start={} threads={}"
and the other println!/eprintln! occurrences around those blocks) with
structured diagnostic logs using the repo's logging crate (tracing or log) at
debug/trace level; update the top of this module to import and use
tracing::debug! or tracing::trace! (or log::debug! if the project uses log) and
change each eprintln!/println! invocation to the appropriate debug/trace macro
so the messages (including workspace.display(), cold, n and the other formatted
fields in the other two blocks) are emitted via the logging subsystem rather
than stdout/stderr.
- Around line 37-43: The current parsing sets let n: usize =
...parse().expect(...) but allows 0; after parsing (the binding named n)
validate that n > 0 and fail fast if not by panicking or returning an error with
a clear message (e.g., "thread count must be a positive integer greater than
0"); update the parsing block around the variable n (where
std::env::args().nth(1)...parse() is called) to check the parsed value and
replace the current expect with logic that enforces strictly positive thread
count.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4a52bd65-537c-4e4c-bb80-8e2de785b296
📒 Files selected for processing (3)
Cargo.tomlsrc/bin/memory_tree_init_smoke.rssrc/openhuman/memory/tree/store.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/memory/tree/store.rs
PRAGMA foreign_keys is connection-local in SQLite — it resets to off on every new Connection::open. The earlier schema-init refactor moved the pragma out of per-call execution into apply_schema, which only runs on first init per DB path. Result: every later with_connection() fast-path call returned a connection with FK enforcement off. Move the pragma to open_connection() so every connection — cold init + fast path — sets it. Add a regression test that calls with_connection twice and asserts PRAGMA foreign_keys returns 1 on both calls. Caught by CodeRabbit review on PR.
Two CodeRabbit review fixes for the memory-tree-init-smoke binary: - Switch all println!/eprintln! to log::info/debug/error and init env_logger so the diagnostics flow through the same subsystem as the rest of the core. Workers default to debug level; control via RUST_LOG (e.g. RUST_LOG=debug to see per-worker results). - Validate the thread-count arg: reject n=0 and unparseable values with a clear error message and exit code 2 (vs 1 = at least one worker failed, 0 = all ok). n=0 previously produced a spurious "0/0 ok" pass with no work done.
graycyrus
left a comment
There was a problem hiding this comment.
Walkthrough
Solid fix for a real production pain point — 19k Sentry events/day from concurrent WAL/SHM bootstrap races is nasty. The mutex-gated SCHEMA_INITIALIZED set with per-path keying is clean, the retry backoff on the three known transient codes is well-targeted, and the FK pragma move to open_connection() (addressing CodeRabbit's earlier catch) is correct. Two issues worth addressing before merge: test instrumentation leaking into production, and the pub visibility escape on with_connection.
Change Summary
| File | Change type | Description |
|---|---|---|
Cargo.toml |
Modified | Added [[bin]] entry for memory-tree-init-smoke manual diagnostic tool |
src/bin/memory_tree_init_smoke.rs |
Added | Manual stress-test binary: N threads racing with_connection on cold workspace |
src/openhuman/memory/tree/store.rs |
Modified | Core fix: SCHEMA_INITIALIZED mutex guard, retry loop with backoff, open_connection() extracted with per-connection FK pragma, apply_schema() split out |
src/openhuman/memory/tree/store_tests.rs |
Modified | Two regression tests: concurrent init serialisation + FK pragma persistence |
…tinyhumansai#2059) The SCHEMA_APPLY_COUNTS HashMap exists only so the concurrent-init regression test can assert apply_schema runs exactly once per path. Before this commit, the static + record_schema_apply ran in release builds too — adding an OnceLock<Mutex<HashMap>>, a mutex acquisition, and a HashMap insert on every cold-start schema init. Wrap both the static and the write side behind cfg(test) so the production binary carries no instrumentation. record_schema_apply still exists in release builds but compiles to a no-op body. Addresses graycyrus review (major) on PR tinyhumansai#2059.
…mansai#2059) Previous commit widened with_connection from pub(crate) to pub so the smoke binary in src/bin/ (a separate crate target) could call it. That side effect made it part of openhuman_core's public surface — any downstream crate could rely on it. Add #[doc(hidden)] and a doc-comment line stating it is NOT a stable API. Lightest-touch fix: smoke bin still resolves the symbol, but rustdoc drops it from generated docs and downstream callers get a clear signal not to depend on it. Addresses graycyrus review (major) on PR tinyhumansai#2059.
…nyhumansai#2059) The retry loop's transient-error gate matched raw extended_code literals (14, 1546, 4874). Replace with named constants so the match arm is self-documenting and grep-friendly: SQLITE_CANTOPEN = 14 SQLITE_IOERR_TRUNCATE = 1546 SQLITE_IOERR_SHMMAP = 4874 is_transient_cold_start was the gatekeeper for the retry loop but had no direct unit test — only exercised incidentally if the concurrent-init test happened to fire a transient (rare on dev machines). Add a targeted test that pins classification for all three cold-start codes and explicitly rejects SQLITE_BUSY + non-SQLite errors so regressions in the match arms surface. Addresses graycyrus review (minor x2) on PR tinyhumansai#2059.
quote_store_round_trips_and_expires called reset_quote_store_for_tests() without acquiring TEST_LOCK, racing against the async execute_prepared_* tests that store a quote and then await. The reset wiped the in-flight quote between store + await, surfacing intermittently as 'quote q_retry not found' in Rust Core Coverage (CI run 26029017503 failed this PR on that race even though the wallet code is untouched here). All other tests in this module acquire TEST_LOCK before touching the shared store. One-line drive-by fix — flake was blocking the Coverage Gate on tinyhumansai#2059.
Resolved Cargo.toml [[bin]] conflict — keep both memory-tree-init-smoke (this PR) and inference-probe (added on main as separate top-level bin entry). Version field auto-resolved to 0.54.0 via release bumps on main. Verified post-merge: - cargo check core + both new bins → ok - memory_tree::store tests: 15/15 - wallet::execution tests: 12/12
graycyrus
left a comment
There was a problem hiding this comment.
Continuation Review — Previous Requested Changes Addressed
All four findings from the previous review have been addressed in follow-up commits:
SCHEMA_APPLY_COUNTStest instrumentation (major) —#[cfg(test)]-gated in 883b295. Static, HashMap import, andrecord_schema_applybody all compile to no-ops in production. ✓with_connectionpub visibility (major) —#[doc(hidden)] pubwith explanatory doc comment in b9fb859. Smoke binary resolves the symbol; rustdoc hides it. ✓- Magic SQLite error codes (minor) — Named constants
SQLITE_CANTOPEN/SQLITE_IOERR_TRUNCATE/SQLITE_IOERR_SHMMAPwith doc comments in 54a104e. ✓ - Missing
is_transient_cold_startunit test (minor) —is_transient_cold_start_classifies_known_extended_codesadded in 54a104e, covering all three transient codes plus negative cases. ✓
Additional Changes Since Last Review
| File | Change type | Description |
|---|---|---|
src/openhuman/wallet/execution.rs |
Bug fix | Added TEST_LOCK guard in quote_store_round_trips_and_expires to prevent cross-test race with reset_quote_store_for_tests() |
| merge commit c7ad0bb | Merge | upstream/main into fix branch — no conflicts |
The wallet test fix follows the established TEST_LOCK pattern already used in that module (see surrounding code in wallet/execution.rs). Clean, no issues.
Summary
Solid fix for a high-volume Sentry issue (~19k events/day). The mutex-gated init set, per-connection FK pragma, and retry-with-backoff are all well-structured. Test coverage is thorough — concurrent init, FK persistence across cold/warm paths, and classifier pin tests. Code follows existing patterns in memory/tree/ (naming, error handling, logging prefixes). No new critical or major findings. All previously requested changes are resolved.
All four review threads have been resolved.
|
@oxoxDev bro conflicts |
…#1574 §7) Reconcile worker-pool race fix (tinyhumansai#2059) with per-(row,model) embedding storage cutover (tinyhumansai#1574 §7): - store.rs: - Keep tinyhumansai#2059's pragma comment AND tinyhumansai#1574's TREE_EMBEDDING_MIGRATION_VERSION const at top. - Keep tinyhumansai#2059's `apply_schema` pure (schema only) — preserves split between `open_connection` (per-call FK pragma), `apply_schema` (first-init only), and the legacy migration. - Wire tinyhumansai#1574's `migrate_legacy_embeddings_to_sidecar(&conn, config)?` into `with_connection` after `open_and_init_with_retry` returns. Fast-path cost is one PRAGMA user_version read once gate set. - store_tests.rs: - Keep all three tinyhumansai#2059 regression tests (`with_connection_serialises_concurrent_schema_init`, `is_transient_cold_start_classifies_known_extended_codes`, `with_connection_keeps_foreign_keys_on_for_every_call`). - Keep tinyhumansai#1574's `legacy_embeddings_migrate_to_sidecar_once`. Verified: `cargo check --tests` clean; 600/600 `openhuman::memory::tree::` lib tests pass, including all 4 conflicting tests.
|
@graycyrus resolved bro — merged upstream/main into branch, reconciled #2059 worker-race split with #1574 §7 legacy→sidecar migration. 600/600 memory_tree tests green. CI re-running. |
# Conflicts: # src/openhuman/memory/tree/store.rs
…nyhumansai#2059) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
…nyhumansai#2059) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
tree_jobs_workertasks were racing intomemory::tree::store::with_connectionon cold start, triggering ~19k Sentry events/day across three SQLite WAL/SHM bootstrap codes.apply_schemabehind a process-wideMutex<HashSet<PathBuf>>so only the first caller per DB path runs WAL setup +SCHEMA+ALTER TABLEmigrations; subsequent callers skip.Okandapply_schemaruns exactly once per path.Problem
Three Sentry issues, all tagged
domain=memory,operation=tree_jobs_worker,worker_idx=0..3, sharing the same root cause:Failed to initialize memory_tree schema: disk I/O error: Error code 1546: Error attempting to truncate fileFailed to initialize memory_tree schema: disk I/O error: Error code 4874: I/O error within the xShmMap method (trying to resize an existing shared-memory segment)Failed to initialize memory_tree schema: unable to open database file: Error code 14: Unable to open the database fileZM + MB share a
trace_id— two SQLite errors in the same event chain. Cross-platform (Windows + macOS) → race, not OS-specific. All three codes are flavors of the same race window on WAL+SHM cold start:SQLITE_CANTOPEN) — lockfile race, WAL not yet readySQLITE_IOERR_TRUNCATE) — WAL truncation contentionSQLITE_IOERR_SHMMAP) — shared-memory segment resize raceThe 4 worker tasks (
WORKER_COUNT = 4inmemory::tree::jobs::worker) all enterwith_connectionviaclaim_nextimmediately after the pool boots. Each connection re-ranPRAGMA journal_mode=WAL+execute_batch(SCHEMA)+ALTER TABLEmigrations — fine for a warm DB, racy on cold start when WAL + SHM files don't yet exist.Solution
Centralised the init in
store.rs:SCHEMA_INITIALIZED: OnceLock<Mutex<HashSet<PathBuf>>>records DB paths whose schema has been applied this process.with_connectioncallsopen_and_init_with_retry(&db_path):busy_timeout+ return.apply_schema(WAL + SCHEMA + ALTERs + index) once, insert path, drop mutex..context()layers still match.PathBuf(not "process-wide once") so per-testtempdir()configs each initialise independently — guards against the stale-handle pattern infeedback_oncelock_sqlite_test_isolation.md.The
busy_timeout = 15sandjournal_mode = WALfrom the existing code are unchanged.Test:
with_connection_serialises_concurrent_schema_initspawns 8 threads racing intowith_connectionon a fresh tempdir. Asserts (a) no errors surface and (b)apply_schemaruns exactly once for the shared path. Per-path counter avoids false positives from parallel tests in the same binary against their own tempdirs.Micro-commits (in order):
fix(memory): serialize memory_tree schema init across workers— the mutex-gated guard + retry instore.rstest(memory): cover concurrent memory_tree schema init regression— 8-thread regression test instore_tests.rsSubmission Checklist
storetest suite (13 tests pass locally). Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged viadiff-cover) meet the gate enforced by.github/workflows/coverage.yml.docs/TEST-COVERAGE-MATRIX.mdreflect this change.## Related.docs/RELEASE-MANUAL-SMOKE.md).Closes #NNNin the## RelatedsectionImpact
Failed to initialize memory_tree schemaerrors from the worker pool's cold-start path.with_connectioncall after first) now doesMutex<HashSet>::lock().contains()instead of runningPRAGMA journal_mode=WAL, the fullSCHEMAexecute_batch, and 12+ALTER TABLEprobes. Net speedup per call on warm DBs — small but free, especially under contention.with_connectioncall).Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/sentry-mt-init-worker-raceValidation Run
pnpm --filter openhuman-app format:checkpnpm typecheckcargo test --lib openhuman::memory::tree::store -- --test-threads=4→ 13/13 pass (includes newwith_connection_serialises_concurrent_schema_init)cargo fmt --checkclean;cargo checkcleanValidation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
with_connectioncallers; skip redundant WAL + SCHEMA + ALTER work after first per process.Parity Contract
busy_timeout = 15sset; WAL journal mode persists in the DB header from the first init; allALTER TABLEmigrations remain idempotent and are unchanged.is_transient_cold_start) only matches the three documented cold-start codes; non-transient errors still bubble up unchanged.Duplicate / Superseded PR Handling
Summary by CodeRabbit
Bug Fixes
New Features
Tests