fix(memory): run memory_tree on TRUNCATE journal instead of WAL#2455
Conversation
memory_tree was the only tree DB on WAL. Its `-shm` shared-memory index and `-wal` checkpoint machinery are the root of the high-volume cold-start init failures — IOERR_SHMMAP (macOS, Sentry TAURI-RUST-X1) and IOERR_TRUNCATE (Windows, AV-held handles, Sentry TAURI-RUST-EV). All tree access already serialises on a single cached PMutex<Connection>, so WAL's only real benefit — concurrent readers — was unused while its `-shm`/`-wal` fragility caused the failures. Switch memory_tree to the TRUNCATE rollback journal with synchronous=FULL (crash-safe for non-WAL modes), matching the sibling tree DBs (cron/vault/redirect_links) that already run rollback journals without issue. Requesting TRUNCATE on a database a prior release left in WAL mode checkpoints the `-wal` back into the main file and removes the `-wal`/`-shm` side-files, so this migrates existing WAL databases in place on upgrade. This composes with tinyhumansai#2206 (single cached connection + init lock), which already removed the concurrent-cold-start race; this change removes the residual environmental `-shm`/`-wal` failure surface that the race fix could only mitigate. Tests: - memory_tree_uses_truncate_journal_not_wal: asserts TRUNCATE + synchronous=FULL and that no `-shm` file is created. - existing_wal_db_migrates_to_truncate: a WAL database migrates to TRUNCATE on open with `-wal`/`-shm` removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `SQLITE_IOERR_SHMMAP` constant was 4874 — but 4874 is actually `SQLITE_IOERR_SHMSIZE`. The real `SHMMAP` is 5386, and the "open a new shared-memory segment" failure that surfaced on macOS (Sentry TAURI-RUST-X1) is `SHMOPEN` 4618. The numeric classifiers — `is_transient_cold_start` and `is_io_open_error` (store.rs) and `is_sqlite_io_transient` (worker.rs) — all matched `1546 | 4874 | 14`, so they MISSED the real macOS code 4618 (and IN_PAGE 8714); they only caught those codes by accident via a brittle message-text fallback. Define the full `-shm` family with correct values (SHMOPEN 4618, SHMSIZE 4874, SHMMAP 5386) plus IN_PAGE 8714, and broaden all three classifiers to recognise them on the robust numeric path. Fix the tests that baked in the wrong value (4874-as-SHMMAP) and add numeric-arm coverage for the real codes. Moot for memory_tree now that it runs TRUNCATE (no `-shm`), but the classifiers are shared and `memory/memory.db` is still WAL, so the correctness fix stands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strengthen existing_wal_db_migrates_to_truncate to read back a row written under WAL after the switch, proving the checkpoint-and-switch preserves committed data (not just that the mode flips and side-files are removed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR broadens SQLite transient I/O classification to include WAL/SHM and in-page/mmap error variants, adds PRAGMA synchronous = FULL, switches schema application to request journal_mode=TRUNCATE (with migration from WAL), updates open-error matching, and adds tests plus a release-smoke checklist item for upgrade behavior. ChangesMemory Tree Cold-Start Reliability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/openhuman/memory/tree/store_tests.rs (1)
629-639: ⚡ Quick winStrengthen the WAL→TRUNCATE cleanup assertion with a precondition
Lines 657-664 can pass even when
-wal/-shmwere never created, so the cleanup claim may be vacuous. Add a pre-migration side-file existence check (or force creation) before callingwith_connection.Proposed test hardening
fn existing_wal_db_migrates_to_truncate() { let (_tmp, cfg) = test_config(); let db_path = cfg.workspace_dir.join("memory_tree").join("chunks.db"); + let shm_path = db_path.with_file_name("chunks.db-shm"); + let wal_path = db_path.with_file_name("chunks.db-wal"); std::fs::create_dir_all(db_path.parent().unwrap()).expect("mkdir"); @@ { let conn = rusqlite::Connection::open(&db_path).expect("open wal db"); @@ conn.execute_batch("CREATE TABLE legacy_marker(x); INSERT INTO legacy_marker VALUES (1);") .expect("seed"); + + assert!( + wal_path.exists() || shm_path.exists(), + "precondition: WAL side-files should exist before migration is exercised" + ); } // connection dropped — the header still records WAL @@ assert!( - !db_path.with_file_name("chunks.db-shm").exists(), + !shm_path.exists(), "-shm must be gone after WAL→TRUNCATE migration" ); assert!( - !db_path.with_file_name("chunks.db-wal").exists(), + !wal_path.exists(), "-wal must be gone after WAL→TRUNCATE migration" ); }Also applies to: 657-664
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/store_tests.rs` around lines 629 - 639, The test currently seeds the DB into WAL mode but may never create the actual side-files, making the later WAL→TRUNCATE cleanup assertion vacuous; after opening the DB with rusqlite::Connection::open, running the PRAGMA and inserting the row (the block using db_path, Connection::open and the PRAGMA query_row), add a precondition that the WAL side-files exist (e.g. check for existence of db_path with "-wal" and "-shm" suffixes) or force their creation (e.g. execute a checkpoint or additional write/close sequence) before dropping the connection so that the subsequent call to with_connection observes a real WAL state; ensure the test fails if those side-files are not present so the cleanup assertion is meaningful.src/openhuman/memory/tree/jobs/worker.rs (1)
258-261: ⚡ Quick winConsider importing the named constants from
store.rs.
store.rsdefines named constants (SQLITE_CANTOPEN,SQLITE_IOERR_SHMOPEN, etc.) for these exact codes with detailed documentation. Using magic numbers here creates a maintenance burden — both files must stay synchronized.Importing or re-exporting the constants would make this explicit:
+use crate::openhuman::memory::tree::store::{ + SQLITE_CANTOPEN, SQLITE_IOERR_IN_PAGE, SQLITE_IOERR_SHMMAP, + SQLITE_IOERR_SHMOPEN, SQLITE_IOERR_SHMSIZE, SQLITE_IOERR_TRUNCATE, +}; ... - if matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) { + if matches!( + f.extended_code, + SQLITE_CANTOPEN + | SQLITE_IOERR_TRUNCATE + | SQLITE_IOERR_SHMOPEN + | SQLITE_IOERR_SHMSIZE + | SQLITE_IOERR_SHMMAP + | SQLITE_IOERR_IN_PAGE + ) {This requires making the constants
pub(crate)instore.rs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/jobs/worker.rs` around lines 258 - 261, Replace the magic-number checks on f.extended_code with the named constants defined in store.rs (e.g., SQLITE_CANTOPEN, SQLITE_IOERR_TRUNCATE, SQLITE_IOERR_SHMOPEN, SQLITE_IOERR_SHMSIZE, SQLITE_IOERR_SHMMAP, SQLITE_IOERR_IN_PAGE) by making those constants pub(crate) in store.rs, importing them into this module, and changing the matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) to use the constants instead (e.g., matches!(f.extended_code, SQLITE_CANTOPEN | SQLITE_IOERR_TRUNCATE | ...)) so the codes are explicit and maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/memory/tree/jobs/worker.rs`:
- Around line 258-261: Replace the magic-number checks on f.extended_code with
the named constants defined in store.rs (e.g., SQLITE_CANTOPEN,
SQLITE_IOERR_TRUNCATE, SQLITE_IOERR_SHMOPEN, SQLITE_IOERR_SHMSIZE,
SQLITE_IOERR_SHMMAP, SQLITE_IOERR_IN_PAGE) by making those constants pub(crate)
in store.rs, importing them into this module, and changing the
matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) to use the
constants instead (e.g., matches!(f.extended_code, SQLITE_CANTOPEN |
SQLITE_IOERR_TRUNCATE | ...)) so the codes are explicit and maintainable.
In `@src/openhuman/memory/tree/store_tests.rs`:
- Around line 629-639: The test currently seeds the DB into WAL mode but may
never create the actual side-files, making the later WAL→TRUNCATE cleanup
assertion vacuous; after opening the DB with rusqlite::Connection::open, running
the PRAGMA and inserting the row (the block using db_path, Connection::open and
the PRAGMA query_row), add a precondition that the WAL side-files exist (e.g.
check for existence of db_path with "-wal" and "-shm" suffixes) or force their
creation (e.g. execute a checkpoint or additional write/close sequence) before
dropping the connection so that the subsequent call to with_connection observes
a real WAL state; ensure the test fails if those side-files are not present so
the cleanup assertion is meaningful.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 11b90d3a-e182-4230-806b-3572e149daa5
📒 Files selected for processing (3)
src/openhuman/memory/tree/jobs/worker.rssrc/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/store_tests.rs
This change migrates existing WAL databases on upgrade (a release-cut surface), so add a cross-platform smoke item: on upgrade from a WAL-era build, verify chunks.db reports journal_mode=truncate, the -wal/-shm side-files are gone, prior memories still surface, and no "Failed to initialize memory_tree schema" errors appear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
graycyrus
left a comment
There was a problem hiding this comment.
Solid fix — switching memory_tree off WAL to TRUNCATE is the right call given the single-connection model and the ~99K Sentry events this eliminates. The migration path (checkpoint → switch → side-file removal) is sound, tests verify data preservation, and the corrected SHM error-code constants are all verified against the SQLite source. Clean work.
One minor comment inline.
graycyrus
left a comment
There was a problem hiding this comment.
Looks good, nice work!
…humansai#2455) Co-authored-by: sanil-23 <sanil@alphahuman.xyz> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…humansai#2455) Co-authored-by: sanil-23 <sanil@alphahuman.xyz> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
memory_treeSQLite store (chunks.db) from WAL to the TRUNCATE rollback journal (withsynchronous=FULL), removing the-shm/-walside-files that caused the high-volume cold-start init failures.Problem
The two highest-volume Sentry issues —
TAURI-RUST-EV(IOERR_TRUNCATE1546, ~86K events, 100% Windows) andTAURI-RUST-X1(IOERR_SHMOPEN4618, ~13K events, 100% macOS) — are both "Failed to initialize memory_tree schema: disk I/O error".Root cause:
memory_treeis the only tree DB on WAL. WAL's-shmshared-memory index +-walcheckpoint machinery fail at cold start (a concurrent-bootstrap race, amplified by Windows AV mandatory-locking and macOS sandbox/permission). The connection-pooling fix (#2206) removed the race, butmemory_treefunnels all access through a singlePMutex<Connection>, so WAL's only real benefit — concurrent readers — is unused, leaving WAL as pure liability. The sibling tree DBs (cron/vault/redirect_links) already run rollback journals without issue.Separately, the SHM error-code constants were mislabeled:
SQLITE_IOERR_SHMMAPwas4874(actuallySHMSIZE); the realSHMMAPis5386and the macOS failure code isSHMOPEN4618. The numeric classifiers matched1546 | 4874 | 14, so they missed the real macOS code (4618) andIN_PAGE(8714) — catching them only by accident via a fragile message-string fallback.Solution
apply_schema:PRAGMA journal_mode=WAL→journal_mode=TRUNCATE, reading back the result and warning if the switch is blocked.init_db: addPRAGMA synchronous=FULL(required for crash-safety in rollback mode;NORMALis only corruption-safe under WAL).TRUNCATEon a database a prior release left in WAL mode checkpoints the-walback into the main file and removes the-wal/-shmside-files — migrating existing databases in place on upgrade. Composes with fix(memory): memory_tree SQLite init fails with disk I/O, xShmMap, and file-open errors (~19K events) #2206: the single cached connection + init lock mean the switch runs cleanly on the sole connection.-shmfamily with correct values (SHMOPEN4618 /SHMSIZE4874 /SHMMAP5386) +IN_PAGE8714, and broadenis_transient_cold_start,is_io_open_error(store.rs) andis_sqlite_io_transient(worker.rs) to match them numerically.Submission Checklist
memory_tree_uses_truncate_journal_not_wal;existing_wal_db_migrates_to_truncate(migrates and preserves committed data);is_sqlite_io_transient_matches_shm_family(numeric arm, real codes); broadenedis_transient_cold_start_classifies_known_extended_codes. Existing race / cleanup / foreign-key tests still pass.N/A: internals change (journal mode + error-code classifier), no user-facing feature row.N/A(internals change).memory_treeWAL→TRUNCATE upgrade-migration step todocs/RELEASE-MANUAL-SMOKE.md(verify migration + memory intact on upgrade from a WAL-era build).N/A: no GitHub tracking issue; surfaced via SentryTAURI-RUST-EV/TAURI-RUST-X1.Impact
memory_treeWAL DB migrates to TRUNCATE on first open — committed data preserved (test-verified),-wal/-shmremoved. Idempotent and crash-safe.fsynccost of rollback+FULLis negligible; foreground reads are unaffected.SHMMAPfailure class entirely (no-shm) and shrinks the Windows-walcheckpoint-truncate surface.-wal/-shmcleanup path may drop committed-but-uncheckpointed WAL frames (reconstructible —memory_treere-ingests from source). Healthy installs are unaffected.Related
TAURI-RUST-EV,TAURI-RUST-X1)memory/memory.db(also WAL + single-connection, not currently failing) to TRUNCATE for consistency;spawn_blockingon the worker loop's inline DB ops (low value — the hot read/ingest paths already usespawn_blocking).AI Authored PR Metadata
Linear Issue
Commit & Branch
fix/memory-tree-truncate-journal1461b539(+11932f81,2ec0152d)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Documentation