Skip to content

fix(memory): run memory_tree on TRUNCATE journal instead of WAL#2455

Merged
graycyrus merged 4 commits into
tinyhumansai:mainfrom
sanil-23:fix/memory-tree-truncate-journal
May 21, 2026
Merged

fix(memory): run memory_tree on TRUNCATE journal instead of WAL#2455
graycyrus merged 4 commits into
tinyhumansai:mainfrom
sanil-23:fix/memory-tree-truncate-journal

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 21, 2026

Summary

  • Switch the memory_tree SQLite store (chunks.db) from WAL to the TRUNCATE rollback journal (with synchronous=FULL), removing the -shm/-wal side-files that caused the high-volume cold-start init failures.
  • Existing WAL databases migrate in place on first open (checkpoint → switch → side-files removed); committed data is preserved (test-verified).
  • Correct the mislabeled SQLite SHM error-code constants and broaden the transient / I/O classifiers to recognise the real codes on the robust numeric path (not just a brittle message-text fallback).
  • No change to the worker pool or the single-connection model.

Problem

The two highest-volume Sentry issues — TAURI-RUST-EV (IOERR_TRUNCATE 1546, ~86K events, 100% Windows) and TAURI-RUST-X1 (IOERR_SHMOPEN 4618, ~13K events, 100% macOS) — are both "Failed to initialize memory_tree schema: disk I/O error".

Root cause: memory_tree is the only tree DB on WAL. WAL's -shm shared-memory index + -wal checkpoint machinery fail at cold start (a concurrent-bootstrap race, amplified by Windows AV mandatory-locking and macOS sandbox/permission). The connection-pooling fix (#2206) removed the race, but memory_tree funnels all access through a single PMutex<Connection>, so WAL's only real benefit — concurrent readers — is unused, leaving WAL as pure liability. The sibling tree DBs (cron / vault / redirect_links) already run rollback journals without issue.

Separately, the SHM error-code constants were mislabeled: SQLITE_IOERR_SHMMAP was 4874 (actually SHMSIZE); the real SHMMAP is 5386 and the macOS failure code is SHMOPEN 4618. The numeric classifiers matched 1546 | 4874 | 14, so they missed the real macOS code (4618) and IN_PAGE (8714) — catching them only by accident via a fragile message-string fallback.

Solution

  • apply_schema: PRAGMA journal_mode=WALjournal_mode=TRUNCATE, reading back the result and warning if the switch is blocked. init_db: add PRAGMA synchronous=FULL (required for crash-safety in rollback mode; NORMAL is only corruption-safe under WAL).
  • Migration: requesting TRUNCATE on a database a prior release left in WAL mode checkpoints the -wal back into the main file and removes the -wal/-shm side-files — migrating existing databases in place on upgrade. Composes with fix(memory): memory_tree SQLite init fails with disk I/O, xShmMap, and file-open errors (~19K events) #2206: the single cached connection + init lock mean the switch runs cleanly on the sole connection.
  • Classifiers: define the full -shm family with correct values (SHMOPEN 4618 / SHMSIZE 4874 / SHMMAP 5386) + IN_PAGE 8714, and broaden is_transient_cold_start, is_io_open_error (store.rs) and is_sqlite_io_transient (worker.rs) to match them numerically.

Submission Checklist

  • Tests added/updated — memory_tree_uses_truncate_journal_not_wal; existing_wal_db_migrates_to_truncate (migrates and preserves committed data); is_sqlite_io_transient_matches_shm_family (numeric arm, real codes); broadened is_transient_cold_start_classifies_known_extended_codes. Existing race / cleanup / foreign-key tests still pass.
  • Diff coverage ≥ 80% — new/updated unit tests cover the changed Rust lines (journal-mode switch, migration, all three classifiers); CI Coverage Gate (diff-cover ≥ 80%) passed.
  • Coverage matrix — N/A: internals change (journal mode + error-code classifier), no user-facing feature row.
  • Affected feature IDs — N/A (internals change).
  • No new external network dependencies.
  • Manual smoke — added a memory_tree WAL→TRUNCATE upgrade-migration step to docs/RELEASE-MANUAL-SMOKE.md (verify migration + memory intact on upgrade from a WAL-era build).
  • Linked issue — N/A: no GitHub tracking issue; surfaced via Sentry TAURI-RUST-EV / TAURI-RUST-X1.

Impact

  • Desktop (all platforms). On upgrade, each user's existing memory_tree WAL DB migrates to TRUNCATE on first open — committed data preserved (test-verified), -wal/-shm removed. Idempotent and crash-safe.
  • Functionally unchanged. Background ingest already batches writes into transactions, so the per-commit fsync cost of rollback+FULL is negligible; foreground reads are unaffected.
  • Removes the macOS SHMMAP failure class entirely (no -shm) and shrinks the Windows -wal checkpoint-truncate surface.
  • Residual on already-broken machines: the inherited fix(memory): memory_tree SQLite init fails with disk I/O, xShmMap, and file-open errors (~19K events) #2206 stale--wal/-shm cleanup path may drop committed-but-uncheckpointed WAL frames (reconstructible — memory_tree re-ingests from source). Healthy installs are unaffected.

Related

  • Closes: N/A (no GitHub issue; Sentry TAURI-RUST-EV, TAURI-RUST-X1)
  • Follow-up PR(s)/TODOs: optional — migrate memory/memory.db (also WAL + single-connection, not currently failing) to TRUNCATE for consistency; spawn_blocking on the worker loop's inline DB ops (low value — the hot read/ingest paths already use spawn_blocking).

AI Authored PR Metadata

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/memory-tree-truncate-journal
  • Commit SHA: 1461b539 (+ 11932f81, 2ec0152d)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fewer startup errors from transient database I/O; app now backs off instead of surfacing noisy errors.
    • Datastore migration on first launch: prior WAL-mode DBs are migrated to TRUNCATE, preserving data and removing .db-wal/.db-shm side files.
    • Improved crash-safety and reliability during DB initialization.
  • Documentation

    • Added smoke-test checklist verifying migration behavior and preserved memories on upgrade.

Review Change Stack

sanil-23 and others added 3 commits May 21, 2026 15:10
memory_tree was the only tree DB on WAL. Its `-shm` shared-memory index and
`-wal` checkpoint machinery are the root of the high-volume cold-start init
failures — IOERR_SHMMAP (macOS, Sentry TAURI-RUST-X1) and IOERR_TRUNCATE
(Windows, AV-held handles, Sentry TAURI-RUST-EV). All tree access already
serialises on a single cached PMutex<Connection>, so WAL's only real benefit —
concurrent readers — was unused while its `-shm`/`-wal` fragility caused the
failures.

Switch memory_tree to the TRUNCATE rollback journal with synchronous=FULL
(crash-safe for non-WAL modes), matching the sibling tree DBs
(cron/vault/redirect_links) that already run rollback journals without issue.
Requesting TRUNCATE on a database a prior release left in WAL mode checkpoints
the `-wal` back into the main file and removes the `-wal`/`-shm` side-files, so
this migrates existing WAL databases in place on upgrade.

This composes with tinyhumansai#2206 (single cached connection + init lock), which already
removed the concurrent-cold-start race; this change removes the residual
environmental `-shm`/`-wal` failure surface that the race fix could only
mitigate.

Tests:
- memory_tree_uses_truncate_journal_not_wal: asserts TRUNCATE + synchronous=FULL
  and that no `-shm` file is created.
- existing_wal_db_migrates_to_truncate: a WAL database migrates to TRUNCATE on
  open with `-wal`/`-shm` removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `SQLITE_IOERR_SHMMAP` constant was 4874 — but 4874 is actually
`SQLITE_IOERR_SHMSIZE`. The real `SHMMAP` is 5386, and the "open a new
shared-memory segment" failure that surfaced on macOS (Sentry TAURI-RUST-X1)
is `SHMOPEN` 4618. The numeric classifiers — `is_transient_cold_start` and
`is_io_open_error` (store.rs) and `is_sqlite_io_transient` (worker.rs) — all
matched `1546 | 4874 | 14`, so they MISSED the real macOS code 4618 (and
IN_PAGE 8714); they only caught those codes by accident via a brittle
message-text fallback.

Define the full `-shm` family with correct values (SHMOPEN 4618, SHMSIZE
4874, SHMMAP 5386) plus IN_PAGE 8714, and broaden all three classifiers to
recognise them on the robust numeric path. Fix the tests that baked in the
wrong value (4874-as-SHMMAP) and add numeric-arm coverage for the real codes.

Moot for memory_tree now that it runs TRUNCATE (no `-shm`), but the
classifiers are shared and `memory/memory.db` is still WAL, so the
correctness fix stands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strengthen existing_wal_db_migrates_to_truncate to read back a row written
under WAL after the switch, proving the checkpoint-and-switch preserves
committed data (not just that the mode flips and side-files are removed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team May 21, 2026 15:10
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c58fd1dc-8d98-434a-9f10-d55d3064b82e

📥 Commits

Reviewing files that changed from the base of the PR and between 1461b53 and 6825b9f.

📒 Files selected for processing (1)
  • docs/RELEASE-MANUAL-SMOKE.md
✅ Files skipped from review due to trivial changes (1)
  • docs/RELEASE-MANUAL-SMOKE.md

📝 Walkthrough

Walkthrough

This PR broadens SQLite transient I/O classification to include WAL/SHM and in-page/mmap error variants, adds PRAGMA synchronous = FULL, switches schema application to request journal_mode=TRUNCATE (with migration from WAL), updates open-error matching, and adds tests plus a release-smoke checklist item for upgrade behavior.

Changes

Memory Tree Cold-Start Reliability

Layer / File(s) Summary
Expanded transient error detection
src/openhuman/memory/tree/jobs/worker.rs, src/openhuman/memory/tree/store.rs
is_sqlite_io_transient and is_transient_cold_start now recognize the full WAL -shm family and additional mmap/in-page extended codes; worker comments updated to reflect broader transient classification.
Journal mode initialization and crash-safety
src/openhuman/memory/tree/store.rs
DB init now sets PRAGMA synchronous = FULL and apply_schema requests journal_mode = TRUNCATE; comments explain TRUNCATE vs WAL semantics and in-place migration from WAL; is_io_open_error expanded to include the same transient set plus CannotOpen.
Unit tests for error classification
src/openhuman/memory/tree/jobs/worker.rs, src/openhuman/memory/tree/store_tests.rs
Replaced single -shm unit test with a parameterized test covering multiple -shm extended codes; expanded is_transient_cold_start test table and updated regression comments.
Integration tests for journal configuration
src/openhuman/memory/tree/store_tests.rs, docs/RELEASE-MANUAL-SMOKE.md
Two tests: one verifies fresh init uses TRUNCATE and synchronous=FULL with no -shm file; the other seeds a WAL-mode DB, runs with_connection, and asserts migration to TRUNCATE, data preservation, and removal of -wal/-shm side-files. Documentation adds a smoke-checklist row for this upgrade behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • M3gA-Mind
  • graycyrus
  • senamakel

Poem

🐰 I hopped in the cold-start with a frown,
WAL and SHM scattered all around,
TRUNCATE and FULL synchronous I choose,
Transient faults now politely excuse,
Hooray — no lost memories, just safe ground!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: switching memory_tree SQLite from WAL to TRUNCATE journal mode, which is the primary objective across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. bug labels May 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/openhuman/memory/tree/store_tests.rs (1)

629-639: ⚡ Quick win

Strengthen the WAL→TRUNCATE cleanup assertion with a precondition

Lines 657-664 can pass even when -wal/-shm were never created, so the cleanup claim may be vacuous. Add a pre-migration side-file existence check (or force creation) before calling with_connection.

Proposed test hardening
 fn existing_wal_db_migrates_to_truncate() {
     let (_tmp, cfg) = test_config();
     let db_path = cfg.workspace_dir.join("memory_tree").join("chunks.db");
+    let shm_path = db_path.with_file_name("chunks.db-shm");
+    let wal_path = db_path.with_file_name("chunks.db-wal");
     std::fs::create_dir_all(db_path.parent().unwrap()).expect("mkdir");

@@
     {
         let conn = rusqlite::Connection::open(&db_path).expect("open wal db");
@@
         conn.execute_batch("CREATE TABLE legacy_marker(x); INSERT INTO legacy_marker VALUES (1);")
             .expect("seed");
+
+        assert!(
+            wal_path.exists() || shm_path.exists(),
+            "precondition: WAL side-files should exist before migration is exercised"
+        );
     } // connection dropped — the header still records WAL
@@
     assert!(
-        !db_path.with_file_name("chunks.db-shm").exists(),
+        !shm_path.exists(),
         "-shm must be gone after WAL→TRUNCATE migration"
     );
     assert!(
-        !db_path.with_file_name("chunks.db-wal").exists(),
+        !wal_path.exists(),
         "-wal must be gone after WAL→TRUNCATE migration"
     );
 }

Also applies to: 657-664

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/store_tests.rs` around lines 629 - 639, The test
currently seeds the DB into WAL mode but may never create the actual side-files,
making the later WAL→TRUNCATE cleanup assertion vacuous; after opening the DB
with rusqlite::Connection::open, running the PRAGMA and inserting the row (the
block using db_path, Connection::open and the PRAGMA query_row), add a
precondition that the WAL side-files exist (e.g. check for existence of db_path
with "-wal" and "-shm" suffixes) or force their creation (e.g. execute a
checkpoint or additional write/close sequence) before dropping the connection so
that the subsequent call to with_connection observes a real WAL state; ensure
the test fails if those side-files are not present so the cleanup assertion is
meaningful.
src/openhuman/memory/tree/jobs/worker.rs (1)

258-261: ⚡ Quick win

Consider importing the named constants from store.rs.

store.rs defines named constants (SQLITE_CANTOPEN, SQLITE_IOERR_SHMOPEN, etc.) for these exact codes with detailed documentation. Using magic numbers here creates a maintenance burden — both files must stay synchronized.

Importing or re-exporting the constants would make this explicit:

+use crate::openhuman::memory::tree::store::{
+    SQLITE_CANTOPEN, SQLITE_IOERR_IN_PAGE, SQLITE_IOERR_SHMMAP,
+    SQLITE_IOERR_SHMOPEN, SQLITE_IOERR_SHMSIZE, SQLITE_IOERR_TRUNCATE,
+};
 ...
-        if matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) {
+        if matches!(
+            f.extended_code,
+            SQLITE_CANTOPEN
+                | SQLITE_IOERR_TRUNCATE
+                | SQLITE_IOERR_SHMOPEN
+                | SQLITE_IOERR_SHMSIZE
+                | SQLITE_IOERR_SHMMAP
+                | SQLITE_IOERR_IN_PAGE
+        ) {

This requires making the constants pub(crate) in store.rs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/jobs/worker.rs` around lines 258 - 261, Replace the
magic-number checks on f.extended_code with the named constants defined in
store.rs (e.g., SQLITE_CANTOPEN, SQLITE_IOERR_TRUNCATE, SQLITE_IOERR_SHMOPEN,
SQLITE_IOERR_SHMSIZE, SQLITE_IOERR_SHMMAP, SQLITE_IOERR_IN_PAGE) by making those
constants pub(crate) in store.rs, importing them into this module, and changing
the matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) to use the
constants instead (e.g., matches!(f.extended_code, SQLITE_CANTOPEN |
SQLITE_IOERR_TRUNCATE | ...)) so the codes are explicit and maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/memory/tree/jobs/worker.rs`:
- Around line 258-261: Replace the magic-number checks on f.extended_code with
the named constants defined in store.rs (e.g., SQLITE_CANTOPEN,
SQLITE_IOERR_TRUNCATE, SQLITE_IOERR_SHMOPEN, SQLITE_IOERR_SHMSIZE,
SQLITE_IOERR_SHMMAP, SQLITE_IOERR_IN_PAGE) by making those constants pub(crate)
in store.rs, importing them into this module, and changing the
matches!(f.extended_code, 14 | 1546 | 4618 | 4874 | 5386 | 8714) to use the
constants instead (e.g., matches!(f.extended_code, SQLITE_CANTOPEN |
SQLITE_IOERR_TRUNCATE | ...)) so the codes are explicit and maintainable.

In `@src/openhuman/memory/tree/store_tests.rs`:
- Around line 629-639: The test currently seeds the DB into WAL mode but may
never create the actual side-files, making the later WAL→TRUNCATE cleanup
assertion vacuous; after opening the DB with rusqlite::Connection::open, running
the PRAGMA and inserting the row (the block using db_path, Connection::open and
the PRAGMA query_row), add a precondition that the WAL side-files exist (e.g.
check for existence of db_path with "-wal" and "-shm" suffixes) or force their
creation (e.g. execute a checkpoint or additional write/close sequence) before
dropping the connection so that the subsequent call to with_connection observes
a real WAL state; ensure the test fails if those side-files are not present so
the cleanup assertion is meaningful.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 11b90d3a-e182-4230-806b-3572e149daa5

📥 Commits

Reviewing files that changed from the base of the PR and between ec9708a and 1461b53.

📒 Files selected for processing (3)
  • src/openhuman/memory/tree/jobs/worker.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/store_tests.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
This change migrates existing WAL databases on upgrade (a release-cut
surface), so add a cross-platform smoke item: on upgrade from a WAL-era
build, verify chunks.db reports journal_mode=truncate, the -wal/-shm
side-files are gone, prior memories still surface, and no
"Failed to initialize memory_tree schema" errors appear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix — switching memory_tree off WAL to TRUNCATE is the right call given the single-connection model and the ~99K Sentry events this eliminates. The migration path (checkpoint → switch → side-file removal) is sound, tests verify data preservation, and the corrected SHM error-code constants are all verified against the SQLite source. Clean work.

One minor comment inline.

Comment thread src/openhuman/memory/tree/store.rs
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

@graycyrus graycyrus merged commit d7b27b9 into tinyhumansai:main May 21, 2026
30 of 32 checks passed
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
…humansai#2455)

Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
senamakel pushed a commit to aqilaziz/openhuman that referenced this pull request May 23, 2026
…humansai#2455)

Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants