fix(store): tighten epoch validation, README no-clobber atomicity, and test mtime determinism#58
Conversation
Two correctness gaps in the v1.10.0 file-per-entry store: 1. load() could observe a partially-committed save(). Each per-entry write was individually atomic, but a reader scanning mid-save could see some entries updated and others not, with no way to detect it. 2. migrateFileToDirectory ran without a lock. Two processes starting simultaneously on a pristine install could race on first migration. Seqlock for torn reads - New _epoch.json sidecar holds a monotonic integer. Even = stable, odd = writer mid-commit. - save() bumps to odd before touching any files and to the next even after all writes complete, both under the existing directory lock. - load() snapshots the epoch before and after scanAndSync(). Mismatch or odd "before" triggers a retry (bounded to 3 attempts, 1-8ms backoff). After retries, returns best-effort and logs via debug. - Missing/bogus _epoch.json reads as 0, so legacy dirs and fresh installs Just Work — the first save transitions 0 → 1 → 2. Migration lock - migrateFileToDirectory now runs inside withFileLock(newDirPath, …), reusing the same lock key as steady-state saves. Concurrent first runs serialize: winner migrates, loser sees already-present. - Migration seeds _epoch.json at 0 inside the tmp dir before the atomic rename, so readers observing the brand-new directory see a coherent even epoch from the first instant. - getGlobalStore() now calls ensureDataDirectoryExists() before migration so the lock file can land in the parent of the store dir. Tests: 10 new cases covering epoch progression (0 → 2 → 4), crashed- writer recovery (odd epoch on disk), legacy dirs without _epoch.json, sequential idempotent migration, and tmp-dir cleanup on migration failure. Full suite: 1134 / 1134 passing. https://claude.ai/code/session_01Xp2V99y4zEf1oxw5zpVMRX
Two cleanups on top of the seqlock fix: Sidecar mtime tracking - _aliases.json and _confirm.json were re-read and JSON-parsed on every scanAndSync(), even when nothing had changed. They now go through the same mtime-cached path as entry files: a stat-first refreshSidecar() helper skips the re-read when mtime matches. Missing sidecars cache as an `-1` sentinel so they're detected the moment they appear. - load() now returns defensive shallow copies of the cached sidecars. Callers like setAlias load the map, mutate it in place, then call save with the mutated reference; the old code dodged this because every scanAndSync reassigned the cache to a fresh JSON.parse result. Now that we reuse the cache object across scans, we have to copy on the way out — otherwise the dirty-check in save() sees the cache and the caller's data as the same reference and silently drops writes. Hand-edit warning README - New _README.md sidecar seeded on first save() and during migration, warning developers not to hand-edit entry files (which desyncs per- entry metadata and breaks staleness signals). Idempotent: existsSync guard means a user-customized README is never overwritten. - The `_` prefix keeps isEntryFilename from picking it up as an entry. File-per-entry layout's big UX win over the single-file store is that per-entry files are browsable in a file manager — but that also invites developers to open one and tweak it. The README lives right next to the data as an in-context nudge, which is better than hoping everyone reads the conventions.editSurface codex entry. Tests: 8 new cases covering cached-sidecar no-reads-on-unchanged-mtime, external-write mtime detection, missing-to-present sidecar transition, caller-mutation regression guard, README creation on save and migration, user-custom README preservation, and README-not-picked-up-as-entry. Full suite: 1141 / 1141 passing. https://claude.ai/code/session_01Xp2V99y4zEf1oxw5zpVMRX
Captures the three changes from the store-hardening series under [Unreleased]: seqlock epoch for torn-read detection, migration lock for concurrent first runs, sidecar mtime tracking, and the new _README.md hand-edit nudge. Codex entries (arch.storeLayout, conventions.persistence, conventions.editSurface, new arch.storeScaling) are intentionally not touched here — they'll be updated via codex_set once MCP tooling is available, per the conventions.editSurface rule against hand-editing .codexcli/*.json files. https://claude.ai/code/session_01Xp2V99y4zEf1oxw5zpVMRX
There was a problem hiding this comment.
Pull request overview
Hardens the v1.10.0 file-per-entry directory store against concurrent access hazards (torn reads + first-run migration races), while improving sidecar read performance and adding an in-directory warning against hand-editing.
Changes:
- Add a seqlock-style
_epoch.jsoncommit epoch:save()marks odd/even epochs under the directory lock;load()retries scans when it detects overlap. - Serialize migration with the store lock and seed new store artifacts (
_epoch.json,_README.md) during tmp-dir creation before the atomic rename. - Add mtime-based caching for
_aliases.json/_confirm.jsonand return defensive copies fromload(); document the changes inCHANGELOG.md.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/utils/directoryStore.ts | Implements _epoch.json seqlock, locked migration wrapper, sidecar mtime caching, and _README.md creation. |
| src/store.ts | Ensures ~/.codexcli/ exists before running directory migration so the lock file can be created. |
| src/tests/directoryStore.test.ts | Adds coverage for epoch behavior, locked migration idempotency, sidecar mtime caching, and README seeding/preservation. |
| CHANGELOG.md | Documents the store hardening, README addition, and sidecar mtime caching under [Unreleased]. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * We check with existsSync rather than overwriting unconditionally so a | ||
| * user who customized the wording (or replaced it with a symlink) isn't | ||
| * stomped on every save. | ||
| */ | ||
| function ensureReadme(dir: string): void { | ||
| const readmePath = path.join(dir, README_FILE); | ||
| try { | ||
| if (fs.existsSync(readmePath)) return; | ||
| atomicWriteFileSync(readmePath, README_CONTENT); | ||
| } catch (err) { |
There was a problem hiding this comment.
ensureReadme() uses existsSync() followed by atomicWriteFileSync() (tmp + rename). That rename will overwrite an _README.md that appears between the check and the rename (e.g., a user-customized file created concurrently), which contradicts the “never overwritten” guarantee. Consider creating the file with an atomic no-clobber operation (e.g., writeFileSync with flag 'wx', or openSync O_CREAT|O_EXCL) and only falling back to overwrite-free behavior, even if that means the README write itself is non-atomic.
| * We check with existsSync rather than overwriting unconditionally so a | |
| * user who customized the wording (or replaced it with a symlink) isn't | |
| * stomped on every save. | |
| */ | |
| function ensureReadme(dir: string): void { | |
| const readmePath = path.join(dir, README_FILE); | |
| try { | |
| if (fs.existsSync(readmePath)) return; | |
| atomicWriteFileSync(readmePath, README_CONTENT); | |
| } catch (err) { | |
| * Use an atomic no-clobber create so a README that appears concurrently | |
| * (for example, a user-customized file) is preserved rather than replaced. | |
| */ | |
| function ensureReadme(dir: string): void { | |
| const readmePath = path.join(dir, README_FILE); | |
| try { | |
| fs.writeFileSync(readmePath, README_CONTENT, { flag: 'wx' }); | |
| } catch (err) { | |
| if ((err as NodeJS.ErrnoException).code === 'EEXIST') return; |
| try { | ||
| const raw = fs.readFileSync(path.join(dir, EPOCH_FILE), 'utf8'); | ||
| const parsed = JSON.parse(raw) as { epoch?: unknown }; | ||
| if (typeof parsed.epoch === 'number' && Number.isFinite(parsed.epoch)) { |
There was a problem hiding this comment.
readEpoch() accepts any finite number, including negatives and non-integers. A hand-edited or corrupted _epoch.json containing e.g. 1.5 will break the intended even/odd invariant and can cause the writer to perpetually advance fractional epochs. Suggest validating with Number.isSafeInteger(parsed.epoch) (and ideally >= 0) and treating anything else as 0.
| if (typeof parsed.epoch === 'number' && Number.isFinite(parsed.epoch)) { | |
| if (typeof parsed.epoch === 'number' | |
| && Number.isSafeInteger(parsed.epoch) | |
| && parsed.epoch >= 0) { |
| // Spin briefly so the mtime actually advances on coarse-grained filesystems. | ||
| const target = Date.now() + 20; | ||
| while (Date.now() < target) { /* spin */ } | ||
| fs.writeFileSync( | ||
| path.join(dir, '_aliases.json'), | ||
| JSON.stringify({ y: 'a' }, null, 2) | ||
| ); |
There was a problem hiding this comment.
The test uses a busy-wait loop (while Date.now() < target) to force an mtime change. This unnecessarily burns CPU and can still be flaky on systems with coarse mtime resolution or under heavy load. Prefer a deterministic approach like fs.utimesSync() to bump the sidecar’s mtime, or introduce a small real sleep via Atomics.wait/setTimeout in async test code.
| // Spin briefly so the mtime actually advances on coarse-grained filesystems. | |
| const target = Date.now() + 20; | |
| while (Date.now() < target) { /* spin */ } | |
| fs.writeFileSync( | |
| path.join(dir, '_aliases.json'), | |
| JSON.stringify({ y: 'a' }, null, 2) | |
| ); | |
| const aliasesPath = path.join(dir, '_aliases.json'); | |
| const previousStat = fs.statSync(aliasesPath); | |
| fs.writeFileSync( | |
| aliasesPath, | |
| JSON.stringify({ y: 'a' }, null, 2) | |
| ); | |
| const newerMtime = new Date(previousStat.mtimeMs + 1000); | |
| fs.utimesSync(aliasesPath, newerMtime, newerMtime); |
|
@copilot apply changes based on the comments in this thread |
…bber, test determinism Agent-Logs-Url: https://github.com/seabearDEV/codexCLI/sessions/7a9ee452-ff7c-45a6-b793-c29de1d73909 Co-authored-by: seabearDEV <40605056+seabearDEV@users.noreply.github.com>
Agent-Logs-Url: https://github.com/seabearDEV/codexCLI/sessions/7a9ee452-ff7c-45a6-b793-c29de1d73909 Co-authored-by: seabearDEV <40605056+seabearDEV@users.noreply.github.com>
Applied all three suggestions in commit
|
The "expand then flatten returns same flat map" fuzz test built its random input via direct calls to randomKey(), which could produce prefix-colliding keys in the same trial (e.g. both `a.b` and `a.b.c`). When expandFlatKeys hits that, it has to pick one to win — `a.b` can't simultaneously be a string value and an intermediate object — and the loser's value is silently lost. The test's round-trip assertion would then fail whenever the RNG happened to produce a collision, roughly 40% of runs on my machine. Root cause: the `expand → flatten` round-trip is only defined for prefix-free flat maps. The test was implicitly assuming that invariant without enforcing it. Note that the sibling "flatten then expand" test is fine — its input is produced by flattenObject, which only emits leaf paths and is therefore inherently prefix-free. Fix: new randomPrefixFreeKeys(n) helper that generates candidates, rejects any that extend or are extended by an already-accepted key, and bounds retries so the loop can't spin on a pathological RNG streak. The "expand then flatten" test now uses this helper instead of calling randomKey() directly. Added an inline comment explaining why the prefix-free invariant is required. Verified: 28 consecutive runs of the fuzz file pass after the fix (vs ~40% baseline flake rate on main). Full test suite also passes at 1142 / 1142. https://claude.ai/code/session_01Xp2V99y4zEf1oxw5zpVMRX
Three doc-only fixes from PR #58 review (no behavior change): - _README.md now lists each `_*` sidecar with a one-line role, including `_epoch.json`, so users browsing the directory aren't mystified by an integer that climbs with every save. - Seqlock backoff comment corrected to `// 1ms, 2ms, 4ms` — only three backoffs actually fire given the `attempt < MAX_LOAD_RETRIES` guard. - Added rationale for why the epoch-mismatch retry path doesn't back off (writes are sub-millisecond; the writer is almost certainly done by the time we re-scan). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes #61. Final v1.11 quality item. The pre-v1.11 withFileLock caught all acquireLock failures and silently ran the closure unlocked. The seqlock added in v1.10.x protects readers from torn states but does NOT protect concurrent unlocked writers from clobbering each others per-entry writes — so the silent fallback was a real correctness footgun, just one that almost never fired. BREAKING CHANGE withFileLock now fails closed by default. Lock acquisition failures throw instead of falling through to lockless execution. Audit of all three production call sites confirmed none depend on the silent fallback: 1. directoryStore.save() — ensureDir() runs immediately before 2. migrateFileToDirectory() — global path uses ensureDataDirectoryExists() (PR #64), project path uses the user-created .codexcli/ parent 3. saveJsonSorted() in legacy migration — both call sites in store.ts have ensureDataDirectoryExists() immediately before So the new throw never fires in normal operation. The breaking change is only for callers that intentionally relied on the silent fallback — none in production, one in tests. TEST ESCAPE HATCH Set CODEX_DISABLE_LOCKING=1 to restore the pre-v1.11 silent fallback. Read fresh on every call so tests can flip it on and off without restarting the process. Documented in the README env-var section as test-only — production code should never set this. CACHE HARDENING (defensive shallow-cache audit) Folded into the same PR since both touch concurrency-correctness territory. Audit found one real (latent) bug: - src/config.ts: loadConfig() returned the cached Config object by reference. setConfigSetting() in commands/config-commands.ts loads the config, mutates it in place (config.colors = ...), then calls saveConfig() with the mutated reference — exactly the same hazard PR #58 fixed for sidecar caches in directoryStore.ts. In practice the next mtime-triggered re-read papered over the pollution, but the latent class of bug is real. Fix: return defensive shallow copies on both the cached path and the freshly-built result, with an inline comment explaining the rationale. Other in-memory caches audited and found safe: - aliasesCache, confirmCache (already fixed in PR #58) - entryCache (string leaves only — theoretical only) - dataDirectoryCache, projectFileCache, projectStoreDirCache (string primitives) - globalStore/projectStore singletons (full instances, not handed to mutators) - _colorEnabled (boolean primitive) - pendingConfirmations (internal MCP working state) DOCS - migrateFileToDirectory JSDoc updated: deleted the stale paragraph about falling through to unlocked execution on pristine installs (PR #64 closed that gap; the comment was misleading). - README env-var section: new CODEX_DISABLE_LOCKING entry. - CHANGELOG: Changed entry under [Unreleased]. - arch.cli codex entry stays unchanged (no impact). TESTS 1153 / 1153 (was 1150 baseline, +3 net). - fileLock.test.ts: split the previous it(still executes function when lock acquisition fails (fallback)) into two tests — one asserting the new fail-closed throw, one asserting the CODEX_DISABLE_LOCKING=1 fallback. Both tests carefully save and restore the env var so they don't pollute siblings. - directoryStore.test.ts: two new integration tests at the store.save() level, mirroring the fileLock unit tests. Confirms the throw propagates end-to-end through the save pipeline. - No existing tests broke, confirming the audit was right that no production code depended on the silent fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles the loose ends from today's v1.11 work into one commit so the release is clean to cut. CHANGELOG cleanup - The [Unreleased] block accumulated duplicate ### Changed and ### Added subsections from PRs #58, #64, #65, and #66 each appending entries in their natural section. Per Keep a Changelog, each subsection should appear at most once. Merged into a single coherent block:⚠️ Breaking Changes → Added → Changed → Removed → Fixed. - New⚠️ Breaking Changes callout at the top summarizing the three breaking items in v1.11 (withFileLock fail-closed, --raw removal, stats/audit -d → -D), so anyone skimming the changelog before upgrading sees them immediately. Hide legacy -a flags on get/rename/remove - These are undocumented entry points to the alias subcommand functionality (per arch.cli codex entry: README only shows the `alias <subcommand>` form). They still work for back-compat but no longer appear in --help output. set's -a stays visible because it's the documented inline alias shortcut; find's -a stays visible because it's a documented search filter. - Converted from .option(...) to .addOption(new Option(...).hideHelp()) so chains can use the hideHelp method. Re-imported Option from commander (was removed in PR #65 when the deprecated --raw Options were dropped). Regression test for setConfigSetting on a fresh install - Tests the bug Copilot caught in PR #66 review: loadConfig() was returning defaultConfig by reference in the ENOENT branch, so a caller doing the loadConfig→mutate→saveConfig pattern (e.g. setConfigSetting on a fresh install) would mutate the module-level defaultConfig constant. The fix shipped in #66 but no test guarded against it. This test simulates a first-run scenario, mutates the returned config, and verifies a second loadConfig still returns fresh defaults. docs/release-checklist.md - New permanent artifact documenting the manual smoke steps for each release. Generic pre-tag verification + post-tag smoke checklist + a per-release breaking-changes section. v1.11's breaking changes pre-filled as the current example. Codex project.v111 - Marked item #2 (withFileLock) as DONE. All four scoped items are now closed. Tests - 1154 / 1154 (was 1153, +1 from the new regression test). - No existing tests broke (hiding flags is help-only, not behavior). This is intentionally a direct-to-main commit, not a PR — it's release-prep housekeeping with no design decisions. Per user request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Quality release. Four scoped items + one bonus quality-pass + release-prep cleanup. All 1167 / 1167 tests passing. Scoped items - (1) Tier 1 doc fixes from #58 review (commit 3077055) - (2) #61 withFileLock fail-closed in production with CODEX_DISABLE_LOCKING=1 opt-out for tests, plus defensive shallow-cache audit (PR #66, 5b41b4b) - (3) #62 Short-flag namespace audit + --raw removal — added -p to --plain on get/context, added -j to stats --json, moved stats/audit --detailed from -d to -D, removed --raw/-r entirely (PR #65, 1c73e49) - (4) #63 CODEX_DATA_DIR polish — README env-var section, absolute-path validation, ccli info provenance, clearDataDirectoryCache() export (PR #64, 74c1508) Bonus quality pass - Telemetry consistency (PR #67, 830b7ee + 6059426): unified session ID between audit and telemetry via shared src/utils/session.ts; accurate CLI responseSize via process.stdout.write monkey-patch + responseMeasure state machine; withPager paged-output counting fixed in Copilot review. Release-prep cleanup (cc350d3) - CHANGELOG section dedup +⚠️ Breaking Changes callout - Hide legacy -a flags on get/rename/remove from --help - setConfigSetting first-run regression test - docs/release-checklist.md added as a permanent artifact This commit - package.json + package-lock.json: 1.10.0 → 1.11.0 - CHANGELOG.md: [Unreleased] → [1.11.0] - 2026-04-08, fresh empty [Unreleased] block above for next cycle - .codexcli/project.v111.json: marked RELEASED with full summary - .codexcli/project.v112.json: NEW, captures the next-sprint plan (#57 perf, #41 audit --follow, #68 llm-instructions nudge) - .codexcli/arch.storeLayout.json: updated to mention _epoch.json, _README.md, the migration withFileLock serialization, and the v1.11 fail-closed locking semantics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The seqlock generation counter from #58 bumps by 2 on every store write, which would create constant diff noise if tracked. The codex tools manage it automatically — git should never see it.
Three context.* entries from MCP stress testing v1.11.1-beta.8: - context.mcpServerFreezeDiagnosis: the three distinct beta.6/.7/.8 freeze causes (scanAndSync rescan in b803683, audit log re-read in 1810fc2, prototype poisoning in d773c4d), with grep recipes if a fourth ever surfaces. - context.cascadeDelete: removeEntry auto-removes aliases and confirm metadata pointing at the key or its children (src/commands/entries.ts:630-634). Discovered via test cleanup when an alias_remove failed because removing the target had already cleaned up the alias. - context.logQueryOnCost: queryAuditLog and computeStats iterate the full in-memory log array, which grows O(N) over a process lifetime. Sub-100ms even at 14k calls so not freeze-class but worth tracking. Sibling perf issue to v1.12 #57. Also includes .codexcli/_README.md, the hand-edit warning sidecar that was added in #58 but never committed.
Three targeted fixes addressing review feedback on the v1.10.0 store hardening PR.
Changes
readEpoch()validation — replacedNumber.isFinitewithNumber.isSafeInteger(x) && x >= 0. Fractional (1.5), negative, or out-of-range values now fall back to0instead of breaking the even/odd seqlock invariant.ensureReadme()atomicity — replaced theexistsSync+atomicWriteFileSynctwo-step with a singlewriteFileSync(path, content, { flag: 'wx' }). TheO_CREAT | O_EXCLkernel semantics close the TOCTOU window: a concurrently created or user-customized_README.mdis never clobbered.EEXISTis silenced; other errors still route todebug().Test mtime determinism — replaced the busy-wait spin loop in the sidecar cache test with
fs.utimesSync(path, mtime+1000, mtime+1000). The+1000 msstep is sufficient for filesystems with 1-second mtime granularity, eliminating CPU waste and flakiness.