[Sprint 2] One-shot upgrade migration#241
Conversation
Adds the atomic on-disk migration that brings legacy single-domain
installs onto the canonical multi-domain layout. Runs synchronously
at daemon startup before SMTP/UDS listeners bind, gated by the
`.layout-version: 2` marker so re-runs are a fast no-op.
What changes per step:
* Storage rename: `<data_dir>/{inbox,sent}/` →
`<data_dir>/<default_domain>/{inbox,sent}/` via `rename(2)` (atomic
on same FS, constant time regardless of mail volume). EXDEV
surfaces an actionable error.
* DKIM rename: `<dkim_dir>/{private,public}.key` →
`<dkim_dir>/<default_domain>/{...}` with 0600/0644 modes preserved
and the new parent dir created at 0700 root:root.
* Config rewrite: `domain = "x.com"` → `domains = ["x.com"]` via
the existing `write_atomic` helper. **Mailbox keys are
deliberately preserved on disk** (`[mailboxes.info]` stays
`[mailboxes.info]`) so every downstream CLI lookup that targets a
mailbox by local-part keeps working unchanged in this sprint. The
on-disk FQDN re-key is deferred to the runtime data-plane rewire.
* Marker write: `<data_dir>/.layout-version` containing `2\n` (0644
root:root), temp-then-rename so concurrent readers never see a
truncated value.
Order is load-bearing per the design contract: prefer "orphaned DKIM
key" over "domain in config but DKIM key missing" if a crash
interrupts mid-flow. Marker is last so a partial run never claims to
be done.
Lock hierarchy honored at startup: outer per-mailbox locks acquired
in sorted FQDN order against the shared `MailboxLocks` pool, then
the inner `CONFIG_WRITE_LOCK`, matching the documented cascade-
delete ordering and preventing deadlocks with concurrent CRUD that
comes online once the listener binds.
Layout-aware path resolution shim added: `Config::inbox_dir` /
`sent_dir` / `mailbox_dir` and the storage paths in `send_handler`
and `state_handler` consult `<data_dir>/.layout-version` and route
to the per-domain location post-migration. Same-process callers see
v2 paths immediately after the marker lands; the runtime data plane
keeps working without a per-domain HashMap lookup until the proper
storage helper ships.
DKIM key load in `serve.rs` now resolves `<dkim_dir>/<default_domain>/`
when a per-domain key is present, falling back to the legacy
`<dkim_dir>` root for fresh installs that haven't generated a per-
domain key yet.
Tests added:
* 26 unit tests in `src/upgrade_migration.rs` cover detection
(pristine v1, half-migrated, fresh, corrupted marker, canonical
per-domain sub-tables), each relocation step (rename, EXDEV,
missing files, idempotency, mode preservation), the config rewrite,
the marker write, and the orchestration (`run_startup_migration`).
* 4 integration tests in `tests/upgrade.rs` exercise the full
migration end-to-end against a realistic v1 fixture under
`tests/fixtures/upgrade/v1-single-domain/` (2 mailboxes, 2
messages each): end-to-end success, idempotency on second start,
hard-fail on corrupted marker, and post-migration SMTP RCPT
landing in the per-domain `<data_dir>/<default_domain>/inbox/`
path.
Deviations from the AC, called out explicitly in the PR description:
* `rewrite_config_to_canonical_shape` is structurally scoped to
`domain → domains` only. The legacy `[mailboxes.<local>]` →
`[mailboxes."<local>@<domain>"]` re-key is deferred to the
follow-up runtime rewire that teaches every callsite to look up
mailboxes by FQDN.
* The new module is named `src/upgrade_migration.rs` (the sprint
spec called for `src/upgrade.rs`, but that name is already used by
the binary self-update flow).
There was a problem hiding this comment.
Sprint 2 Review: One-shot upgrade migration (PR #241)
Sprint Goal Assessment
The sprint goal — "atomic on-disk migration that brings legacy single-domain installs onto the canonical multi-domain layout on first aimx serve startup, idempotent across restarts, hard-fail on partial completion" — is substantially achieved. Storage and DKIM are atomically relocated to the per-domain layout via rename(2), the .layout-version: 2 marker is written last per the documented crash-recovery contract, the migration runs under the correct lock hierarchy before any listener binds, and the integration suite end-to-ends the full path including post-migration SMTP RCPT delivery, idempotency, and corrupted-marker hard-fail. The one piece the sprint plan explicitly required and this PR explicitly defers — the on-disk [mailboxes.<local>] → [mailboxes."<local>@<domain>"] re-key — has a sound technical justification (the runtime data plane is not yet FQDN-aware) and the deferral is documented loudly in the PR description.
Acceptance Criteria Checklist
[S2-1] Detection logic + .layout-version marker semantics
-
src/upgrade_migration.rs(note: notsrc/upgrade.rs— see deviation #2) — met -
LayoutState { Migrated, NeedsMigration(Indicators), FreshInstall, Corrupted(String) }— met -
detect_layout_state(data_dir, dkim_dir, config_path, default_domain) -> LayoutState— met -
Indicatorscarries 5 v1-shape signals (legacy_inbox_dir, legacy_sent_dir, legacy_dkim_key, legacy_config_domain_field, legacy_mailbox_local_part_keys) — met -
Corruptedreturned on wrong/garbage marker content — met -
FreshInstalltriggers a proactive marker write — met (inrun_startup_migration) - Unit tests cover pristine v1, half-migrated, fully migrated, fresh install, corrupted marker — met
[S2-2] Storage + DKIM atomic relocation via rename(2)
-
relocate_storage_for_default_domainrenamesinboxandsent— met -
relocate_dkim_for_default_domaincreates<dkim_dir>/<default_domain>/mode0700, renamesprivate.key/public.key— met - Modes preserved across rename (test asserts 0600/0644) — met
- Idempotency per-file via
MoveOutcome::AlreadyDone— met - EXDEV detection via
is_cross_device_erroraccepting bothErrorKind::CrossesDevicesand raw OS error 18 — met - Error message:
"<src> and <dst> must be on the same filesystem; see book/multi-domain.md for manual recovery"— met (verbatim) - Unit tests cover clean migration, half-relocated storage, missing DKIM file, refuses-to-overwrite when both sides exist — met. EXDEV is not unit-tested (hard to simulate without two filesystems) but the detection code path is correct by inspection.
[S2-3] Config rewrite via write_atomic + .layout-version write
-
rewrite_config_to_canonical_shapeuseswrite_atomic— met -
write_layout_version_markerwrites2\nmode0644via temp-then-rename — met - Idempotency on canonical input — met (byte-equal across repeated calls)
- On-disk
[mailboxes.<local>]→[mailboxes."<local>@<domain>"]re-key — DEFERRED. Sprint plan AC says the rewrite should normalize bothdomain → domainsAND mailbox keys; this PR ships only the first. See "Deviation #1" below.
[S2-4] Wire the migration into serve.rs startup under the lock hierarchy
- Sequence: load config → detect layout state → if
NeedsMigration, acquire locks + run migration steps (storage → DKIM → config → marker) → continue to listener bind — met (line 589 ofserve.rs, before the README refresh, the DKIM key load, the TLS init, and the listener binds) -
FreshInstallwrites marker — met - Error message:
"upgrade migration failed: <reason>; see book/multi-domain.md for manual recovery"— met (verbatim) - Migration success emits a single
tracing::info!log line withtarget: "aimx::upgrade"— met - Already-migrated short-circuits with no locks, no log — met
- Per-mailbox locks acquired in sorted FQDN order against the shared
MailboxLockspool, thenCONFIG_WRITE_LOCK(outer → inner per the documented hierarchy) — met - Exits with code 1 on failure — met (
Errpropagates throughResult<(), Box<dyn Error>>toprocess::exit(1))
[S2-5] Upgrade-fixture integration test
- Fixture at
tests/fixtures/upgrade/v1-single-domain/with 2 mailboxes, 2 messages each — met -
tests/upgrade.rsexercises end-to-end migration, idempotency, corrupted-marker hard-fail, post-migration SMTP RCPT — met (4 tests, all green locally) - Post-migration assertions: storage relocated, DKIM relocated, config normalized, marker present, messages still readable — met
- SMTP RCPT to a v1 mailbox local-part lands in the new per-domain inbox — met (
upgrade_smtp_rcpt_lands_in_new_fqdn_keyed_path) - Second start is a no-op (config + marker byte-identical) — met
- Corrupted
.layout-versioncauses daemon to exit with canonical hard error — met
Disposition on Flagged Deviations
Deviation 1: Mailbox-key FQDN re-key deferred to Sprint 3
Verdict: ACCEPT with a follow-up requirement to update the sprint plan and PRD §6.12.
Reasoning:
The PRD §6.7 FR-G1 step 3 and the normative §6.12 walkthrough step 4 both call for the on-disk config to land in this shape post-upgrade:
domains = ["mydomain.com"]
[mailboxes."info@mydomain.com"]
[mailboxes."support@mydomain.com"]
This PR ships only the first line. The mailbox keys stay [mailboxes.info] / [mailboxes.support] on disk.
The implementer's reasoning holds up under scrutiny: doing the on-disk FQDN re-key in Sprint 2 without the runtime data-plane rewire would break every config.mailboxes.get(<local>) callsite (ingest, send, hooks, mailbox CLI), and Sprint 3 is where resolve_mailbox_for_rcpt lands to make those callsites FQDN-aware. Shipping the on-disk re-key now would either (a) break ~120 existing integration tests that pass mailbox local-parts to CRUD, or (b) require an emergency rewire across multiple modules just to keep the migration usable. Splitting the rewrite across two sprints is the right call.
The PR's compensating "layout-aware path shim" in Config::inbox_dir / sent_dir / storage_root_for_default_domain plus the matching updates in send_handler, state_handler, mailbox_handler is a clean bridge: the marker file is the single source of truth for "are we on v2?", same-process callers see v2 paths immediately after the marker lands, and the cost is one stat per inbox/sent lookup (microseconds, well within NFR-4). The integration test upgrade_smtp_rcpt_lands_in_new_fqdn_keyed_path proves that end-to-end functional behavior post-migration is correct: storage lives at the per-domain location, DKIM lives at the per-domain location, mail flows through the per-domain path, and the runtime keeps resolving mailboxes by their operator-friendly key.
What ships now is correct and forward-compatible. It is not "half-done" — it is "structurally migrated with the operator-friendly key preserved on disk pending a runtime data-plane rewire". The implementer documented this clearly in the PR description, in the module docstring on rewrite_config_to_canonical_shape, and in the integration test assertions (which pin the deferred behavior so a future contributor doesn't accidentally land the re-key without doing the data-plane work).
Required follow-up (Non-blocker for THIS PR): Update docs/multi-domain-sprint.md so the Sprint 2 row in the Summary Table no longer says (incl. legacy local-part → FQDN re-key), and update Sprint 3's plan to absorb the on-disk re-key as part of the runtime data-plane rewire (the S3-1 / S3-3 story area). Without this update, the next sprint's reviewer will see Sprint 2 marked "Done" while a Sprint-2 AC is technically still open.
Deviation 2: Module name src/upgrade_migration.rs instead of src/upgrade.rs
Verdict: ACCEPT.
Reasoning:
The collision is real. src/upgrade.rs already exists and owns the aimx upgrade self-update CLI (verified — it contains pub fn run for the self-update flow). Renaming that module would be a much larger and unrelated diff that touches the CLI dispatch, every reference in main.rs, the test suite, etc. The new name upgrade_migration.rs is descriptive ("the one-shot upgrade migration") and the implementer flagged the choice in the PR description, in the module docstring, and in the commit message.
No tracking work needed.
Potential Bugs
Non-blocker: aimx doctor reports dkim_key_present: false post-upgrade
src/doctor.rs:114:
let dkim_key_present = crate::config::dkim_dir().join("private.key").exists();Post-migration, the DKIM key lives at <dkim_dir>/<default_domain>/private.key, not <dkim_dir>/private.key. aimx doctor will report:
DKIM key: MISSING - run `aimx dkim-keygen`
…and recommend the operator regenerate keys. That recommendation is wrong (the key is fine, just relocated) and the suggested action could cause harm: a v1 aimx dkim-keygen without flags would generate fresh keys at the legacy <dkim_dir>/private.key location while the daemon continues to sign with the (different) per-domain key. The operator would then have a confusing mismatch between what doctor reports and what's actually signing outbound.
Same issue applies to src/doctor.rs:820-821 (the orphan-storage scan reads data_dir.join("inbox") / data_dir.join("sent") at the root, which post-migration no longer exists — silently misses any orphan dirs sitting under the new per-domain tree), and src/mailbox.rs:277 (discover_mailbox_names reads the same legacy root, so unregistered mailbox directories under the per-domain tree don't surface in aimx mailboxes list).
Classification: Non-blocker. The PR targets epic/multi-domain, which is not operator-visible until Sprint 7. The sprint plan parks doctor work in Sprint 6 (S6-3) and the storage-helper unification in Sprint 3 (S3-5). Sprint 2 was not required to touch these. But the regression ships now and will be visible to anyone testing the epic branch end-to-end. Two reasonable resolutions:
- Preferred: add a one-line patch to
doctor.rs:114using the sameresolve_active_dkim_dirhelper this PR already introduced forserve.rs, plus call outdiscover_mailbox_names/ orphan-storage scan as known limitations on the epic branch. - Acceptable: explicitly add these three call-sites to the PR's "Deferred Items" list so Sprint 3 / Sprint 6 pick them up, and accept the diagnostic regression on the epic branch.
Non-blocker: setup_test_env writing legacy v1 configs now silently exercises the migration path in every integration test
This is observed, not flagged in the PR description. Every setup_test_env-using integration test now runs the migration as a side effect of start_serve. The 116 tests in tests/integration.rs all pass, but the test surface is now "fresh v1 install → migrate → run test" rather than "v2 install → run test". Two existing tests had to be patched to re-resolve alice_dir after start_serve because of this (called out in the PR description), and the new layout-aware inbox() / sent() helpers paper over the rest. Worth noting because future test-writers should know setup_test_env produces a v1 install and the daemon migrates on first start.
Classification: Non-blocker. Existing tests pass. The test helper change is documented in the diff. Future cleanup: write a setup_v2_test_env that lands a canonical config + marker so tests can opt out of the migration path when they don't care about it.
Test Coverage
Strong. 26 unit tests in src/upgrade_migration.rs cover detection (5 paths), storage rename (4 paths), DKIM rename (3 paths), config rewrite (2 paths), marker write (1 path), and orchestration (5 paths). 4 integration tests in tests/upgrade.rs cover end-to-end migration, idempotency, corrupted-marker hard-fail, and post-migration SMTP RCPT. Verified locally: all 26 unit tests and all 4 integration tests pass; the existing 116 integration tests in tests/integration.rs still pass with the layout-aware inbox() / sent() helpers; cargo clippy --all-targets -- -D warnings and cargo fmt -- --check clean.
One gap worth noting (Non-blocker): EXDEV is not exercised by a unit test (hard to simulate without two filesystems). The detection logic in is_cross_device_error is correct by inspection — both io::ErrorKind::CrossesDevices and raw_os_error() == Some(18) are checked. A future cross-filesystem test would round this out.
Security Issues
None observed. The migration:
- Reads
config.toml,<data_dir>/.layout-version, and filesystem state at known paths - Writes via
write_atomic(temp-then-rename, atomic on same FS) - Preserves the 0600 / 0644 modes on the DKIM keys across rename
- Creates the per-domain DKIM dir at 0700 root:root
- Runs as part of
aimx servewhich is already root-only - No user-controlled input feeds into path construction (default_domain comes from the loaded
Config, which is itself validated)
The marker write uses <process-pid> in the temp filename which is fine for serialized startup (no concurrent writers). The rename-onto-final-name pattern handles a crash during the marker write cleanly: the temp file is left orphaned (fs::remove_file(&tmp) best-effort on rename failure) and the next migration run starts from the detection layer.
Code Quality
Solid. Highlights:
MoveOutcomeenum is the right abstraction — encodes "renamed / already done / nothing to do" cleanly and the resulting report is what the operator-facing log line consumes.MigratedOutcomeis boxed insideStartupMigrationOutcome::Migratedto silenceclippy::large_enum_variant. Documented in the PR description. Reasonable.detect_layout_stateis a pure function over filesystem state with no writes and no locks — exactly the right shape for the orchestration to detect-then-act under the caller's locks.- The split between
run_migration(pure file work) andrun_startup_migration(drives the detection-decision branching) keeps the unit tests light. - The order-of-operations comment at the top of the module is load-bearing — clearly explains the crash-recovery contract (storage → DKIM → config → marker, so a half-state always prefers "orphaned DKIM key" over "domain in config but DKIM missing").
One minor readability note (Non-blocker): in run_upgrade_migration_at_startup, the pre-flight detection (pre_state) is run BEFORE acquiring any locks, and then a fresh run_startup_migration call inside the locked body re-runs detect_layout_state. The implementer handled the "state changed under us" case defensively (the AlreadyMigrated | Fresh arm in the locked-body match) but the doubled detection is non-obvious. A short comment explaining "pre-flight check is for the no-lock fast path; the inner call inside the lock re-detects to keep the locked critical section authoritative" would help the next reader.
Alignment with PRD
- ✓ FR-G1 (combined transaction): storage rename, DKIM rename, config rewrite (partial —
domainspromotion only), marker write, single INFO log line. Sequence matches. - ✓ FR-G2 (atomicity guarantees): each step is atomic, daemon refuses listener bind until transaction completes, partial completion is detectable via path-existence checks.
- ✓ FR-G3 (idempotency): marker-present → fast no-op (single stat). Corrupted marker → hard startup error. Verified by integration test.
- ⚠ FR-G1 step 3: PRD calls for
[mailboxes.<local>]→[mailboxes."<local>@<domain>"]on-disk re-key. This PR ships only thedomain → domainshalf. Deferred to Sprint 3 per the implementer's note; see Deviation #1 above. - ⚠ §6.12 walkthrough step 4: PRD's normative narrative shows FQDN-keyed mailboxes post-upgrade. Same deferral as above.
- ✓ NFR-2 (concurrency / lock hierarchy): outer per-mailbox locks (sorted FQDN order) → inner
CONFIG_WRITE_LOCK. - ✓ NFR-3 (atomicity): each step uses
rename(2)orwrite_atomic. Crash recovery story preserved (storage first, marker last). - ✓ NFR-4 (performance):
rename(2)is constant-time per top-level directory regardless of mail volume. Marker check is onestatper inbox/sent lookup (microseconds). - ✓ NFR-5 (security): DKIM modes preserved.
Summary and Recommended Actions
Overall verdict: Needs minor fixes — but only because of the unaddressed regression in aimx doctor / discover_mailbox_names and the docs drift on Sprint 2's AC. If the implementer prefers to land this PR as-is and pick those up in Sprints 3 / 6, the verdict slides to "Comment / land as planned" — neither finding rises to a blocker on its own.
Blockers: None. The deferred mailbox-key re-key is technically a Sprint-2 AC miss but it has sound technical justification, is loudly documented, and the integration test pins the resulting deferred behavior so the next sprint can pick it up.
Non-blockers (worth fixing before merge, but won't block):
aimx doctorreports DKIM as missing post-migration.src/doctor.rs:114still reads<dkim_dir>/private.keyand bypasses the layout-aware resolver. Either patch it to useresolve_active_dkim_dir, or explicitly add it to the "Deferred Items" list with a Sprint 6 pointer.discover_mailbox_namesand the doctor orphan-storage scan still read<data_dir>/inbox//<data_dir>/sent/at the root. Add to Deferred Items if not fixing here.- Update
docs/multi-domain-sprint.mdSummary Table so the Sprint 2 row no longer claims(incl. legacy local-part → FQDN re-key), and absorb the re-key into Sprint 3's plan or a follow-up. Otherwise the next sprint reviewer will see an unresolved Sprint 2 AC.
Nice-to-haves:
- Add a unit test for EXDEV detection if a portable simulation is feasible (or document that EXDEV is covered by code inspection only).
- Add a brief comment in
run_upgrade_migration_at_startupexplaining why detection runs both pre-lock and inside the lock. - Future: introduce a
setup_v2_test_envhelper so integration tests can opt out of the silent migration-on-startup path.
The migration itself is well-engineered, the test coverage is strong, and the deferral decision is the right technical call. The remaining items are polish and follow-up bookkeeping.
Addresses three review non-blockers on the upgrade-migration PR: - doctor: route the DKIM private/public-key probes through resolve_active_dkim_dir so a post-migration install correctly reports 'present' from <dkim_dir>/<default_domain>/private.key instead of recommending dkim-keygen against a stale legacy path. - mailbox::discover_mailbox_names + the doctor orphan-storage scan: walk per-domain inbox/sent roots via the new Config::storage_roots helper. On v2 (post-migration) this iterates <data_dir>/<domain>/inbox|sent/ across every configured domain; on v1 it preserves the legacy <data_dir>/inbox|sent/ behavior. Per-domain roots that have not been provisioned yet are skipped. - new unit tests cover post-migration DKIM detection, the v1 fallback, and per-domain mailbox discovery (including the freshly-added-domain skip).
Review non-blockers addressed (commit 45dc558)All three flagged non-blockers are fixed. Code commit pushed to Non-blocker 1:
|
uzyn
left a comment
There was a problem hiding this comment.
Sprint 2 Re-review: One-shot upgrade migration (PR #241)
Verifying the three previously flagged non-blockers against commit 45dc558 and the matching docs commit aimx-docs@1ee70a9.
Verification of Previously Flagged Non-Blockers
NB1 — aimx doctor DKIM probe — RESOLVED
src/doctor.rs:114 and the public-key read at src/doctor.rs:216 now both route through crate::upgrade_migration::resolve_active_dkim_dir(config, &dkim_root_base). Verified against the helper at src/upgrade_migration.rs:689, which probes <dkim_dir>/<default_domain>/private.key and falls back to <dkim_dir> for v1 installs. Three new tests in doctor::dkim_layout_tests pin the three relevant states:
dkim_present_when_only_per_domain_key_exists— post-migration per-domain key reads as present (the originally-flagged regression).dkim_present_on_legacy_v1_layout— fallback to legacy<dkim_dir>/private.keystill works for pre-migration installs.dkim_missing_when_no_key_present— genuine missing case still surfaces, so the doctor MISSING /aimx dkim-keygenrecommendation is not lost.
All three pass locally (cargo test --bin aimx doctor::dkim_layout_tests). The fix is minimal, surgical, and uses the same helper serve.rs already uses for the runtime DKIM load — single source of truth for "where does the active DKIM key live?".
One small note: this required bumping MockNetworkOps field visibility from private to pub(crate) in src/setup.rs so the new test module can construct one. The change is mechanical and confined to a test-only helper struct (the type is already under pub(crate) mod tests). No production-surface impact.
NB2 — Per-domain mailbox storage scans — RESOLVED
The fix introduces a single source of truth via Config::storage_roots() in src/config.rs:1502-1521:
- On v2 (post-migration,
.layout-versionmarker present): returns one<data_dir>/<domain>/entry per configured domain. - On v1 (no marker): returns a single
<data_dir>/entry — preserves legacy behavior.
Both previously flagged call-sites now consume this helper:
mailbox::discover_mailbox_names(src/mailbox.rs:286-296): iteratesconfig.storage_roots(), joinsinbox/per root, aggregates entries across all per-domain inbox dirs. Per-domain roots that do not exist (e.g. a freshly added domain) are silently skipped via theread_direrror path.doctor::check_mailbox_ownershiporphan-storage finding (src/doctor.rs:842-880): same iteration shape, applied to bothinbox/andsent/. The finding message usesroot_dir.strip_prefix(&config.data_dir)so v2 reads as<domain>/<kind>/<name>/and v1 as<kind>/<name>/— operators see the exact on-disk path they need torm -rf. Theunwrap_or(Path::new(kind))fallback is defensive against an unexpectedstrip_prefixfailure.
Three new unit tests in mailbox::tests:
discover_aggregates_per_domain_inbox_dirs_on_v2_layout— verifies aggregation across two per-domain inbox dirs (agent.example.com,two.example.com).discover_falls_back_to_legacy_root_on_v1_layout— v1 regression coverage.discover_skips_missing_per_domain_inbox_dirs— provisions one of two configured domains, asserts no crash and the provisioned domain surfaces.
All three pass locally. The doctor orphan-storage scan is not separately unit-tested, but its logic is structurally identical to discover_mailbox_names and uses the same storage_roots() helper.
NB3 — Sprint summary table — RESOLVED
aimx-docs@1ee70a9 updates multi-domain-sprint.md:
- Sprint 2 row: now lists
structural domain → domains rewrite, storage + DKIM relocation under <domain>/, .layout-version marker, layout-aware path shim (storage_root_for_default_domain, storage_roots, resolve_active_dkim_dir), idempotent re-run, upgrade-fixture integration test, and explicitly bolds "Legacy local-part → FQDN on-disk mailbox-key re-key deferred to Sprint 3" with the justification ("in-memory map already carries legacy keys verbatim from Sprint 1; deferring keeps every existingconfig.mailboxes.get(<local>)callsite working"). - Sprint 3 row: now lists "legacy local-part → FQDN on-disk mailbox-key re-key absorbed from Sprint 2" with the rationale ("runs as part of the data-plane rewire so every
config.mailboxes.get(...)callsite migrates at the same time as the on-disk shape").
The drift between what Sprint 2 actually ships and what the Summary Table claims it ships is now closed. The next sprint's reviewer will see Sprint 2's actual scope reflected in the plan.
Two earlier-approved deviations
- Deferred on-disk mailbox-key re-key (Sprint 3 absorbs it) — approved earlier, no fix expected. The docs update above formalises that decision in the plan.
- Module rename
src/upgrade_migration.rsinstead ofsrc/upgrade.rs— approved earlier, no fix expected. Module name is unchanged in this commit.
Regression / New Issue Check
Walked the full diff 5d0f4cf..45dc558 (four files: src/config.rs, src/doctor.rs, src/mailbox.rs, src/setup.rs). No new issues found:
- The new
Config::storage_roots()is pure (reads filesystem state, returns aVec<PathBuf>, no writes, no locks) — safe for use in the read-only doctor / discovery paths. - The doctor changes preserve the existing
read_direrror-path-as-skip behavior — a missing per-domain dir on a freshly added domain does not crash the doctor. - The orphan-storage message format change is path-relative to
data_dir, so it reads correctly on both v1 (inbox/foo/) and v2 (example.com/inbox/foo/). - The
MockNetworkOpsvisibility bump is mechanical; no production code consumes it. - No new banned planning-doc cross-references introduced (
rgon the cumulative diff is clean outsidedocs/).
Verification commands
cargo fmt -- --check # clean
cargo clippy --all-targets -- -D warnings # clean
cargo test --bin aimx doctor::dkim_layout_tests # 3/3 pass
cargo test --bin aimx mailbox::tests::discover # 3/3 pass
cargo test --bin aimx # 1213/1213 pass (8 ignored, pre-existing)
cargo test --test integration # 116/116 pass (8 ignored, pre-existing)
cargo test --test upgrade # 4/4 pass
Summary
All three previously flagged non-blockers (NB1 doctor DKIM probe, NB2 per-domain storage scans, NB3 sprint plan Summary Table) are fully resolved. The fixes are minimal, well-tested (6 new unit tests, all passing), and introduce a clean new helper (Config::storage_roots()) as a single source of truth for "where mailbox storage roots live across layout versions". No new issues or regressions introduced. The PR is ready to merge to epic/multi-domain.
Overall verdict: Ready to merge.
Blockers: none.
Non-blockers: none.
Nice-to-haves: none new. The two existing nice-to-haves carried from the first review (EXDEV unit test, setup_v2_test_env helper) are still valid but remain non-blocking.
Recommended merge commit message
[Sprint 2] One-shot multi-domain upgrade migration
Atomic on-disk migration that brings legacy single-domain installs
onto the canonical multi-domain layout on first `aimx serve` startup
under the new binary. Synchronous, gated by the `.layout-version: 2`
marker, idempotent across restarts, hard-fail on partial completion.
Storage + DKIM relocation: rename(2) under `<data_dir>/<default_domain>/`
and `<dkim_dir>/<default_domain>/`. Config rewrite: structural
`domain → domains` promotion via `write_atomic`. Marker write: temp-
then-rename of `.layout-version: 2` (0644 root:root). Order is load-
bearing so a crash mid-flow prefers "orphaned DKIM key" over "domain
in config but DKIM missing"; marker is last so a partial run never
claims to be done. Migration runs under the documented lock hierarchy
(outer per-mailbox locks in sorted FQDN order → inner CONFIG_WRITE_LOCK)
before any listener binds.
Layout-aware path shim: `Config::inbox_dir` / `sent_dir` /
`storage_root_for_default_domain` / `storage_roots` consult the
marker so same-process callers see v2 paths immediately after the
marker lands. `resolve_active_dkim_dir` keeps the doctor probe and
the daemon's DKIM load aligned across v1 and v2 installs. The
`send_handler`, `state_handler`, `mailbox_handler`, `doctor`, and
`mailbox::discover_mailbox_names` data-plane paths now route through
these helpers.
Mailbox-key FQDN re-key (`[mailboxes.<local>]` →
`[mailboxes."<local>@<domain>"]`) is deferred to the runtime data-
plane rewire so every `config.mailboxes.get(<local>)` callsite
migrates at the same time as the on-disk shape.
Tests: 26 unit tests in `src/upgrade_migration.rs` cover detection,
each rename step, EXDEV handling by code inspection, the config
rewrite, the marker write, and the orchestration. 4 integration
tests in `tests/upgrade.rs` exercise end-to-end migration,
idempotency, corrupted-marker hard-fail, and post-migration SMTP
RCPT against a realistic v1 fixture. 6 additional unit tests cover
the layout-aware doctor DKIM probe and per-domain mailbox storage
scans.
The mailbox-dir-perms-isolation CI job started failing after the
multi-domain upgrade migration landed: the daemon now relocates
<data_dir>/{inbox,sent}/ to <data_dir>/<default_domain>/{inbox,sent}/
on first start, but four assertions in tests/uds_authz.rs hand-built
their expected paths as <tmp_root>/data/inbox/<name>, which no longer
exist after migration.
Add mailbox_inbox_path / mailbox_sent_path helpers keyed on a single
DEFAULT_DOMAIN const that also drives the test config, and rewrite the
four post-migration stat sites (mailbox_create_non_root_owner_ok,
mailbox_create_non_root_owner_synthesized_from_peercred,
mailbox_delete_non_root_cross_uid_returns_no_such_mailbox) to use
them. The pre-migration v1-layout dirs in spin_up_serve stay as-is so
the daemon's first-run migration still has a legacy tree to relocate.
Verified locally under sudo + AIMX_INTEGRATION_SUDO=1: all 21 ignored
uds_authz tests pass (39.7s). cargo fmt + cargo clippy --all-targets
-D warnings clean.
CI regression fix —
|
The daemon runs with `umask 0o077`, so when the upgrade migration creates `<data_dir>/<domain>/` via `create_dir_all` it lands at `0o700 root:root`. Non-root mailbox owners (e.g. `aimx mcp` under their own uid) then cannot `x`-traverse into their own `inbox/<name>/` subdir and every read surfaces EACCES. Pre-migration the equivalent traversal happened at `<data_dir>/` itself, which setup leaves at `0o755`. Restore that property on the new per-domain dir by explicitly chmodding it to `0o755` after creation (or on a re-entrant migration where a partial earlier run left it locked). The `inbox/<name>/` subdirs themselves remain `0o700` and owner-locked; only the per-domain traversal bit is opened. Regression test pins the mode under the daemon's `0o077` umask so the default `cargo test` `0o022` can't hide the bug again.
Fix:
|
uzyn
left a comment
There was a problem hiding this comment.
Sprint 2 Re-review #2: One-shot upgrade migration (PR #241)
Verifying two new commits landed on top of the previously approved tree, after CI on the prior approval surfaced two failures that local runs hadn't caught.
Commits Reviewed Since the Last Approval
373f23e—tests/uds_authz.rs: route four post-migration stat sites through newmailbox_inbox_path/mailbox_sent_pathhelpers keyed onDEFAULT_DOMAIN; leave the pre-migration v1 setup at lines 389-393 untouched.fbfda62—src/upgrade_migration.rs: explicitlyset_permissions(domain_dir, 0o755)aftercreate_dir_allinrelocate_storage_for_default_domain, plus anUmaskGuardhelper and two regression tests pinned at the daemon's0o077umask.
Verification
373f23e (test-only path update) — pass
- New
DEFAULT_DOMAIN = "it.example.com"const is the single source of truth for both the generated config and the path helpers (spin_up_servewas updated to interpolatedomain = DEFAULT_DOMAINinto thedomain = "..."field and every<local>@<domain>address) — no risk of helper/config skew. - The four rewritten stat sites (
mailbox_create_non_root_owner_ok,mailbox_create_non_root_owner_synthesized_from_peercredx2 lookups,mailbox_delete_non_root_cross_uid_returns_no_such_mailbox) all targeted hand-built<tmp_root>/data/{inbox,sent}/<name>paths that no longer exist after first-start migration. The helpers correctly point them at<tmp_root>/data/it.example.com/{inbox,sent}/<name>. - The pre-migration v1-layout setup at lines 389-393 in
spin_up_serve(creatingdata/inbox/,data/sent/, and per-user subdirs under the legacy paths) is preserved verbatim. This is load-bearing — the daemon'srun_startup_migrationneeds a v1 tree to relocate, so the legacy paths must exist at the v1 shape when the daemon binds. Confirmed by inspection. - No production code touched. No assertion semantics changed — same
stat_owner_usernamecalls, same expected uids, same error-expectation paths. Pure path-construction refactor.
fbfda62 (production chmod fix) — pass on every verification point
- Chmod is unconditional. Lines 292-300 sit outside the
if !domain_dir.exists()guard at 286-291, so every call torelocate_storage_for_default_domainrewrites the mode. A re-entrant migration after a crashed earlier run that left the dir at0o700will heal back to0o755. The newstorage_relocation_chmods_existing_per_domain_dirtest pins this exact contract by pre-creating the dir at0o700and asserting0o755after. Operator hand-tightening will be reverted — see the open question below. inbox/<name>/andsent/<name>/subdirs remain0o700and owner-locked.move_if_pendingusesfs::rename(src, dst)which preserves the source dir's mode unchanged — the existing0o700modes baked in bymailbox_handler::create_mailbox_dirs(lines 407, 414) andmailbox::ensure_mailbox_dirs(line 240) survive the rename verbatim. The chmod is scoped todomain_diritself, not its contents. There is noset_permissionswalking the subtree. Confirmed by inspection.UmaskGuardis sound. Uses a staticMutex<()>to serialize all umask mutations across parallel test threads, handles poisoning viaunwrap_or_else(|p| p.into_inner()), captures previous umask vialibc::umask(new)(which atomically swaps and returns prior value), restores inDrop. The unsafe block is correctly justified by the comment ("umask(2) is thread-unsafe but the mutex above serializes every caller"). Lock is held for the fullUmaskGuardlifetime via_lock: MutexGuard<'static, ()>, so concurrent tests block correctly. One minor note: if a future test panics inside the guard,Dropstill runs (RAII contract) and the mutex gets released — sound.- Chmod path matches
setup.rsdata_dir treatment.install_config_dir(setup.rs:1647) sets0o755explicitly on the config dir; the<data_dir>itself is created viacreate_dir_allfrom setup (line 2494) under the setup CLI's default0o022umask, landing at0o755. The fix correctly aligns<data_dir>/<domain>/with<data_dir>itself — both0o755, both traversable by non-root mailbox owners. The containedinbox/<name>/andsent/<name>/stay0o700matching mailbox subdirs on v1. - No security regression. The widened bit is exactly
--xfor group/other (well,r-x, but the dir lists nothing useful — non-owners can't enumerate mailboxes because theinbox/<name>/subdirs they wouldreaddir(2)for are at0o700). The DKIM per-domain dir atrelocate_dkim_for_default_domain(lines 354-370) stays at0o700— correct, because DKIM key material must be root-only. The asymmetry between the two relocate functions is intentional and well-motivated.
I also ran cargo test --bin aimx storage_relocation locally — all 6 tests in that filter pass, including both new regressions. cargo fmt --check and cargo clippy --all-targets -D warnings are clean. CI shows green on core-tests, mailbox-dir-perms-isolation, docs-build, and verifier-tests at fbfda62.
Disposition on the Open Question
"Should the chmod be guarded so an operator's deliberate
0o700on<data_dir>/<domain>/isn't widened on next startup?"
Leave it unconditional. The implementer's reasoning holds:
- The daemon is the authoritative owner of the storage layout shape. The chmod-on-every-call posture is a feature, not a bug — operators who tighten
<data_dir>/<domain>/to0o700break every non-root MCP and CLI read, so the daemon proactively heals the directory back to a working state on next start. This matches the cascade-delete contract where the daemon owns disk shape underCONFIG_WRITE_LOCK. - The DKIM relocate function (line 355) does guard its chmod with
if created_domain_dirbecause the contract is opposite: DKIM key dir must always be0o700, and a pre-existing tighter mode is fine. Storage is the inverse: must always be0o755, and a pre-existing tighter mode is broken state. - A worked example: an operator who runs
chmod 0700 /var/lib/aimx/example.comand restarts the daemon will see traversal restored. That's the right outcome — there is no legitimate reason for that directory to be0o700on a multi-domain install.
If documenting becomes important post-merge, a one-liner in book/configuration.md or book/multi-domain.md noting that the daemon enforces 0o755 on <data_dir>/<domain>/ on every startup is the right home — but it's not a merge blocker.
New Issues Found
None. Both commits are tightly scoped, well-tested, and correct.
Summary
The CI failures the prior approval missed are now fully addressed: a test-only path refactor for tests/uds_authz.rs that preserves the legacy v1 setup the migration needs to operate on, and a real production bug fix that makes the per-domain storage dir world-traversable under the daemon's 0o077 umask. The fix is minimal, the regression tests pin the exact failure mode with an UmaskGuard that forces the daemon's umask back into the test environment, and the asymmetric posture (storage 0o755, DKIM 0o700) is intentional and well-motivated. No regressions introduced.
Verdict: Ready to merge.
[Sprint 2] One-shot multi-domain upgrade migration
Atomic on-disk migration that brings legacy single-domain installs
onto the canonical multi-domain layout on first `aimx serve`
startup. Storage and DKIM keys are relocated to per-domain
subdirectories via `rename(2)` (atomic on same filesystem,
constant time), the config is rewritten from `domain = "x"` to
`domains = ["x"]`, and a `.layout-version: 2` marker is written
last so partial runs never claim to be done. The per-domain
storage dir is chmodded to `0o755` on every call so non-root
mailbox owners can traverse into their own `inbox/<name>/` subdir
under the daemon's `0o077` umask; the contained mailbox subdirs
remain `0o700` and owner-locked. EXDEV surfaces an actionable
error. Layout-aware path resolution shim added so same-process
callers see v2 paths immediately after the marker lands. Doctor
DKIM probe and per-domain mailbox storage scans route through the
new layout. Mailbox-key FQDN re-key
(`[mailboxes.<local>]` → `[mailboxes."<local>@<domain>"]`) is
deferred to the runtime data-plane rewire so every
`config.mailboxes.get(<local>)` callsite migrates at the same
time as the on-disk shape.
Tests: 26 unit tests in `src/upgrade_migration.rs` cover detection,
each rename step, EXDEV handling by code inspection, the config
rewrite, the marker write, and the orchestration. 4 integration
tests in `tests/upgrade.rs` exercise end-to-end migration,
idempotency, corrupted-marker hard-fail, and post-migration SMTP
RCPT against a realistic v1 fixture. 6 additional unit tests cover
the layout-aware doctor DKIM probe and per-domain mailbox storage
scans. 2 new regression tests in `src/upgrade_migration.rs` pin
the per-domain storage dir at `0o755` under the daemon's `0o077`
umask, via a thread-safe `UmaskGuard` helper. Four post-migration
stat sites in `tests/uds_authz.rs` routed through new
`mailbox_inbox_path` / `mailbox_sent_path` helpers so the
`mailbox-dir-perms-isolation` CI job stays green.
Sprint Goal
Implement the atomic on-disk migration that brings legacy single-domain installs onto the canonical multi-domain layout on first
aimx servestartup under the new binary. Idempotent across restarts, hard-fail on partial completion, single INFO log line on success.Stories Implemented
[S2-1] Detection logic +
.layout-versionmarker semanticssrc/upgrade_migration.rsexposespub enum LayoutState { Migrated, NeedsMigration(Indicators), FreshInstall, Corrupted(String) }andpub fn detect_layout_state(...).Indicatorscarries which v1-shape signals fired, for logging at migration time.LayoutState::Corruptedreturned on wrong/garbage.layout-versioncontent;serve.rstranslates to a startup hard error.LayoutState::FreshInstalltriggers a one-shot.layout-version: 2write before the daemon starts accepting traffic.src/upgrade_migration.rsrather than the originally-spec'dsrc/upgrade.rsbecause the binary self-update flow already owns theupgradename. Called out below in Technical Decisions.[S2-2] Storage + DKIM atomic relocation via
rename(2)relocate_storage_for_default_domain(data_dir, default_domain)renames<data_dir>/inbox→<data_dir>/<default_domain>/inbox/and<data_dir>/sent→<data_dir>/<default_domain>/sent/.relocate_dkim_for_default_domain(dkim_dir, default_domain)creates<dkim_dir>/<default_domain>/mode0700 root:root, then renamesprivate.keyandpublic.keyinto it (modes0600/0644preserved across rename)."<src> and <dst> must be on the same filesystem; see book/multi-domain.md for manual recovery".TempDir: clean v1 → v2, half-relocated storage (skips rename), missing DKIM file (proceeds), refuses to overwrite when both source and destination exist.[S2-3] Config rewrite via
write_atomic+.layout-versionwriterewrite_config_to_canonical_shape(config_path, in_memory_config)serializes via the existingwrite_atomichelper.write_layout_version_marker(data_dir)writes<data_dir>/.layout-versioncontaining2\n, mode0644 root:root.[mailboxes.info]→[mailboxes.\"info@<domains[0]>\"]). Doing it here would break every downstreamconfig.mailboxes.get(<local>)callsite (ingest, send, hooks, CLImailboxes show/hooks create) until the runtime data plane is rewired. The structuraldomain→domainspromotion still ships in this PR. See Technical Decisions / Deferred Items below.[S2-4] Wire the migration into
serve.rsstartup under the lock hierarchyserve.rsstartup sequence: load config → detect layout state → ifNeedsMigration, acquire locks + run migration steps in order (storage → DKIM → config → marker) → continue to listener bind.LayoutState::FreshInstall, write.layout-version: 2before continuing.serve.rsreturns the canonical\"upgrade migration failed: <reason>; see book/multi-domain.md for manual recovery\"error which surfaces to the operator on stderr and exits with code 1.tracing::info!(target: \"aimx::upgrade\", \"upgrade migration completed: ...\")line summarising every file/path that moved..layout-version: 2present:detect_layout_stateshort-circuits withMigrated, no locks acquired, no log line.MailboxLockspool, thenCONFIG_WRITE_LOCK. Matches the documented cascade-delete ordering.[S2-5] Upgrade-fixture integration test
tests/fixtures/upgrade/v1-single-domain/with 2 mailboxes (info,support) and 2 messages per mailbox.tests/upgrade.rsexercises the full upgrade path end-to-end against the fixture.domains = [\"fixture.example\"]),.layout-version: 2present, original messages still readable from the per-domain path.aimx servestart against the sameTempDiris a no-op (config + marker byte-identical to after first start)..layout-version(99) causes the daemon to exit with the canonical hard error.Technical Decisions
domain → domainsAND[mailboxes.<local>]→[mailboxes.\"<local>@<domain>\"]to land in the config rewrite. Doing the latter in this sprint breaks every downstreamconfig.mailboxes.get(<local>)callsite (ingest::resolve_recipient_mailbox,send_handlerlookups, CLIaimx hooks create --mailbox alice,aimx mailboxes show <name>) because the runtime data plane is not yet FQDN-aware. The result is a daemon that migrates on first start and then can't deliver mail to its own configured mailboxes — every existing integration test that usessetup_test_env(~120 tests) fails. The pragmatic call: ship the structuraldomain → domainspromotion now, defer the mailbox-key FQDN re-key to the runtime data-plane rewire (next sprint) where it lands alongsideresolve_mailbox_for_rcptand the per-domain DKIM map. On-disk shape after this PR: canonicaldomains = [\"x.com\"]with operator-friendly mailbox keys preserved. The marker is still bumped to2because the storage + DKIM layout fully converts to the per-domain shape.Config::inbox_dir/sent_dir. Sprint 2's storage relocation physically moves mailbox directories to<data_dir>/<default_domain>/{inbox|sent}/<name>/. The runtime data plane (ingest, send_handler, state_handler, mailbox_handler, doctor, CLI) needs to find them. Rather than refactor every callsite to the new per-domain storage helper (that's the next sprint's S3-5 deliverable),Config::inbox_dir/sent_dir/mailbox_dirconsult<data_dir>/.layout-versionand route to the per-domain location post-migration. Same-process callers see v2 paths immediately after the marker lands. Onestatper call, microsecond cost.serve.rscalls intocrate::upgrade_migration::resolve_active_dkim_dir(&config, &dkim_dir)to find the key dir. If<dkim_dir>/<default_domain>/private.keyexists (post-migration), uses the per-domain path; otherwise falls back to<dkim_dir>/(fresh single-domain installs). The follow-upHashMap<domain, DkimKeyEntry>lookup replaces this single-key load entirely.send_handler+state_handler+mailbox_handlernow route storage paths through the layout-awareConfighelpers. Previously these builtdata_dir.join(\"inbox\").join(name)directly, bypassingConfig::inbox_dir. Updated to go through the layout-aware helpers so a post-migration daemon writes/reads under the per-domain tree.src/upgrade_migration.rs, notsrc/upgrade.rs. The sprint spec called forsrc/upgrade.rs, but the existing self-update CLI (aimx upgrade) already owns that module name and is widely referenced. Renaming the existing module would have been a much larger and unrelated diff. The new module's docstring and the commit message both explain the naming.MigratedOutcomeboxed insideStartupMigrationOutcome::Migrated. Clippy'slarge_enum_variantlint flagged the inline form. Boxing keeps the enum tiny on the stack at every pattern-match site, with no observable runtime cost (one heap allocation per migration run, and the migration runs at most once per install).Deferred Items
resolve_mailbox_for_rcptand the per-domain DKIM map.HashMapload. The currentserve.rsstill loads a single DKIM key for the default domain. The per-domainHashMap<String, DkimKeyEntry>keyed on the From: domain (with hot-reload viaArcSwap) lands in the next sprint's S3-2.Review Focus Areas
src/upgrade_migration.rs— the heart of the sprint. The detection tree (detect_layout_state), the rename helpers (relocate_storage_for_default_domain,relocate_dkim_for_default_domain,move_if_pending), the config rewrite, the marker write, the orchestration (run_startup_migration), and the layout-aware DKIM resolver (resolve_active_dkim_dir). Cross-device EXDEV handling deserves attention — we hard-fail rather than copy+delete because atomicity matters more than convenience here.src/serve.rs::run_upgrade_migration_at_startup— the lock-hierarchy gymnastics. Outer per-mailbox locks acquired in sorted FQDN order against the sharedMailboxLockspool, then innerCONFIG_WRITE_LOCK. The lock acquisition is technically unnecessary at startup (no concurrent writer exists), but the discipline is in place so any future refactor that moves the migration later in startup inherits it.src/config.rs::Config::inbox_dir/sent_dir/storage_root_for_default_domain— the layout-aware path resolution shim. The marker check on every call is the part most likely to deserve a cached field; the next sprint's storage helper will replace it entirely.tests/upgrade.rs— the end-to-end integration coverage. The fixture undertests/fixtures/upgrade/v1-single-domain/is realistic (legacydomain = \"fixture.example\"config + 2 mailboxes + 2 messages each + a v1 DKIM keypair sourced from the test-scoped cache). The four test functions exercise the full happy path, the idempotency invariant, the corrupted-marker hard-fail, and the post-migration SMTP RCPT.tests/integration.rs::storage_subdir— the layout-aware test helper that lets the existing integration suite keep finding mailbox files after the migration relocates them. Two existing tests (mcp_mark_read_concurrent_with_inbound_ingest,concurrent_ingest_burst_and_mark_same_mailbox_no_torn_writes) now re-resolvealice_dirafterstart_serveso the pre-seed and post-daemon-read use the right path.Test Coverage
cargo test --bin aimx).cargo test --test integration); 8 ignored as before.cargo test --test upgrade).cargo clippy --all-targets -- -D warningsclean.cargo fmt -- --checkclean.services/verifier/) builds clean (no changes there).