Skip to content

[Sprint 2] One-shot upgrade migration#241

Merged
uzyn merged 4 commits into
epic/multi-domainfrom
sprint-2-multi-domain-upgrade-migration
May 23, 2026
Merged

[Sprint 2] One-shot upgrade migration#241
uzyn merged 4 commits into
epic/multi-domainfrom
sprint-2-multi-domain-upgrade-migration

Conversation

@uzyn
Copy link
Copy Markdown
Owner

@uzyn uzyn commented May 23, 2026

Sprint Goal

Implement the atomic on-disk migration that brings legacy single-domain installs onto the canonical multi-domain layout on first aimx serve startup under the new binary. Idempotent across restarts, hard-fail on partial completion, single INFO log line on success.

Stories Implemented

[S2-1] Detection logic + .layout-version marker semantics

  • New module src/upgrade_migration.rs exposes pub enum LayoutState { Migrated, NeedsMigration(Indicators), FreshInstall, Corrupted(String) } and pub fn detect_layout_state(...).
  • Indicators carries which v1-shape signals fired, for logging at migration time.
  • LayoutState::Corrupted returned on wrong/garbage .layout-version content; serve.rs translates to a startup hard error.
  • LayoutState::FreshInstall triggers a one-shot .layout-version: 2 write before the daemon starts accepting traffic.
  • Unit tests cover: pristine v1 layout, half-migrated layout, fully migrated, fresh install, corrupted marker.
  • Module is named src/upgrade_migration.rs rather than the originally-spec'd src/upgrade.rs because the binary self-update flow already owns the upgrade name. Called out below in Technical Decisions.

[S2-2] Storage + DKIM atomic relocation via rename(2)

  • relocate_storage_for_default_domain(data_dir, default_domain) renames <data_dir>/inbox<data_dir>/<default_domain>/inbox/ and <data_dir>/sent<data_dir>/<default_domain>/sent/.
  • relocate_dkim_for_default_domain(dkim_dir, default_domain) creates <dkim_dir>/<default_domain>/ mode 0700 root:root, then renames private.key and public.key into it (modes 0600 / 0644 preserved across rename).
  • Idempotency per-file: when the destination exists and the source doesn't, the rename is skipped.
  • EXDEV detection: cross-filesystem rename caught and reported as "<src> and <dst> must be on the same filesystem; see book/multi-domain.md for manual recovery".
  • Unit tests with TempDir: clean v1 → v2, half-relocated storage (skips rename), missing DKIM file (proceeds), refuses to overwrite when both source and destination exist.

[S2-3] Config rewrite via write_atomic + .layout-version write

  • rewrite_config_to_canonical_shape(config_path, in_memory_config) serializes via the existing write_atomic helper.
  • write_layout_version_marker(data_dir) writes <data_dir>/.layout-version containing 2\n, mode 0644 root:root.
  • Idempotency: re-running on an already-canonical config writes the same bytes.
  • Unit tests cover legacy → canonical rewrite, already-canonical no-op, marker write idempotent across repeated calls.
  • Deferred: the on-disk legacy local-part → FQDN re-key ([mailboxes.info][mailboxes.\"info@<domains[0]>\"]). Doing it here would break every downstream config.mailboxes.get(<local>) callsite (ingest, send, hooks, CLI mailboxes show / hooks create) until the runtime data plane is rewired. The structural domaindomains promotion still ships in this PR. See Technical Decisions / Deferred Items below.

[S2-4] Wire the migration into serve.rs startup under the lock hierarchy

  • serve.rs startup sequence: load config → detect layout state → if NeedsMigration, acquire locks + run migration steps in order (storage → DKIM → config → marker) → continue to listener bind.
  • If LayoutState::FreshInstall, write .layout-version: 2 before continuing.
  • On migration error, serve.rs returns the canonical \"upgrade migration failed: <reason>; see book/multi-domain.md for manual recovery\" error which surfaces to the operator on stderr and exits with code 1.
  • On migration success, emit a single tracing::info!(target: \"aimx::upgrade\", \"upgrade migration completed: ...\") line summarising every file/path that moved.
  • Subsequent restarts with .layout-version: 2 present: detect_layout_state short-circuits with Migrated, no locks acquired, no log line.
  • Per-mailbox locks acquired in sorted FQDN order against the shared MailboxLocks pool, then CONFIG_WRITE_LOCK. Matches the documented cascade-delete ordering.

[S2-5] Upgrade-fixture integration test

  • Fixture v1 install at tests/fixtures/upgrade/v1-single-domain/ with 2 mailboxes (info, support) and 2 messages per mailbox.
  • New integration test tests/upgrade.rs exercises the full upgrade path end-to-end against the fixture.
  • Post-migration assertions: storage relocated, DKIM relocated, config normalized (domains = [\"fixture.example\"]), .layout-version: 2 present, original messages still readable from the per-domain path.
  • Post-migration SMTP RCPT to a v1 mailbox local-part lands in the new per-domain inbox.
  • Idempotency: second aimx serve start against the same TempDir is a no-op (config + marker byte-identical to after first start).
  • Hard-fail: a fixture with a corrupted .layout-version (99) causes the daemon to exit with the canonical hard error.

Technical Decisions

  • Mailbox-key re-key deferred. The S2-3 acceptance criteria call for both domain → domains AND [mailboxes.<local>][mailboxes.\"<local>@<domain>\"] to land in the config rewrite. Doing the latter in this sprint breaks every downstream config.mailboxes.get(<local>) callsite (ingest::resolve_recipient_mailbox, send_handler lookups, CLI aimx hooks create --mailbox alice, aimx mailboxes show <name>) because the runtime data plane is not yet FQDN-aware. The result is a daemon that migrates on first start and then can't deliver mail to its own configured mailboxes — every existing integration test that uses setup_test_env (~120 tests) fails. The pragmatic call: ship the structural domain → domains promotion now, defer the mailbox-key FQDN re-key to the runtime data-plane rewire (next sprint) where it lands alongside resolve_mailbox_for_rcpt and the per-domain DKIM map. On-disk shape after this PR: canonical domains = [\"x.com\"] with operator-friendly mailbox keys preserved. The marker is still bumped to 2 because the storage + DKIM layout fully converts to the per-domain shape.
  • Layout-aware path shim in Config::inbox_dir / sent_dir. Sprint 2's storage relocation physically moves mailbox directories to <data_dir>/<default_domain>/{inbox|sent}/<name>/. The runtime data plane (ingest, send_handler, state_handler, mailbox_handler, doctor, CLI) needs to find them. Rather than refactor every callsite to the new per-domain storage helper (that's the next sprint's S3-5 deliverable), Config::inbox_dir / sent_dir / mailbox_dir consult <data_dir>/.layout-version and route to the per-domain location post-migration. Same-process callers see v2 paths immediately after the marker lands. One stat per call, microsecond cost.
  • DKIM key load resolves per-domain post-migration. serve.rs calls into crate::upgrade_migration::resolve_active_dkim_dir(&config, &dkim_dir) to find the key dir. If <dkim_dir>/<default_domain>/private.key exists (post-migration), uses the per-domain path; otherwise falls back to <dkim_dir>/ (fresh single-domain installs). The follow-up HashMap<domain, DkimKeyEntry> lookup replaces this single-key load entirely.
  • send_handler + state_handler + mailbox_handler now route storage paths through the layout-aware Config helpers. Previously these built data_dir.join(\"inbox\").join(name) directly, bypassing Config::inbox_dir. Updated to go through the layout-aware helpers so a post-migration daemon writes/reads under the per-domain tree.
  • New module is src/upgrade_migration.rs, not src/upgrade.rs. The sprint spec called for src/upgrade.rs, but the existing self-update CLI (aimx upgrade) already owns that module name and is widely referenced. Renaming the existing module would have been a much larger and unrelated diff. The new module's docstring and the commit message both explain the naming.
  • MigratedOutcome boxed inside StartupMigrationOutcome::Migrated. Clippy's large_enum_variant lint flagged the inline form. Boxing keeps the enum tiny on the stack at every pattern-match site, with no observable runtime cost (one heap allocation per migration run, and the migration runs at most once per install).

Deferred Items

  • On-disk legacy local-part → FQDN mailbox key re-key. Deferred from the S2-3 AC because the runtime data plane is not yet FQDN-aware. Lands in the runtime data-plane rewire that introduces resolve_mailbox_for_rcpt and the per-domain DKIM map.
  • Per-domain DKIM HashMap load. The current serve.rs still loads a single DKIM key for the default domain. The per-domain HashMap<String, DkimKeyEntry> keyed on the From: domain (with hot-reload via ArcSwap) lands in the next sprint's S3-2.

Review Focus Areas

  • src/upgrade_migration.rs — the heart of the sprint. The detection tree (detect_layout_state), the rename helpers (relocate_storage_for_default_domain, relocate_dkim_for_default_domain, move_if_pending), the config rewrite, the marker write, the orchestration (run_startup_migration), and the layout-aware DKIM resolver (resolve_active_dkim_dir). Cross-device EXDEV handling deserves attention — we hard-fail rather than copy+delete because atomicity matters more than convenience here.
  • src/serve.rs::run_upgrade_migration_at_startup — the lock-hierarchy gymnastics. Outer per-mailbox locks acquired in sorted FQDN order against the shared MailboxLocks pool, then inner CONFIG_WRITE_LOCK. The lock acquisition is technically unnecessary at startup (no concurrent writer exists), but the discipline is in place so any future refactor that moves the migration later in startup inherits it.
  • src/config.rs::Config::inbox_dir / sent_dir / storage_root_for_default_domain — the layout-aware path resolution shim. The marker check on every call is the part most likely to deserve a cached field; the next sprint's storage helper will replace it entirely.
  • tests/upgrade.rs — the end-to-end integration coverage. The fixture under tests/fixtures/upgrade/v1-single-domain/ is realistic (legacy domain = \"fixture.example\" config + 2 mailboxes + 2 messages each + a v1 DKIM keypair sourced from the test-scoped cache). The four test functions exercise the full happy path, the idempotency invariant, the corrupted-marker hard-fail, and the post-migration SMTP RCPT.
  • tests/integration.rs::storage_subdir — the layout-aware test helper that lets the existing integration suite keep finding mailbox files after the migration relocates them. Two existing tests (mcp_mark_read_concurrent_with_inbound_ingest, concurrent_ingest_burst_and_mark_same_mailbox_no_torn_writes) now re-resolve alice_dir after start_serve so the pre-seed and post-daemon-read use the right path.

Test Coverage

  • 1207 unit tests pass (cargo test --bin aimx).
  • 116 integration tests pass (cargo test --test integration); 8 ignored as before.
  • 4 upgrade-fixture integration tests pass (cargo test --test upgrade).
  • cargo clippy --all-targets -- -D warnings clean.
  • cargo fmt -- --check clean.
  • Verifier crate (services/verifier/) builds clean (no changes there).

Adds the atomic on-disk migration that brings legacy single-domain
installs onto the canonical multi-domain layout. Runs synchronously
at daemon startup before SMTP/UDS listeners bind, gated by the
`.layout-version: 2` marker so re-runs are a fast no-op.

What changes per step:

* Storage rename: `<data_dir>/{inbox,sent}/` →
  `<data_dir>/<default_domain>/{inbox,sent}/` via `rename(2)` (atomic
  on same FS, constant time regardless of mail volume). EXDEV
  surfaces an actionable error.
* DKIM rename: `<dkim_dir>/{private,public}.key` →
  `<dkim_dir>/<default_domain>/{...}` with 0600/0644 modes preserved
  and the new parent dir created at 0700 root:root.
* Config rewrite: `domain = "x.com"` → `domains = ["x.com"]` via
  the existing `write_atomic` helper. **Mailbox keys are
  deliberately preserved on disk** (`[mailboxes.info]` stays
  `[mailboxes.info]`) so every downstream CLI lookup that targets a
  mailbox by local-part keeps working unchanged in this sprint. The
  on-disk FQDN re-key is deferred to the runtime data-plane rewire.
* Marker write: `<data_dir>/.layout-version` containing `2\n` (0644
  root:root), temp-then-rename so concurrent readers never see a
  truncated value.

Order is load-bearing per the design contract: prefer "orphaned DKIM
key" over "domain in config but DKIM key missing" if a crash
interrupts mid-flow. Marker is last so a partial run never claims to
be done.

Lock hierarchy honored at startup: outer per-mailbox locks acquired
in sorted FQDN order against the shared `MailboxLocks` pool, then
the inner `CONFIG_WRITE_LOCK`, matching the documented cascade-
delete ordering and preventing deadlocks with concurrent CRUD that
comes online once the listener binds.

Layout-aware path resolution shim added: `Config::inbox_dir` /
`sent_dir` / `mailbox_dir` and the storage paths in `send_handler`
and `state_handler` consult `<data_dir>/.layout-version` and route
to the per-domain location post-migration. Same-process callers see
v2 paths immediately after the marker lands; the runtime data plane
keeps working without a per-domain HashMap lookup until the proper
storage helper ships.

DKIM key load in `serve.rs` now resolves `<dkim_dir>/<default_domain>/`
when a per-domain key is present, falling back to the legacy
`<dkim_dir>` root for fresh installs that haven't generated a per-
domain key yet.

Tests added:

* 26 unit tests in `src/upgrade_migration.rs` cover detection
  (pristine v1, half-migrated, fresh, corrupted marker, canonical
  per-domain sub-tables), each relocation step (rename, EXDEV,
  missing files, idempotency, mode preservation), the config rewrite,
  the marker write, and the orchestration (`run_startup_migration`).
* 4 integration tests in `tests/upgrade.rs` exercise the full
  migration end-to-end against a realistic v1 fixture under
  `tests/fixtures/upgrade/v1-single-domain/` (2 mailboxes, 2
  messages each): end-to-end success, idempotency on second start,
  hard-fail on corrupted marker, and post-migration SMTP RCPT
  landing in the per-domain `<data_dir>/<default_domain>/inbox/`
  path.

Deviations from the AC, called out explicitly in the PR description:

* `rewrite_config_to_canonical_shape` is structurally scoped to
  `domain → domains` only. The legacy `[mailboxes.<local>]` →
  `[mailboxes."<local>@<domain>"]` re-key is deferred to the
  follow-up runtime rewire that teaches every callsite to look up
  mailboxes by FQDN.
* The new module is named `src/upgrade_migration.rs` (the sprint
  spec called for `src/upgrade.rs`, but that name is already used by
  the binary self-update flow).
Copy link
Copy Markdown
Owner Author

@uzyn uzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprint 2 Review: One-shot upgrade migration (PR #241)

Sprint Goal Assessment

The sprint goal — "atomic on-disk migration that brings legacy single-domain installs onto the canonical multi-domain layout on first aimx serve startup, idempotent across restarts, hard-fail on partial completion" — is substantially achieved. Storage and DKIM are atomically relocated to the per-domain layout via rename(2), the .layout-version: 2 marker is written last per the documented crash-recovery contract, the migration runs under the correct lock hierarchy before any listener binds, and the integration suite end-to-ends the full path including post-migration SMTP RCPT delivery, idempotency, and corrupted-marker hard-fail. The one piece the sprint plan explicitly required and this PR explicitly defers — the on-disk [mailboxes.<local>][mailboxes."<local>@<domain>"] re-key — has a sound technical justification (the runtime data plane is not yet FQDN-aware) and the deferral is documented loudly in the PR description.

Acceptance Criteria Checklist

[S2-1] Detection logic + .layout-version marker semantics

  • src/upgrade_migration.rs (note: not src/upgrade.rs — see deviation #2) — met
  • LayoutState { Migrated, NeedsMigration(Indicators), FreshInstall, Corrupted(String) } — met
  • detect_layout_state(data_dir, dkim_dir, config_path, default_domain) -> LayoutState — met
  • Indicators carries 5 v1-shape signals (legacy_inbox_dir, legacy_sent_dir, legacy_dkim_key, legacy_config_domain_field, legacy_mailbox_local_part_keys) — met
  • Corrupted returned on wrong/garbage marker content — met
  • FreshInstall triggers a proactive marker write — met (in run_startup_migration)
  • Unit tests cover pristine v1, half-migrated, fully migrated, fresh install, corrupted marker — met

[S2-2] Storage + DKIM atomic relocation via rename(2)

  • relocate_storage_for_default_domain renames inbox and sent — met
  • relocate_dkim_for_default_domain creates <dkim_dir>/<default_domain>/ mode 0700, renames private.key/public.key — met
  • Modes preserved across rename (test asserts 0600/0644) — met
  • Idempotency per-file via MoveOutcome::AlreadyDone — met
  • EXDEV detection via is_cross_device_error accepting both ErrorKind::CrossesDevices and raw OS error 18 — met
  • Error message: "<src> and <dst> must be on the same filesystem; see book/multi-domain.md for manual recovery" — met (verbatim)
  • Unit tests cover clean migration, half-relocated storage, missing DKIM file, refuses-to-overwrite when both sides exist — met. EXDEV is not unit-tested (hard to simulate without two filesystems) but the detection code path is correct by inspection.

[S2-3] Config rewrite via write_atomic + .layout-version write

  • rewrite_config_to_canonical_shape uses write_atomic — met
  • write_layout_version_marker writes 2\n mode 0644 via temp-then-rename — met
  • Idempotency on canonical input — met (byte-equal across repeated calls)
  • On-disk [mailboxes.<local>][mailboxes."<local>@<domain>"] re-key — DEFERRED. Sprint plan AC says the rewrite should normalize both domain → domains AND mailbox keys; this PR ships only the first. See "Deviation #1" below.

[S2-4] Wire the migration into serve.rs startup under the lock hierarchy

  • Sequence: load config → detect layout state → if NeedsMigration, acquire locks + run migration steps (storage → DKIM → config → marker) → continue to listener bind — met (line 589 of serve.rs, before the README refresh, the DKIM key load, the TLS init, and the listener binds)
  • FreshInstall writes marker — met
  • Error message: "upgrade migration failed: <reason>; see book/multi-domain.md for manual recovery" — met (verbatim)
  • Migration success emits a single tracing::info! log line with target: "aimx::upgrade" — met
  • Already-migrated short-circuits with no locks, no log — met
  • Per-mailbox locks acquired in sorted FQDN order against the shared MailboxLocks pool, then CONFIG_WRITE_LOCK (outer → inner per the documented hierarchy) — met
  • Exits with code 1 on failure — met (Err propagates through Result<(), Box<dyn Error>> to process::exit(1))

[S2-5] Upgrade-fixture integration test

  • Fixture at tests/fixtures/upgrade/v1-single-domain/ with 2 mailboxes, 2 messages each — met
  • tests/upgrade.rs exercises end-to-end migration, idempotency, corrupted-marker hard-fail, post-migration SMTP RCPT — met (4 tests, all green locally)
  • Post-migration assertions: storage relocated, DKIM relocated, config normalized, marker present, messages still readable — met
  • SMTP RCPT to a v1 mailbox local-part lands in the new per-domain inbox — met (upgrade_smtp_rcpt_lands_in_new_fqdn_keyed_path)
  • Second start is a no-op (config + marker byte-identical) — met
  • Corrupted .layout-version causes daemon to exit with canonical hard error — met

Disposition on Flagged Deviations

Deviation 1: Mailbox-key FQDN re-key deferred to Sprint 3

Verdict: ACCEPT with a follow-up requirement to update the sprint plan and PRD §6.12.

Reasoning:

The PRD §6.7 FR-G1 step 3 and the normative §6.12 walkthrough step 4 both call for the on-disk config to land in this shape post-upgrade:

domains = ["mydomain.com"]
[mailboxes."info@mydomain.com"]
[mailboxes."support@mydomain.com"]

This PR ships only the first line. The mailbox keys stay [mailboxes.info] / [mailboxes.support] on disk.

The implementer's reasoning holds up under scrutiny: doing the on-disk FQDN re-key in Sprint 2 without the runtime data-plane rewire would break every config.mailboxes.get(<local>) callsite (ingest, send, hooks, mailbox CLI), and Sprint 3 is where resolve_mailbox_for_rcpt lands to make those callsites FQDN-aware. Shipping the on-disk re-key now would either (a) break ~120 existing integration tests that pass mailbox local-parts to CRUD, or (b) require an emergency rewire across multiple modules just to keep the migration usable. Splitting the rewrite across two sprints is the right call.

The PR's compensating "layout-aware path shim" in Config::inbox_dir / sent_dir / storage_root_for_default_domain plus the matching updates in send_handler, state_handler, mailbox_handler is a clean bridge: the marker file is the single source of truth for "are we on v2?", same-process callers see v2 paths immediately after the marker lands, and the cost is one stat per inbox/sent lookup (microseconds, well within NFR-4). The integration test upgrade_smtp_rcpt_lands_in_new_fqdn_keyed_path proves that end-to-end functional behavior post-migration is correct: storage lives at the per-domain location, DKIM lives at the per-domain location, mail flows through the per-domain path, and the runtime keeps resolving mailboxes by their operator-friendly key.

What ships now is correct and forward-compatible. It is not "half-done" — it is "structurally migrated with the operator-friendly key preserved on disk pending a runtime data-plane rewire". The implementer documented this clearly in the PR description, in the module docstring on rewrite_config_to_canonical_shape, and in the integration test assertions (which pin the deferred behavior so a future contributor doesn't accidentally land the re-key without doing the data-plane work).

Required follow-up (Non-blocker for THIS PR): Update docs/multi-domain-sprint.md so the Sprint 2 row in the Summary Table no longer says (incl. legacy local-part → FQDN re-key), and update Sprint 3's plan to absorb the on-disk re-key as part of the runtime data-plane rewire (the S3-1 / S3-3 story area). Without this update, the next sprint's reviewer will see Sprint 2 marked "Done" while a Sprint-2 AC is technically still open.

Deviation 2: Module name src/upgrade_migration.rs instead of src/upgrade.rs

Verdict: ACCEPT.

Reasoning:

The collision is real. src/upgrade.rs already exists and owns the aimx upgrade self-update CLI (verified — it contains pub fn run for the self-update flow). Renaming that module would be a much larger and unrelated diff that touches the CLI dispatch, every reference in main.rs, the test suite, etc. The new name upgrade_migration.rs is descriptive ("the one-shot upgrade migration") and the implementer flagged the choice in the PR description, in the module docstring, and in the commit message.

No tracking work needed.

Potential Bugs

Non-blocker: aimx doctor reports dkim_key_present: false post-upgrade

src/doctor.rs:114:

let dkim_key_present = crate::config::dkim_dir().join("private.key").exists();

Post-migration, the DKIM key lives at <dkim_dir>/<default_domain>/private.key, not <dkim_dir>/private.key. aimx doctor will report:

DKIM key:         MISSING - run `aimx dkim-keygen`

…and recommend the operator regenerate keys. That recommendation is wrong (the key is fine, just relocated) and the suggested action could cause harm: a v1 aimx dkim-keygen without flags would generate fresh keys at the legacy <dkim_dir>/private.key location while the daemon continues to sign with the (different) per-domain key. The operator would then have a confusing mismatch between what doctor reports and what's actually signing outbound.

Same issue applies to src/doctor.rs:820-821 (the orphan-storage scan reads data_dir.join("inbox") / data_dir.join("sent") at the root, which post-migration no longer exists — silently misses any orphan dirs sitting under the new per-domain tree), and src/mailbox.rs:277 (discover_mailbox_names reads the same legacy root, so unregistered mailbox directories under the per-domain tree don't surface in aimx mailboxes list).

Classification: Non-blocker. The PR targets epic/multi-domain, which is not operator-visible until Sprint 7. The sprint plan parks doctor work in Sprint 6 (S6-3) and the storage-helper unification in Sprint 3 (S3-5). Sprint 2 was not required to touch these. But the regression ships now and will be visible to anyone testing the epic branch end-to-end. Two reasonable resolutions:

  1. Preferred: add a one-line patch to doctor.rs:114 using the same resolve_active_dkim_dir helper this PR already introduced for serve.rs, plus call out discover_mailbox_names / orphan-storage scan as known limitations on the epic branch.
  2. Acceptable: explicitly add these three call-sites to the PR's "Deferred Items" list so Sprint 3 / Sprint 6 pick them up, and accept the diagnostic regression on the epic branch.

Non-blocker: setup_test_env writing legacy v1 configs now silently exercises the migration path in every integration test

This is observed, not flagged in the PR description. Every setup_test_env-using integration test now runs the migration as a side effect of start_serve. The 116 tests in tests/integration.rs all pass, but the test surface is now "fresh v1 install → migrate → run test" rather than "v2 install → run test". Two existing tests had to be patched to re-resolve alice_dir after start_serve because of this (called out in the PR description), and the new layout-aware inbox() / sent() helpers paper over the rest. Worth noting because future test-writers should know setup_test_env produces a v1 install and the daemon migrates on first start.

Classification: Non-blocker. Existing tests pass. The test helper change is documented in the diff. Future cleanup: write a setup_v2_test_env that lands a canonical config + marker so tests can opt out of the migration path when they don't care about it.

Test Coverage

Strong. 26 unit tests in src/upgrade_migration.rs cover detection (5 paths), storage rename (4 paths), DKIM rename (3 paths), config rewrite (2 paths), marker write (1 path), and orchestration (5 paths). 4 integration tests in tests/upgrade.rs cover end-to-end migration, idempotency, corrupted-marker hard-fail, and post-migration SMTP RCPT. Verified locally: all 26 unit tests and all 4 integration tests pass; the existing 116 integration tests in tests/integration.rs still pass with the layout-aware inbox() / sent() helpers; cargo clippy --all-targets -- -D warnings and cargo fmt -- --check clean.

One gap worth noting (Non-blocker): EXDEV is not exercised by a unit test (hard to simulate without two filesystems). The detection logic in is_cross_device_error is correct by inspection — both io::ErrorKind::CrossesDevices and raw_os_error() == Some(18) are checked. A future cross-filesystem test would round this out.

Security Issues

None observed. The migration:

  • Reads config.toml, <data_dir>/.layout-version, and filesystem state at known paths
  • Writes via write_atomic (temp-then-rename, atomic on same FS)
  • Preserves the 0600 / 0644 modes on the DKIM keys across rename
  • Creates the per-domain DKIM dir at 0700 root:root
  • Runs as part of aimx serve which is already root-only
  • No user-controlled input feeds into path construction (default_domain comes from the loaded Config, which is itself validated)

The marker write uses <process-pid> in the temp filename which is fine for serialized startup (no concurrent writers). The rename-onto-final-name pattern handles a crash during the marker write cleanly: the temp file is left orphaned (fs::remove_file(&tmp) best-effort on rename failure) and the next migration run starts from the detection layer.

Code Quality

Solid. Highlights:

  • MoveOutcome enum is the right abstraction — encodes "renamed / already done / nothing to do" cleanly and the resulting report is what the operator-facing log line consumes.
  • MigratedOutcome is boxed inside StartupMigrationOutcome::Migrated to silence clippy::large_enum_variant. Documented in the PR description. Reasonable.
  • detect_layout_state is a pure function over filesystem state with no writes and no locks — exactly the right shape for the orchestration to detect-then-act under the caller's locks.
  • The split between run_migration (pure file work) and run_startup_migration (drives the detection-decision branching) keeps the unit tests light.
  • The order-of-operations comment at the top of the module is load-bearing — clearly explains the crash-recovery contract (storage → DKIM → config → marker, so a half-state always prefers "orphaned DKIM key" over "domain in config but DKIM missing").

One minor readability note (Non-blocker): in run_upgrade_migration_at_startup, the pre-flight detection (pre_state) is run BEFORE acquiring any locks, and then a fresh run_startup_migration call inside the locked body re-runs detect_layout_state. The implementer handled the "state changed under us" case defensively (the AlreadyMigrated | Fresh arm in the locked-body match) but the doubled detection is non-obvious. A short comment explaining "pre-flight check is for the no-lock fast path; the inner call inside the lock re-detects to keep the locked critical section authoritative" would help the next reader.

Alignment with PRD

  • ✓ FR-G1 (combined transaction): storage rename, DKIM rename, config rewrite (partial — domains promotion only), marker write, single INFO log line. Sequence matches.
  • ✓ FR-G2 (atomicity guarantees): each step is atomic, daemon refuses listener bind until transaction completes, partial completion is detectable via path-existence checks.
  • ✓ FR-G3 (idempotency): marker-present → fast no-op (single stat). Corrupted marker → hard startup error. Verified by integration test.
  • ⚠ FR-G1 step 3: PRD calls for [mailboxes.<local>][mailboxes."<local>@<domain>"] on-disk re-key. This PR ships only the domain → domains half. Deferred to Sprint 3 per the implementer's note; see Deviation #1 above.
  • ⚠ §6.12 walkthrough step 4: PRD's normative narrative shows FQDN-keyed mailboxes post-upgrade. Same deferral as above.
  • ✓ NFR-2 (concurrency / lock hierarchy): outer per-mailbox locks (sorted FQDN order) → inner CONFIG_WRITE_LOCK.
  • ✓ NFR-3 (atomicity): each step uses rename(2) or write_atomic. Crash recovery story preserved (storage first, marker last).
  • ✓ NFR-4 (performance): rename(2) is constant-time per top-level directory regardless of mail volume. Marker check is one stat per inbox/sent lookup (microseconds).
  • ✓ NFR-5 (security): DKIM modes preserved.

Summary and Recommended Actions

Overall verdict: Needs minor fixes — but only because of the unaddressed regression in aimx doctor / discover_mailbox_names and the docs drift on Sprint 2's AC. If the implementer prefers to land this PR as-is and pick those up in Sprints 3 / 6, the verdict slides to "Comment / land as planned" — neither finding rises to a blocker on its own.

Blockers: None. The deferred mailbox-key re-key is technically a Sprint-2 AC miss but it has sound technical justification, is loudly documented, and the integration test pins the resulting deferred behavior so the next sprint can pick it up.

Non-blockers (worth fixing before merge, but won't block):

  1. aimx doctor reports DKIM as missing post-migration. src/doctor.rs:114 still reads <dkim_dir>/private.key and bypasses the layout-aware resolver. Either patch it to use resolve_active_dkim_dir, or explicitly add it to the "Deferred Items" list with a Sprint 6 pointer.
  2. discover_mailbox_names and the doctor orphan-storage scan still read <data_dir>/inbox/ / <data_dir>/sent/ at the root. Add to Deferred Items if not fixing here.
  3. Update docs/multi-domain-sprint.md Summary Table so the Sprint 2 row no longer claims (incl. legacy local-part → FQDN re-key), and absorb the re-key into Sprint 3's plan or a follow-up. Otherwise the next sprint reviewer will see an unresolved Sprint 2 AC.

Nice-to-haves:

  • Add a unit test for EXDEV detection if a portable simulation is feasible (or document that EXDEV is covered by code inspection only).
  • Add a brief comment in run_upgrade_migration_at_startup explaining why detection runs both pre-lock and inside the lock.
  • Future: introduce a setup_v2_test_env helper so integration tests can opt out of the silent migration-on-startup path.

The migration itself is well-engineered, the test coverage is strong, and the deferral decision is the right technical call. The remaining items are polish and follow-up bookkeeping.

Addresses three review non-blockers on the upgrade-migration PR:

- doctor: route the DKIM private/public-key probes through
  resolve_active_dkim_dir so a post-migration install correctly
  reports 'present' from <dkim_dir>/<default_domain>/private.key
  instead of recommending dkim-keygen against a stale legacy path.

- mailbox::discover_mailbox_names + the doctor orphan-storage scan:
  walk per-domain inbox/sent roots via the new Config::storage_roots
  helper. On v2 (post-migration) this iterates
  <data_dir>/<domain>/inbox|sent/ across every configured domain;
  on v1 it preserves the legacy <data_dir>/inbox|sent/ behavior.
  Per-domain roots that have not been provisioned yet are skipped.

- new unit tests cover post-migration DKIM detection, the v1
  fallback, and per-domain mailbox discovery (including the
  freshly-added-domain skip).
@uzyn
Copy link
Copy Markdown
Owner Author

uzyn commented May 23, 2026

Review non-blockers addressed (commit 45dc558)

All three flagged non-blockers are fixed. Code commit pushed to sprint-2-multi-domain-upgrade-migration; docs commit pushed to aimx-docs main.

Non-blocker 1: aimx doctor reads legacy DKIM path

  • src/doctor.rs now resolves both the private-key probe (top of gather_status_with_ops) and the public-key read in gather_dns_section through crate::upgrade_migration::resolve_active_dkim_dir(config, &dkim_dir).
  • Post-migration the probe reads <dkim_dir>/<default_domain>/private.key (where the daemon actually signs from); pre-migration it falls back to the legacy <dkim_dir>/private.key. No more false MISSING + bad aimx dkim-keygen hint.
  • New unit tests in src/doctor.rs (dkim_layout_tests): post-migration per-domain key reads as present, legacy v1 key still reads as present, missing-from-both reads as missing.

Non-blocker 2: per-domain mailbox storage scans

  • New helper Config::storage_roots() in src/config.rs — returns <data_dir>/<domain>/ per configured domain on v2 (post-migration), and <data_dir>/ itself on v1 (no marker). One source of truth for "where to scan for mailbox storage".
  • mailbox::discover_mailbox_names (src/mailbox.rs) iterates config.storage_roots() and joins inbox/ per root; per-domain roots that have not been provisioned yet (e.g. a freshly added domain) are silently skipped via the read_dir error path.
  • The orphan-storage finding in src/doctor.rs does the same for both inbox/ and sent/. The finding message now uses the on-disk-relative path (<domain>/inbox/<name>/ on v2, inbox/<name>/ on v1) so the operator can rm -rf the exact dir.
  • New unit tests in src/mailbox.rs: aggregates across two per-domain inbox dirs on v2, falls back to legacy root on v1, skips unprovisioned per-domain roots without crashing.

Non-blocker 3: sprint summary table

  • multi-domain-sprint.md Summary Table updated in commit aimx-docs@1ee70a9. The Sprint 2 row now lists what actually ships (structural rewrite + storage/DKIM relocation + layout marker + layout-aware path shim) and explicitly notes the deferred legacy local-part → FQDN re-key. The Sprint 3 row records that it absorbs the deferred re-key.

Verification

  • cargo fmt -- --check: clean.
  • cargo clippy --all-targets -- -D warnings: clean.
  • cargo test: 1213 + 116 + 4 unit/integration tests pass (8 ignored — pre-existing root-only / two-uid-required tests, no regressions).
  • Banned-tokens scan on the diff: zero hits outside docs/.

Copy link
Copy Markdown
Owner Author

@uzyn uzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprint 2 Re-review: One-shot upgrade migration (PR #241)

Verifying the three previously flagged non-blockers against commit 45dc558 and the matching docs commit aimx-docs@1ee70a9.

Verification of Previously Flagged Non-Blockers

NB1 — aimx doctor DKIM probe — RESOLVED

src/doctor.rs:114 and the public-key read at src/doctor.rs:216 now both route through crate::upgrade_migration::resolve_active_dkim_dir(config, &dkim_root_base). Verified against the helper at src/upgrade_migration.rs:689, which probes <dkim_dir>/<default_domain>/private.key and falls back to <dkim_dir> for v1 installs. Three new tests in doctor::dkim_layout_tests pin the three relevant states:

  • dkim_present_when_only_per_domain_key_exists — post-migration per-domain key reads as present (the originally-flagged regression).
  • dkim_present_on_legacy_v1_layout — fallback to legacy <dkim_dir>/private.key still works for pre-migration installs.
  • dkim_missing_when_no_key_present — genuine missing case still surfaces, so the doctor MISSING / aimx dkim-keygen recommendation is not lost.

All three pass locally (cargo test --bin aimx doctor::dkim_layout_tests). The fix is minimal, surgical, and uses the same helper serve.rs already uses for the runtime DKIM load — single source of truth for "where does the active DKIM key live?".

One small note: this required bumping MockNetworkOps field visibility from private to pub(crate) in src/setup.rs so the new test module can construct one. The change is mechanical and confined to a test-only helper struct (the type is already under pub(crate) mod tests). No production-surface impact.

NB2 — Per-domain mailbox storage scans — RESOLVED

The fix introduces a single source of truth via Config::storage_roots() in src/config.rs:1502-1521:

  • On v2 (post-migration, .layout-version marker present): returns one <data_dir>/<domain>/ entry per configured domain.
  • On v1 (no marker): returns a single <data_dir>/ entry — preserves legacy behavior.

Both previously flagged call-sites now consume this helper:

  • mailbox::discover_mailbox_names (src/mailbox.rs:286-296): iterates config.storage_roots(), joins inbox/ per root, aggregates entries across all per-domain inbox dirs. Per-domain roots that do not exist (e.g. a freshly added domain) are silently skipped via the read_dir error path.
  • doctor::check_mailbox_ownership orphan-storage finding (src/doctor.rs:842-880): same iteration shape, applied to both inbox/ and sent/. The finding message uses root_dir.strip_prefix(&config.data_dir) so v2 reads as <domain>/<kind>/<name>/ and v1 as <kind>/<name>/ — operators see the exact on-disk path they need to rm -rf. The unwrap_or(Path::new(kind)) fallback is defensive against an unexpected strip_prefix failure.

Three new unit tests in mailbox::tests:

  • discover_aggregates_per_domain_inbox_dirs_on_v2_layout — verifies aggregation across two per-domain inbox dirs (agent.example.com, two.example.com).
  • discover_falls_back_to_legacy_root_on_v1_layout — v1 regression coverage.
  • discover_skips_missing_per_domain_inbox_dirs — provisions one of two configured domains, asserts no crash and the provisioned domain surfaces.

All three pass locally. The doctor orphan-storage scan is not separately unit-tested, but its logic is structurally identical to discover_mailbox_names and uses the same storage_roots() helper.

NB3 — Sprint summary table — RESOLVED

aimx-docs@1ee70a9 updates multi-domain-sprint.md:

  • Sprint 2 row: now lists structural domain → domains rewrite, storage + DKIM relocation under <domain>/, .layout-version marker, layout-aware path shim (storage_root_for_default_domain, storage_roots, resolve_active_dkim_dir), idempotent re-run, upgrade-fixture integration test, and explicitly bolds "Legacy local-part → FQDN on-disk mailbox-key re-key deferred to Sprint 3" with the justification ("in-memory map already carries legacy keys verbatim from Sprint 1; deferring keeps every existing config.mailboxes.get(<local>) callsite working").
  • Sprint 3 row: now lists "legacy local-part → FQDN on-disk mailbox-key re-key absorbed from Sprint 2" with the rationale ("runs as part of the data-plane rewire so every config.mailboxes.get(...) callsite migrates at the same time as the on-disk shape").

The drift between what Sprint 2 actually ships and what the Summary Table claims it ships is now closed. The next sprint's reviewer will see Sprint 2's actual scope reflected in the plan.

Two earlier-approved deviations

  • Deferred on-disk mailbox-key re-key (Sprint 3 absorbs it) — approved earlier, no fix expected. The docs update above formalises that decision in the plan.
  • Module rename src/upgrade_migration.rs instead of src/upgrade.rs — approved earlier, no fix expected. Module name is unchanged in this commit.

Regression / New Issue Check

Walked the full diff 5d0f4cf..45dc558 (four files: src/config.rs, src/doctor.rs, src/mailbox.rs, src/setup.rs). No new issues found:

  • The new Config::storage_roots() is pure (reads filesystem state, returns a Vec<PathBuf>, no writes, no locks) — safe for use in the read-only doctor / discovery paths.
  • The doctor changes preserve the existing read_dir error-path-as-skip behavior — a missing per-domain dir on a freshly added domain does not crash the doctor.
  • The orphan-storage message format change is path-relative to data_dir, so it reads correctly on both v1 (inbox/foo/) and v2 (example.com/inbox/foo/).
  • The MockNetworkOps visibility bump is mechanical; no production code consumes it.
  • No new banned planning-doc cross-references introduced (rg on the cumulative diff is clean outside docs/).

Verification commands

cargo fmt -- --check                                                   # clean
cargo clippy --all-targets -- -D warnings                              # clean
cargo test --bin aimx doctor::dkim_layout_tests                        # 3/3 pass
cargo test --bin aimx mailbox::tests::discover                         # 3/3 pass
cargo test --bin aimx                                                  # 1213/1213 pass (8 ignored, pre-existing)
cargo test --test integration                                          # 116/116 pass (8 ignored, pre-existing)
cargo test --test upgrade                                              # 4/4 pass

Summary

All three previously flagged non-blockers (NB1 doctor DKIM probe, NB2 per-domain storage scans, NB3 sprint plan Summary Table) are fully resolved. The fixes are minimal, well-tested (6 new unit tests, all passing), and introduce a clean new helper (Config::storage_roots()) as a single source of truth for "where mailbox storage roots live across layout versions". No new issues or regressions introduced. The PR is ready to merge to epic/multi-domain.

Overall verdict: Ready to merge.

Blockers: none.

Non-blockers: none.

Nice-to-haves: none new. The two existing nice-to-haves carried from the first review (EXDEV unit test, setup_v2_test_env helper) are still valid but remain non-blocking.

Recommended merge commit message

[Sprint 2] One-shot multi-domain upgrade migration

Atomic on-disk migration that brings legacy single-domain installs
onto the canonical multi-domain layout on first `aimx serve` startup
under the new binary. Synchronous, gated by the `.layout-version: 2`
marker, idempotent across restarts, hard-fail on partial completion.

Storage + DKIM relocation: rename(2) under `<data_dir>/<default_domain>/`
and `<dkim_dir>/<default_domain>/`. Config rewrite: structural
`domain → domains` promotion via `write_atomic`. Marker write: temp-
then-rename of `.layout-version: 2` (0644 root:root). Order is load-
bearing so a crash mid-flow prefers "orphaned DKIM key" over "domain
in config but DKIM missing"; marker is last so a partial run never
claims to be done. Migration runs under the documented lock hierarchy
(outer per-mailbox locks in sorted FQDN order → inner CONFIG_WRITE_LOCK)
before any listener binds.

Layout-aware path shim: `Config::inbox_dir` / `sent_dir` /
`storage_root_for_default_domain` / `storage_roots` consult the
marker so same-process callers see v2 paths immediately after the
marker lands. `resolve_active_dkim_dir` keeps the doctor probe and
the daemon's DKIM load aligned across v1 and v2 installs. The
`send_handler`, `state_handler`, `mailbox_handler`, `doctor`, and
`mailbox::discover_mailbox_names` data-plane paths now route through
these helpers.

Mailbox-key FQDN re-key (`[mailboxes.<local>]` →
`[mailboxes."<local>@<domain>"]`) is deferred to the runtime data-
plane rewire so every `config.mailboxes.get(<local>)` callsite
migrates at the same time as the on-disk shape.

Tests: 26 unit tests in `src/upgrade_migration.rs` cover detection,
each rename step, EXDEV handling by code inspection, the config
rewrite, the marker write, and the orchestration. 4 integration
tests in `tests/upgrade.rs` exercise end-to-end migration,
idempotency, corrupted-marker hard-fail, and post-migration SMTP
RCPT against a realistic v1 fixture. 6 additional unit tests cover
the layout-aware doctor DKIM probe and per-domain mailbox storage
scans.

The mailbox-dir-perms-isolation CI job started failing after the
multi-domain upgrade migration landed: the daemon now relocates
<data_dir>/{inbox,sent}/ to <data_dir>/<default_domain>/{inbox,sent}/
on first start, but four assertions in tests/uds_authz.rs hand-built
their expected paths as <tmp_root>/data/inbox/<name>, which no longer
exist after migration.

Add mailbox_inbox_path / mailbox_sent_path helpers keyed on a single
DEFAULT_DOMAIN const that also drives the test config, and rewrite the
four post-migration stat sites (mailbox_create_non_root_owner_ok,
mailbox_create_non_root_owner_synthesized_from_peercred,
mailbox_delete_non_root_cross_uid_returns_no_such_mailbox) to use
them. The pre-migration v1-layout dirs in spin_up_serve stay as-is so
the daemon's first-run migration still has a legacy tree to relocate.

Verified locally under sudo + AIMX_INTEGRATION_SUDO=1: all 21 ignored
uds_authz tests pass (39.7s). cargo fmt + cargo clippy --all-targets
-D warnings clean.
@uzyn
Copy link
Copy Markdown
Owner Author

uzyn commented May 23, 2026

CI regression fix — mailbox-dir-perms-isolation

Pushed 373f23e to fix three tests/uds_authz.rs failures the root-equipped CI job surfaced after the multi-domain upgrade migration landed.

Root cause

The runtime data plane now writes mailbox storage at <data_dir>/<default_domain>/{inbox,sent}/<name>/ instead of the legacy <data_dir>/{inbox,sent}/<name>/. Four assertions in tests/uds_authz.rs hand-built their expected paths against the pre-migration layout, so after the daemon's first-start migration ran the files were elsewhere and stat failed:

  • mailbox_create_non_root_owner_okstat ... /data/inbox/aliceowned no such file
  • mailbox_create_non_root_owner_synthesized_from_peercred — same, for /data/inbox/alicemade + /data/sent/alicemade
  • mailbox_delete_non_root_cross_uid_returns_no_such_mailboxbob_inbox.exists() false because the path was the pre-migration one

Fix

  • New DEFAULT_DOMAIN const (it.example.com) that also drives the test config so the two stay in lock-step.
  • New helpers mailbox_inbox_path(tmp_root, name) / mailbox_sent_path(tmp_root, name) returning <tmp_root>/data/<DEFAULT_DOMAIN>/{inbox,sent}/<name>/.
  • Replaced the four hand-built path joins with calls to those helpers.
  • The pre-migration v1-layout dirs in spin_up_serve (lines 389-393) deliberately stay as data_dir.join("inbox") / data_dir.join("sent") — those are the legacy tree the daemon's first-run migration relocates, so they must be present in the old shape.

Verification

Ran under sudo locally on this host:

sudo -E env AIMX_INTEGRATION_SUDO=1 cargo test --test uds_authz -- --ignored
# 21 passed; 0 failed; 39.7s

All three previously-failing tests pass. cargo fmt --check clean, cargo clippy --all-targets -- -D warnings clean. Banned-tokens regex on the diff returns zero new hits.

The daemon runs with `umask 0o077`, so when the upgrade migration
creates `<data_dir>/<domain>/` via `create_dir_all` it lands at
`0o700 root:root`. Non-root mailbox owners (e.g. `aimx mcp` under
their own uid) then cannot `x`-traverse into their own
`inbox/<name>/` subdir and every read surfaces EACCES.

Pre-migration the equivalent traversal happened at `<data_dir>/`
itself, which setup leaves at `0o755`. Restore that property on
the new per-domain dir by explicitly chmodding it to `0o755` after
creation (or on a re-entrant migration where a partial earlier run
left it locked). The `inbox/<name>/` subdirs themselves remain
`0o700` and owner-locked; only the per-domain traversal bit is
opened.

Regression test pins the mode under the daemon's `0o077` umask so
the default `cargo test` `0o022` can't hide the bug again.
@uzyn
Copy link
Copy Markdown
Owner Author

uzyn commented May 23, 2026

Fix: email_list_full_cycle_against_root_owned_config EACCES regression

Root cause. Under the v2 layout the upgrade migration creates <data_dir>/<domain>/ via fs::create_dir_all inside relocate_storage_for_default_domain. The daemon clamps its process umask to 0o077 at startup (ownership::set_process_umask), so the new per-domain dir lands at 0o700 root:root. A non-root mailbox owner running aimx mcp under their own uid then cannot x-traverse it to reach their own inbox/<name>/ subdir — every email_list / email_read / email_send (which writes a sent-copy through the same path) bubbles Permission denied (os error 13).

Pre-migration the equivalent traversal happened at <data_dir>/ itself, which setup.rs leaves at 0o755. The new per-domain dir had no equivalent chmod, so the regression silently appeared the moment the migration ran.

Fix. src/upgrade_migration.rs relocate_storage_for_default_domain now explicitly chmods the per-domain storage dir to 0o755 after create_dir_all (idempotent — runs on every migration call, so a re-entry after a crashed earlier run heals an over-tightened dir too). The inbox/<name>/ subdirs remain 0o700 and owner-locked; only the per-domain traversal bit is opened.

Regression test. storage_relocation_per_domain_dir_is_world_traversable (and a sibling storage_relocation_chmods_existing_per_domain_dir) pin the mode under a 0o077 umask via an UmaskGuard so the default cargo test 0o022 umask can't mask the bug again.

Verification.

  • Local repro under sudo: email_list_full_cycle_against_root_owned_config now passes (was failing on the EACCES assertion at tests/integration.rs:6975).
  • All three production-perm smoke tests pass together: email_list, mailbox_list, hook_list.
  • All 21 tests/uds_authz.rs --ignored tests still pass.
  • Full cargo test is green (1215 + 116 + 4 = 1335 passed).
  • cargo fmt --check + cargo clippy --all-targets -- -D warnings clean.

Diff is one file: src/upgrade_migration.rs (+99 lines, all in the migration function and its test module).

Copy link
Copy Markdown
Owner Author

@uzyn uzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprint 2 Re-review #2: One-shot upgrade migration (PR #241)

Verifying two new commits landed on top of the previously approved tree, after CI on the prior approval surfaced two failures that local runs hadn't caught.

Commits Reviewed Since the Last Approval

  • 373f23etests/uds_authz.rs: route four post-migration stat sites through new mailbox_inbox_path / mailbox_sent_path helpers keyed on DEFAULT_DOMAIN; leave the pre-migration v1 setup at lines 389-393 untouched.
  • fbfda62src/upgrade_migration.rs: explicitly set_permissions(domain_dir, 0o755) after create_dir_all in relocate_storage_for_default_domain, plus an UmaskGuard helper and two regression tests pinned at the daemon's 0o077 umask.

Verification

373f23e (test-only path update) — pass

  • New DEFAULT_DOMAIN = "it.example.com" const is the single source of truth for both the generated config and the path helpers (spin_up_serve was updated to interpolate domain = DEFAULT_DOMAIN into the domain = "..." field and every <local>@<domain> address) — no risk of helper/config skew.
  • The four rewritten stat sites (mailbox_create_non_root_owner_ok, mailbox_create_non_root_owner_synthesized_from_peercred x2 lookups, mailbox_delete_non_root_cross_uid_returns_no_such_mailbox) all targeted hand-built <tmp_root>/data/{inbox,sent}/<name> paths that no longer exist after first-start migration. The helpers correctly point them at <tmp_root>/data/it.example.com/{inbox,sent}/<name>.
  • The pre-migration v1-layout setup at lines 389-393 in spin_up_serve (creating data/inbox/, data/sent/, and per-user subdirs under the legacy paths) is preserved verbatim. This is load-bearing — the daemon's run_startup_migration needs a v1 tree to relocate, so the legacy paths must exist at the v1 shape when the daemon binds. Confirmed by inspection.
  • No production code touched. No assertion semantics changed — same stat_owner_username calls, same expected uids, same error-expectation paths. Pure path-construction refactor.

fbfda62 (production chmod fix) — pass on every verification point

  • Chmod is unconditional. Lines 292-300 sit outside the if !domain_dir.exists() guard at 286-291, so every call to relocate_storage_for_default_domain rewrites the mode. A re-entrant migration after a crashed earlier run that left the dir at 0o700 will heal back to 0o755. The new storage_relocation_chmods_existing_per_domain_dir test pins this exact contract by pre-creating the dir at 0o700 and asserting 0o755 after. Operator hand-tightening will be reverted — see the open question below.
  • inbox/<name>/ and sent/<name>/ subdirs remain 0o700 and owner-locked. move_if_pending uses fs::rename(src, dst) which preserves the source dir's mode unchanged — the existing 0o700 modes baked in by mailbox_handler::create_mailbox_dirs (lines 407, 414) and mailbox::ensure_mailbox_dirs (line 240) survive the rename verbatim. The chmod is scoped to domain_dir itself, not its contents. There is no set_permissions walking the subtree. Confirmed by inspection.
  • UmaskGuard is sound. Uses a static Mutex<()> to serialize all umask mutations across parallel test threads, handles poisoning via unwrap_or_else(|p| p.into_inner()), captures previous umask via libc::umask(new) (which atomically swaps and returns prior value), restores in Drop. The unsafe block is correctly justified by the comment ("umask(2) is thread-unsafe but the mutex above serializes every caller"). Lock is held for the full UmaskGuard lifetime via _lock: MutexGuard<'static, ()>, so concurrent tests block correctly. One minor note: if a future test panics inside the guard, Drop still runs (RAII contract) and the mutex gets released — sound.
  • Chmod path matches setup.rs data_dir treatment. install_config_dir (setup.rs:1647) sets 0o755 explicitly on the config dir; the <data_dir> itself is created via create_dir_all from setup (line 2494) under the setup CLI's default 0o022 umask, landing at 0o755. The fix correctly aligns <data_dir>/<domain>/ with <data_dir> itself — both 0o755, both traversable by non-root mailbox owners. The contained inbox/<name>/ and sent/<name>/ stay 0o700 matching mailbox subdirs on v1.
  • No security regression. The widened bit is exactly --x for group/other (well, r-x, but the dir lists nothing useful — non-owners can't enumerate mailboxes because the inbox/<name>/ subdirs they would readdir(2) for are at 0o700). The DKIM per-domain dir at relocate_dkim_for_default_domain (lines 354-370) stays at 0o700 — correct, because DKIM key material must be root-only. The asymmetry between the two relocate functions is intentional and well-motivated.

I also ran cargo test --bin aimx storage_relocation locally — all 6 tests in that filter pass, including both new regressions. cargo fmt --check and cargo clippy --all-targets -D warnings are clean. CI shows green on core-tests, mailbox-dir-perms-isolation, docs-build, and verifier-tests at fbfda62.

Disposition on the Open Question

"Should the chmod be guarded so an operator's deliberate 0o700 on <data_dir>/<domain>/ isn't widened on next startup?"

Leave it unconditional. The implementer's reasoning holds:

  1. The daemon is the authoritative owner of the storage layout shape. The chmod-on-every-call posture is a feature, not a bug — operators who tighten <data_dir>/<domain>/ to 0o700 break every non-root MCP and CLI read, so the daemon proactively heals the directory back to a working state on next start. This matches the cascade-delete contract where the daemon owns disk shape under CONFIG_WRITE_LOCK.
  2. The DKIM relocate function (line 355) does guard its chmod with if created_domain_dir because the contract is opposite: DKIM key dir must always be 0o700, and a pre-existing tighter mode is fine. Storage is the inverse: must always be 0o755, and a pre-existing tighter mode is broken state.
  3. A worked example: an operator who runs chmod 0700 /var/lib/aimx/example.com and restarts the daemon will see traversal restored. That's the right outcome — there is no legitimate reason for that directory to be 0o700 on a multi-domain install.

If documenting becomes important post-merge, a one-liner in book/configuration.md or book/multi-domain.md noting that the daemon enforces 0o755 on <data_dir>/<domain>/ on every startup is the right home — but it's not a merge blocker.

New Issues Found

None. Both commits are tightly scoped, well-tested, and correct.

Summary

The CI failures the prior approval missed are now fully addressed: a test-only path refactor for tests/uds_authz.rs that preserves the legacy v1 setup the migration needs to operate on, and a real production bug fix that makes the per-domain storage dir world-traversable under the daemon's 0o077 umask. The fix is minimal, the regression tests pin the exact failure mode with an UmaskGuard that forces the daemon's umask back into the test environment, and the asymmetric posture (storage 0o755, DKIM 0o700) is intentional and well-motivated. No regressions introduced.

Verdict: Ready to merge.

[Sprint 2] One-shot multi-domain upgrade migration

Atomic on-disk migration that brings legacy single-domain installs
onto the canonical multi-domain layout on first `aimx serve`
startup. Storage and DKIM keys are relocated to per-domain
subdirectories via `rename(2)` (atomic on same filesystem,
constant time), the config is rewritten from `domain = "x"` to
`domains = ["x"]`, and a `.layout-version: 2` marker is written
last so partial runs never claim to be done. The per-domain
storage dir is chmodded to `0o755` on every call so non-root
mailbox owners can traverse into their own `inbox/<name>/` subdir
under the daemon's `0o077` umask; the contained mailbox subdirs
remain `0o700` and owner-locked. EXDEV surfaces an actionable
error. Layout-aware path resolution shim added so same-process
callers see v2 paths immediately after the marker lands. Doctor
DKIM probe and per-domain mailbox storage scans route through the
new layout. Mailbox-key FQDN re-key
(`[mailboxes.<local>]` → `[mailboxes."<local>@<domain>"]`) is
deferred to the runtime data-plane rewire so every
`config.mailboxes.get(<local>)` callsite migrates at the same
time as the on-disk shape.

Tests: 26 unit tests in `src/upgrade_migration.rs` cover detection,
each rename step, EXDEV handling by code inspection, the config
rewrite, the marker write, and the orchestration. 4 integration
tests in `tests/upgrade.rs` exercise end-to-end migration,
idempotency, corrupted-marker hard-fail, and post-migration SMTP
RCPT against a realistic v1 fixture. 6 additional unit tests cover
the layout-aware doctor DKIM probe and per-domain mailbox storage
scans. 2 new regression tests in `src/upgrade_migration.rs` pin
the per-domain storage dir at `0o755` under the daemon's `0o077`
umask, via a thread-safe `UmaskGuard` helper. Four post-migration
stat sites in `tests/uds_authz.rs` routed through new
`mailbox_inbox_path` / `mailbox_sent_path` helpers so the
`mailbox-dir-perms-isolation` CI job stays green.

@uzyn uzyn merged commit 6167d0b into epic/multi-domain May 23, 2026
4 checks passed
@uzyn uzyn deleted the sprint-2-multi-domain-upgrade-migration branch May 23, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant