Skip to content

Epic: Multi-Domain Support (v1)#240

Draft
uzyn wants to merge 7 commits into
mainfrom
epic/multi-domain
Draft

Epic: Multi-Domain Support (v1)#240
uzyn wants to merge 7 commits into
mainfrom
epic/multi-domain

Conversation

@uzyn
Copy link
Copy Markdown
Owner

@uzyn uzyn commented May 23, 2026

Epic Integration PR

Track: multi-domain
PRD: docs/multi-domain-prd.md
Sprint plan: docs/multi-domain-sprint.md

All 7 sprints of the multi-domain track have landed on epic/multi-domain.
This PR is ready for human review and merge to main.

Sprints

Scope (v1, "light" multi-domain)

One operator hosts multiple sending/receiving domains on the same server:

  • Each domain has its own DKIM keypair, catchall, mailboxes
  • domains[0] is the default; bare local-parts resolve against it
  • Existing single-domain installs upgrade atomically with zero data loss
  • Domain CRUD is daemon-mediated and hot-reloads (no restart)

Explicitly out of scope: multi-tenant features (per-domain ACLs, separate
operators, per-domain rate limits, per-domain verifier endpoints, per-domain
TLS certs). See the PRD §1 and §9 for the full scope discipline.

How to review

Pre-tag recommendation

One smoke step is documented as "manual pre-tag verification recommended":
the rollback procedure on real hardware (see book/multi-domain.md rollback
section and docs/multi-domain-smoke-results.md). Not a blocker for merging
this PR — leaves room to validate it before tagging the multi-domain release.

Release notes

RELEASE_NOTES.md calls out everything operators need to know about the
upgrade (config rewrite, storage relocation, DKIM relocation, rollback
pointer). aimx upgrade also prints a one-screen post-upgrade reminder.

uzyn added 6 commits May 23, 2026 12:21
Lands the multi-domain config schema in `config.rs` (parse + validate
only; no migration, no daemon/runtime changes):

- `Config.domains: Vec<String>` replaces `Config.domain: String`.
  Canonical shape is `domains = ["a.com", "b.com"]`; legacy `domain
  = "x.com"` accepted on read and normalized to a one-entry vec.
  Mixed `domain` + `domains` rejects with the exact wording
  "specify either 'domain' (singular, legacy) or 'domains' (plural),
  not both". Entries lowercased, case-insensitively deduplicated,
  RFC 1035-validated. Order significant — `domains[0]` is default.

- `[mailboxes."info@a.com"]` (FQDN-keyed) parses; key must equal
  `address` (case-insensitive on the domain). Legacy
  `[mailboxes.info]` (local-part-keyed) preserves the
  operator-friendly key in the in-memory map; `address` must
  reference a configured domain. On-disk re-keying to FQDN is
  deferred to the later upgrade migration so single-domain runtime
  paths keep working unchanged this sprint.

- `MailboxConfig::is_catchall(&self, config: &Config)` matches
  `*@<d>` for any `d` in `config.domains`.

- `Config.per_domain: HashMap<String, DomainOverride>` parses from
  `[domain."<name>"]` sub-tables (singular `domain` key — TOML
  cannot let `domains` be both an array and a table). Each
  `DomainOverride` carries optional `signature`, `dkim_selector`,
  `trust`, `trusted_senders`. Dangling sub-tables reject at load.
  Per-domain trust validates against the same allowlist as the
  global trust.

- Top-level `dkim_selector` is now `Option<String>`;
  `Config::default_dkim_selector(&self) -> &str` resolves to
  `"aimx"` when unset. `Config::default_domain(&self) -> &str`
  returns `domains[0]`.

- 47 unit tests in `config::tests` cover every legal and rejected
  shape. 6 fixture configs land at `tests/fixtures/config/*.toml`
  with structural-invariant load tests for each.

All 1181 unit + 116 integration tests pass. `cargo clippy
--all-targets -- -D warnings` and `cargo fmt -- --check` clean.
Atomic on-disk migration that brings legacy single-domain installs
onto the canonical multi-domain layout on first `aimx serve` startup
under the new binary. Synchronous, gated by the `.layout-version: 2`
marker, idempotent across restarts, hard-fail on partial completion.

Storage + DKIM relocation: rename(2) under `<data_dir>/<default_domain>/`
and `<dkim_dir>/<default_domain>/`. Config rewrite: structural
`domain → domains` promotion via `write_atomic`. Marker write: temp-
then-rename of `.layout-version: 2` (0644 root:root). Order is load-
bearing so a crash mid-flow prefers "orphaned DKIM key" over "domain
in config but DKIM missing"; marker is last so a partial run never
claims to be done. Migration runs under the documented lock hierarchy
(outer per-mailbox locks in sorted FQDN order → inner CONFIG_WRITE_LOCK)
before any listener binds.

Layout-aware path shim: `Config::inbox_dir` / `sent_dir` /
`storage_root_for_default_domain` / `storage_roots` consult the
marker so same-process callers see v2 paths immediately after the
marker lands. `resolve_active_dkim_dir` keeps the doctor probe and
the daemon's DKIM load aligned across v1 and v2 installs. The
`send_handler`, `state_handler`, `mailbox_handler`, `doctor`, and
`mailbox::discover_mailbox_names` data-plane paths now route through
these helpers.

Mailbox-key FQDN re-key (`[mailboxes.<local>]` →
`[mailboxes."<local>@<domain>"]`) is deferred to the runtime data-
plane rewire so every `config.mailboxes.get(<local>)` callsite
migrates at the same time as the on-disk shape.

Per-domain storage dir is explicitly chmod'd to 0o755 after creation
so the daemon's defensive 0o077 umask doesn't lock out non-root MCP
callers; contained inbox/<name>/ and sent/<name>/ subdirs remain
0o700 root-locked. UmaskGuard test helper pins the umask so cargo
test's default 0o022 can't mask the regression.

Tests: 26 unit tests in `src/upgrade_migration.rs` cover detection,
each rename step, EXDEV handling by code inspection, the config
rewrite, the marker write, the orchestration, and the per-domain
dir traversal-bit invariant. 4 integration tests in `tests/upgrade.rs`
exercise end-to-end migration, idempotency, corrupted-marker hard-fail,
and post-migration SMTP RCPT against a realistic v1 fixture. 6
additional unit tests cover the layout-aware doctor DKIM probe and
per-domain mailbox storage scans. `tests/uds_authz.rs` paths routed
through new per-domain helpers to keep the production-perm smoke
suite passing.
…re-key (#242)

Wires multi-domain into the runtime data plane: SMTP RCPT TO accepts
any configured domain, outbound signs with the per-message domain's
DKIM key + selector, sent copies persist under
`<data_dir>/<from-domain>/sent/<local>/`, bare-local-part From:
rewrites to the default domain daemon-side, and the deferred on-disk
mailbox-key FQDN re-key fires on first start under the new binary.

- `recipient_domain_matches_any` replaces the single-domain helper in
  the SMTP session state machine; `Config::resolve_mailbox_for_rcpt`
  does exact FQDN lookup with per-domain catchall fallback.
- Per-domain DKIM key map via
  `Arc<ArcSwap<HashMap<String, DkimKeyEntry>>>` so future domain CRUD
  verbs can hot-swap without restarting. Selector resolution order:
  per-domain override → top-level → built-in `"aimx"`. Missing key
  for non-default domains warns and the daemon still starts; missing
  default-domain key is fatal. Legacy `<dkim_dir>/private.key`
  fallback applies only to the default domain.
- `send_handler` extracts the From: domain from the submitted body,
  validates against `config.domains`, signs with the per-domain key,
  and rejects per-domain catchall as outbound sender. Bare-local-
  part From: rewrites both header and body bytes before signing so
  DMARC alignment stays valid.
- New `src/storage.rs` with `mailbox_storage_path` / `Folder`;
  `Config::inbox_dir` / `sent_dir` delegate to the helper, and a CI
  grep job rejects new raw `.join("inbox" / "sent")` outside the
  storage / upgrade-migration / mailbox modules.
- Carry-over startup re-key rewrites legacy `[mailboxes.<local>]` to
  `[mailboxes."<local>@<domain>"]` on already-v2 installs. Idempotent.
- MAILBOX-CREATE (daemon + CLI fallback) inserts new mailboxes
  FQDN-keyed so the in-memory shape is consistent post-create
  without waiting for the next-restart carry-over.
- 29 new tests across `src/config.rs`, `src/dkim_keys.rs`,
  `src/smtp/session.rs`, `src/send_handler.rs`, `src/storage.rs`,
  `tests/multi_domain.rs`, and `tests/upgrade.rs`.
Adds `aimx domains list` / `aimx domains add` (with `aimx domain` clap
alias and a scaffolded `remove`), the `AIMX/1 DOMAIN-LIST` and
`DOMAIN-ADD` UDS verbs, the root-only `Action::DomainCrud` authz
variant, and an `--domain` flag on `aimx dkim-keygen` for targeting a
specific per-domain key directory. The `DOMAIN-ADD` handler hot-swaps
the in-memory `Arc<Config>` and the per-domain DKIM `ArcSwap` map
atomically (DKIM map first, config second) so a concurrent send
observing the new domain in `config.domains` always sees the matching
key. SMTP RCPT to a freshly-added domain is accepted by the running
daemon without a restart, validated by an end-to-end CI test under
sudo.

`aimx dkim-keygen` (no `--domain`) writes to
`<dkim_dir>/<default_domain>/` — the v2 per-domain layout the daemon
loader reads from — eliminating the rotation footgun where the new
key would have silently landed at a path the daemon ignored. The
read-side legacy fallback for unmigrated v1 installs is unchanged.

Daemon-stopped fallback: root falls back to a direct `config.toml`
edit plus DKIM keygen with a restart hint; non-root hard-errors with
the canonical "daemon must be running for non-root domain CRUD" hint.

`dkim::generate_keypair` now `chmod 0700`s its parent dir itself, so
both the CLI direct path and the daemon `handle_domain_add` path land
at identical on-disk permissions.
Land `aimx domains remove <domain>` with the `AIMX/1 DOMAIN-REMOVE`
UDS verb. Default path refuses with a sorted JSON list of blocking
mailbox FQDNs; `--force` cascades to per-mailbox wipe + per-domain
storage `rmdir` + config rewrite + DKIM-map hot-swap under the daemon
lock hierarchy (outer: per-mailbox locks in sorted FQDN order; inner:
CONFIG_WRITE_LOCK — matches the existing codebase convention so the
cascade cannot deadlock against concurrent MAILBOX-CRUD / HOOK-CRUD /
MARK-* / ingest).

Last-domain remove is hard-blocked regardless of `--force` with a
pointer to `aimx uninstall`. DKIM key files at `<dkim_dir>/<domain>/`
are preserved on disk so the operator can re-add the domain without
regenerating the keypair; the response echoes the path back so the
CLI can print the canonical preservation hint.

The cascade is re-runnable, not strict-atomic: on partial IO failure
the in-memory Config and DKIM map are not swapped, external observers
still see the pre-cascade view, and a second invocation completes the
cascade idempotently. The under-lock re-snapshot guards against
mailbox-set drift between the pre-flight scan and the lock acquisition
list with a Conflict refusal.

Daemon-stopped fallback: root falls back to direct config edit +
storage wipe + restart hint; non-root hard-errors with the canonical
"daemon must be running" message. The `storage_tree_removed` field on
the response is true only when an on-disk per-domain tree was actually
removed, so the CLI's "Storage tree removed." line is now accurate.

CI is wired: `tests/domains_remove.rs` runs under sudo on the
`mailbox-dir-perms-isolation` job. Coverage includes a concurrent-
ingest stress test that pins the lock-hierarchy invariant
operationally (cascade completes within 10s while a background thread
hammers SMTP RCPT TO on the surviving domain) and a unit test that
pins the `live_blocker_fqdns != lock_keys` conflict-detection branch
via a release-build-zero-cost test hook.

`src/domain_handler.rs` added to the storage-path enforcement awk
allowlist in CI so the cascade's per-domain `inbox/`/`sent/` walk is
the only sanctioned use of raw `.join("inbox")` outside `storage.rs`.
Per-domain runtime wiring + observability + MCP FQDN sweep for the
multi-domain track.

- Trust resolution helpers (`MailboxConfig::effective_trust` /
  `effective_trusted_senders`) walk per-mailbox → per-domain → global
  with replace semantics at every layer.
- DKIM selector + signature resolution helpers
  (`Config::dkim_selector_for_domain` / `signature_for_domain` /
  `effective_signature_for_domain`) walk per-domain → top-level →
  built-in default. Per-domain signature is appended to the body before
  DKIM signing so the recipient verifies the signed-over bytes.
- `aimx doctor` renders per-domain blocks on multi-domain installs with
  default-domain marker, per-domain DKIM key presence + DNS verification
  status, mailbox + unread counts. Single-domain installs keep the flat
  layout (no regression).
- MCP FQDN sweep: every tool returning mailbox identifiers
  (`mailbox_list`, `email_list`, `email_mark_read`, `email_mark_unread`,
  `hook_create`, `hook_list`, `mailbox_delete`) returns FQDN-shaped
  names. Bare local-parts on input continue to resolve against
  `domains[0]`.
- Datadir README template bumped to describe the per-domain layout +
  `.layout-version` marker; first `aimx serve` start post-upgrade
  refreshes via the existing version-gated overwrite.

Tests: 8-combination trust resolution coverage, DKIM selector +
signature resolution order, per-domain doctor rendering (flat +
multi-domain blocks + per-domain DKIM DNS status), MAILBOX-LIST FQDN
regression on single + multi-domain, end-to-end MCP integration suite
(two-domain + single-domain) spanning mailbox_list FQDN shape and
email_list bare-vs-FQDN input acceptance.
- book/multi-domain.md: new 9-section operator reference (when to add a
  second domain, `aimx domains` CLI, per-domain config, per-domain DKIM,
  storage layout, upgrade migration walkthrough, removal semantics, light
  scope, rollback procedure). Linked from book/SUMMARY.md and
  book/README.md.
- book/{setup,mailboxes,mcp,cli,faq,troubleshooting}.md: multi-domain
  content threaded through existing pages — FQDN-keyed mailboxes,
  per-domain catchall, `--domain` flag on dkim-keygen, `aimx domains`
  command group, DOMAIN-* UDS verbs, three multi-domain FAQs, a new
  troubleshooting section.
- agents/common/aimx-primer.md: default-domain resolution and FQDN
  disambiguation rules; primer line-count soft cap bumped 500 -> 600.
- agents/common/references/multi-domain.md (new): operator-facing
  reference card covering the default-domain rule, FQDN disambiguation
  across mailbox-scoped MCP tools, per-domain storage, and the
  operator-only boundary on domain CRUD.
- RELEASE_NOTES.md (new): top-level notes calling out the config
  rewrite, storage relocation, DKIM relocation, and rollback pointer.
- src/upgrade.rs: `aimx upgrade` now prints a one-screen post-upgrade
  reminder; `post_upgrade_reminder_text()` pinned by a unit test so
  future edits cannot silently drop a section.
- scripts/check-docs.sh: allow `aimx domain` singular clap alias.

Smoke results documented in docs/multi-domain-smoke-results.md via a
synthetic-via-tests mapping — each step pins to a CI integration test
(tests/upgrade.rs, tests/domains_uds.rs, tests/domains_remove.rs,
tests/multi_domain.rs, tests/mcp_multi_domain.rs). Real-hardware
rollback verification is recommended pre-tag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant