Skip to content

feat: add deployment status to Settings page#7

Merged
ericodom merged 1 commit into
mainfrom
feat/settings-deployment-status
Apr 12, 2026
Merged

feat: add deployment status to Settings page#7
ericodom merged 1 commit into
mainfrom
feat/settings-deployment-status

Conversation

@ericodom
Copy link
Copy Markdown
Contributor

Summary

  • Add deploymentStatus GraphQL query + resolver that reads Lambda env vars (no DB, no live AWS calls)
  • Add DeploymentStatus type to GraphQL schema with stage, region, services, resources, and URLs
  • Update Settings page with two new cards: Deployment (stage, region, account, service statuses) and Resources & URLs (S3, DB, ECR, clickable links)
  • Add Terraform env vars (ADMIN_URL, DOCS_URL, APPSYNC_REALTIME_URL, ECR_REPOSITORY_URL, AWS_ACCOUNT_ID) to Lambda common_env

Test plan

  • terraform plan confirms new env vars are added without drift
  • Admin app builds with no new type errors
  • Settings page renders Deployment and Resources cards after deploy
  • URLs are clickable and open in new tabs
  • Graceful handling when env vars are empty (rows hidden)

🤖 Generated with Claude Code

Surface deployment infrastructure info (stage, region, services, resources,
URLs) on the admin Settings page via a new deploymentStatus GraphQL query
that reads Lambda environment variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ericodom ericodom merged commit 17758b3 into main Apr 12, 2026
3 checks passed
@ericodom ericodom deleted the feat/settings-deployment-status branch April 12, 2026 22:02
ericodom added a commit that referenced this pull request Apr 24, 2026
…SI-2/3/6) (#510)

Lands the single code path every skill-with-scripts invocation will flow
through once U5 wires the Skill meta-tool (plan #7 §U4). Ships as inert
today — the Dockerfile COPY picks it up (via U2a's wildcard) and _boot_assert
registers it, but no production path calls it yet. Shadow-dispatch in U7 is
the first consumer.

## What lands

### `container-sources/skill_session_pool.py`
Async pool keyed on `(tenant_id, user_id, environment)`. LRU cap 8 per key,
30-min idle timeout, per-key async lock so concurrent acquires on the same
key don't double-start a session. API:

  - `acquire(key) -> SessionHandle`  (warm reuse or fresh start)
  - `handle.release()`
  - `flush_for_tenant(tenant_id)` — U12 kill-switch path
  - `flush_all()` — ops escape hatch
  - `prune_idle()` — caller decides cadence; exposed so tests advance time

### `container-sources/skill_dispatcher.py`
`dispatch_skill_script(tenant_id, user_id, skill_slug, args, environment, *,
pool, catalog, runner, counters)`. Security invariants enforced:

  - **SI-2** args travel via `writeFiles(_args.json=json.dumps(args))`; the
    executeCode string is a fixed template that opens the file and calls
    `run(**args)`. Model-controlled values never touch the Python source.
  - **SI-6** template purges `scripts.<slug>.*` from `sys.modules` +
    `importlib.invalidate_caches()` before every import, so a monkey-patch
    from call N cannot leak into call N+1 on the same pooled session.
  - Depth cap 5 (SkillDepthExceeded), per-turn budget 50 (SkillTurnBudgetExceeded).
  - Stdout parsed as JSON; structured errors (SkillOutputParseError,
    SkillTimeout, SkillExecutionError, SkillNotFound) all ride the same
    `DispatchResult` shape for uniform audit downstream.

SI-3 (user-scoped pool key) is enforced structurally in the pool itself.

### `test_skill_session_pool.py` — 9 cases
Acquire + reuse, concurrent-acquire safety, LRU eviction of idle slots,
in-use-never-evicted, idle pruning with frozen time, flush-for-tenant
isolation, flush-all.

### `test_skill_dispatcher.py` — 9 cases
Happy path (args land in `_args.json`, not in exec string), unknown slug,
non-JSON stdout, timeout, non-zero exit with stderr, depth-cap boundary
(max OK, max+1 rejected), turn budget, audit hook firing on ok + failure.

### `test_skill_dispatcher_security.py` — 6 cases
Each named with its SI number so grep surfaces coverage at review time:

  - SI-2: adversarial args (`__import__('os').system('curl evil.test')`,
    nested `exec()`, unicode escapes) round-trip through _args.json
    unchanged, never appear in the exec string.
  - SI-2: exec template byte-identical across two invocations with
    different args — a structural assertion that fails if anyone ever
    reintroduces interpolation.
  - SI-3: alice and bob on the same tenant get distinct pool sessions;
    flush-for-tenant isolates.
  - SI-6: exec template purges `scripts.<slug>.*` before import, even on
    back-to-back calls with the same slug.

### Wiring
- `_boot_assert.EXPECTED_CONTAINER_SOURCES` grows skill_dispatcher +
  skill_session_pool so the Dockerfile RUN asserts they landed.
- `packages/api/src/lib/sandbox-preflight.ts` gains an optional
  `caller: 'execute_code' | 'skill_dispatch'` field on the input +
  result. Defaults to `execute_code` for backwards compat; dispatcher
  paths set `skill_dispatch` when U5+ wires them. No behavior change
  for existing callers.

## What this does NOT do

- Does NOT wire the dispatcher into server.py's Agent(tools=...) flow.
  That's U5 (Skill meta-tool).
- Does NOT extract the quota/audit loop from server.py:682-755. The plan
  calls for this as part of U4; deferring to the shadow-dispatch wiring
  in U7 where the quota call actually fires — extracting now would add
  a seam with no caller yet.
- Does NOT call the real AgentCore Code Interpreter. Tests drive
  injected runner/pool callables. Real integration happens in U7's
  shadow-dispatch harness.

## Test plan

- [x] `uv run ... pytest` on the three new files — 24 tests green
- [x] Full `pytest packages/agentcore-strands/agent-container/` — 211 green
      (24 new + 187 existing)
- [x] `pnpm --filter @thinkwork/api typecheck` green (preflight caller
      field threaded through existing tests)
- [x] `pnpm --filter @thinkwork/api test` on `sandbox-preflight.test.ts`
      — 9 tests green
- [x] ruff import-sort clean on every new file
- [x] prettier clean on every touched TS file

Part of the V1 agent-architecture plan
(`docs/plans/2026-04-23-007-feat-v1-agent-architecture-final-call-plan.md` §U4).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request Apr 24, 2026
…s inert) (#511)

The single `Skill(name, args)` meta-tool that U6 flips to be the sole
invocation path once U7's shadow harness validates equivalence. Today it
ships as inert code — the Dockerfile wildcard COPY picks it up (via U2a)
and _boot_assert registers it, but server.py's live Agent(tools=...) path
still routes through the existing run_skill_dispatch / composition_runner
code.

## Why ship inert

The plan (#7 §U4/U5/U6/U7) explicitly gates U6's cutover on U7 PASS
— U7 is the shadow harness that dual-dispatches both the old and new
paths on real invocations and measures divergence. Wiring U5 into the
live Agent(tools=...) before U7 exists would swap the invocation path
without the safety net the plan itself calls for. This PR therefore
ships the module + tests and defers server.py wiring to U7.

## What lands

### `container-sources/skill_meta_tool.py`
- `SessionAllowlist` — intersection of
  `tenant_skills ∩ template_skills ∩ ¬template_blocks ∩ ¬tenant_kill_switches`
  pre-computed once at Agent(tools=...). Narrow-only: a template cannot
  widen past what the tenant enabled (plan R6/R7).
- `invoke_skill(name, args, *, ctx)` — pure entry point the Strands @tool
  wrapper calls. Routes script-bundle skills to U4's `dispatch_skill_script`;
  pure-SKILL.md skills return their body for in-prompt consumption
  (no sandbox roundtrip).
- `build_skill_meta_tool(ctx)` — factory returning the coroutine the
  `@strands.tool` decorator wraps. Decoupled from the SDK so unit tests
  exercise the full decision tree without importing strands.
- `intersect_allowed_tools(declared, session_tools)` — narrow-only
  intersection of a skill's declared `allowed-tools` frontmatter against
  the session's effective tool set. Warns on declared-but-missing so
  operators can spot disabled dependencies.
- `SkillUnauthorized` — distinct error from `SkillNotFound` so the model
  cannot enumerate tenant-scoped catalog membership by probing slugs.
  Both raise; the audit log gets full context.

### `test_skill_meta_tool.py` — 12 cases
Covers plan AE4 + every listed test scenario:
- happy path: Skill("sales-prep") routes to dispatcher with correct args
- nested Skill() threads the same TurnCounters through
- pure-SKILL.md slug returns body, no sandbox
- unknown slug → SkillNotFound
- in catalog but not in session → SkillUnauthorized
- SessionAllowlist triple-constraint intersection correctness
- tenant kill-switch trumps template enablement (R7 precedence)
- allowed-tools frontmatter narrows (never widens) past session tools
- build_skill_meta_tool closure captures ctx correctly

### `_boot_assert.EXPECTED_CONTAINER_SOURCES`
Adds skill_meta_tool so the Dockerfile RUN asserts it landed.

## What this PR does NOT do

- Does NOT wire `Skill` into server.py's Agent(tools=...). Deferred to
  U7 (shadow wiring) then U6 (canonical cutover).
- Does NOT drop the AGENTS.md-conditional around AgentSkills. Plan calls
  for this at U5 but it's entangled with the live-path swap — lands
  alongside the cutover.
- Does NOT suppress AgentSkills' built-in `skills` tool. Same reason —
  suppression only makes sense once `Skill` is the canonical path.

## Test counts

- `test_skill_meta_tool.py` — 12 cases
- Full agent-container suite: 223 green (12 new + 211 existing)
- ruff import-sort clean on new files

Part of the V1 agent-architecture plan
(`docs/plans/2026-04-23-007-feat-v1-agent-architecture-final-call-plan.md` §U5).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request Apr 24, 2026
…520)

Wires the admin decision surface for plugin-uploaded MCP servers. Plan §U11
lands:

- `POST /api/tenants/:tenantId/mcp-servers/:serverId/approve` computes
  `url_hash = sha256(canonical(url, auth_config))`, sets `status='approved'`
  + `approved_by` + `approved_at`.
- `POST /api/tenants/:tenantId/mcp-servers/:serverId/reject` clears
  approval metadata; reason captured in CloudWatch audit log.
- `buildMcpConfigs` SQL gate narrows to `status='approved' AND enabled=true`,
  with an in-code defensive hash-match check for drift (grandfathered
  `url_hash IS NULL` rows pass through).
- `applyMcpServerFieldUpdate` reverts approved rows back to `pending` on
  any url/auth_config mutation (SI-5). mcpUpdateServer + mcpRegisterServer
  upsert + DCR cache route through it; DCR stays approved by recomputing
  url_hash (system-internal discovery, not admin intent).
- Daily EventBridge sweeper auto-rejects pending rows older than 30 days.
- Admin SPA renders the approval badge and surfaces Approve / Reject
  buttons for pending rows; Reject accepts an optional reason.
- Cognito-only client (`cognitoFetch`) for the approval routes; mirrors
  plugin-upload.ts's REST analogue of requireTenantAdmin.
- 40 new unit tests: hash canonicalization, approve/reject handler
  (authz + tenant isolation), SI-5 url-swap protection, TTL sweeper,
  and buildMcpConfigs approved-filter behavior.

Terraform wires two new handlers (`mcp-approval`, `mcp-approval-sweeper`),
four new API Gateway routes, and a daily cron. No schema migration
required — `status`, `url_hash`, `approved_by`, `approved_at` all landed
with U3 migration 0025.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
feat: add deployment status to Settings page
ericodom added a commit that referenced this pull request May 5, 2026
…SI-2/3/6) (#510)

Lands the single code path every skill-with-scripts invocation will flow
through once U5 wires the Skill meta-tool (plan #7 §U4). Ships as inert
today — the Dockerfile COPY picks it up (via U2a's wildcard) and _boot_assert
registers it, but no production path calls it yet. Shadow-dispatch in U7 is
the first consumer.

## What lands

### `container-sources/skill_session_pool.py`
Async pool keyed on `(tenant_id, user_id, environment)`. LRU cap 8 per key,
30-min idle timeout, per-key async lock so concurrent acquires on the same
key don't double-start a session. API:

  - `acquire(key) -> SessionHandle`  (warm reuse or fresh start)
  - `handle.release()`
  - `flush_for_tenant(tenant_id)` — U12 kill-switch path
  - `flush_all()` — ops escape hatch
  - `prune_idle()` — caller decides cadence; exposed so tests advance time

### `container-sources/skill_dispatcher.py`
`dispatch_skill_script(tenant_id, user_id, skill_slug, args, environment, *,
pool, catalog, runner, counters)`. Security invariants enforced:

  - **SI-2** args travel via `writeFiles(_args.json=json.dumps(args))`; the
    executeCode string is a fixed template that opens the file and calls
    `run(**args)`. Model-controlled values never touch the Python source.
  - **SI-6** template purges `scripts.<slug>.*` from `sys.modules` +
    `importlib.invalidate_caches()` before every import, so a monkey-patch
    from call N cannot leak into call N+1 on the same pooled session.
  - Depth cap 5 (SkillDepthExceeded), per-turn budget 50 (SkillTurnBudgetExceeded).
  - Stdout parsed as JSON; structured errors (SkillOutputParseError,
    SkillTimeout, SkillExecutionError, SkillNotFound) all ride the same
    `DispatchResult` shape for uniform audit downstream.

SI-3 (user-scoped pool key) is enforced structurally in the pool itself.

### `test_skill_session_pool.py` — 9 cases
Acquire + reuse, concurrent-acquire safety, LRU eviction of idle slots,
in-use-never-evicted, idle pruning with frozen time, flush-for-tenant
isolation, flush-all.

### `test_skill_dispatcher.py` — 9 cases
Happy path (args land in `_args.json`, not in exec string), unknown slug,
non-JSON stdout, timeout, non-zero exit with stderr, depth-cap boundary
(max OK, max+1 rejected), turn budget, audit hook firing on ok + failure.

### `test_skill_dispatcher_security.py` — 6 cases
Each named with its SI number so grep surfaces coverage at review time:

  - SI-2: adversarial args (`__import__('os').system('curl evil.test')`,
    nested `exec()`, unicode escapes) round-trip through _args.json
    unchanged, never appear in the exec string.
  - SI-2: exec template byte-identical across two invocations with
    different args — a structural assertion that fails if anyone ever
    reintroduces interpolation.
  - SI-3: alice and bob on the same tenant get distinct pool sessions;
    flush-for-tenant isolates.
  - SI-6: exec template purges `scripts.<slug>.*` before import, even on
    back-to-back calls with the same slug.

### Wiring
- `_boot_assert.EXPECTED_CONTAINER_SOURCES` grows skill_dispatcher +
  skill_session_pool so the Dockerfile RUN asserts they landed.
- `packages/api/src/lib/sandbox-preflight.ts` gains an optional
  `caller: 'execute_code' | 'skill_dispatch'` field on the input +
  result. Defaults to `execute_code` for backwards compat; dispatcher
  paths set `skill_dispatch` when U5+ wires them. No behavior change
  for existing callers.

## What this does NOT do

- Does NOT wire the dispatcher into server.py's Agent(tools=...) flow.
  That's U5 (Skill meta-tool).
- Does NOT extract the quota/audit loop from server.py:682-755. The plan
  calls for this as part of U4; deferring to the shadow-dispatch wiring
  in U7 where the quota call actually fires — extracting now would add
  a seam with no caller yet.
- Does NOT call the real AgentCore Code Interpreter. Tests drive
  injected runner/pool callables. Real integration happens in U7's
  shadow-dispatch harness.

## Test plan

- [x] `uv run ... pytest` on the three new files — 24 tests green
- [x] Full `pytest packages/agentcore-strands/agent-container/` — 211 green
      (24 new + 187 existing)
- [x] `pnpm --filter @thinkwork/api typecheck` green (preflight caller
      field threaded through existing tests)
- [x] `pnpm --filter @thinkwork/api test` on `sandbox-preflight.test.ts`
      — 9 tests green
- [x] ruff import-sort clean on every new file
- [x] prettier clean on every touched TS file

Part of the V1 agent-architecture plan
(`docs/plans/2026-04-23-007-feat-v1-agent-architecture-final-call-plan.md` §U4).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
…s inert) (#511)

The single `Skill(name, args)` meta-tool that U6 flips to be the sole
invocation path once U7's shadow harness validates equivalence. Today it
ships as inert code — the Dockerfile wildcard COPY picks it up (via U2a)
and _boot_assert registers it, but server.py's live Agent(tools=...) path
still routes through the existing run_skill_dispatch / composition_runner
code.

## Why ship inert

The plan (#7 §U4/U5/U6/U7) explicitly gates U6's cutover on U7 PASS
— U7 is the shadow harness that dual-dispatches both the old and new
paths on real invocations and measures divergence. Wiring U5 into the
live Agent(tools=...) before U7 exists would swap the invocation path
without the safety net the plan itself calls for. This PR therefore
ships the module + tests and defers server.py wiring to U7.

## What lands

### `container-sources/skill_meta_tool.py`
- `SessionAllowlist` — intersection of
  `tenant_skills ∩ template_skills ∩ ¬template_blocks ∩ ¬tenant_kill_switches`
  pre-computed once at Agent(tools=...). Narrow-only: a template cannot
  widen past what the tenant enabled (plan R6/R7).
- `invoke_skill(name, args, *, ctx)` — pure entry point the Strands @tool
  wrapper calls. Routes script-bundle skills to U4's `dispatch_skill_script`;
  pure-SKILL.md skills return their body for in-prompt consumption
  (no sandbox roundtrip).
- `build_skill_meta_tool(ctx)` — factory returning the coroutine the
  `@strands.tool` decorator wraps. Decoupled from the SDK so unit tests
  exercise the full decision tree without importing strands.
- `intersect_allowed_tools(declared, session_tools)` — narrow-only
  intersection of a skill's declared `allowed-tools` frontmatter against
  the session's effective tool set. Warns on declared-but-missing so
  operators can spot disabled dependencies.
- `SkillUnauthorized` — distinct error from `SkillNotFound` so the model
  cannot enumerate tenant-scoped catalog membership by probing slugs.
  Both raise; the audit log gets full context.

### `test_skill_meta_tool.py` — 12 cases
Covers plan AE4 + every listed test scenario:
- happy path: Skill("sales-prep") routes to dispatcher with correct args
- nested Skill() threads the same TurnCounters through
- pure-SKILL.md slug returns body, no sandbox
- unknown slug → SkillNotFound
- in catalog but not in session → SkillUnauthorized
- SessionAllowlist triple-constraint intersection correctness
- tenant kill-switch trumps template enablement (R7 precedence)
- allowed-tools frontmatter narrows (never widens) past session tools
- build_skill_meta_tool closure captures ctx correctly

### `_boot_assert.EXPECTED_CONTAINER_SOURCES`
Adds skill_meta_tool so the Dockerfile RUN asserts it landed.

## What this PR does NOT do

- Does NOT wire `Skill` into server.py's Agent(tools=...). Deferred to
  U7 (shadow wiring) then U6 (canonical cutover).
- Does NOT drop the AGENTS.md-conditional around AgentSkills. Plan calls
  for this at U5 but it's entangled with the live-path swap — lands
  alongside the cutover.
- Does NOT suppress AgentSkills' built-in `skills` tool. Same reason —
  suppression only makes sense once `Skill` is the canonical path.

## Test counts

- `test_skill_meta_tool.py` — 12 cases
- Full agent-container suite: 223 green (12 new + 211 existing)
- ruff import-sort clean on new files

Part of the V1 agent-architecture plan
(`docs/plans/2026-04-23-007-feat-v1-agent-architecture-final-call-plan.md` §U5).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
…520)

Wires the admin decision surface for plugin-uploaded MCP servers. Plan §U11
lands:

- `POST /api/tenants/:tenantId/mcp-servers/:serverId/approve` computes
  `url_hash = sha256(canonical(url, auth_config))`, sets `status='approved'`
  + `approved_by` + `approved_at`.
- `POST /api/tenants/:tenantId/mcp-servers/:serverId/reject` clears
  approval metadata; reason captured in CloudWatch audit log.
- `buildMcpConfigs` SQL gate narrows to `status='approved' AND enabled=true`,
  with an in-code defensive hash-match check for drift (grandfathered
  `url_hash IS NULL` rows pass through).
- `applyMcpServerFieldUpdate` reverts approved rows back to `pending` on
  any url/auth_config mutation (SI-5). mcpUpdateServer + mcpRegisterServer
  upsert + DCR cache route through it; DCR stays approved by recomputing
  url_hash (system-internal discovery, not admin intent).
- Daily EventBridge sweeper auto-rejects pending rows older than 30 days.
- Admin SPA renders the approval badge and surfaces Approve / Reject
  buttons for pending rows; Reject accepts an optional reason.
- Cognito-only client (`cognitoFetch`) for the approval routes; mirrors
  plugin-upload.ts's REST analogue of requireTenantAdmin.
- 40 new unit tests: hash canonicalization, approve/reject handler
  (authz + tenant isolation), SI-5 url-swap protection, TTL sweeper,
  and buildMcpConfigs approved-filter behavior.

Terraform wires two new handlers (`mcp-approval`, `mcp-approval-sweeper`),
four new API Gateway routes, and a daily cron. No schema migration
required — `status`, `url_hash`, `approved_by`, `approved_at` all landed
with U3 migration 0025.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 7, 2026
…k retention)

Replaces `_anchor_fn_inert` with `_anchor_fn_live`, which performs the
actual S3 PutObject of per-tenant proof slices and the global anchor
JSON to the WORM-locked compliance bucket. The anchor object carries an
explicit `ObjectLockMode` + `ObjectLockRetainUntilDate` per-object
override (mirroring the bucket-default), so the retention contract is
portable across buckets and visible at the call site. Slices write
under `proofs/tenant-{id}/cadence-{cadence_id}.json` (no per-object
lock; bucket default applies); anchor writes last so a partial failure
never publishes a verifier-discoverable commit point.

Five guards land alongside the body swap:

  * **Deterministic cadence_id** — sha256 of canonical chain-head
    fingerprint, reshaped to UUIDv7 form. Same heads produce the same
    cadence_id, so a retry after a partial failure overwrites its own
    slice keys instead of orphaning slices for the full 365-day
    retention window.
  * **Merkle self-check** — `_anchor_fn_live` recomputes the root from
    the received leaves and asserts equality before any PutObject. Cheap
    insurance against latent runAnchorPass arithmetic bugs becoming
    WORM-locked poisoned evidence.
  * **Layer 2 body-swap test** — `compliance-anchor-s3-spy.test.ts`
    mocks S3Client.send and asserts the live function actually issues
    PutObjectCommand for both slices and anchor (with SHA256 checksum,
    SSE-KMS, and ObjectLock retention on the anchor key only). Pairs
    with the Layer 1 identity assertion (`getWiredAnchorFn() ===
    _anchor_fn_live`) in the integration test.
  * **Sibling watchdog IAM role** — watchdog moves OFF the shared
    lambda role onto a dedicated role with `kms:DescribeKey` only on
    the bucket CMK (NOT `kms:Decrypt` — the watchdog never reads
    object bodies), `s3:ListBucket` prefix-conditioned on `anchors/`,
    and an explicit Deny on every Delete + Bypass + Lock-mutation
    action so future role broadening cannot turn the watchdog into a
    deletion vector.
  * **Dev-COMPLIANCE precondition** — `var.allow_compliance_in_non_prod`
    (default false) blocks accidentally locking a dev bucket into
    irreversible COMPLIANCE bytes via a stage typo.

Watchdog flips to live: `mode: "live"`, ListObjectsV2 with 1000-key
truncation warning, max-LastModified pick, `ComplianceAnchorGap` metric
emission (suppressed on greenfield-empty bucket), heartbeat unchanged.
The CloudWatch alarm cuts over: gap → `treat_missing_data = breaching`
(catches both real gaps and a watchdog-down regression); a sibling
heartbeat-missing alarm is born `notBreaching` so deploy-time gaps
don't fire it before the first heartbeat lands (Decision #7).

Operator pre-merge step: `terraform state mv` the watchdog from the
for_each handler set to the new standalone resource address. Without
it, the next `terraform apply` fails with ResourceConflictException on
the function name. Plan documents the exact command.

Plan: docs/plans/2026-05-07-012-feat-compliance-u8b-anchor-lambda-live-plan.md
Master plan: docs/plans/2026-05-06-011-feat-compliance-audit-event-log-plan.md (U8b)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 7, 2026
…k retention) (#927)

* feat(compliance): U8b — anchor Lambda live (S3 PutObject + Object Lock retention)

Replaces `_anchor_fn_inert` with `_anchor_fn_live`, which performs the
actual S3 PutObject of per-tenant proof slices and the global anchor
JSON to the WORM-locked compliance bucket. The anchor object carries an
explicit `ObjectLockMode` + `ObjectLockRetainUntilDate` per-object
override (mirroring the bucket-default), so the retention contract is
portable across buckets and visible at the call site. Slices write
under `proofs/tenant-{id}/cadence-{cadence_id}.json` (no per-object
lock; bucket default applies); anchor writes last so a partial failure
never publishes a verifier-discoverable commit point.

Five guards land alongside the body swap:

  * **Deterministic cadence_id** — sha256 of canonical chain-head
    fingerprint, reshaped to UUIDv7 form. Same heads produce the same
    cadence_id, so a retry after a partial failure overwrites its own
    slice keys instead of orphaning slices for the full 365-day
    retention window.
  * **Merkle self-check** — `_anchor_fn_live` recomputes the root from
    the received leaves and asserts equality before any PutObject. Cheap
    insurance against latent runAnchorPass arithmetic bugs becoming
    WORM-locked poisoned evidence.
  * **Layer 2 body-swap test** — `compliance-anchor-s3-spy.test.ts`
    mocks S3Client.send and asserts the live function actually issues
    PutObjectCommand for both slices and anchor (with SHA256 checksum,
    SSE-KMS, and ObjectLock retention on the anchor key only). Pairs
    with the Layer 1 identity assertion (`getWiredAnchorFn() ===
    _anchor_fn_live`) in the integration test.
  * **Sibling watchdog IAM role** — watchdog moves OFF the shared
    lambda role onto a dedicated role with `kms:DescribeKey` only on
    the bucket CMK (NOT `kms:Decrypt` — the watchdog never reads
    object bodies), `s3:ListBucket` prefix-conditioned on `anchors/`,
    and an explicit Deny on every Delete + Bypass + Lock-mutation
    action so future role broadening cannot turn the watchdog into a
    deletion vector.
  * **Dev-COMPLIANCE precondition** — `var.allow_compliance_in_non_prod`
    (default false) blocks accidentally locking a dev bucket into
    irreversible COMPLIANCE bytes via a stage typo.

Watchdog flips to live: `mode: "live"`, ListObjectsV2 with 1000-key
truncation warning, max-LastModified pick, `ComplianceAnchorGap` metric
emission (suppressed on greenfield-empty bucket), heartbeat unchanged.
The CloudWatch alarm cuts over: gap → `treat_missing_data = breaching`
(catches both real gaps and a watchdog-down regression); a sibling
heartbeat-missing alarm is born `notBreaching` so deploy-time gaps
don't fire it before the first heartbeat lands (Decision #7).

Operator pre-merge step: `terraform state mv` the watchdog from the
for_each handler set to the new standalone resource address. Without
it, the next `terraform apply` fails with ResourceConflictException on
the function name. Plan documents the exact command.

Plan: docs/plans/2026-05-07-012-feat-compliance-u8b-anchor-lambda-live-plan.md
Master plan: docs/plans/2026-05-06-011-feat-compliance-audit-event-log-plan.md (U8b)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): apply autofix feedback

Drop unused drizzle-orm imports flagged by ce-code-review:
- compliance-anchor.ts: `and`, `eq`, `gt`, plus the `auditEvents` schema
  import (raw SQL via `` sql`...` `` is the actual codepath there)
- compliance-anchor.integration.test.ts: `and`, `gt`, `auditOutbox`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(compliance): make compliance-anchor.test.ts stub anchorFn async

`AnchorFn` is now `=> Promise<...>` in U8b. The timestamp-normalization
test added in #925 used a sync stub, which fails typecheck against the
new contract. Switch the stub to `async () => ({ anchored: false })` —
test still exercises the same path (recorded_at coercion → drainer
update) since runAnchorPass awaits the result either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant