feat: out-of-band kind=run_skill dispatcher (plan §007-008) by ericodom · Pull Request #552 · thinkwork-ai/thinkwork

ericodom · 2026-04-24T20:12:21Z

Summary

Launch-blocker #1 from today's session. Replaces the post-§U6 placeholder rejector with a real dispatcher: scheduled jobs, admin "Run now", chat-intent-dispatched skill runs, and webhook-triggered runs (with non-null agentId) now actually execute the target skill via the same Strands agent loop the chat turn uses.

Plan: docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md

What ships

Five focused units, dependency-ordered, each a separate commit:

U1 — Shared runtime-config helper + new service-auth REST endpoint

New packages/api/src/lib/resolve-agent-runtime-config.ts factors the ~300-line agent/template/skills/MCP/KB/guardrail resolution out of chat-agent-invoke.ts. Helper is the single source of truth; both chat-invoke and the new endpoint call it. 9 unit tests.
New packages/api/src/handlers/agents-runtime-config.ts exposes GET /api/agents/runtime-config?tenantId=X&agentId=Y (Bearer API_AUTH_SECRET). 13 handler tests.
chat-agent-invoke.ts refactored to call the helper; zero behavior change. Per-turn fields (thread context, sandbox preflight, user resolution) stay inline.
Terraform route + handler registration in lambda-api/handlers.tf.

U2 — Extract `_execute_agent_turn` helper in `server.py`

~170-line chat-loop prologue factored into a reusable helper. Same env setup, skill install, workspace ready, eval span, messages build, system prompt, _call_strands_agent, tool-cost drain. Pure lift-and-shift refactor.
Chat do_POST now calls the helper; dispatcher will too.
All 234 existing container tests pass unchanged.

U3 — Python `api_runtime_config` client + real dispatcher

New container-sources/api_runtime_config.py: urllib GET with 3-attempt retry (5xx + transient retry, 4xx terminal, 404 → AgentConfigNotFoundError). 9 unit tests.
run_skill_dispatch.py rewritten: null-agentId fails fast; otherwise fetch config → build synthetic chat turn ("Run the sales-prep skill with these inputs: ...") → call _execute_agent_turn → POST /api/skills/complete with {type: inline, payload} on complete, reason on failure. 16 new tests cover happy path + all failure modes.
Dockerfile uses wildcard COPY container-sources/ /app/ so new module picked up automatically; _boot_assert.py::EXPECTED_CONTAINER_SOURCES updated per the institutional learning.

U4 — Flip TS emitters to `InvocationType: "Event"`

packages/api/src/graphql/utils.ts::invokeSkillRun, packages/api/src/handlers/skills.ts::invokeAgentcoreRunSkill, packages/lambda/job-trigger.ts all flip RequestResponse → Event. Webhook path inherits via invokeSkillRun.
Why: agent loop routinely takes 10-60s; RequestResponse with 28s socket timeout falsely timed out. Event returns 202 on enqueue; the HMAC-signed /api/skills/complete callback is the authoritative execution-result signal.
Does NOT cross feedback_avoid_fire_and_forget_lambda_invokes — that rule governs paths without callbacks; we have a durable HMAC-signed one.
Chat-agent-invoke stays RequestResponse (needs the synchronous response for the thread message stream).

U5 — Smoke scripts + CHECKS.md refresh

scripts/smoke/{chat,catalog}-smoke.sh + CHECKS.md + README.md flip passing language off "every envelope terminates with U6 unsupported-runtime" onto "row reaches complete or failed with a specific reason; only stuck-running / transport errors FAIL."

ce-code-review fixes (autofix + P0s, post-U5)

Four additional commits address findings from ce-code-review's multi-agent pass against U1–U5:

Autofix (ac59a32) — removed stale comment in chat-agent-invoke.ts describing nonexistent currentUserEmail propagation; updated scheduled.test.ts + harness README to reflect post-§U4 Event + HMAC callback semantics.
P0-A (c61ff58) — agentId threading. The SkillRunInvokePayload type was missing agentId; all four TS emitters stripped the field; the Python dispatcher rejected every non-webhook envelope with _MISSING_AGENT_REASON. Added agentId: string | null to the type, threaded through startSkillRun.mutation.ts, skills.ts::startSkillRunService, webhooks/_shared.ts, and job-trigger.ts. 3 regression tests (including an InvocationType=Event pin).
P0-B (7351cbe) — cross-tenant user-email leak. resolveAgentRuntimeConfig's users lookup scoped only on users.id; any holder of API_AUTH_SECRET could enumerate cross-tenant emails via the currentUserId query param. All three users lookups (currentUserId, human_pair email, human_pair name) now AND on users.tenant_id. Added a regression test that captures the actual where() predicates.
P0-C (5f03f00) — async-invoke retry non-idempotency. AWS Lambda default MaximumRetryAttempts=2 on the §U4 Event invoke would re-run the agent turn (re-burning Bedrock tokens) on any transient failure. Set MaximumRetryAttempts=0 on agentcore-invoke via aws_lambda_function_event_invoke_config; added a thinkwork-<stage>-agentcore-async-dlq SQS queue (14-day retention, SSE-managed) for failure visibility.

Key technical decisions

Reuse _call_strands_agent via _execute_agent_turn — one code path for the agent loop; the chat-invoke and run_skill paths cannot drift.
Runtime config fetched inside the container via GET /api/agents/runtime-config — keeps the four TS emitters lean (no envelope bloat); single resolver for both chat + dispatcher; tenant-scoped 404 returns don't leak cross-tenant existence.
Fail fast on null agentId — scheduled/catalog/chat-intent always carry one; only webhook resolvers emit null, and tenant-admin fallback routing is a product decision that belongs in its own PR.
Synthetic user message, not SKILL.md injection — the model reads the skill's SKILL.md via the AgentSkills plugin's progressive disclosure; we just tell it to run it with the given args. No harness magic.
One-shot dispatch, not retried — agent turn is non-idempotent (Bedrock tokens); /api/skills/complete atomic CAS + 15-min reconciler provide the safety net; DLQ preserves observability of dropped invokes.

What's explicitly NOT in this PR

Refactor of _call_strands_agent (1,400-line function; too risky for this change).
Streaming the skill-run output.
Cancellation cooperation inside the agent loop (cancelSkillRun mutation flips the row; mid-run polling is separate work).
Webhook null-agentId fallback routing — fails with named reason for now.

Test plan

pnpm -r typecheck (minus @thinkwork/agent-tools pre-existing no-tsc) — green
pnpm -r test — 1378 passed, 7 skipped across api (1385 total); 68 passed lambda
uv run pytest packages/agentcore-strands/agent-container/ — 249 passed
pnpm exec prettier --check on touched packages — clean
terraform validate on agentcore-runtime module — clean
Mandatory grep for old rejector strings returns zero hits
Post-merge: run catalog-smoke on dev for sales-prep, confirm row reaches complete or fails at a specific connector
Post-merge: schedule a run via the admin "Run now" button, confirm same
Post-merge: log-grep dispatch_run_skill in CloudWatch to confirm the new code is serving (per docs/solutions/workflow-issues/deploy-silent-arch-mismatch-took-a-week-to-surface-2026-04-24.md)

Deploy ordering notes

Migration 0029 already applied on dev (from feat(U6): delete composition runner + skill_catalog.execution/mode #542). This PR adds no new migrations.
The AgentCore runtime update (deploy.yml's container rebuild + UpdateAgentRuntime) is required for the new Python code to serve — per docs/solutions/workflow-issues/agentcore-runtime-no-auto-repull-requires-explicit-update-2026-04-24.md the ECR push alone is not enough. The merge pipeline handles this.
Warm containers may boot pre-env-injection during terraform-apply per project_agentcore_deploy_race_env; the 15-min reconciler is the backstop. The new api_runtime_config.fetch reads env at call time, not import, so post-race recovery is automatic.
The new aws_sqs_queue.agentcore_async_dlq + aws_lambda_function_event_invoke_config.agentcore resources are additive; first apply creates them and flips retry semantics. No data migration needed.

Residual Review Findings

ce-code-review (autofix mode, 10 reviewers, 8 returned findings) surfaced the P0s above plus the P1/P2/P3 work listed here. Durable sink: this PR body (tracker filing was not invoked for in-scope launch work). Full artifact on disk at .context/compound-engineering/ce-code-review/20260424-151323-e0154520/findings.md in the worktree.

Actionable (P1) — should ship in a follow-up before/alongside launch

P1 | reliability-02 — aws_lambda_function_event_invoke_config now exists, but consider EventInvoke destinations for OnSuccess too if operator-side observability wants completion streaming. Not blocking.
P1 | reliability-05 — post_skill_run_complete in run_skill_dispatch.py returns silently when API_AUTH_SECRET / THINKWORK_API_URL are missing. Row stays running until the 15-min reconciler. Should log ERROR + raise so the AWS retry (now 0, but future-proof) or the operator sees it immediately.
P1 | testing-1 — _execute_agent_turn (~170 LOC, heavy hidden state) has zero direct unit tests. Add at minimum: happy path, empty response, tool-call path, exception propagation.
P1 | testing-2 — PostCompletionTests.test_missing_env_logs_and_returns asserts nothing; false confidence. Assert return value + log emission.
P1 | testing-4 — resolve-agent-runtime-config branches (guardrail-assigned path, human-pair email fallback path, humanN lookup) have no direct unit tests; current coverage is indirect via chat + dispatcher integration.
P1 | kieran-python async-def-without-awaits — dispatch_run_skill is async def with zero awaits. asyncio.run(...) makes it work, but either drop async or convert the blocking urllib calls to awaitable HTTP so the label reflects reality.
P1 | kieran-python late-imports — from server import _execute_agent_turn inside the function hides a module-load cycle. Either declare explicitly or extract _execute_agent_turn into its own module.
P1 | kt-002 — three inlined copies of the invokeAgentcoreRunSkill envelope constructor (utils.ts, skills.ts, job-trigger.ts). Extract to a shared helper so future envelope-shape changes (like today's agentId thread) happen once, not thrice.

Polish (P2) — defer to ops hardening pass

_tool_costs global accumulates across invocations if _execute_agent_turn raises (reliability-03, reliability-07) — wrap in try/finally _tool_costs.clear().
No soft deadline inside dispatch_run_skill; runaway agent loop burns full 900s budget (reliability-04).
Missing completion_hmac_secret in envelope strands the row on 401 terminal (reliability-06).
Guardrail-blocked response is posted as successful skill completion instead of failed with guardrail_blocked reason (correctness-2, adv-05).
Tenant built-in-tool kill-switch not propagated to skill-run dispatch (correctness-3).
Dead fallback reads: scope.get('invokerUserId' | 'agentId' | 'threadId') never match TS producer shape — remove to prevent silent future breakage (correctness-4, api-contract-002).
_execute_agent_turn leaves workspace/composer/hindsight env vars set on warm containers (correctness-5).
_FETCH_RETRY_DELAYS and _COMPLETE_RETRY_DELAYS identical tuples in two modules (kieran-python retry-delays-duplicated).
_cfg(camel, snake) closure repeated 3 places (kieran-python envelope-scope-fallback-duplicated-three-times).
(k: any) cast in chat-agent-invoke.ts defeats KnowledgeBaseConfig type (kt-003).
JSONB unchecked casts in resolveAgentRuntimeConfig rely on schema stability (kt-004).
run_skill_dispatch reaches past the underscore prefix into server._execute_agent_turn (M2).
RuntimeConfigFetchError(code=None) conflates missing-env with retry-exhaustion (M4).
Empty-skills envelope silently posts "no skills available" deliverable as complete (adv-04).
Unvalidated response_text posted verbatim as deliveredArtifactRef with no size/shape check (adv-06).
Warm-container env leak via exception in do_POST snake_case/camelCase ordering (adv-07).
Admin cancel race — dispatcher blind to skill_runs.status flip mid-run; 4xx on callback is only signal (adv-08).
Several coverage gaps: _build_synthetic_payload asserts only 4 of 16+ keys (testing-5); scope casing mismatch untested (testing-6); agents-runtime-config handler test mocks the helper wholesale (testing-7); _execute_agent_turn edge cases — response_text=None, empty skills, None mcp_configs — untested (testing-8).

Advisory (P3)

(k: any) parameter naming drift (kt-005).
Inlined UUID_RE duplication — 6th+ copy (kt-006).
agents-runtime-config handler hand-rolls OPTIONS instead of using handleCors/cors() (kt-007).
_urlopen_with_retry's assert last_exc is not None (reliability-08).
Runtime-config fetch budget sums to ~14s but each call allows timeout=15s × 4 attempts = 60s worst case (reliability-09).
No circuit breaker on runtime-config fetch — API outage cascades across N dispatches (adv-09).
HMAC header parse edge case — bare-hex accepted alongside sha256=... (adv-10).

🤖 Generated with Claude Code

…config endpoint Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1. Factors the agent-level resolution logic (template + skills + KBs + MCP configs + guardrail + sandbox template) out of chat-agent-invoke.ts into a shared helper. Both the chat-invoke Lambda and the new service-auth REST endpoint call the same resolver, so the run_skill dispatcher and the chat loop can never drift on "what this agent's runtime looks like." * packages/api/src/lib/resolve-agent-runtime-config.ts — new helper. Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig, guardrailId, skillsConfig with defaults + built-in tools + blocked- tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate). Exports AgentNotFoundError + AgentTemplateNotFoundError for caller mapping to HTTP responses. * packages/api/src/handlers/agents-runtime-config.ts — new service-auth GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer API_AUTH_SECRET per the `docs/solutions/best-practices/service- endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent. Optional currentUserId + currentUserEmail query params drive the CURRENT_USER_EMAIL overlay on default skills. * packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300- line inline resolution block with a single `resolveAgentRuntimeConfig` call. Per-turn concerns (thread_id, trace_id, message, messages_ history, currentUserId + currentUserEmail derivation, sandbox preflight) stay in the handler. Behavior-preserving refactor. * terraform/modules/app/lambda-api/handlers.tf — registers the new handler and routes GET /api/agents/runtime-config to it. * Tests: 8 helper-level unit tests (happy path, AgentNotFoundError, AgentTemplateNotFoundError, template vs tenant-default guardrail, blocked-tools filter, currentUserEmail overlay, built-in tool injection, MCP delegation). 13 handler-level tests (method/path/auth/ UUID validation + helper-exception → 404 mapping + currentUser param pass-through). Next: U2 — extract `_execute_agent_turn` helper in server.py so the dispatcher can reuse the chat-loop prologue + `_call_strands_agent`.

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2. Pure refactor — no behavior change. Factors the ~170-line chat-loop prologue out of AgentCoreHandler.do_POST into a free function `_execute_agent_turn(payload)` so U3's dispatcher can reuse the same path without duplicating env setup + skill install + workspace bootstrap + eval-span attach + message build + system_prompt build + the _call_strands_agent call. What stayed in do_POST (chat-specific): * invocation_env.apply/cleanup (run_skill branch has its own already). * OpenAI chat.completion response shape + tool_costs/hindsight_usage projection. * _audit_response, auto-retain via api_memory_client, log_agent_invocation. * Guardrail-exception classification into a blocked-response 200. * self._respond HTTP writes. What moved into _execute_agent_turn: * Payload env unpack (workspace_bucket, thinkwork_api_url/secret, hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID). * install_skill_from_s3 loop. * _ensure_workspace_ready. * _inject_skill_env. * attach/detach_eval_context. * messages list build (history + current). * Router profile resolution + effective_skills filter. * _build_system_prompt + external-task / workflow-skill / KB context injection. * _call_strands_agent. * tool_costs drain snapshot. Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}. Raises on _call_strands_agent failure; caller classifies guardrail vs non-guardrail. Covered by existing container tests (234/234 passing). Adds no new tests — the extraction is a lift-and-shift. U3 will exercise the helper from the dispatcher path with real assertions.

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3. Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with a real dispatcher that: 1. Validates the envelope — null agentId fails fast with a named reason (webhook-sourced runs deferred to a follow-up per plan). 2. Fetches the agent's runtime config from the new GET /api/agents/runtime-config endpoint (U1) via a new Python client api_runtime_config.py. Bounded retry: 5xx + transient transport errors retry 3× with jittered backoff; 4xx terminal. 404 is a specific AgentConfigNotFoundError subclass. 3. Builds a chat-invoke-shaped synthetic payload (tenant / agent / skills / MCP / guardrail + a synthetic user message instructing the agent to run the skill with the given inputs). 4. Runs a headless Strands agent turn via the shared server._execute_agent_turn helper (U2) — the chat loop and the dispatcher now share one execution path; no second codebase to drift. 5. POSTs terminal status back to /api/skills/complete with the HMAC-signed callback. Successful runs write deliveredArtifactRef={type: "inline", payload: <markdown>}; empty responses fail with "agent produced no final text"; agent-loop exceptions fail with "agent loop crashed: <msg>". Also: * container-sources/_boot_assert.py — adds api_runtime_config to EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces loudly at boot instead of silently shipping non-functional. * test_api_runtime_config.py — 9 tests: happy path, query-string shape, currentUserId forwarding, missing env, 404 → AgentConfigNot- FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket timeout exhaustion, non-JSON body. * test_server_run_skill.py — rewritten for the new contract. 16 tests covering: happy path inline deliverable, null-agentId fast-fail, missing-field short-circuit, config-404, config-5xx, agent-loop crash, empty-response-text failure, synthetic-payload shape (both camelCase + snake_case config keys), HMAC retry semantics carried forward. Full container suite: 249 passed (up from 234 pre-U3). Dockerfile uses the wildcard COPY container-sources/ /app/ so no explicit entry needed (per the post-#542 Dockerfile shape).

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4. RequestResponse with a 28s socket timeout was fundamentally incompatible with the agent loop the dispatcher now runs — the AgentCore Lambda has a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter flips to InvocationType: Event so enqueue acknowledges immediately while the agent executes for as long as it needs. The HMAC-signed /api/skills/complete callback is the authoritative execution-result signal — it writes skill_runs.status on completion whether success or failure. Enqueue-level errors (IAM, Lambda missing) still surface synchronously via AWS returning 4xx/5xx on the Invoke call, which each emitter re-exposes through its ok/error return shape. This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes — that rule governs paths with no callback; we have a durable HMAC-signed one. Flipped sites: * packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL mutation path + webhooks/_shared.ts both call this). * packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill (POST /api/skills/start service-auth path). * packages/lambda/job-trigger.ts — scheduled job invoke. Post-flip error handling checks `res.StatusCode` (Event returns 202 on success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` + `res.Payload` (RequestResponse-only). Removed the NodeHttpHandler socket-timeout override — Event invokes don't hold the connection open waiting for the Lambda, so the 28s timeout workaround isn't meaningful. The chat-agent-invoke path (InvocationType: RequestResponse at chat-agent-invoke.ts:426) is unaffected — chat turns still need the synchronous response to land in the thread message stream. Full TS suite: 1375 passed, 68 passed (no regressions).

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5. Replaces the post-#542 "every kind=run_skill envelope terminates with U6 unsupported-runtime" passing condition with the real contract: the row reaches `complete` (agent produced the deliverable) or `failed` with a specific reason (connector missing, agent loop error, config-fetch failure). Only stuck-running / transport errors fail the smoke. * scripts/smoke/CHECKS.md — rewrites the "What passing means today" section to describe the U4 dispatcher flow (config fetch → headless agent turn → HMAC callback). Refreshes the "Known gaps" list to note the remaining webhook null-agentId deferral instead of the retired U6 placeholder. * scripts/smoke/README.md — updates the webhook smoke test expectations; complete or specific-reason failure are both PASS. Refreshes the "What passes vs fails" bullet list. * scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition comment describe the new terminal-states contract. * scripts/smoke/chat-smoke.sh — same.

- chat-agent-invoke.ts: delete stale 8-line comment describing currentUserEmail propagation that does not happen (M5). - scheduled.test.ts + _harness/README.md: update docstrings to reflect post-U4 Event + HMAC callback semantics, replacing stale RequestResponse prose (M6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all four TS emitters (startSkillRun.mutation, skills.ts startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted the field; Python dispatcher rejected every non-webhook envelope with _MISSING_AGENT_REASON. test_server_run_skill.py hand-injected agentId="agent-1" which masked the prod bug. The fix: - Add `agentId: string | null` to SkillRunInvokePayload. - Thread `i.agentId ?? null` into the GraphQL mutation emitter. - Thread `agentId ?? null` into the service-REST startSkillRunService. - Thread `dispatch.agentId ?? null` into the webhook shared handler. - Thread `targetAgentId ?? null` into job-trigger's scheduled path. - Add regression tests in skill-runs-resolvers, webhook-shared, and job-trigger that assert agentId survives the envelope round-trip. The job-trigger test also pins InvocationType=Event (§U4 contract). Without this, PR #552 ships green CI but is inert for scheduled/chat/catalog runs in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ce-code-review P0-B. The helper is reachable via the service-auth REST endpoint `GET /api/agents/runtime-config?currentUserId=...`, where `currentUserId` is a caller-controlled query parameter. The users lookup at L225 only filtered on users.id, so any holder of API_AUTH_SECRET could enumerate a foreign tenant's user emails by flipping `tenantId` to their own while passing another tenant's userId. All three users lookups (currentUserId email, human_pair email fallback, human_pair name) now AND on users.tenant_id = opts.tenantId. The human_pair lookups derive from a tenant-scoped agent row so the predicate is defense-in-depth there; the currentUserId lookup is the actual leak. Added a regression test that upgrades the drizzle mock to capture where() predicates and asserts both `users.id` and `users.tenant_id` eq-pairs appear for the currentUserId lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to InvocationType=Event so the agent loop has the full 900s Lambda budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2, which on this path means a single transient failure (5xx on the /api/skills/complete callback, container OOM, etc.) causes the whole agent turn to run again — re-burning Bedrock tokens and (in pathological cases) racing the first deliverable's writeback. The agent turn is not idempotent. Set MaximumRetryAttempts=0 so dispatch is one-shot. Durable safety net already exists: /api/skills/complete atomically CAS-updates on status='running' (any stray retry gets 409), and the skill-runs-reconciler flips rows stuck in `running` past the 15-min cutoff to `failed`. Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS, 14-day retention, SSE-managed) instead of disappearing. Added IAM policy on the agentcore role for sqs:SendMessage, plus two outputs for operator inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sions (#562) Two related deploy-pipeline fixes that surfaced together when PR #561 merged: terraform-apply on main has been failing since PR #552 because the agents-runtime-config Lambda zip was never built, and the catalog S3 sync was uploading dev-only artifacts (node_modules, __tests__, tests/, characterization/) into the public skills/catalog/ prefix where the admin UI surfaces them as if they were skills. (1) scripts/build-lambdas.sh — add agents-runtime-config build entry PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts plus the Terraform route in modules/app/lambda-api/handlers.tf, but missed the build entry. Result: every deploy after PR #552 merged failed at terraform-apply with: Error: Error in function call ... filebase64sha256 ... /home/runner/.../dist/lambdas/agents-runtime-config.zip: no such file or directory. This blocked PR #561 from actually deploying skill.yaml removal, and will block every subsequent merge until fixed. Add the build entry between `skills` and `plugin-upload` (matching the registration order in handlers.tf). (2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync The script iterates `packages/skill-catalog/*/` and `aws s3 syncs` each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only exclusion was `scripts/`. Everything else got uploaded: - node_modules/ — 5900+ objects from pnpm install - __tests__/ — vitest test directories - characterization/ — test fixtures - tests/ — top-level test artifacts - per-skill tests/ dirs — Python pytest dirs (e.g., skills/catalog/agent-email-send/tests/test_send_outbound.py) The admin Capabilities → Skills file viewer reads `listCatalogFiles` which does an S3 ListObjects on the slug prefix and renders whatever it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced as if they were part of the skill. Two-layer defense added: - NON_SKILL_DIRS array — top-level dirs that aren't skills are skipped entirely (not synced to any catalog prefix). Includes scripts (preserved), __tests__, characterization, node_modules, tests, dist. - Per-skill --exclude flags on `aws s3 sync` — drops tests/, __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store from the per-skill upload regardless of whether the skill author accidentally committed them. Also extends RETIRED_SLUGS to include thinkwork-admin (retired in PR #488), smoke-package-only, and the legacy top-level cruft prefixes (__tests__, characterization, node_modules, tests) so the next bootstrap run cleans up existing stale objects in dev/prod S3 even from before this script change. Verification: - bash -n on both scripts — syntax clean - bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip (110K) alongside every other handler Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…563) Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the agent turn ran successfully end-to-end, but `post_skill_run_complete` saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed silently with the ERROR log "cannot post completion — missing THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at status=running for ~17 minutes until the reconciler backstopped it with the generic "stale running row" reason. The strange part: 30 seconds earlier in the same coroutine, `api_runtime_config.fetch` (which reads the same env) succeeded, and the workspace_sync log line `action=composer_fetch sync_ms=2838 files=14` confirmed the env was populated. Something inside the long agent turn appears to clear or shadow these vars by the time `post_skill_run_complete` runs. Root cause not fully diagnosed — the container code itself only reads these vars (no `os.environ.pop` of THINKWORK_API_URL anywhere), so it's likely a Strands SDK or botocore subprocess interaction. Rather than chase the diagnosis, this commit eliminates the failure mode structurally: capture the env values at dispatcher entry into local snapshots, and pass them as `api_url` / `api_secret` parameters to every `post_skill_run_complete` call. The function falls back to `os.environ` when not provided, preserving backward compat for the few callers that don't snapshot. Changes: - `post_skill_run_complete` accepts new optional params `api_url` + `api_secret`. Snapshot wins over env. Sharper ERROR log on the unrecoverable case (both snapshot and env empty), naming the runId + status so operators can correlate to the stuck skill_runs row when the reconciler picks it up. - `dispatch_run_skill` captures env at top, passes the snapshots through to all 7 `post_skill_run_complete` call sites (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError, _execute_agent_turn ImportError, agent loop crash, empty response, success). If env IS empty at dispatcher entry — the genuine pre-injection race from `project_agentcore_deploy_race_env` — log loud at the entry point so operators see the cause without having to grep across multiple callback paths. Two new tests in `PostCompletionTests`: - `test_snapshot_params_override_empty_env` — env empty + snapshot provided → urlopen called with snapshot URL + Bearer header. - `test_snapshot_params_take_precedence_over_env` — env populated + snapshot provided → snapshot wins, env is ignored. Pins the precedence rule that's load-bearing for the env-drift fix. Verification: - `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py` — 18 passed (16 prior + 2 new). Closes the P1 reliability-05 finding from the ce-code-review autofix run on PR #552. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan hygiene + solution doc capturing today's launch-blocker arc: (1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md - Flip U1-U5 checkboxes from [ ] to [x] - Bump status: active → status: shipped - Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0 fixes) and #563 (env-snapshot follow-up after dev incident) (2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md - Real incident on dev today: dispatch_run_skill ran the agent turn successfully but post_skill_run_complete saw os.environ empty and bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET", even though api_runtime_config.fetch in the same coroutine 30s earlier had succeeded with the same env read. - Container code never `pop`s these vars; strong hypothesis is a Strands SDK or botocore subprocess shadowing them mid-turn. Not diagnosed conclusively — the structural fix (snapshot at entry, thread through to writeback) makes the diagnosis moot. - Solution doc captures: full symptom trail, the obvious-explanation rule-out, the structural fix pattern, when this pattern applies to other coroutines, and operator triage playbook for prod. - Reusable for any future long-running agent coroutine that ends with a callback POST (memory-retain, MCP tool telemetry, etc.). Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/): - feedback_completion_callback_snapshot_pattern — the snapshot rule - feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together - feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync - project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): U1 resolveAgentRuntimeConfig helper + /api/agents/runtime-config endpoint Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1. Factors the agent-level resolution logic (template + skills + KBs + MCP configs + guardrail + sandbox template) out of chat-agent-invoke.ts into a shared helper. Both the chat-invoke Lambda and the new service-auth REST endpoint call the same resolver, so the run_skill dispatcher and the chat loop can never drift on "what this agent's runtime looks like." * packages/api/src/lib/resolve-agent-runtime-config.ts — new helper. Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig, guardrailId, skillsConfig with defaults + built-in tools + blocked- tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate). Exports AgentNotFoundError + AgentTemplateNotFoundError for caller mapping to HTTP responses. * packages/api/src/handlers/agents-runtime-config.ts — new service-auth GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer API_AUTH_SECRET per the `docs/solutions/best-practices/service- endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent. Optional currentUserId + currentUserEmail query params drive the CURRENT_USER_EMAIL overlay on default skills. * packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300- line inline resolution block with a single `resolveAgentRuntimeConfig` call. Per-turn concerns (thread_id, trace_id, message, messages_ history, currentUserId + currentUserEmail derivation, sandbox preflight) stay in the handler. Behavior-preserving refactor. * terraform/modules/app/lambda-api/handlers.tf — registers the new handler and routes GET /api/agents/runtime-config to it. * Tests: 8 helper-level unit tests (happy path, AgentNotFoundError, AgentTemplateNotFoundError, template vs tenant-default guardrail, blocked-tools filter, currentUserEmail overlay, built-in tool injection, MCP delegation). 13 handler-level tests (method/path/auth/ UUID validation + helper-exception → 404 mapping + currentUser param pass-through). Next: U2 — extract `_execute_agent_turn` helper in server.py so the dispatcher can reuse the chat-loop prologue + `_call_strands_agent`. * refactor(agentcore-strands): U2 extract _execute_agent_turn Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2. Pure refactor — no behavior change. Factors the ~170-line chat-loop prologue out of AgentCoreHandler.do_POST into a free function `_execute_agent_turn(payload)` so U3's dispatcher can reuse the same path without duplicating env setup + skill install + workspace bootstrap + eval-span attach + message build + system_prompt build + the _call_strands_agent call. What stayed in do_POST (chat-specific): * invocation_env.apply/cleanup (run_skill branch has its own already). * OpenAI chat.completion response shape + tool_costs/hindsight_usage projection. * _audit_response, auto-retain via api_memory_client, log_agent_invocation. * Guardrail-exception classification into a blocked-response 200. * self._respond HTTP writes. What moved into _execute_agent_turn: * Payload env unpack (workspace_bucket, thinkwork_api_url/secret, hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID). * install_skill_from_s3 loop. * _ensure_workspace_ready. * _inject_skill_env. * attach/detach_eval_context. * messages list build (history + current). * Router profile resolution + effective_skills filter. * _build_system_prompt + external-task / workflow-skill / KB context injection. * _call_strands_agent. * tool_costs drain snapshot. Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}. Raises on _call_strands_agent failure; caller classifies guardrail vs non-guardrail. Covered by existing container tests (234/234 passing). Adds no new tests — the extraction is a lift-and-shift. U3 will exercise the helper from the dispatcher path with real assertions. * feat(agentcore-strands): U3 real kind=run_skill dispatcher Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3. Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with a real dispatcher that: 1. Validates the envelope — null agentId fails fast with a named reason (webhook-sourced runs deferred to a follow-up per plan). 2. Fetches the agent's runtime config from the new GET /api/agents/runtime-config endpoint (U1) via a new Python client api_runtime_config.py. Bounded retry: 5xx + transient transport errors retry 3× with jittered backoff; 4xx terminal. 404 is a specific AgentConfigNotFoundError subclass. 3. Builds a chat-invoke-shaped synthetic payload (tenant / agent / skills / MCP / guardrail + a synthetic user message instructing the agent to run the skill with the given inputs). 4. Runs a headless Strands agent turn via the shared server._execute_agent_turn helper (U2) — the chat loop and the dispatcher now share one execution path; no second codebase to drift. 5. POSTs terminal status back to /api/skills/complete with the HMAC-signed callback. Successful runs write deliveredArtifactRef={type: "inline", payload: <markdown>}; empty responses fail with "agent produced no final text"; agent-loop exceptions fail with "agent loop crashed: <msg>". Also: * container-sources/_boot_assert.py — adds api_runtime_config to EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces loudly at boot instead of silently shipping non-functional. * test_api_runtime_config.py — 9 tests: happy path, query-string shape, currentUserId forwarding, missing env, 404 → AgentConfigNot- FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket timeout exhaustion, non-JSON body. * test_server_run_skill.py — rewritten for the new contract. 16 tests covering: happy path inline deliverable, null-agentId fast-fail, missing-field short-circuit, config-404, config-5xx, agent-loop crash, empty-response-text failure, synthetic-payload shape (both camelCase + snake_case config keys), HMAC retry semantics carried forward. Full container suite: 249 passed (up from 234 pre-U3). Dockerfile uses the wildcard COPY container-sources/ /app/ so no explicit entry needed (per the post-#542 Dockerfile shape). * feat(api,lambda): U4 flip kind=run_skill Lambda invokes to Event Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4. RequestResponse with a 28s socket timeout was fundamentally incompatible with the agent loop the dispatcher now runs — the AgentCore Lambda has a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter flips to InvocationType: Event so enqueue acknowledges immediately while the agent executes for as long as it needs. The HMAC-signed /api/skills/complete callback is the authoritative execution-result signal — it writes skill_runs.status on completion whether success or failure. Enqueue-level errors (IAM, Lambda missing) still surface synchronously via AWS returning 4xx/5xx on the Invoke call, which each emitter re-exposes through its ok/error return shape. This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes — that rule governs paths with no callback; we have a durable HMAC-signed one. Flipped sites: * packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL mutation path + webhooks/_shared.ts both call this). * packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill (POST /api/skills/start service-auth path). * packages/lambda/job-trigger.ts — scheduled job invoke. Post-flip error handling checks `res.StatusCode` (Event returns 202 on success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` + `res.Payload` (RequestResponse-only). Removed the NodeHttpHandler socket-timeout override — Event invokes don't hold the connection open waiting for the Lambda, so the 28s timeout workaround isn't meaningful. The chat-agent-invoke path (InvocationType: RequestResponse at chat-agent-invoke.ts:426) is unaffected — chat turns still need the synchronous response to land in the thread message stream. Full TS suite: 1375 passed, 68 passed (no regressions). * docs(smoke): U5 update smoke scripts for the real dispatcher Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5. Replaces the post-#542 "every kind=run_skill envelope terminates with U6 unsupported-runtime" passing condition with the real contract: the row reaches `complete` (agent produced the deliverable) or `failed` with a specific reason (connector missing, agent loop error, config-fetch failure). Only stuck-running / transport errors fail the smoke. * scripts/smoke/CHECKS.md — rewrites the "What passing means today" section to describe the U4 dispatcher flow (config fetch → headless agent turn → HMAC callback). Refreshes the "Known gaps" list to note the remaining webhook null-agentId deferral instead of the retired U6 placeholder. * scripts/smoke/README.md — updates the webhook smoke test expectations; complete or specific-reason failure are both PASS. Refreshes the "What passes vs fails" bullet list. * scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition comment describe the new terminal-states contract. * scripts/smoke/chat-smoke.sh — same. * fix(review): apply ce-code-review autofix feedback - chat-agent-invoke.ts: delete stale 8-line comment describing currentUserEmail propagation that does not happen (M5). - scheduled.test.ts + _harness/README.md: update docstrings to reflect post-U4 Event + HMAC callback semantics, replacing stale RequestResponse prose (M6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(api,lambda): thread agentId through run_skill envelope ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all four TS emitters (startSkillRun.mutation, skills.ts startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted the field; Python dispatcher rejected every non-webhook envelope with _MISSING_AGENT_REASON. test_server_run_skill.py hand-injected agentId="agent-1" which masked the prod bug. The fix: - Add `agentId: string | null` to SkillRunInvokePayload. - Thread `i.agentId ?? null` into the GraphQL mutation emitter. - Thread `agentId ?? null` into the service-REST startSkillRunService. - Thread `dispatch.agentId ?? null` into the webhook shared handler. - Thread `targetAgentId ?? null` into job-trigger's scheduled path. - Add regression tests in skill-runs-resolvers, webhook-shared, and job-trigger that assert agentId survives the envelope round-trip. The job-trigger test also pins InvocationType=Event (§U4 contract). Without this, PR #552 ships green CI but is inert for scheduled/chat/catalog runs in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(api): tenant-scope users lookups in resolveAgentRuntimeConfig ce-code-review P0-B. The helper is reachable via the service-auth REST endpoint `GET /api/agents/runtime-config?currentUserId=...`, where `currentUserId` is a caller-controlled query parameter. The users lookup at L225 only filtered on users.id, so any holder of API_AUTH_SECRET could enumerate a foreign tenant's user emails by flipping `tenantId` to their own while passing another tenant's userId. All three users lookups (currentUserId email, human_pair email fallback, human_pair name) now AND on users.tenant_id = opts.tenantId. The human_pair lookups derive from a tenant-scoped agent row so the predicate is defense-in-depth there; the currentUserId lookup is the actual leak. Added a regression test that upgrades the drizzle mock to capture where() predicates and asserts both `users.id` and `users.tenant_id` eq-pairs appear for the currentUserId lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tf): MaximumRetryAttempts=0 + DLQ for agentcore-invoke ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to InvocationType=Event so the agent loop has the full 900s Lambda budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2, which on this path means a single transient failure (5xx on the /api/skills/complete callback, container OOM, etc.) causes the whole agent turn to run again — re-burning Bedrock tokens and (in pathological cases) racing the first deliverable's writeback. The agent turn is not idempotent. Set MaximumRetryAttempts=0 so dispatch is one-shot. Durable safety net already exists: /api/skills/complete atomically CAS-updates on status='running' (any stray retry gets 409), and the skill-runs-reconciler flips rows stuck in `running` past the 15-min cutoff to `failed`. Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS, 14-day retention, SSE-managed) instead of disappearing. Added IAM policy on the agentcore role for sqs:SendMessage, plus two outputs for operator inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sions (#562) Two related deploy-pipeline fixes that surfaced together when PR #561 merged: terraform-apply on main has been failing since PR #552 because the agents-runtime-config Lambda zip was never built, and the catalog S3 sync was uploading dev-only artifacts (node_modules, __tests__, tests/, characterization/) into the public skills/catalog/ prefix where the admin UI surfaces them as if they were skills. (1) scripts/build-lambdas.sh — add agents-runtime-config build entry PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts plus the Terraform route in modules/app/lambda-api/handlers.tf, but missed the build entry. Result: every deploy after PR #552 merged failed at terraform-apply with: Error: Error in function call ... filebase64sha256 ... /home/runner/.../dist/lambdas/agents-runtime-config.zip: no such file or directory. This blocked PR #561 from actually deploying skill.yaml removal, and will block every subsequent merge until fixed. Add the build entry between `skills` and `plugin-upload` (matching the registration order in handlers.tf). (2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync The script iterates `packages/skill-catalog/*/` and `aws s3 syncs` each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only exclusion was `scripts/`. Everything else got uploaded: - node_modules/ — 5900+ objects from pnpm install - __tests__/ — vitest test directories - characterization/ — test fixtures - tests/ — top-level test artifacts - per-skill tests/ dirs — Python pytest dirs (e.g., skills/catalog/agent-email-send/tests/test_send_outbound.py) The admin Capabilities → Skills file viewer reads `listCatalogFiles` which does an S3 ListObjects on the slug prefix and renders whatever it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced as if they were part of the skill. Two-layer defense added: - NON_SKILL_DIRS array — top-level dirs that aren't skills are skipped entirely (not synced to any catalog prefix). Includes scripts (preserved), __tests__, characterization, node_modules, tests, dist. - Per-skill --exclude flags on `aws s3 sync` — drops tests/, __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store from the per-skill upload regardless of whether the skill author accidentally committed them. Also extends RETIRED_SLUGS to include thinkwork-admin (retired in PR #488), smoke-package-only, and the legacy top-level cruft prefixes (__tests__, characterization, node_modules, tests) so the next bootstrap run cleans up existing stale objects in dev/prod S3 even from before this script change. Verification: - bash -n on both scripts — syntax clean - bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip (110K) alongside every other handler Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…563) Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the agent turn ran successfully end-to-end, but `post_skill_run_complete` saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed silently with the ERROR log "cannot post completion — missing THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at status=running for ~17 minutes until the reconciler backstopped it with the generic "stale running row" reason. The strange part: 30 seconds earlier in the same coroutine, `api_runtime_config.fetch` (which reads the same env) succeeded, and the workspace_sync log line `action=composer_fetch sync_ms=2838 files=14` confirmed the env was populated. Something inside the long agent turn appears to clear or shadow these vars by the time `post_skill_run_complete` runs. Root cause not fully diagnosed — the container code itself only reads these vars (no `os.environ.pop` of THINKWORK_API_URL anywhere), so it's likely a Strands SDK or botocore subprocess interaction. Rather than chase the diagnosis, this commit eliminates the failure mode structurally: capture the env values at dispatcher entry into local snapshots, and pass them as `api_url` / `api_secret` parameters to every `post_skill_run_complete` call. The function falls back to `os.environ` when not provided, preserving backward compat for the few callers that don't snapshot. Changes: - `post_skill_run_complete` accepts new optional params `api_url` + `api_secret`. Snapshot wins over env. Sharper ERROR log on the unrecoverable case (both snapshot and env empty), naming the runId + status so operators can correlate to the stuck skill_runs row when the reconciler picks it up. - `dispatch_run_skill` captures env at top, passes the snapshots through to all 7 `post_skill_run_complete` call sites (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError, _execute_agent_turn ImportError, agent loop crash, empty response, success). If env IS empty at dispatcher entry — the genuine pre-injection race from `project_agentcore_deploy_race_env` — log loud at the entry point so operators see the cause without having to grep across multiple callback paths. Two new tests in `PostCompletionTests`: - `test_snapshot_params_override_empty_env` — env empty + snapshot provided → urlopen called with snapshot URL + Bearer header. - `test_snapshot_params_take_precedence_over_env` — env populated + snapshot provided → snapshot wins, env is ignored. Pins the precedence rule that's load-bearing for the env-drift fix. Verification: - `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py` — 18 passed (16 prior + 2 new). Closes the P1 reliability-05 finding from the ce-code-review autofix run on PR #552. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan hygiene + solution doc capturing today's launch-blocker arc: (1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md - Flip U1-U5 checkboxes from [ ] to [x] - Bump status: active → status: shipped - Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0 fixes) and #563 (env-snapshot follow-up after dev incident) (2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md - Real incident on dev today: dispatch_run_skill ran the agent turn successfully but post_skill_run_complete saw os.environ empty and bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET", even though api_runtime_config.fetch in the same coroutine 30s earlier had succeeded with the same env read. - Container code never `pop`s these vars; strong hypothesis is a Strands SDK or botocore subprocess shadowing them mid-turn. Not diagnosed conclusively — the structural fix (snapshot at entry, thread through to writeback) makes the diagnosis moot. - Solution doc captures: full symptom trail, the obvious-explanation rule-out, the structural fix pattern, when this pattern applies to other coroutines, and operator triage playbook for prod. - Reusable for any future long-running agent coroutine that ends with a callback POST (memory-retain, MCP tool telemetry, etc.). Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/): - feedback_completion_callback_snapshot_pattern — the snapshot rule - feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together - feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync - project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ericodom and others added 9 commits April 24, 2026 14:54

ericodom merged commit 168ef25 into main Apr 24, 2026
4 checks passed

ericodom deleted the feat/skill-run-dispatcher branch April 24, 2026 23:56

ericodom mentioned this pull request Apr 25, 2026

fix(deploy): unblock terraform-apply + harden bootstrap S3 sync exclusions #562

Merged

5 tasks

ericodom mentioned this pull request Apr 25, 2026

fix(agentcore-strands): snapshot completion env at dispatcher entry #563

Merged

3 tasks

ericodom mentioned this pull request Apr 25, 2026

docs(plans,solutions): close out launch-blocker-1 dispatcher work #564

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: out-of-band kind=run_skill dispatcher (plan §007-008)#552

feat: out-of-band kind=run_skill dispatcher (plan §007-008)#552
ericodom merged 9 commits into
mainfrom
feat/skill-run-dispatcher

ericodom commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericodom commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

U1 — Shared runtime-config helper + new service-auth REST endpoint

U2 — Extract _execute_agent_turn helper in server.py

U3 — Python api_runtime_config client + real dispatcher

U4 — Flip TS emitters to InvocationType: "Event"

U5 — Smoke scripts + CHECKS.md refresh

ce-code-review fixes (autofix + P0s, post-U5)

Key technical decisions

What's explicitly NOT in this PR

Test plan

Deploy ordering notes

Residual Review Findings

Actionable (P1) — should ship in a follow-up before/alongside launch

Polish (P2) — defer to ops hardening pass

Advisory (P3)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ericodom commented Apr 24, 2026 •

edited

Loading

U2 — Extract `_execute_agent_turn` helper in `server.py`

U3 — Python `api_runtime_config` client + real dispatcher

U4 — Flip TS emitters to `InvocationType: "Event"`