Skip to content

feat: out-of-band kind=run_skill dispatcher (plan §007-008)#552

Merged
ericodom merged 9 commits into
mainfrom
feat/skill-run-dispatcher
Apr 24, 2026
Merged

feat: out-of-band kind=run_skill dispatcher (plan §007-008)#552
ericodom merged 9 commits into
mainfrom
feat/skill-run-dispatcher

Conversation

@ericodom
Copy link
Copy Markdown
Contributor

@ericodom ericodom commented Apr 24, 2026

Summary

Launch-blocker #1 from today's session. Replaces the post-§U6 placeholder rejector with a real dispatcher: scheduled jobs, admin "Run now", chat-intent-dispatched skill runs, and webhook-triggered runs (with non-null agentId) now actually execute the target skill via the same Strands agent loop the chat turn uses.

Plan: docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md

What ships

Five focused units, dependency-ordered, each a separate commit:

U1 — Shared runtime-config helper + new service-auth REST endpoint

  • New packages/api/src/lib/resolve-agent-runtime-config.ts factors the ~300-line agent/template/skills/MCP/KB/guardrail resolution out of chat-agent-invoke.ts. Helper is the single source of truth; both chat-invoke and the new endpoint call it. 9 unit tests.
  • New packages/api/src/handlers/agents-runtime-config.ts exposes GET /api/agents/runtime-config?tenantId=X&agentId=Y (Bearer API_AUTH_SECRET). 13 handler tests.
  • chat-agent-invoke.ts refactored to call the helper; zero behavior change. Per-turn fields (thread context, sandbox preflight, user resolution) stay inline.
  • Terraform route + handler registration in lambda-api/handlers.tf.

U2 — Extract _execute_agent_turn helper in server.py

  • ~170-line chat-loop prologue factored into a reusable helper. Same env setup, skill install, workspace ready, eval span, messages build, system prompt, _call_strands_agent, tool-cost drain. Pure lift-and-shift refactor.
  • Chat do_POST now calls the helper; dispatcher will too.
  • All 234 existing container tests pass unchanged.

U3 — Python api_runtime_config client + real dispatcher

  • New container-sources/api_runtime_config.py: urllib GET with 3-attempt retry (5xx + transient retry, 4xx terminal, 404 → AgentConfigNotFoundError). 9 unit tests.
  • run_skill_dispatch.py rewritten: null-agentId fails fast; otherwise fetch config → build synthetic chat turn ("Run the sales-prep skill with these inputs: ...") → call _execute_agent_turn → POST /api/skills/complete with {type: inline, payload} on complete, reason on failure. 16 new tests cover happy path + all failure modes.
  • Dockerfile uses wildcard COPY container-sources/ /app/ so new module picked up automatically; _boot_assert.py::EXPECTED_CONTAINER_SOURCES updated per the institutional learning.

U4 — Flip TS emitters to InvocationType: "Event"

  • packages/api/src/graphql/utils.ts::invokeSkillRun, packages/api/src/handlers/skills.ts::invokeAgentcoreRunSkill, packages/lambda/job-trigger.ts all flip RequestResponseEvent. Webhook path inherits via invokeSkillRun.
  • Why: agent loop routinely takes 10-60s; RequestResponse with 28s socket timeout falsely timed out. Event returns 202 on enqueue; the HMAC-signed /api/skills/complete callback is the authoritative execution-result signal.
  • Does NOT cross feedback_avoid_fire_and_forget_lambda_invokes — that rule governs paths without callbacks; we have a durable HMAC-signed one.
  • Chat-agent-invoke stays RequestResponse (needs the synchronous response for the thread message stream).

U5 — Smoke scripts + CHECKS.md refresh

  • scripts/smoke/{chat,catalog}-smoke.sh + CHECKS.md + README.md flip passing language off "every envelope terminates with U6 unsupported-runtime" onto "row reaches complete or failed with a specific reason; only stuck-running / transport errors FAIL."

ce-code-review fixes (autofix + P0s, post-U5)

Four additional commits address findings from ce-code-review's multi-agent pass against U1–U5:

  • Autofix (ac59a32) — removed stale comment in chat-agent-invoke.ts describing nonexistent currentUserEmail propagation; updated scheduled.test.ts + harness README to reflect post-§U4 Event + HMAC callback semantics.
  • P0-A (c61ff58) — agentId threading. The SkillRunInvokePayload type was missing agentId; all four TS emitters stripped the field; the Python dispatcher rejected every non-webhook envelope with _MISSING_AGENT_REASON. Added agentId: string | null to the type, threaded through startSkillRun.mutation.ts, skills.ts::startSkillRunService, webhooks/_shared.ts, and job-trigger.ts. 3 regression tests (including an InvocationType=Event pin).
  • P0-B (7351cbe) — cross-tenant user-email leak. resolveAgentRuntimeConfig's users lookup scoped only on users.id; any holder of API_AUTH_SECRET could enumerate cross-tenant emails via the currentUserId query param. All three users lookups (currentUserId, human_pair email, human_pair name) now AND on users.tenant_id. Added a regression test that captures the actual where() predicates.
  • P0-C (5f03f00) — async-invoke retry non-idempotency. AWS Lambda default MaximumRetryAttempts=2 on the §U4 Event invoke would re-run the agent turn (re-burning Bedrock tokens) on any transient failure. Set MaximumRetryAttempts=0 on agentcore-invoke via aws_lambda_function_event_invoke_config; added a thinkwork-<stage>-agentcore-async-dlq SQS queue (14-day retention, SSE-managed) for failure visibility.

Key technical decisions

  1. Reuse _call_strands_agent via _execute_agent_turn — one code path for the agent loop; the chat-invoke and run_skill paths cannot drift.
  2. Runtime config fetched inside the container via GET /api/agents/runtime-config — keeps the four TS emitters lean (no envelope bloat); single resolver for both chat + dispatcher; tenant-scoped 404 returns don't leak cross-tenant existence.
  3. Fail fast on null agentId — scheduled/catalog/chat-intent always carry one; only webhook resolvers emit null, and tenant-admin fallback routing is a product decision that belongs in its own PR.
  4. Synthetic user message, not SKILL.md injection — the model reads the skill's SKILL.md via the AgentSkills plugin's progressive disclosure; we just tell it to run it with the given args. No harness magic.
  5. One-shot dispatch, not retried — agent turn is non-idempotent (Bedrock tokens); /api/skills/complete atomic CAS + 15-min reconciler provide the safety net; DLQ preserves observability of dropped invokes.

What's explicitly NOT in this PR

  • Refactor of _call_strands_agent (1,400-line function; too risky for this change).
  • Streaming the skill-run output.
  • Cancellation cooperation inside the agent loop (cancelSkillRun mutation flips the row; mid-run polling is separate work).
  • Webhook null-agentId fallback routing — fails with named reason for now.

Test plan

  • pnpm -r typecheck (minus @thinkwork/agent-tools pre-existing no-tsc) — green
  • pnpm -r test — 1378 passed, 7 skipped across api (1385 total); 68 passed lambda
  • uv run pytest packages/agentcore-strands/agent-container/ — 249 passed
  • pnpm exec prettier --check on touched packages — clean
  • terraform validate on agentcore-runtime module — clean
  • Mandatory grep for old rejector strings returns zero hits
  • Post-merge: run catalog-smoke on dev for sales-prep, confirm row reaches complete or fails at a specific connector
  • Post-merge: schedule a run via the admin "Run now" button, confirm same
  • Post-merge: log-grep dispatch_run_skill in CloudWatch to confirm the new code is serving (per docs/solutions/workflow-issues/deploy-silent-arch-mismatch-took-a-week-to-surface-2026-04-24.md)

Deploy ordering notes

  • Migration 0029 already applied on dev (from feat(U6): delete composition runner + skill_catalog.execution/mode #542). This PR adds no new migrations.
  • The AgentCore runtime update (deploy.yml's container rebuild + UpdateAgentRuntime) is required for the new Python code to serve — per docs/solutions/workflow-issues/agentcore-runtime-no-auto-repull-requires-explicit-update-2026-04-24.md the ECR push alone is not enough. The merge pipeline handles this.
  • Warm containers may boot pre-env-injection during terraform-apply per project_agentcore_deploy_race_env; the 15-min reconciler is the backstop. The new api_runtime_config.fetch reads env at call time, not import, so post-race recovery is automatic.
  • The new aws_sqs_queue.agentcore_async_dlq + aws_lambda_function_event_invoke_config.agentcore resources are additive; first apply creates them and flips retry semantics. No data migration needed.

Residual Review Findings

ce-code-review (autofix mode, 10 reviewers, 8 returned findings) surfaced the P0s above plus the P1/P2/P3 work listed here. Durable sink: this PR body (tracker filing was not invoked for in-scope launch work). Full artifact on disk at .context/compound-engineering/ce-code-review/20260424-151323-e0154520/findings.md in the worktree.

Actionable (P1) — should ship in a follow-up before/alongside launch

  • P1 | reliability-02aws_lambda_function_event_invoke_config now exists, but consider EventInvoke destinations for OnSuccess too if operator-side observability wants completion streaming. Not blocking.
  • P1 | reliability-05post_skill_run_complete in run_skill_dispatch.py returns silently when API_AUTH_SECRET / THINKWORK_API_URL are missing. Row stays running until the 15-min reconciler. Should log ERROR + raise so the AWS retry (now 0, but future-proof) or the operator sees it immediately.
  • P1 | testing-1_execute_agent_turn (~170 LOC, heavy hidden state) has zero direct unit tests. Add at minimum: happy path, empty response, tool-call path, exception propagation.
  • P1 | testing-2PostCompletionTests.test_missing_env_logs_and_returns asserts nothing; false confidence. Assert return value + log emission.
  • P1 | testing-4resolve-agent-runtime-config branches (guardrail-assigned path, human-pair email fallback path, humanN lookup) have no direct unit tests; current coverage is indirect via chat + dispatcher integration.
  • P1 | kieran-python async-def-without-awaitsdispatch_run_skill is async def with zero awaits. asyncio.run(...) makes it work, but either drop async or convert the blocking urllib calls to awaitable HTTP so the label reflects reality.
  • P1 | kieran-python late-importsfrom server import _execute_agent_turn inside the function hides a module-load cycle. Either declare explicitly or extract _execute_agent_turn into its own module.
  • P1 | kt-002 — three inlined copies of the invokeAgentcoreRunSkill envelope constructor (utils.ts, skills.ts, job-trigger.ts). Extract to a shared helper so future envelope-shape changes (like today's agentId thread) happen once, not thrice.

Polish (P2) — defer to ops hardening pass

  • _tool_costs global accumulates across invocations if _execute_agent_turn raises (reliability-03, reliability-07) — wrap in try/finally _tool_costs.clear().
  • No soft deadline inside dispatch_run_skill; runaway agent loop burns full 900s budget (reliability-04).
  • Missing completion_hmac_secret in envelope strands the row on 401 terminal (reliability-06).
  • Guardrail-blocked response is posted as successful skill completion instead of failed with guardrail_blocked reason (correctness-2, adv-05).
  • Tenant built-in-tool kill-switch not propagated to skill-run dispatch (correctness-3).
  • Dead fallback reads: scope.get('invokerUserId' | 'agentId' | 'threadId') never match TS producer shape — remove to prevent silent future breakage (correctness-4, api-contract-002).
  • _execute_agent_turn leaves workspace/composer/hindsight env vars set on warm containers (correctness-5).
  • _FETCH_RETRY_DELAYS and _COMPLETE_RETRY_DELAYS identical tuples in two modules (kieran-python retry-delays-duplicated).
  • _cfg(camel, snake) closure repeated 3 places (kieran-python envelope-scope-fallback-duplicated-three-times).
  • (k: any) cast in chat-agent-invoke.ts defeats KnowledgeBaseConfig type (kt-003).
  • JSONB unchecked casts in resolveAgentRuntimeConfig rely on schema stability (kt-004).
  • run_skill_dispatch reaches past the underscore prefix into server._execute_agent_turn (M2).
  • RuntimeConfigFetchError(code=None) conflates missing-env with retry-exhaustion (M4).
  • Empty-skills envelope silently posts "no skills available" deliverable as complete (adv-04).
  • Unvalidated response_text posted verbatim as deliveredArtifactRef with no size/shape check (adv-06).
  • Warm-container env leak via exception in do_POST snake_case/camelCase ordering (adv-07).
  • Admin cancel race — dispatcher blind to skill_runs.status flip mid-run; 4xx on callback is only signal (adv-08).
  • Several coverage gaps: _build_synthetic_payload asserts only 4 of 16+ keys (testing-5); scope casing mismatch untested (testing-6); agents-runtime-config handler test mocks the helper wholesale (testing-7); _execute_agent_turn edge cases — response_text=None, empty skills, None mcp_configs — untested (testing-8).

Advisory (P3)

  • (k: any) parameter naming drift (kt-005).
  • Inlined UUID_RE duplication — 6th+ copy (kt-006).
  • agents-runtime-config handler hand-rolls OPTIONS instead of using handleCors/cors() (kt-007).
  • _urlopen_with_retry's assert last_exc is not None (reliability-08).
  • Runtime-config fetch budget sums to ~14s but each call allows timeout=15s × 4 attempts = 60s worst case (reliability-09).
  • No circuit breaker on runtime-config fetch — API outage cascades across N dispatches (adv-09).
  • HMAC header parse edge case — bare-hex accepted alongside sha256=... (adv-10).

🤖 Generated with Claude Code

ericodom and others added 9 commits April 24, 2026 14:54
…config endpoint

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1.

Factors the agent-level resolution logic (template + skills + KBs + MCP
configs + guardrail + sandbox template) out of chat-agent-invoke.ts into
a shared helper. Both the chat-invoke Lambda and the new service-auth
REST endpoint call the same resolver, so the run_skill dispatcher and
the chat loop can never drift on "what this agent's runtime looks like."

* packages/api/src/lib/resolve-agent-runtime-config.ts — new helper.
  Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig,
  guardrailId, skillsConfig with defaults + built-in tools + blocked-
  tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate).
  Exports AgentNotFoundError + AgentTemplateNotFoundError for caller
  mapping to HTTP responses.
* packages/api/src/handlers/agents-runtime-config.ts — new service-auth
  GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer
  API_AUTH_SECRET per the `docs/solutions/best-practices/service-
  endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent.
  Optional currentUserId + currentUserEmail query params drive the
  CURRENT_USER_EMAIL overlay on default skills.
* packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300-
  line inline resolution block with a single `resolveAgentRuntimeConfig`
  call. Per-turn concerns (thread_id, trace_id, message, messages_
  history, currentUserId + currentUserEmail derivation, sandbox
  preflight) stay in the handler. Behavior-preserving refactor.
* terraform/modules/app/lambda-api/handlers.tf — registers the new
  handler and routes GET /api/agents/runtime-config to it.
* Tests: 8 helper-level unit tests (happy path, AgentNotFoundError,
  AgentTemplateNotFoundError, template vs tenant-default guardrail,
  blocked-tools filter, currentUserEmail overlay, built-in tool
  injection, MCP delegation). 13 handler-level tests (method/path/auth/
  UUID validation + helper-exception → 404 mapping + currentUser param
  pass-through).

Next: U2 — extract `_execute_agent_turn` helper in server.py so the
dispatcher can reuse the chat-loop prologue + `_call_strands_agent`.
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2.

Pure refactor — no behavior change. Factors the ~170-line chat-loop
prologue out of AgentCoreHandler.do_POST into a free function
`_execute_agent_turn(payload)` so U3's dispatcher can reuse the same
path without duplicating env setup + skill install + workspace bootstrap
+ eval-span attach + message build + system_prompt build + the
_call_strands_agent call.

What stayed in do_POST (chat-specific):
* invocation_env.apply/cleanup (run_skill branch has its own already).
* OpenAI chat.completion response shape + tool_costs/hindsight_usage
  projection.
* _audit_response, auto-retain via api_memory_client, log_agent_invocation.
* Guardrail-exception classification into a blocked-response 200.
* self._respond HTTP writes.

What moved into _execute_agent_turn:
* Payload env unpack (workspace_bucket, thinkwork_api_url/secret,
  hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID).
* install_skill_from_s3 loop.
* _ensure_workspace_ready.
* _inject_skill_env.
* attach/detach_eval_context.
* messages list build (history + current).
* Router profile resolution + effective_skills filter.
* _build_system_prompt + external-task / workflow-skill / KB context
  injection.
* _call_strands_agent.
* tool_costs drain snapshot.

Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}.
Raises on _call_strands_agent failure; caller classifies guardrail vs
non-guardrail.

Covered by existing container tests (234/234 passing). Adds no new
tests — the extraction is a lift-and-shift. U3 will exercise the
helper from the dispatcher path with real assertions.
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3.

Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with
a real dispatcher that:

  1. Validates the envelope — null agentId fails fast with a named
     reason (webhook-sourced runs deferred to a follow-up per plan).
  2. Fetches the agent's runtime config from the new
     GET /api/agents/runtime-config endpoint (U1) via a new Python
     client api_runtime_config.py. Bounded retry: 5xx + transient
     transport errors retry 3× with jittered backoff; 4xx terminal.
     404 is a specific AgentConfigNotFoundError subclass.
  3. Builds a chat-invoke-shaped synthetic payload (tenant / agent /
     skills / MCP / guardrail + a synthetic user message instructing
     the agent to run the skill with the given inputs).
  4. Runs a headless Strands agent turn via the shared
     server._execute_agent_turn helper (U2) — the chat loop and the
     dispatcher now share one execution path; no second codebase to
     drift.
  5. POSTs terminal status back to /api/skills/complete with the
     HMAC-signed callback. Successful runs write
     deliveredArtifactRef={type: "inline", payload: <markdown>};
     empty responses fail with "agent produced no final text";
     agent-loop exceptions fail with "agent loop crashed: <msg>".

Also:

* container-sources/_boot_assert.py — adds api_runtime_config to
  EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces
  loudly at boot instead of silently shipping non-functional.
* test_api_runtime_config.py — 9 tests: happy path, query-string
  shape, currentUserId forwarding, missing env, 404 → AgentConfigNot-
  FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket
  timeout exhaustion, non-JSON body.
* test_server_run_skill.py — rewritten for the new contract. 16 tests
  covering: happy path inline deliverable, null-agentId fast-fail,
  missing-field short-circuit, config-404, config-5xx, agent-loop
  crash, empty-response-text failure, synthetic-payload shape (both
  camelCase + snake_case config keys), HMAC retry semantics carried
  forward.

Full container suite: 249 passed (up from 234 pre-U3).

Dockerfile uses the wildcard COPY container-sources/ /app/ so no
explicit entry needed (per the post-#542 Dockerfile shape).
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4.

RequestResponse with a 28s socket timeout was fundamentally incompatible
with the agent loop the dispatcher now runs — the AgentCore Lambda has
a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter
flips to InvocationType: Event so enqueue acknowledges immediately while
the agent executes for as long as it needs.

The HMAC-signed /api/skills/complete callback is the authoritative
execution-result signal — it writes skill_runs.status on completion
whether success or failure. Enqueue-level errors (IAM, Lambda missing)
still surface synchronously via AWS returning 4xx/5xx on the Invoke
call, which each emitter re-exposes through its ok/error return shape.

This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes —
that rule governs paths with no callback; we have a durable HMAC-signed
one.

Flipped sites:
* packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL
  mutation path + webhooks/_shared.ts both call this).
* packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill
  (POST /api/skills/start service-auth path).
* packages/lambda/job-trigger.ts — scheduled job invoke.

Post-flip error handling checks `res.StatusCode` (Event returns 202 on
success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` +
`res.Payload` (RequestResponse-only). Removed the NodeHttpHandler
socket-timeout override — Event invokes don't hold the connection
open waiting for the Lambda, so the 28s timeout workaround isn't
meaningful.

The chat-agent-invoke path (InvocationType: RequestResponse at
chat-agent-invoke.ts:426) is unaffected — chat turns still need the
synchronous response to land in the thread message stream.

Full TS suite: 1375 passed, 68 passed (no regressions).
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5.

Replaces the post-#542 "every kind=run_skill envelope terminates with
U6 unsupported-runtime" passing condition with the real contract: the
row reaches `complete` (agent produced the deliverable) or `failed`
with a specific reason (connector missing, agent loop error,
config-fetch failure). Only stuck-running / transport errors fail the
smoke.

* scripts/smoke/CHECKS.md — rewrites the "What passing means today"
  section to describe the U4 dispatcher flow (config fetch → headless
  agent turn → HMAC callback). Refreshes the "Known gaps" list to
  note the remaining webhook null-agentId deferral instead of the
  retired U6 placeholder.
* scripts/smoke/README.md — updates the webhook smoke test
  expectations; complete or specific-reason failure are both PASS.
  Refreshes the "What passes vs fails" bullet list.
* scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition
  comment describe the new terminal-states contract.
* scripts/smoke/chat-smoke.sh — same.
- chat-agent-invoke.ts: delete stale 8-line comment describing
  currentUserEmail propagation that does not happen (M5).
- scheduled.test.ts + _harness/README.md: update docstrings to
  reflect post-U4 Event + HMAC callback semantics, replacing stale
  RequestResponse prose (M6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all
four TS emitters (startSkillRun.mutation, skills.ts
startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted
the field; Python dispatcher rejected every non-webhook envelope with
_MISSING_AGENT_REASON. test_server_run_skill.py hand-injected
agentId="agent-1" which masked the prod bug.

The fix:
- Add `agentId: string | null` to SkillRunInvokePayload.
- Thread `i.agentId ?? null` into the GraphQL mutation emitter.
- Thread `agentId ?? null` into the service-REST startSkillRunService.
- Thread `dispatch.agentId ?? null` into the webhook shared handler.
- Thread `targetAgentId ?? null` into job-trigger's scheduled path.
- Add regression tests in skill-runs-resolvers, webhook-shared, and
  job-trigger that assert agentId survives the envelope round-trip.
  The job-trigger test also pins InvocationType=Event (§U4 contract).

Without this, PR #552 ships green CI but is inert for
scheduled/chat/catalog runs in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-B. The helper is reachable via the service-auth
REST endpoint `GET /api/agents/runtime-config?currentUserId=...`,
where `currentUserId` is a caller-controlled query parameter. The
users lookup at L225 only filtered on users.id, so any holder of
API_AUTH_SECRET could enumerate a foreign tenant's user emails by
flipping `tenantId` to their own while passing another tenant's
userId.

All three users lookups (currentUserId email, human_pair email
fallback, human_pair name) now AND on users.tenant_id = opts.tenantId.
The human_pair lookups derive from a tenant-scoped agent row so the
predicate is defense-in-depth there; the currentUserId lookup is the
actual leak.

Added a regression test that upgrades the drizzle mock to capture
where() predicates and asserts both `users.id` and `users.tenant_id`
eq-pairs appear for the currentUserId lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to
InvocationType=Event so the agent loop has the full 900s Lambda
budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2,
which on this path means a single transient failure (5xx on the
/api/skills/complete callback, container OOM, etc.) causes the whole
agent turn to run again — re-burning Bedrock tokens and (in
pathological cases) racing the first deliverable's writeback.

The agent turn is not idempotent. Set MaximumRetryAttempts=0 so
dispatch is one-shot. Durable safety net already exists:
/api/skills/complete atomically CAS-updates on status='running' (any
stray retry gets 409), and the skill-runs-reconciler flips rows stuck
in `running` past the 15-min cutoff to `failed`.

Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS,
14-day retention, SSE-managed) instead of disappearing. Added IAM
policy on the agentcore role for sqs:SendMessage, plus two outputs
for operator inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ericodom ericodom merged commit 168ef25 into main Apr 24, 2026
4 checks passed
@ericodom ericodom deleted the feat/skill-run-dispatcher branch April 24, 2026 23:56
ericodom added a commit that referenced this pull request Apr 25, 2026
…sions (#562)

Two related deploy-pipeline fixes that surfaced together when PR #561
merged: terraform-apply on main has been failing since PR #552 because
the agents-runtime-config Lambda zip was never built, and the catalog
S3 sync was uploading dev-only artifacts (node_modules, __tests__,
tests/, characterization/) into the public skills/catalog/ prefix
where the admin UI surfaces them as if they were skills.

(1) scripts/build-lambdas.sh — add agents-runtime-config build entry

PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts
plus the Terraform route in modules/app/lambda-api/handlers.tf, but
missed the build entry. Result: every deploy after PR #552 merged
failed at terraform-apply with:

  Error: Error in function call ... filebase64sha256 ...
  /home/runner/.../dist/lambdas/agents-runtime-config.zip:
  no such file or directory.

This blocked PR #561 from actually deploying skill.yaml removal, and
will block every subsequent merge until fixed. Add the build entry
between `skills` and `plugin-upload` (matching the registration order
in handlers.tf).

(2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync

The script iterates `packages/skill-catalog/*/` and `aws s3 syncs`
each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only
exclusion was `scripts/`. Everything else got uploaded:

  - node_modules/ — 5900+ objects from pnpm install
  - __tests__/ — vitest test directories
  - characterization/ — test fixtures
  - tests/ — top-level test artifacts
  - per-skill tests/ dirs — Python pytest dirs (e.g.,
    skills/catalog/agent-email-send/tests/test_send_outbound.py)

The admin Capabilities → Skills file viewer reads `listCatalogFiles`
which does an S3 ListObjects on the slug prefix and renders whatever
it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced
as if they were part of the skill.

Two-layer defense added:

  - NON_SKILL_DIRS array — top-level dirs that aren't skills are
    skipped entirely (not synced to any catalog prefix). Includes
    scripts (preserved), __tests__, characterization, node_modules,
    tests, dist.

  - Per-skill --exclude flags on `aws s3 sync` — drops tests/,
    __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store
    from the per-skill upload regardless of whether the skill author
    accidentally committed them.

Also extends RETIRED_SLUGS to include thinkwork-admin (retired in
PR #488), smoke-package-only, and the legacy top-level cruft prefixes
(__tests__, characterization, node_modules, tests) so the next
bootstrap run cleans up existing stale objects in dev/prod S3
even from before this script change.

Verification:
- bash -n on both scripts — syntax clean
- bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip
  (110K) alongside every other handler

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request Apr 25, 2026
…563)

Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the
agent turn ran successfully end-to-end, but `post_skill_run_complete`
saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed
silently with the ERROR log "cannot post completion — missing
THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at
status=running for ~17 minutes until the reconciler backstopped it
with the generic "stale running row" reason.

The strange part: 30 seconds earlier in the same coroutine,
`api_runtime_config.fetch` (which reads the same env) succeeded, and
the workspace_sync log line `action=composer_fetch sync_ms=2838
files=14` confirmed the env was populated. Something inside the long
agent turn appears to clear or shadow these vars by the time
`post_skill_run_complete` runs. Root cause not fully diagnosed — the
container code itself only reads these vars (no `os.environ.pop` of
THINKWORK_API_URL anywhere), so it's likely a Strands SDK or
botocore subprocess interaction.

Rather than chase the diagnosis, this commit eliminates the failure
mode structurally: capture the env values at dispatcher entry into
local snapshots, and pass them as `api_url` / `api_secret` parameters
to every `post_skill_run_complete` call. The function falls back to
`os.environ` when not provided, preserving backward compat for the
few callers that don't snapshot.

Changes:
- `post_skill_run_complete` accepts new optional params `api_url` +
  `api_secret`. Snapshot wins over env. Sharper ERROR log on the
  unrecoverable case (both snapshot and env empty), naming the
  runId + status so operators can correlate to the stuck skill_runs
  row when the reconciler picks it up.
- `dispatch_run_skill` captures env at top, passes the snapshots
  through to all 7 `post_skill_run_complete` call sites
  (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError,
  _execute_agent_turn ImportError, agent loop crash, empty response,
  success). If env IS empty at dispatcher entry — the genuine
  pre-injection race from `project_agentcore_deploy_race_env` — log
  loud at the entry point so operators see the cause without having
  to grep across multiple callback paths.

Two new tests in `PostCompletionTests`:
- `test_snapshot_params_override_empty_env` — env empty + snapshot
  provided → urlopen called with snapshot URL + Bearer header.
- `test_snapshot_params_take_precedence_over_env` — env populated +
  snapshot provided → snapshot wins, env is ignored. Pins the
  precedence rule that's load-bearing for the env-drift fix.

Verification:
- `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py`
  — 18 passed (16 prior + 2 new).

Closes the P1 reliability-05 finding from the ce-code-review autofix
run on PR #552.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request Apr 25, 2026
Plan hygiene + solution doc capturing today's launch-blocker arc:

(1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md
- Flip U1-U5 checkboxes from [ ] to [x]
- Bump status: active → status: shipped
- Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0
  fixes) and #563 (env-snapshot follow-up after dev incident)

(2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md
- Real incident on dev today: dispatch_run_skill ran the agent turn
  successfully but post_skill_run_complete saw os.environ empty and
  bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET",
  even though api_runtime_config.fetch in the same coroutine 30s
  earlier had succeeded with the same env read.
- Container code never `pop`s these vars; strong hypothesis is a
  Strands SDK or botocore subprocess shadowing them mid-turn. Not
  diagnosed conclusively — the structural fix (snapshot at entry,
  thread through to writeback) makes the diagnosis moot.
- Solution doc captures: full symptom trail, the obvious-explanation
  rule-out, the structural fix pattern, when this pattern applies to
  other coroutines, and operator triage playbook for prod.
- Reusable for any future long-running agent coroutine that ends
  with a callback POST (memory-retain, MCP tool telemetry, etc.).

Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/):
- feedback_completion_callback_snapshot_pattern — the snapshot rule
- feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together
- feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync
- project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
* feat(api): U1 resolveAgentRuntimeConfig helper + /api/agents/runtime-config endpoint

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1.

Factors the agent-level resolution logic (template + skills + KBs + MCP
configs + guardrail + sandbox template) out of chat-agent-invoke.ts into
a shared helper. Both the chat-invoke Lambda and the new service-auth
REST endpoint call the same resolver, so the run_skill dispatcher and
the chat loop can never drift on "what this agent's runtime looks like."

* packages/api/src/lib/resolve-agent-runtime-config.ts — new helper.
  Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig,
  guardrailId, skillsConfig with defaults + built-in tools + blocked-
  tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate).
  Exports AgentNotFoundError + AgentTemplateNotFoundError for caller
  mapping to HTTP responses.
* packages/api/src/handlers/agents-runtime-config.ts — new service-auth
  GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer
  API_AUTH_SECRET per the `docs/solutions/best-practices/service-
  endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent.
  Optional currentUserId + currentUserEmail query params drive the
  CURRENT_USER_EMAIL overlay on default skills.
* packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300-
  line inline resolution block with a single `resolveAgentRuntimeConfig`
  call. Per-turn concerns (thread_id, trace_id, message, messages_
  history, currentUserId + currentUserEmail derivation, sandbox
  preflight) stay in the handler. Behavior-preserving refactor.
* terraform/modules/app/lambda-api/handlers.tf — registers the new
  handler and routes GET /api/agents/runtime-config to it.
* Tests: 8 helper-level unit tests (happy path, AgentNotFoundError,
  AgentTemplateNotFoundError, template vs tenant-default guardrail,
  blocked-tools filter, currentUserEmail overlay, built-in tool
  injection, MCP delegation). 13 handler-level tests (method/path/auth/
  UUID validation + helper-exception → 404 mapping + currentUser param
  pass-through).

Next: U2 — extract `_execute_agent_turn` helper in server.py so the
dispatcher can reuse the chat-loop prologue + `_call_strands_agent`.

* refactor(agentcore-strands): U2 extract _execute_agent_turn

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2.

Pure refactor — no behavior change. Factors the ~170-line chat-loop
prologue out of AgentCoreHandler.do_POST into a free function
`_execute_agent_turn(payload)` so U3's dispatcher can reuse the same
path without duplicating env setup + skill install + workspace bootstrap
+ eval-span attach + message build + system_prompt build + the
_call_strands_agent call.

What stayed in do_POST (chat-specific):
* invocation_env.apply/cleanup (run_skill branch has its own already).
* OpenAI chat.completion response shape + tool_costs/hindsight_usage
  projection.
* _audit_response, auto-retain via api_memory_client, log_agent_invocation.
* Guardrail-exception classification into a blocked-response 200.
* self._respond HTTP writes.

What moved into _execute_agent_turn:
* Payload env unpack (workspace_bucket, thinkwork_api_url/secret,
  hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID).
* install_skill_from_s3 loop.
* _ensure_workspace_ready.
* _inject_skill_env.
* attach/detach_eval_context.
* messages list build (history + current).
* Router profile resolution + effective_skills filter.
* _build_system_prompt + external-task / workflow-skill / KB context
  injection.
* _call_strands_agent.
* tool_costs drain snapshot.

Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}.
Raises on _call_strands_agent failure; caller classifies guardrail vs
non-guardrail.

Covered by existing container tests (234/234 passing). Adds no new
tests — the extraction is a lift-and-shift. U3 will exercise the
helper from the dispatcher path with real assertions.

* feat(agentcore-strands): U3 real kind=run_skill dispatcher

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3.

Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with
a real dispatcher that:

  1. Validates the envelope — null agentId fails fast with a named
     reason (webhook-sourced runs deferred to a follow-up per plan).
  2. Fetches the agent's runtime config from the new
     GET /api/agents/runtime-config endpoint (U1) via a new Python
     client api_runtime_config.py. Bounded retry: 5xx + transient
     transport errors retry 3× with jittered backoff; 4xx terminal.
     404 is a specific AgentConfigNotFoundError subclass.
  3. Builds a chat-invoke-shaped synthetic payload (tenant / agent /
     skills / MCP / guardrail + a synthetic user message instructing
     the agent to run the skill with the given inputs).
  4. Runs a headless Strands agent turn via the shared
     server._execute_agent_turn helper (U2) — the chat loop and the
     dispatcher now share one execution path; no second codebase to
     drift.
  5. POSTs terminal status back to /api/skills/complete with the
     HMAC-signed callback. Successful runs write
     deliveredArtifactRef={type: "inline", payload: <markdown>};
     empty responses fail with "agent produced no final text";
     agent-loop exceptions fail with "agent loop crashed: <msg>".

Also:

* container-sources/_boot_assert.py — adds api_runtime_config to
  EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces
  loudly at boot instead of silently shipping non-functional.
* test_api_runtime_config.py — 9 tests: happy path, query-string
  shape, currentUserId forwarding, missing env, 404 → AgentConfigNot-
  FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket
  timeout exhaustion, non-JSON body.
* test_server_run_skill.py — rewritten for the new contract. 16 tests
  covering: happy path inline deliverable, null-agentId fast-fail,
  missing-field short-circuit, config-404, config-5xx, agent-loop
  crash, empty-response-text failure, synthetic-payload shape (both
  camelCase + snake_case config keys), HMAC retry semantics carried
  forward.

Full container suite: 249 passed (up from 234 pre-U3).

Dockerfile uses the wildcard COPY container-sources/ /app/ so no
explicit entry needed (per the post-#542 Dockerfile shape).

* feat(api,lambda): U4 flip kind=run_skill Lambda invokes to Event

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4.

RequestResponse with a 28s socket timeout was fundamentally incompatible
with the agent loop the dispatcher now runs — the AgentCore Lambda has
a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter
flips to InvocationType: Event so enqueue acknowledges immediately while
the agent executes for as long as it needs.

The HMAC-signed /api/skills/complete callback is the authoritative
execution-result signal — it writes skill_runs.status on completion
whether success or failure. Enqueue-level errors (IAM, Lambda missing)
still surface synchronously via AWS returning 4xx/5xx on the Invoke
call, which each emitter re-exposes through its ok/error return shape.

This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes —
that rule governs paths with no callback; we have a durable HMAC-signed
one.

Flipped sites:
* packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL
  mutation path + webhooks/_shared.ts both call this).
* packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill
  (POST /api/skills/start service-auth path).
* packages/lambda/job-trigger.ts — scheduled job invoke.

Post-flip error handling checks `res.StatusCode` (Event returns 202 on
success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` +
`res.Payload` (RequestResponse-only). Removed the NodeHttpHandler
socket-timeout override — Event invokes don't hold the connection
open waiting for the Lambda, so the 28s timeout workaround isn't
meaningful.

The chat-agent-invoke path (InvocationType: RequestResponse at
chat-agent-invoke.ts:426) is unaffected — chat turns still need the
synchronous response to land in the thread message stream.

Full TS suite: 1375 passed, 68 passed (no regressions).

* docs(smoke): U5 update smoke scripts for the real dispatcher

Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5.

Replaces the post-#542 "every kind=run_skill envelope terminates with
U6 unsupported-runtime" passing condition with the real contract: the
row reaches `complete` (agent produced the deliverable) or `failed`
with a specific reason (connector missing, agent loop error,
config-fetch failure). Only stuck-running / transport errors fail the
smoke.

* scripts/smoke/CHECKS.md — rewrites the "What passing means today"
  section to describe the U4 dispatcher flow (config fetch → headless
  agent turn → HMAC callback). Refreshes the "Known gaps" list to
  note the remaining webhook null-agentId deferral instead of the
  retired U6 placeholder.
* scripts/smoke/README.md — updates the webhook smoke test
  expectations; complete or specific-reason failure are both PASS.
  Refreshes the "What passes vs fails" bullet list.
* scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition
  comment describe the new terminal-states contract.
* scripts/smoke/chat-smoke.sh — same.

* fix(review): apply ce-code-review autofix feedback

- chat-agent-invoke.ts: delete stale 8-line comment describing
  currentUserEmail propagation that does not happen (M5).
- scheduled.test.ts + _harness/README.md: update docstrings to
  reflect post-U4 Event + HMAC callback semantics, replacing stale
  RequestResponse prose (M6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(api,lambda): thread agentId through run_skill envelope

ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all
four TS emitters (startSkillRun.mutation, skills.ts
startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted
the field; Python dispatcher rejected every non-webhook envelope with
_MISSING_AGENT_REASON. test_server_run_skill.py hand-injected
agentId="agent-1" which masked the prod bug.

The fix:
- Add `agentId: string | null` to SkillRunInvokePayload.
- Thread `i.agentId ?? null` into the GraphQL mutation emitter.
- Thread `agentId ?? null` into the service-REST startSkillRunService.
- Thread `dispatch.agentId ?? null` into the webhook shared handler.
- Thread `targetAgentId ?? null` into job-trigger's scheduled path.
- Add regression tests in skill-runs-resolvers, webhook-shared, and
  job-trigger that assert agentId survives the envelope round-trip.
  The job-trigger test also pins InvocationType=Event (§U4 contract).

Without this, PR #552 ships green CI but is inert for
scheduled/chat/catalog runs in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(api): tenant-scope users lookups in resolveAgentRuntimeConfig

ce-code-review P0-B. The helper is reachable via the service-auth
REST endpoint `GET /api/agents/runtime-config?currentUserId=...`,
where `currentUserId` is a caller-controlled query parameter. The
users lookup at L225 only filtered on users.id, so any holder of
API_AUTH_SECRET could enumerate a foreign tenant's user emails by
flipping `tenantId` to their own while passing another tenant's
userId.

All three users lookups (currentUserId email, human_pair email
fallback, human_pair name) now AND on users.tenant_id = opts.tenantId.
The human_pair lookups derive from a tenant-scoped agent row so the
predicate is defense-in-depth there; the currentUserId lookup is the
actual leak.

Added a regression test that upgrades the drizzle mock to capture
where() predicates and asserts both `users.id` and `users.tenant_id`
eq-pairs appear for the currentUserId lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tf): MaximumRetryAttempts=0 + DLQ for agentcore-invoke

ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to
InvocationType=Event so the agent loop has the full 900s Lambda
budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2,
which on this path means a single transient failure (5xx on the
/api/skills/complete callback, container OOM, etc.) causes the whole
agent turn to run again — re-burning Bedrock tokens and (in
pathological cases) racing the first deliverable's writeback.

The agent turn is not idempotent. Set MaximumRetryAttempts=0 so
dispatch is one-shot. Durable safety net already exists:
/api/skills/complete atomically CAS-updates on status='running' (any
stray retry gets 409), and the skill-runs-reconciler flips rows stuck
in `running` past the 15-min cutoff to `failed`.

Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS,
14-day retention, SSE-managed) instead of disappearing. Added IAM
policy on the agentcore role for sqs:SendMessage, plus two outputs
for operator inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
…sions (#562)

Two related deploy-pipeline fixes that surfaced together when PR #561
merged: terraform-apply on main has been failing since PR #552 because
the agents-runtime-config Lambda zip was never built, and the catalog
S3 sync was uploading dev-only artifacts (node_modules, __tests__,
tests/, characterization/) into the public skills/catalog/ prefix
where the admin UI surfaces them as if they were skills.

(1) scripts/build-lambdas.sh — add agents-runtime-config build entry

PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts
plus the Terraform route in modules/app/lambda-api/handlers.tf, but
missed the build entry. Result: every deploy after PR #552 merged
failed at terraform-apply with:

  Error: Error in function call ... filebase64sha256 ...
  /home/runner/.../dist/lambdas/agents-runtime-config.zip:
  no such file or directory.

This blocked PR #561 from actually deploying skill.yaml removal, and
will block every subsequent merge until fixed. Add the build entry
between `skills` and `plugin-upload` (matching the registration order
in handlers.tf).

(2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync

The script iterates `packages/skill-catalog/*/` and `aws s3 syncs`
each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only
exclusion was `scripts/`. Everything else got uploaded:

  - node_modules/ — 5900+ objects from pnpm install
  - __tests__/ — vitest test directories
  - characterization/ — test fixtures
  - tests/ — top-level test artifacts
  - per-skill tests/ dirs — Python pytest dirs (e.g.,
    skills/catalog/agent-email-send/tests/test_send_outbound.py)

The admin Capabilities → Skills file viewer reads `listCatalogFiles`
which does an S3 ListObjects on the slug prefix and renders whatever
it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced
as if they were part of the skill.

Two-layer defense added:

  - NON_SKILL_DIRS array — top-level dirs that aren't skills are
    skipped entirely (not synced to any catalog prefix). Includes
    scripts (preserved), __tests__, characterization, node_modules,
    tests, dist.

  - Per-skill --exclude flags on `aws s3 sync` — drops tests/,
    __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store
    from the per-skill upload regardless of whether the skill author
    accidentally committed them.

Also extends RETIRED_SLUGS to include thinkwork-admin (retired in
PR #488), smoke-package-only, and the legacy top-level cruft prefixes
(__tests__, characterization, node_modules, tests) so the next
bootstrap run cleans up existing stale objects in dev/prod S3
even from before this script change.

Verification:
- bash -n on both scripts — syntax clean
- bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip
  (110K) alongside every other handler

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
…563)

Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the
agent turn ran successfully end-to-end, but `post_skill_run_complete`
saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed
silently with the ERROR log "cannot post completion — missing
THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at
status=running for ~17 minutes until the reconciler backstopped it
with the generic "stale running row" reason.

The strange part: 30 seconds earlier in the same coroutine,
`api_runtime_config.fetch` (which reads the same env) succeeded, and
the workspace_sync log line `action=composer_fetch sync_ms=2838
files=14` confirmed the env was populated. Something inside the long
agent turn appears to clear or shadow these vars by the time
`post_skill_run_complete` runs. Root cause not fully diagnosed — the
container code itself only reads these vars (no `os.environ.pop` of
THINKWORK_API_URL anywhere), so it's likely a Strands SDK or
botocore subprocess interaction.

Rather than chase the diagnosis, this commit eliminates the failure
mode structurally: capture the env values at dispatcher entry into
local snapshots, and pass them as `api_url` / `api_secret` parameters
to every `post_skill_run_complete` call. The function falls back to
`os.environ` when not provided, preserving backward compat for the
few callers that don't snapshot.

Changes:
- `post_skill_run_complete` accepts new optional params `api_url` +
  `api_secret`. Snapshot wins over env. Sharper ERROR log on the
  unrecoverable case (both snapshot and env empty), naming the
  runId + status so operators can correlate to the stuck skill_runs
  row when the reconciler picks it up.
- `dispatch_run_skill` captures env at top, passes the snapshots
  through to all 7 `post_skill_run_complete` call sites
  (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError,
  _execute_agent_turn ImportError, agent loop crash, empty response,
  success). If env IS empty at dispatcher entry — the genuine
  pre-injection race from `project_agentcore_deploy_race_env` — log
  loud at the entry point so operators see the cause without having
  to grep across multiple callback paths.

Two new tests in `PostCompletionTests`:
- `test_snapshot_params_override_empty_env` — env empty + snapshot
  provided → urlopen called with snapshot URL + Bearer header.
- `test_snapshot_params_take_precedence_over_env` — env populated +
  snapshot provided → snapshot wins, env is ignored. Pins the
  precedence rule that's load-bearing for the env-drift fix.

Verification:
- `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py`
  — 18 passed (16 prior + 2 new).

Closes the P1 reliability-05 finding from the ce-code-review autofix
run on PR #552.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom added a commit that referenced this pull request May 5, 2026
Plan hygiene + solution doc capturing today's launch-blocker arc:

(1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md
- Flip U1-U5 checkboxes from [ ] to [x]
- Bump status: active → status: shipped
- Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0
  fixes) and #563 (env-snapshot follow-up after dev incident)

(2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md
- Real incident on dev today: dispatch_run_skill ran the agent turn
  successfully but post_skill_run_complete saw os.environ empty and
  bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET",
  even though api_runtime_config.fetch in the same coroutine 30s
  earlier had succeeded with the same env read.
- Container code never `pop`s these vars; strong hypothesis is a
  Strands SDK or botocore subprocess shadowing them mid-turn. Not
  diagnosed conclusively — the structural fix (snapshot at entry,
  thread through to writeback) makes the diagnosis moot.
- Solution doc captures: full symptom trail, the obvious-explanation
  rule-out, the structural fix pattern, when this pattern applies to
  other coroutines, and operator triage playbook for prod.
- Reusable for any future long-running agent coroutine that ends
  with a callback POST (memory-retain, MCP tool telemetry, etc.).

Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/):
- feedback_completion_callback_snapshot_pattern — the snapshot rule
- feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together
- feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync
- project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant