feat: out-of-band kind=run_skill dispatcher (plan §007-008)#552
Merged
Conversation
…config endpoint Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1. Factors the agent-level resolution logic (template + skills + KBs + MCP configs + guardrail + sandbox template) out of chat-agent-invoke.ts into a shared helper. Both the chat-invoke Lambda and the new service-auth REST endpoint call the same resolver, so the run_skill dispatcher and the chat loop can never drift on "what this agent's runtime looks like." * packages/api/src/lib/resolve-agent-runtime-config.ts — new helper. Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig, guardrailId, skillsConfig with defaults + built-in tools + blocked- tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate). Exports AgentNotFoundError + AgentTemplateNotFoundError for caller mapping to HTTP responses. * packages/api/src/handlers/agents-runtime-config.ts — new service-auth GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer API_AUTH_SECRET per the `docs/solutions/best-practices/service- endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent. Optional currentUserId + currentUserEmail query params drive the CURRENT_USER_EMAIL overlay on default skills. * packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300- line inline resolution block with a single `resolveAgentRuntimeConfig` call. Per-turn concerns (thread_id, trace_id, message, messages_ history, currentUserId + currentUserEmail derivation, sandbox preflight) stay in the handler. Behavior-preserving refactor. * terraform/modules/app/lambda-api/handlers.tf — registers the new handler and routes GET /api/agents/runtime-config to it. * Tests: 8 helper-level unit tests (happy path, AgentNotFoundError, AgentTemplateNotFoundError, template vs tenant-default guardrail, blocked-tools filter, currentUserEmail overlay, built-in tool injection, MCP delegation). 13 handler-level tests (method/path/auth/ UUID validation + helper-exception → 404 mapping + currentUser param pass-through). Next: U2 — extract `_execute_agent_turn` helper in server.py so the dispatcher can reuse the chat-loop prologue + `_call_strands_agent`.
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2.
Pure refactor — no behavior change. Factors the ~170-line chat-loop
prologue out of AgentCoreHandler.do_POST into a free function
`_execute_agent_turn(payload)` so U3's dispatcher can reuse the same
path without duplicating env setup + skill install + workspace bootstrap
+ eval-span attach + message build + system_prompt build + the
_call_strands_agent call.
What stayed in do_POST (chat-specific):
* invocation_env.apply/cleanup (run_skill branch has its own already).
* OpenAI chat.completion response shape + tool_costs/hindsight_usage
projection.
* _audit_response, auto-retain via api_memory_client, log_agent_invocation.
* Guardrail-exception classification into a blocked-response 200.
* self._respond HTTP writes.
What moved into _execute_agent_turn:
* Payload env unpack (workspace_bucket, thinkwork_api_url/secret,
hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID).
* install_skill_from_s3 loop.
* _ensure_workspace_ready.
* _inject_skill_env.
* attach/detach_eval_context.
* messages list build (history + current).
* Router profile resolution + effective_skills filter.
* _build_system_prompt + external-task / workflow-skill / KB context
injection.
* _call_strands_agent.
* tool_costs drain snapshot.
Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}.
Raises on _call_strands_agent failure; caller classifies guardrail vs
non-guardrail.
Covered by existing container tests (234/234 passing). Adds no new
tests — the extraction is a lift-and-shift. U3 will exercise the
helper from the dispatcher path with real assertions.
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3. Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with a real dispatcher that: 1. Validates the envelope — null agentId fails fast with a named reason (webhook-sourced runs deferred to a follow-up per plan). 2. Fetches the agent's runtime config from the new GET /api/agents/runtime-config endpoint (U1) via a new Python client api_runtime_config.py. Bounded retry: 5xx + transient transport errors retry 3× with jittered backoff; 4xx terminal. 404 is a specific AgentConfigNotFoundError subclass. 3. Builds a chat-invoke-shaped synthetic payload (tenant / agent / skills / MCP / guardrail + a synthetic user message instructing the agent to run the skill with the given inputs). 4. Runs a headless Strands agent turn via the shared server._execute_agent_turn helper (U2) — the chat loop and the dispatcher now share one execution path; no second codebase to drift. 5. POSTs terminal status back to /api/skills/complete with the HMAC-signed callback. Successful runs write deliveredArtifactRef={type: "inline", payload: <markdown>}; empty responses fail with "agent produced no final text"; agent-loop exceptions fail with "agent loop crashed: <msg>". Also: * container-sources/_boot_assert.py — adds api_runtime_config to EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces loudly at boot instead of silently shipping non-functional. * test_api_runtime_config.py — 9 tests: happy path, query-string shape, currentUserId forwarding, missing env, 404 → AgentConfigNot- FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket timeout exhaustion, non-JSON body. * test_server_run_skill.py — rewritten for the new contract. 16 tests covering: happy path inline deliverable, null-agentId fast-fail, missing-field short-circuit, config-404, config-5xx, agent-loop crash, empty-response-text failure, synthetic-payload shape (both camelCase + snake_case config keys), HMAC retry semantics carried forward. Full container suite: 249 passed (up from 234 pre-U3). Dockerfile uses the wildcard COPY container-sources/ /app/ so no explicit entry needed (per the post-#542 Dockerfile shape).
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4. RequestResponse with a 28s socket timeout was fundamentally incompatible with the agent loop the dispatcher now runs — the AgentCore Lambda has a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter flips to InvocationType: Event so enqueue acknowledges immediately while the agent executes for as long as it needs. The HMAC-signed /api/skills/complete callback is the authoritative execution-result signal — it writes skill_runs.status on completion whether success or failure. Enqueue-level errors (IAM, Lambda missing) still surface synchronously via AWS returning 4xx/5xx on the Invoke call, which each emitter re-exposes through its ok/error return shape. This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes — that rule governs paths with no callback; we have a durable HMAC-signed one. Flipped sites: * packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL mutation path + webhooks/_shared.ts both call this). * packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill (POST /api/skills/start service-auth path). * packages/lambda/job-trigger.ts — scheduled job invoke. Post-flip error handling checks `res.StatusCode` (Event returns 202 on success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` + `res.Payload` (RequestResponse-only). Removed the NodeHttpHandler socket-timeout override — Event invokes don't hold the connection open waiting for the Lambda, so the 28s timeout workaround isn't meaningful. The chat-agent-invoke path (InvocationType: RequestResponse at chat-agent-invoke.ts:426) is unaffected — chat turns still need the synchronous response to land in the thread message stream. Full TS suite: 1375 passed, 68 passed (no regressions).
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5. Replaces the post-#542 "every kind=run_skill envelope terminates with U6 unsupported-runtime" passing condition with the real contract: the row reaches `complete` (agent produced the deliverable) or `failed` with a specific reason (connector missing, agent loop error, config-fetch failure). Only stuck-running / transport errors fail the smoke. * scripts/smoke/CHECKS.md — rewrites the "What passing means today" section to describe the U4 dispatcher flow (config fetch → headless agent turn → HMAC callback). Refreshes the "Known gaps" list to note the remaining webhook null-agentId deferral instead of the retired U6 placeholder. * scripts/smoke/README.md — updates the webhook smoke test expectations; complete or specific-reason failure are both PASS. Refreshes the "What passes vs fails" bullet list. * scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition comment describe the new terminal-states contract. * scripts/smoke/chat-smoke.sh — same.
- chat-agent-invoke.ts: delete stale 8-line comment describing currentUserEmail propagation that does not happen (M5). - scheduled.test.ts + _harness/README.md: update docstrings to reflect post-U4 Event + HMAC callback semantics, replacing stale RequestResponse prose (M6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all four TS emitters (startSkillRun.mutation, skills.ts startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted the field; Python dispatcher rejected every non-webhook envelope with _MISSING_AGENT_REASON. test_server_run_skill.py hand-injected agentId="agent-1" which masked the prod bug. The fix: - Add `agentId: string | null` to SkillRunInvokePayload. - Thread `i.agentId ?? null` into the GraphQL mutation emitter. - Thread `agentId ?? null` into the service-REST startSkillRunService. - Thread `dispatch.agentId ?? null` into the webhook shared handler. - Thread `targetAgentId ?? null` into job-trigger's scheduled path. - Add regression tests in skill-runs-resolvers, webhook-shared, and job-trigger that assert agentId survives the envelope round-trip. The job-trigger test also pins InvocationType=Event (§U4 contract). Without this, PR #552 ships green CI but is inert for scheduled/chat/catalog runs in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-B. The helper is reachable via the service-auth REST endpoint `GET /api/agents/runtime-config?currentUserId=...`, where `currentUserId` is a caller-controlled query parameter. The users lookup at L225 only filtered on users.id, so any holder of API_AUTH_SECRET could enumerate a foreign tenant's user emails by flipping `tenantId` to their own while passing another tenant's userId. All three users lookups (currentUserId email, human_pair email fallback, human_pair name) now AND on users.tenant_id = opts.tenantId. The human_pair lookups derive from a tenant-scoped agent row so the predicate is defense-in-depth there; the currentUserId lookup is the actual leak. Added a regression test that upgrades the drizzle mock to capture where() predicates and asserts both `users.id` and `users.tenant_id` eq-pairs appear for the currentUserId lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to InvocationType=Event so the agent loop has the full 900s Lambda budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2, which on this path means a single transient failure (5xx on the /api/skills/complete callback, container OOM, etc.) causes the whole agent turn to run again — re-burning Bedrock tokens and (in pathological cases) racing the first deliverable's writeback. The agent turn is not idempotent. Set MaximumRetryAttempts=0 so dispatch is one-shot. Durable safety net already exists: /api/skills/complete atomically CAS-updates on status='running' (any stray retry gets 409), and the skill-runs-reconciler flips rows stuck in `running` past the 15-min cutoff to `failed`. Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS, 14-day retention, SSE-managed) instead of disappearing. Added IAM policy on the agentcore role for sqs:SendMessage, plus two outputs for operator inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
ericodom
added a commit
that referenced
this pull request
Apr 25, 2026
…sions (#562) Two related deploy-pipeline fixes that surfaced together when PR #561 merged: terraform-apply on main has been failing since PR #552 because the agents-runtime-config Lambda zip was never built, and the catalog S3 sync was uploading dev-only artifacts (node_modules, __tests__, tests/, characterization/) into the public skills/catalog/ prefix where the admin UI surfaces them as if they were skills. (1) scripts/build-lambdas.sh — add agents-runtime-config build entry PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts plus the Terraform route in modules/app/lambda-api/handlers.tf, but missed the build entry. Result: every deploy after PR #552 merged failed at terraform-apply with: Error: Error in function call ... filebase64sha256 ... /home/runner/.../dist/lambdas/agents-runtime-config.zip: no such file or directory. This blocked PR #561 from actually deploying skill.yaml removal, and will block every subsequent merge until fixed. Add the build entry between `skills` and `plugin-upload` (matching the registration order in handlers.tf). (2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync The script iterates `packages/skill-catalog/*/` and `aws s3 syncs` each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only exclusion was `scripts/`. Everything else got uploaded: - node_modules/ — 5900+ objects from pnpm install - __tests__/ — vitest test directories - characterization/ — test fixtures - tests/ — top-level test artifacts - per-skill tests/ dirs — Python pytest dirs (e.g., skills/catalog/agent-email-send/tests/test_send_outbound.py) The admin Capabilities → Skills file viewer reads `listCatalogFiles` which does an S3 ListObjects on the slug prefix and renders whatever it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced as if they were part of the skill. Two-layer defense added: - NON_SKILL_DIRS array — top-level dirs that aren't skills are skipped entirely (not synced to any catalog prefix). Includes scripts (preserved), __tests__, characterization, node_modules, tests, dist. - Per-skill --exclude flags on `aws s3 sync` — drops tests/, __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store from the per-skill upload regardless of whether the skill author accidentally committed them. Also extends RETIRED_SLUGS to include thinkwork-admin (retired in PR #488), smoke-package-only, and the legacy top-level cruft prefixes (__tests__, characterization, node_modules, tests) so the next bootstrap run cleans up existing stale objects in dev/prod S3 even from before this script change. Verification: - bash -n on both scripts — syntax clean - bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip (110K) alongside every other handler Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
ericodom
added a commit
that referenced
this pull request
Apr 25, 2026
…563) Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the agent turn ran successfully end-to-end, but `post_skill_run_complete` saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed silently with the ERROR log "cannot post completion — missing THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at status=running for ~17 minutes until the reconciler backstopped it with the generic "stale running row" reason. The strange part: 30 seconds earlier in the same coroutine, `api_runtime_config.fetch` (which reads the same env) succeeded, and the workspace_sync log line `action=composer_fetch sync_ms=2838 files=14` confirmed the env was populated. Something inside the long agent turn appears to clear or shadow these vars by the time `post_skill_run_complete` runs. Root cause not fully diagnosed — the container code itself only reads these vars (no `os.environ.pop` of THINKWORK_API_URL anywhere), so it's likely a Strands SDK or botocore subprocess interaction. Rather than chase the diagnosis, this commit eliminates the failure mode structurally: capture the env values at dispatcher entry into local snapshots, and pass them as `api_url` / `api_secret` parameters to every `post_skill_run_complete` call. The function falls back to `os.environ` when not provided, preserving backward compat for the few callers that don't snapshot. Changes: - `post_skill_run_complete` accepts new optional params `api_url` + `api_secret`. Snapshot wins over env. Sharper ERROR log on the unrecoverable case (both snapshot and env empty), naming the runId + status so operators can correlate to the stuck skill_runs row when the reconciler picks it up. - `dispatch_run_skill` captures env at top, passes the snapshots through to all 7 `post_skill_run_complete` call sites (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError, _execute_agent_turn ImportError, agent loop crash, empty response, success). If env IS empty at dispatcher entry — the genuine pre-injection race from `project_agentcore_deploy_race_env` — log loud at the entry point so operators see the cause without having to grep across multiple callback paths. Two new tests in `PostCompletionTests`: - `test_snapshot_params_override_empty_env` — env empty + snapshot provided → urlopen called with snapshot URL + Bearer header. - `test_snapshot_params_take_precedence_over_env` — env populated + snapshot provided → snapshot wins, env is ignored. Pins the precedence rule that's load-bearing for the env-drift fix. Verification: - `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py` — 18 passed (16 prior + 2 new). Closes the P1 reliability-05 finding from the ce-code-review autofix run on PR #552. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
ericodom
added a commit
that referenced
this pull request
Apr 25, 2026
Plan hygiene + solution doc capturing today's launch-blocker arc: (1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md - Flip U1-U5 checkboxes from [ ] to [x] - Bump status: active → status: shipped - Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0 fixes) and #563 (env-snapshot follow-up after dev incident) (2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md - Real incident on dev today: dispatch_run_skill ran the agent turn successfully but post_skill_run_complete saw os.environ empty and bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET", even though api_runtime_config.fetch in the same coroutine 30s earlier had succeeded with the same env read. - Container code never `pop`s these vars; strong hypothesis is a Strands SDK or botocore subprocess shadowing them mid-turn. Not diagnosed conclusively — the structural fix (snapshot at entry, thread through to writeback) makes the diagnosis moot. - Solution doc captures: full symptom trail, the obvious-explanation rule-out, the structural fix pattern, when this pattern applies to other coroutines, and operator triage playbook for prod. - Reusable for any future long-running agent coroutine that ends with a callback POST (memory-retain, MCP tool telemetry, etc.). Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/): - feedback_completion_callback_snapshot_pattern — the snapshot rule - feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together - feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync - project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom
added a commit
that referenced
this pull request
May 5, 2026
* feat(api): U1 resolveAgentRuntimeConfig helper + /api/agents/runtime-config endpoint
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U1.
Factors the agent-level resolution logic (template + skills + KBs + MCP
configs + guardrail + sandbox template) out of chat-agent-invoke.ts into
a shared helper. Both the chat-invoke Lambda and the new service-auth
REST endpoint call the same resolver, so the run_skill dispatcher and
the chat loop can never drift on "what this agent's runtime looks like."
* packages/api/src/lib/resolve-agent-runtime-config.ts — new helper.
Returns AgentRuntimeConfig (agent, tenant, template, guardrailConfig,
guardrailId, skillsConfig with defaults + built-in tools + blocked-
tools filter, knowledgeBasesConfig, mcpConfigs, sandboxTemplate).
Exports AgentNotFoundError + AgentTemplateNotFoundError for caller
mapping to HTTP responses.
* packages/api/src/handlers/agents-runtime-config.ts — new service-auth
GET /api/agents/runtime-config?tenantId=X&agentId=Y. Bearer
API_AUTH_SECRET per the `docs/solutions/best-practices/service-
endpoint-vs-widening-resolvecaller-auth-2026-04-21.md` precedent.
Optional currentUserId + currentUserEmail query params drive the
CURRENT_USER_EMAIL overlay on default skills.
* packages/api/src/handlers/chat-agent-invoke.ts — replaces the ~300-
line inline resolution block with a single `resolveAgentRuntimeConfig`
call. Per-turn concerns (thread_id, trace_id, message, messages_
history, currentUserId + currentUserEmail derivation, sandbox
preflight) stay in the handler. Behavior-preserving refactor.
* terraform/modules/app/lambda-api/handlers.tf — registers the new
handler and routes GET /api/agents/runtime-config to it.
* Tests: 8 helper-level unit tests (happy path, AgentNotFoundError,
AgentTemplateNotFoundError, template vs tenant-default guardrail,
blocked-tools filter, currentUserEmail overlay, built-in tool
injection, MCP delegation). 13 handler-level tests (method/path/auth/
UUID validation + helper-exception → 404 mapping + currentUser param
pass-through).
Next: U2 — extract `_execute_agent_turn` helper in server.py so the
dispatcher can reuse the chat-loop prologue + `_call_strands_agent`.
* refactor(agentcore-strands): U2 extract _execute_agent_turn
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U2.
Pure refactor — no behavior change. Factors the ~170-line chat-loop
prologue out of AgentCoreHandler.do_POST into a free function
`_execute_agent_turn(payload)` so U3's dispatcher can reuse the same
path without duplicating env setup + skill install + workspace bootstrap
+ eval-span attach + message build + system_prompt build + the
_call_strands_agent call.
What stayed in do_POST (chat-specific):
* invocation_env.apply/cleanup (run_skill branch has its own already).
* OpenAI chat.completion response shape + tool_costs/hindsight_usage
projection.
* _audit_response, auto-retain via api_memory_client, log_agent_invocation.
* Guardrail-exception classification into a blocked-response 200.
* self._respond HTTP writes.
What moved into _execute_agent_turn:
* Payload env unpack (workspace_bucket, thinkwork_api_url/secret,
hindsight_endpoint, _INSTANCE_ID, _ASSISTANT_ID).
* install_skill_from_s3 loop.
* _ensure_workspace_ready.
* _inject_skill_env.
* attach/detach_eval_context.
* messages list build (history + current).
* Router profile resolution + effective_skills filter.
* _build_system_prompt + external-task / workflow-skill / KB context
injection.
* _call_strands_agent.
* tool_costs drain snapshot.
Returns {response_text, strands_usage, duration_ms, invocation_tool_costs}.
Raises on _call_strands_agent failure; caller classifies guardrail vs
non-guardrail.
Covered by existing container tests (234/234 passing). Adds no new
tests — the extraction is a lift-and-shift. U3 will exercise the
helper from the dispatcher path with real assertions.
* feat(agentcore-strands): U3 real kind=run_skill dispatcher
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U3.
Replaces the post-§U6 unsupported-runtime rejector shipped in #542 with
a real dispatcher that:
1. Validates the envelope — null agentId fails fast with a named
reason (webhook-sourced runs deferred to a follow-up per plan).
2. Fetches the agent's runtime config from the new
GET /api/agents/runtime-config endpoint (U1) via a new Python
client api_runtime_config.py. Bounded retry: 5xx + transient
transport errors retry 3× with jittered backoff; 4xx terminal.
404 is a specific AgentConfigNotFoundError subclass.
3. Builds a chat-invoke-shaped synthetic payload (tenant / agent /
skills / MCP / guardrail + a synthetic user message instructing
the agent to run the skill with the given inputs).
4. Runs a headless Strands agent turn via the shared
server._execute_agent_turn helper (U2) — the chat loop and the
dispatcher now share one execution path; no second codebase to
drift.
5. POSTs terminal status back to /api/skills/complete with the
HMAC-signed callback. Successful runs write
deliveredArtifactRef={type: "inline", payload: <markdown>};
empty responses fail with "agent produced no final text";
agent-loop exceptions fail with "agent loop crashed: <msg>".
Also:
* container-sources/_boot_assert.py — adds api_runtime_config to
EXPECTED_CONTAINER_SOURCES so a Dockerfile COPY regression surfaces
loudly at boot instead of silently shipping non-functional.
* test_api_runtime_config.py — 9 tests: happy path, query-string
shape, currentUserId forwarding, missing env, 404 → AgentConfigNot-
FoundError, 401 terminal, 5xx retry, 5xx-then-200 recovery, socket
timeout exhaustion, non-JSON body.
* test_server_run_skill.py — rewritten for the new contract. 16 tests
covering: happy path inline deliverable, null-agentId fast-fail,
missing-field short-circuit, config-404, config-5xx, agent-loop
crash, empty-response-text failure, synthetic-payload shape (both
camelCase + snake_case config keys), HMAC retry semantics carried
forward.
Full container suite: 249 passed (up from 234 pre-U3).
Dockerfile uses the wildcard COPY container-sources/ /app/ so no
explicit entry needed (per the post-#542 Dockerfile shape).
* feat(api,lambda): U4 flip kind=run_skill Lambda invokes to Event
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U4.
RequestResponse with a 28s socket timeout was fundamentally incompatible
with the agent loop the dispatcher now runs — the AgentCore Lambda has
a 900s ceiling and real skill runs routinely take 10-60s+. Every emitter
flips to InvocationType: Event so enqueue acknowledges immediately while
the agent executes for as long as it needs.
The HMAC-signed /api/skills/complete callback is the authoritative
execution-result signal — it writes skill_runs.status on completion
whether success or failure. Enqueue-level errors (IAM, Lambda missing)
still surface synchronously via AWS returning 4xx/5xx on the Invoke
call, which each emitter re-exposes through its ok/error return shape.
This does NOT cross feedback_avoid_fire_and_forget_lambda_invokes —
that rule governs paths with no callback; we have a durable HMAC-signed
one.
Flipped sites:
* packages/api/src/graphql/utils.ts — invokeSkillRun (GraphQL
mutation path + webhooks/_shared.ts both call this).
* packages/api/src/handlers/skills.ts — invokeAgentcoreRunSkill
(POST /api/skills/start service-auth path).
* packages/lambda/job-trigger.ts — scheduled job invoke.
Post-flip error handling checks `res.StatusCode` (Event returns 202 on
success, 4xx/5xx on enqueue failure) rather than `res.FunctionError` +
`res.Payload` (RequestResponse-only). Removed the NodeHttpHandler
socket-timeout override — Event invokes don't hold the connection
open waiting for the Lambda, so the 28s timeout workaround isn't
meaningful.
The chat-agent-invoke path (InvocationType: RequestResponse at
chat-agent-invoke.ts:426) is unaffected — chat turns still need the
synchronous response to land in the thread message stream.
Full TS suite: 1375 passed, 68 passed (no regressions).
* docs(smoke): U5 update smoke scripts for the real dispatcher
Plan docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md §U5.
Replaces the post-#542 "every kind=run_skill envelope terminates with
U6 unsupported-runtime" passing condition with the real contract: the
row reaches `complete` (agent produced the deliverable) or `failed`
with a specific reason (connector missing, agent loop error,
config-fetch failure). Only stuck-running / transport errors fail the
smoke.
* scripts/smoke/CHECKS.md — rewrites the "What passing means today"
section to describe the U4 dispatcher flow (config fetch → headless
agent turn → HMAC callback). Refreshes the "Known gaps" list to
note the remaining webhook null-agentId deferral instead of the
retired U6 placeholder.
* scripts/smoke/README.md — updates the webhook smoke test
expectations; complete or specific-reason failure are both PASS.
Refreshes the "What passes vs fails" bullet list.
* scripts/smoke/catalog-smoke.sh — header docstring + PASS-condition
comment describe the new terminal-states contract.
* scripts/smoke/chat-smoke.sh — same.
* fix(review): apply ce-code-review autofix feedback
- chat-agent-invoke.ts: delete stale 8-line comment describing
currentUserEmail propagation that does not happen (M5).
- scheduled.test.ts + _harness/README.md: update docstrings to
reflect post-U4 Event + HMAC callback semantics, replacing stale
RequestResponse prose (M6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(api,lambda): thread agentId through run_skill envelope
ce-code-review P0-A. SkillRunInvokePayload was missing agentId; all
four TS emitters (startSkillRun.mutation, skills.ts
startSkillRunService, webhooks/_shared.ts, job-trigger.ts) omitted
the field; Python dispatcher rejected every non-webhook envelope with
_MISSING_AGENT_REASON. test_server_run_skill.py hand-injected
agentId="agent-1" which masked the prod bug.
The fix:
- Add `agentId: string | null` to SkillRunInvokePayload.
- Thread `i.agentId ?? null` into the GraphQL mutation emitter.
- Thread `agentId ?? null` into the service-REST startSkillRunService.
- Thread `dispatch.agentId ?? null` into the webhook shared handler.
- Thread `targetAgentId ?? null` into job-trigger's scheduled path.
- Add regression tests in skill-runs-resolvers, webhook-shared, and
job-trigger that assert agentId survives the envelope round-trip.
The job-trigger test also pins InvocationType=Event (§U4 contract).
Without this, PR #552 ships green CI but is inert for
scheduled/chat/catalog runs in production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(api): tenant-scope users lookups in resolveAgentRuntimeConfig
ce-code-review P0-B. The helper is reachable via the service-auth
REST endpoint `GET /api/agents/runtime-config?currentUserId=...`,
where `currentUserId` is a caller-controlled query parameter. The
users lookup at L225 only filtered on users.id, so any holder of
API_AUTH_SECRET could enumerate a foreign tenant's user emails by
flipping `tenantId` to their own while passing another tenant's
userId.
All three users lookups (currentUserId email, human_pair email
fallback, human_pair name) now AND on users.tenant_id = opts.tenantId.
The human_pair lookups derive from a tenant-scoped agent row so the
predicate is defense-in-depth there; the currentUserId lookup is the
actual leak.
Added a regression test that upgrades the drizzle mock to capture
where() predicates and asserts both `users.id` and `users.tenant_id`
eq-pairs appear for the currentUserId lookup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(tf): MaximumRetryAttempts=0 + DLQ for agentcore-invoke
ce-code-review P0-C. Plan §U4 flipped kind=run_skill dispatch to
InvocationType=Event so the agent loop has the full 900s Lambda
budget. AWS Lambda async-invoke defaults to MaximumRetryAttempts=2,
which on this path means a single transient failure (5xx on the
/api/skills/complete callback, container OOM, etc.) causes the whole
agent turn to run again — re-burning Bedrock tokens and (in
pathological cases) racing the first deliverable's writeback.
The agent turn is not idempotent. Set MaximumRetryAttempts=0 so
dispatch is one-shot. Durable safety net already exists:
/api/skills/complete atomically CAS-updates on status='running' (any
stray retry gets 409), and the skill-runs-reconciler flips rows stuck
in `running` past the 15-min cutoff to `failed`.
Failed invokes now land in thinkwork-<stage>-agentcore-async-dlq (SQS,
14-day retention, SSE-managed) instead of disappearing. Added IAM
policy on the agentcore role for sqs:SendMessage, plus two outputs
for operator inspection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom
added a commit
that referenced
this pull request
May 5, 2026
…sions (#562) Two related deploy-pipeline fixes that surfaced together when PR #561 merged: terraform-apply on main has been failing since PR #552 because the agents-runtime-config Lambda zip was never built, and the catalog S3 sync was uploading dev-only artifacts (node_modules, __tests__, tests/, characterization/) into the public skills/catalog/ prefix where the admin UI surfaces them as if they were skills. (1) scripts/build-lambdas.sh — add agents-runtime-config build entry PR #552 §U1 added the handler at packages/api/src/handlers/agents-runtime-config.ts plus the Terraform route in modules/app/lambda-api/handlers.tf, but missed the build entry. Result: every deploy after PR #552 merged failed at terraform-apply with: Error: Error in function call ... filebase64sha256 ... /home/runner/.../dist/lambdas/agents-runtime-config.zip: no such file or directory. This blocked PR #561 from actually deploying skill.yaml removal, and will block every subsequent merge until fixed. Add the build entry between `skills` and `plugin-upload` (matching the registration order in handlers.tf). (2) scripts/bootstrap-workspace.sh — exclude dev artifacts from S3 sync The script iterates `packages/skill-catalog/*/` and `aws s3 syncs` each subdir into `s3://<bucket>/skills/catalog/<slug>/`. The only exclusion was `scripts/`. Everything else got uploaded: - node_modules/ — 5900+ objects from pnpm install - __tests__/ — vitest test directories - characterization/ — test fixtures - tests/ — top-level test artifacts - per-skill tests/ dirs — Python pytest dirs (e.g., skills/catalog/agent-email-send/tests/test_send_outbound.py) The admin Capabilities → Skills file viewer reads `listCatalogFiles` which does an S3 ListObjects on the slug prefix and renders whatever it finds — so operators saw `tests/`, `node_modules/`, etc. surfaced as if they were part of the skill. Two-layer defense added: - NON_SKILL_DIRS array — top-level dirs that aren't skills are skipped entirely (not synced to any catalog prefix). Includes scripts (preserved), __tests__, characterization, node_modules, tests, dist. - Per-skill --exclude flags on `aws s3 sync` — drops tests/, __tests__/, __pycache__/, node_modules/, dist/, *.pyc, .DS_Store from the per-skill upload regardless of whether the skill author accidentally committed them. Also extends RETIRED_SLUGS to include thinkwork-admin (retired in PR #488), smoke-package-only, and the legacy top-level cruft prefixes (__tests__, characterization, node_modules, tests) so the next bootstrap run cleans up existing stale objects in dev/prod S3 even from before this script change. Verification: - bash -n on both scripts — syntax clean - bash scripts/build-lambdas.sh — produces dist/lambdas/agents-runtime-config.zip (110K) alongside every other handler Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom
added a commit
that referenced
this pull request
May 5, 2026
…563) Real incident on dev (2026-04-25, runs c886c82e + 6d143ead): the agent turn ran successfully end-to-end, but `post_skill_run_complete` saw `os.environ.get("THINKWORK_API_URL")` come back empty and bailed silently with the ERROR log "cannot post completion — missing THINKWORK_API_URL / API_AUTH_SECRET". The skill_runs row was stuck at status=running for ~17 minutes until the reconciler backstopped it with the generic "stale running row" reason. The strange part: 30 seconds earlier in the same coroutine, `api_runtime_config.fetch` (which reads the same env) succeeded, and the workspace_sync log line `action=composer_fetch sync_ms=2838 files=14` confirmed the env was populated. Something inside the long agent turn appears to clear or shadow these vars by the time `post_skill_run_complete` runs. Root cause not fully diagnosed — the container code itself only reads these vars (no `os.environ.pop` of THINKWORK_API_URL anywhere), so it's likely a Strands SDK or botocore subprocess interaction. Rather than chase the diagnosis, this commit eliminates the failure mode structurally: capture the env values at dispatcher entry into local snapshots, and pass them as `api_url` / `api_secret` parameters to every `post_skill_run_complete` call. The function falls back to `os.environ` when not provided, preserving backward compat for the few callers that don't snapshot. Changes: - `post_skill_run_complete` accepts new optional params `api_url` + `api_secret`. Snapshot wins over env. Sharper ERROR log on the unrecoverable case (both snapshot and env empty), naming the runId + status so operators can correlate to the stuck skill_runs row when the reconciler picks it up. - `dispatch_run_skill` captures env at top, passes the snapshots through to all 7 `post_skill_run_complete` call sites (null-agentId, AgentConfigNotFoundError, RuntimeConfigFetchError, _execute_agent_turn ImportError, agent loop crash, empty response, success). If env IS empty at dispatcher entry — the genuine pre-injection race from `project_agentcore_deploy_race_env` — log loud at the entry point so operators see the cause without having to grep across multiple callback paths. Two new tests in `PostCompletionTests`: - `test_snapshot_params_override_empty_env` — env empty + snapshot provided → urlopen called with snapshot URL + Bearer header. - `test_snapshot_params_take_precedence_over_env` — env populated + snapshot provided → snapshot wins, env is ignored. Pins the precedence rule that's load-bearing for the env-drift fix. Verification: - `uv run pytest packages/agentcore-strands/agent-container/test_server_run_skill.py` — 18 passed (16 prior + 2 new). Closes the P1 reliability-05 finding from the ce-code-review autofix run on PR #552. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericodom
added a commit
that referenced
this pull request
May 5, 2026
Plan hygiene + solution doc capturing today's launch-blocker arc: (1) docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.md - Flip U1-U5 checkboxes from [ ] to [x] - Bump status: active → status: shipped - Add `shipped: 2026-04-25` + `prs:` block listing #552 (U1-U5 + 3 P0 fixes) and #563 (env-snapshot follow-up after dev incident) (2) docs/solutions/workflow-issues/agentcore-completion-callback-env-shadowing-2026-04-25.md - Real incident on dev today: dispatch_run_skill ran the agent turn successfully but post_skill_run_complete saw os.environ empty and bailed silently with "missing THINKWORK_API_URL / API_AUTH_SECRET", even though api_runtime_config.fetch in the same coroutine 30s earlier had succeeded with the same env read. - Container code never `pop`s these vars; strong hypothesis is a Strands SDK or botocore subprocess shadowing them mid-turn. Not diagnosed conclusively — the structural fix (snapshot at entry, thread through to writeback) makes the diagnosis moot. - Solution doc captures: full symptom trail, the obvious-explanation rule-out, the structural fix pattern, when this pattern applies to other coroutines, and operator triage playbook for prod. - Reusable for any future long-running agent coroutine that ends with a callback POST (memory-retain, MCP tool telemetry, etc.). Auto-memory entries (out-of-tree, in ~/.claude/projects/.../memory/): - feedback_completion_callback_snapshot_pattern — the snapshot rule - feedback_lambda_zip_build_entry_required — handlers.tf + build-lambdas.sh together - feedback_bootstrap_script_excludes_dev_artifacts — never restore unfiltered S3 sync - project_async_retry_idempotency_lessons — Event invoke retry=0 + DLQ pattern Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Launch-blocker #1 from today's session. Replaces the post-§U6 placeholder rejector with a real dispatcher: scheduled jobs, admin "Run now", chat-intent-dispatched skill runs, and webhook-triggered runs (with non-null agentId) now actually execute the target skill via the same Strands agent loop the chat turn uses.
Plan:
docs/plans/2026-04-24-008-feat-skill-run-dispatcher-plan.mdWhat ships
Five focused units, dependency-ordered, each a separate commit:
U1 — Shared runtime-config helper + new service-auth REST endpoint
packages/api/src/lib/resolve-agent-runtime-config.tsfactors the ~300-line agent/template/skills/MCP/KB/guardrail resolution out ofchat-agent-invoke.ts. Helper is the single source of truth; both chat-invoke and the new endpoint call it. 9 unit tests.packages/api/src/handlers/agents-runtime-config.tsexposesGET /api/agents/runtime-config?tenantId=X&agentId=Y(BearerAPI_AUTH_SECRET). 13 handler tests.chat-agent-invoke.tsrefactored to call the helper; zero behavior change. Per-turn fields (thread context, sandbox preflight, user resolution) stay inline.lambda-api/handlers.tf.U2 — Extract
_execute_agent_turnhelper inserver.py_call_strands_agent, tool-cost drain. Pure lift-and-shift refactor.U3 — Python
api_runtime_configclient + real dispatchercontainer-sources/api_runtime_config.py: urllib GET with 3-attempt retry (5xx + transient retry, 4xx terminal, 404 →AgentConfigNotFoundError). 9 unit tests.run_skill_dispatch.pyrewritten: null-agentId fails fast; otherwise fetch config → build synthetic chat turn ("Run thesales-prepskill with these inputs: ...") → call_execute_agent_turn→ POST/api/skills/completewith{type: inline, payload}on complete, reason on failure. 16 new tests cover happy path + all failure modes.COPY container-sources/ /app/so new module picked up automatically;_boot_assert.py::EXPECTED_CONTAINER_SOURCESupdated per the institutional learning.U4 — Flip TS emitters to
InvocationType: "Event"packages/api/src/graphql/utils.ts::invokeSkillRun,packages/api/src/handlers/skills.ts::invokeAgentcoreRunSkill,packages/lambda/job-trigger.tsall flipRequestResponse→Event. Webhook path inherits viainvokeSkillRun./api/skills/completecallback is the authoritative execution-result signal.feedback_avoid_fire_and_forget_lambda_invokes— that rule governs paths without callbacks; we have a durable HMAC-signed one.RequestResponse(needs the synchronous response for the thread message stream).U5 — Smoke scripts + CHECKS.md refresh
scripts/smoke/{chat,catalog}-smoke.sh+CHECKS.md+README.mdflip passing language off "every envelope terminates with U6 unsupported-runtime" onto "row reachescompleteorfailedwith a specific reason; only stuck-running / transport errors FAIL."ce-code-review fixes (autofix + P0s, post-U5)
Four additional commits address findings from ce-code-review's multi-agent pass against U1–U5:
ac59a32) — removed stale comment inchat-agent-invoke.tsdescribing nonexistentcurrentUserEmailpropagation; updatedscheduled.test.ts+ harness README to reflect post-§U4 Event + HMAC callback semantics.c61ff58) — agentId threading. TheSkillRunInvokePayloadtype was missingagentId; all four TS emitters stripped the field; the Python dispatcher rejected every non-webhook envelope with_MISSING_AGENT_REASON. AddedagentId: string | nullto the type, threaded throughstartSkillRun.mutation.ts,skills.ts::startSkillRunService,webhooks/_shared.ts, andjob-trigger.ts. 3 regression tests (including anInvocationType=Eventpin).7351cbe) — cross-tenant user-email leak.resolveAgentRuntimeConfig'suserslookup scoped only onusers.id; any holder ofAPI_AUTH_SECRETcould enumerate cross-tenant emails via thecurrentUserIdquery param. All threeuserslookups (currentUserId, human_pair email, human_pair name) now AND onusers.tenant_id. Added a regression test that captures the actualwhere()predicates.5f03f00) — async-invoke retry non-idempotency. AWS Lambda defaultMaximumRetryAttempts=2on the §U4 Event invoke would re-run the agent turn (re-burning Bedrock tokens) on any transient failure. SetMaximumRetryAttempts=0onagentcore-invokeviaaws_lambda_function_event_invoke_config; added athinkwork-<stage>-agentcore-async-dlqSQS queue (14-day retention, SSE-managed) for failure visibility.Key technical decisions
_call_strands_agentvia_execute_agent_turn— one code path for the agent loop; the chat-invoke and run_skill paths cannot drift.GET /api/agents/runtime-config— keeps the four TS emitters lean (no envelope bloat); single resolver for both chat + dispatcher; tenant-scoped 404 returns don't leak cross-tenant existence.agentId— scheduled/catalog/chat-intent always carry one; only webhook resolvers emit null, and tenant-admin fallback routing is a product decision that belongs in its own PR./api/skills/completeatomic CAS + 15-min reconciler provide the safety net; DLQ preserves observability of dropped invokes.What's explicitly NOT in this PR
_call_strands_agent(1,400-line function; too risky for this change).cancelSkillRunmutation flips the row; mid-run polling is separate work).Test plan
pnpm -r typecheck(minus@thinkwork/agent-toolspre-existing no-tsc) — greenpnpm -r test— 1378 passed, 7 skipped across api (1385 total); 68 passed lambdauv run pytest packages/agentcore-strands/agent-container/— 249 passedpnpm exec prettier --checkon touched packages — cleanterraform validateon agentcore-runtime module — cleancompleteor fails at a specific connectordispatch_run_skillin CloudWatch to confirm the new code is serving (perdocs/solutions/workflow-issues/deploy-silent-arch-mismatch-took-a-week-to-surface-2026-04-24.md)Deploy ordering notes
deploy.yml's container rebuild +UpdateAgentRuntime) is required for the new Python code to serve — perdocs/solutions/workflow-issues/agentcore-runtime-no-auto-repull-requires-explicit-update-2026-04-24.mdthe ECR push alone is not enough. The merge pipeline handles this.project_agentcore_deploy_race_env; the 15-min reconciler is the backstop. The newapi_runtime_config.fetchreads env at call time, not import, so post-race recovery is automatic.aws_sqs_queue.agentcore_async_dlq+aws_lambda_function_event_invoke_config.agentcoreresources are additive; first apply creates them and flips retry semantics. No data migration needed.Residual Review Findings
ce-code-review (autofix mode, 10 reviewers, 8 returned findings) surfaced the P0s above plus the P1/P2/P3 work listed here. Durable sink: this PR body (tracker filing was not invoked for in-scope launch work). Full artifact on disk at
.context/compound-engineering/ce-code-review/20260424-151323-e0154520/findings.mdin the worktree.Actionable (P1) — should ship in a follow-up before/alongside launch
aws_lambda_function_event_invoke_confignow exists, but consider EventInvoke destinations forOnSuccesstoo if operator-side observability wants completion streaming. Not blocking.post_skill_run_completeinrun_skill_dispatch.pyreturns silently whenAPI_AUTH_SECRET/THINKWORK_API_URLare missing. Row staysrunninguntil the 15-min reconciler. Should log ERROR + raise so the AWS retry (now 0, but future-proof) or the operator sees it immediately._execute_agent_turn(~170 LOC, heavy hidden state) has zero direct unit tests. Add at minimum: happy path, empty response, tool-call path, exception propagation.PostCompletionTests.test_missing_env_logs_and_returnsasserts nothing; false confidence. Assert return value + log emission.resolve-agent-runtime-configbranches (guardrail-assigned path, human-pair email fallback path, humanN lookup) have no direct unit tests; current coverage is indirect via chat + dispatcher integration.dispatch_run_skillisasync defwith zeroawaits.asyncio.run(...)makes it work, but either dropasyncor convert the blockingurllibcalls to awaitable HTTP so the label reflects reality.from server import _execute_agent_turninside the function hides a module-load cycle. Either declare explicitly or extract_execute_agent_turninto its own module.invokeAgentcoreRunSkillenvelope constructor (utils.ts,skills.ts,job-trigger.ts). Extract to a shared helper so future envelope-shape changes (like today's agentId thread) happen once, not thrice.Polish (P2) — defer to ops hardening pass
_tool_costsglobal accumulates across invocations if_execute_agent_turnraises (reliability-03, reliability-07) — wrap in try/finally_tool_costs.clear().dispatch_run_skill; runaway agent loop burns full 900s budget (reliability-04).completion_hmac_secretin envelope strands the row on 401 terminal (reliability-06).failedwithguardrail_blockedreason (correctness-2, adv-05).scope.get('invokerUserId' | 'agentId' | 'threadId')never match TS producer shape — remove to prevent silent future breakage (correctness-4, api-contract-002)._execute_agent_turnleaves workspace/composer/hindsight env vars set on warm containers (correctness-5)._FETCH_RETRY_DELAYSand_COMPLETE_RETRY_DELAYSidentical tuples in two modules (kieran-python retry-delays-duplicated)._cfg(camel, snake)closure repeated 3 places (kieran-python envelope-scope-fallback-duplicated-three-times).(k: any)cast inchat-agent-invoke.tsdefeatsKnowledgeBaseConfigtype (kt-003).resolveAgentRuntimeConfigrely on schema stability (kt-004).run_skill_dispatchreaches past the underscore prefix intoserver._execute_agent_turn(M2).RuntimeConfigFetchError(code=None)conflates missing-env with retry-exhaustion (M4).complete(adv-04).response_textposted verbatim asdeliveredArtifactRefwith no size/shape check (adv-06).do_POSTsnake_case/camelCase ordering (adv-07).skill_runs.statusflip mid-run; 4xx on callback is only signal (adv-08)._build_synthetic_payloadasserts only 4 of 16+ keys (testing-5); scope casing mismatch untested (testing-6); agents-runtime-config handler test mocks the helper wholesale (testing-7);_execute_agent_turnedge cases —response_text=None, empty skills, None mcp_configs — untested (testing-8).Advisory (P3)
(k: any)parameter naming drift (kt-005).agents-runtime-confighandler hand-rolls OPTIONS instead of usinghandleCors/cors()(kt-007)._urlopen_with_retry'sassert last_exc is not None(reliability-08).sha256=...(adv-10).🤖 Generated with Claude Code