feat(telemetry): per-method telemetry events for workflow runs (swamp-club#301)#1349
feat(telemetry): per-method telemetry events for workflow runs (swamp-club#301)#1349
Conversation
…-club#301) Workflow runs now emit one TelemetryEntry per workflow YAML step that resolves to a model method, alongside the parent CLI invocation entry. Children use the existing cli_invocation event shape (same redactions as a direct `swamp model method run`) and link to the parent via a new optional `parentInvocationId` field. A new optional `workflowContext` block carries workflowName/runId/jobName/stepName/modelType/driver so per-driver and per-model-type analytics are first-class without joining through the parent. The bridge lives in src/libswamp/workflows/telemetry_bridge.ts: it tracks in-flight method invocations between method_executing and the matching step_completed/step_failed events, synthesizes durationMs=0 entries for pre-method-executing failures (model lookup, vault expression resolution, vary-key validation, env-var validation), and finalizes any unfinished invocations on stream termination so cancellation/timeout paths don't silently drop telemetry. Domain event extensions: - step_failed gains optional modelName/methodName/driver, populated only at the model-method failure site; structural failures (max-depth, cycle, nested-workflow) leave them undefined so the bridge can distinguish method failures from structural failures. - method_executing gains optional driver, captured from the resolved DriverPlan; the yield is reordered to fire after DriverPlan resolution. Wire shape is opaque on the swamp-club ingest side (properties: Record<string, unknown>) so the additive fields ride across with no consumer-side coordination — verified against services/telemetry/lib/schema.ts and consumers/metrics.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous matcher used `stepName.startsWith("fanout-") && stepName.includes("a")`
which non-deterministically aliased `"fanout-b"` to the same entry as
`"fanout-a"` because the prefix `"fanout-"` itself contains the letter `"a"`.
Linux CI's directory iteration order returned `"fanout-b"` first, so
`find()` matched it for BOTH `fanoutA` and `fanoutB` and the
distinct-stepNames assertion failed.
Use exact `===` match instead — the iterations are known constants in
this fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fixture uses POSIX shell built-ins (`echo`, `exit`) via the command/shell model. On Windows the shell exec exits with code -65536 because shell built-ins aren't directly resolvable as Windows binaries — already a known limitation handled by `keeb_shell_model_test.ts` which uses the same pattern. The bridge logic itself is platform-independent and covered by src/libswamp/workflows/telemetry_bridge_test.ts which runs on all platforms. This integration test verifies end-to-end CLI plumbing on POSIX only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
CLI UX Review
Blocking
None.
Suggestions
None.
Verdict
PASS — This PR makes no user-facing changes. All modifications are internal telemetry plumbing: enriching the TelemetryEntry wire shape with parentInvocationId and workflowContext, wiring a telemetry sink into WorkflowRunDeps, and a module-scoped accessor in telemetry_integration.ts. No command flags, help text, log-mode output, JSON-mode output, or error messages were added or changed.
There was a problem hiding this comment.
Code Review
Well-architected feature with comprehensive test coverage across all layers.
Blocking Issues
None.
Suggestions
key.split(":")infinalize()is fragile (src/libswamp/workflows/telemetry_bridge.ts:197): The step key uses${jobId}:${stepId}as the map key, thensplit(":")to recover the parts during drain. If a step ID ever contained a colon (e.g., from a CEL expression or template expansion), the split would misattribute thestepNamein the workflow context. Step names don't currently use colons so this isn't realistic today, but a safer approach would be to store the(jobId, stepId)tuple directly onInFlightMethodInvocationrather than re-parsing the key. Low-priority since it only affects the error-drain path.
What looks good
- DDD alignment:
WorkflowContextis a proper value object (immutable, equality by value), the bridge acts as an application-layer service mediating between domain events and the telemetry sink, and the sink callback keeps libswamp decoupled from the domain telemetry service. - Import boundary: CLI command imports
WorkflowTelemetrySinkandWorkflowRunDepsfrom../../libswamp/mod.ts— no direct internal imports. - Additive wire schema: New optional fields on
cli_invocationevent with backward-compat regression tests for legacy entries missingparentInvocationId/workflowContext. Clean zero-serialization for absent optional fields. - Failure semantics: The five-way failure categorization (success, post-method error, pre-method-executing error, structural skip, finalize drain) is well-mapped and each branch has dedicated test coverage.
- Pre-allocated
invocationId: Letting children reference the parent ID before the parent entry is written avoids timestamp-based join heuristics — correct design. method_executingreordering: Moving the yield to after DriverPlan resolution is the right call — it gives the bridge the resolved driver and correctly reclassifies vary-key failures as pre-method-executing.- Test breadth: 23 new unit tests across bridge, service, entry, repository, and HTTP sender; plus an integration test that verifies the full CLI → libswamp → persistence path. The
finalize()idempotency, sequential-workflows, and mid-stream-throw tests are particularly well-constructed. - License headers present on all new files.
There was a problem hiding this comment.
Adversarial Review
Critical / High
No critical or high severity issues found.
Medium
-
Unhandled telemetry write failure can crash workflow execution —
src/libswamp/workflows/run.ts:579and:631await telemetryBridge.observe(mapped)inside the main for-await loop andawait telemetryBridge.finalize()in thefinallyblock both propagate any exception thrown bysink.recordChildInvocation. If the underlyingJsonTelemetryRepository.save()fails (disk full, permission denied, corrupted directory), the workflow fails with a confusing telemetry error instead of completing normally.Breaking scenario: Disk nears capacity during a long workflow run. A child telemetry entry write fails →
observe()throws → the catch block yields{ kind: "error", error: workflowExecutionFailed(diskError) }→ the workflow appears to have failed, even though all model methods succeeded.Additionally, if
finalize()throws in thefinallyblock, it can mask the original workflow error (the thrown error fromfinallyreplaces whatever thetry/catchwas doing).Suggested fix: Wrap both the
observeandfinalizecalls in try/catch to swallow telemetry failures gracefully:if (telemetryBridge) { try { await telemetryBridge.observe(mapped); } catch { /* telemetry best-effort */ } } // ... if (telemetryBridge) { try { await telemetryBridge.finalize(); } catch { /* telemetry best-effort */ } }
Severity is medium rather than high because: (a) in practice the JSON repository writes small files to the
.swamp/telemetry/directory which is unlikely to fail in normal operation, and (b) the parentrecordSuccess/recordErrorcalls in the CLI lifecycle have the same unguarded pattern, so this isn't a regression — it's consistent with the existing design. But since child invocations fire mid-workflow (not just at CLI exit), the blast radius of a failure is larger here. -
key.split(":")infinalize()is fragile when identifiers contain colons —src/libswamp/workflows/telemetry_bridge.ts:197const [jobId, stepId] = key.split(":");destructures only the first two segments. If a job name or step name contains a colon (e.g."deploy:prod", or a forEach-expanded name like"step-host:port[0]"), the stepId would be truncated. ThestepKeyfunction on line 230 joins with:but the reverse split is not symmetric.Breaking scenario: A workflow YAML names a job
"deploy:us-east-1". The key becomes"deploy:us-east-1:validate". The split producesjobId = "deploy",stepId = "us-east-1"— both wrong, and"validate"is lost entirely. TheworkflowContextin the drained telemetry entry would have incorrectjobNameandstepName.Suggested fix: Use
indexOffor a single split:const sep = key.indexOf(":"); const jobId = key.slice(0, sep); const stepId = key.slice(sep + 1);Impact is limited to telemetry metadata for drained in-flight entries (the
finalizepath). The normalobservepath uses the original event'sjobId/stepIddirectly and is unaffected.
Low
-
new Date(0)dead write —src/libswamp/workflows/telemetry_bridge.ts:159The synthesized
InFlightMethodInvocationsetsstartedAt: new Date(0)but this value is never read —sameInstanton line 165 is passed torecordChildInvocationdirectly. ThestartedAtinside the synthesized object is only consumed bybuildWorkflowContext, which doesn't use it. Not a bug, just a misleading dead value. -
Nested workflow events forwarded to parent bridge — The execution service's
runWorkflowStepforwards child workflow events to the parent stream (line 1961 inexecution_service.ts). If a nested workflow emits its ownmethod_executing/step_completedpairs, the parent bridge would observe them and create child telemetry entries attributed to the parent workflow'sworkflowName/runId. This is arguably correct (the parent bridge sees all events from the parent stream), but nested workflow method invocations would carry the outer workflow's name, not the inner workflow's. The PR explicitly documents nested workflows as out of scope for V1, so this is just a note for future iterations.
Verdict
PASS — The architecture is well-considered. The bridge design is clean, idempotent, and correctly handles all five documented failure branches. Event ordering is correct — method_executing fires after driver resolution, model_resolved fires before it, and step_failed carries the right context for pre/post-method-executing failures. Wire-shape tests lock the contract. The two medium findings are worth addressing in a follow-up but neither represents data loss or incorrect behavior in normal operation.
Summary
Closes swamp-club#301.
Workflow runs now emit one
TelemetryEntryper workflow YAML step that resolves to a model method, alongside the parent CLI invocation entry. Children use the existingcli_invocationevent shape (same redactions as a directswamp model method run) and link to the parent via a new optionalparentInvocationIdfield. A new optionalworkflowContextblock carriesworkflowName/runId/jobName/stepName/modelType/driverso per-driver and per-model-type analytics are first-class without joining through the parent.The design choice was deliberate: the issue originally proposed a new
workflow_method_invocationevent type. We pushed back during planning and chose additive optional fields oncli_invocationinstead — the swamp-club ingest side declaresproperties: Record<string, unknown>so additive fields ride across with no consumer-side coordination. Analytics queries that aggregate bycommand/subcommand/durationimmediately see workflow-internal method invocations alongside direct ones.What's new on the wire
{ "event": "cli_invocation", "properties": { "id": "<child-uuid>", "invocation": { "command": "model", "subcommand": "method", "args": ["run", "<REDACTED>", "<methodName>"], "optionKeys": [], "globalOptions": [] }, "result": { "status": "success", "exitCode": 0 }, "parentInvocationId": "<parent-cli-invocation-uuid>", "workflowContext": { "workflowName": "deploy", "runId": "<workflow-run-uuid>", "jobName": "build", "stepName": "validate", "modelType": "command/shell", "driver": "local" } // ... existing fields (startedAt, completedAt, durationMs, swampVersion, // denoVersion, platform, invocationContext) unchanged } }Older entries continue to decode without
parentInvocationId/workflowContext(forward-compat regression test added).Architecture
src/libswamp/workflows/telemetry_bridge.ts) — tracks in-flight method invocations by${jobId}:${stepId}, maps the existingmethod_executing→step_completed/step_failedevent pairs into success/error child entries, synthesizesdurationMs = 0entries for pre-method-executing failures (model lookup, vault expression resolution, vary-key validation, env-var validation), and finalizes any unfinished invocations on stream termination so cancellation/timeout paths don't silently drop telemetry.WorkflowTelemetrySinkinsrc/libswamp/workflows/run.ts) — narrow callback shape onWorkflowRunDeps. CLI binds it toTelemetryService.recordChildInvocation; non-CLI consumers passundefinedand the bridge becomes a no-op. Keeps libswamp free of direct domain.telemetry imports beyond plain DTOs.TelemetryServiceexposes a stableinvocationId(constructor pre-allocates aTelemetryId) so children can reference it asparentInvocationIdbefore the parent entry itself is recorded at the end of the CLI lifecycle. Module-scoped accessor (getActiveTelemetryServiceinsrc/cli/telemetry_integration.ts) is set inrunClibefore parse and cleared in the surroundingtry/finally.Domain event extensions
step_failedgains optionalmodelName/methodName/driver, populated only at the model-method failure site (line ~1820 inrunStep's catch block). Structural failures — max-nesting-depth, cycle detection, nested-workflow throw/failed — leave them undefined so the bridge can distinguish method failures from structural failures.method_executinggains optionaldriver, captured from the resolvedDriverPlan. The yield is reordered to fire after DriverPlan resolution; vary-key validation failures (which happen between event start and method_executing) become pre-method-executing failures by design — more accurate categorization since the method was never invoked.Failure semantics
status: success, real durationmethod_executingstatus: error, real durationmethod_executing(model lookup, vault, vary, env var)status: error,durationMs = 0(synthesized)allowFailure: truestepstatus: erroron the child (method outcome); parent records workflowsuccesserrorvia the bridge'sfinalize()V1 limitations (documented in
design/workflow.md)DefaultMethodExecutionService.executeare not captured separately.Test Plan
TelemetryEntryround-trip with/without new fields (back-compat regression locked in);TelemetryService.recordChildInvocationsuccess and error paths withUserErrorclassification;WorkflowTelemetryBridgefor all five branches (success, post-method failure, pre-method failure, structural skip, finalize drain) plus idempotency, sequential workflows, forEach, allowFailure semantics — 23 new test cases.throwwith an in-flight method invocation: bridge'stry/finallydrains it as an error child, parent stream'serrorevent still propagates cleanly.integration/telemetry_workflow_method_invocations_test.ts) — end-to-end CLI invocation runs a workflow with success step + forEach iterations, asserts one parent + correct number of children withparentInvocationIdlinkage and fullworkflowContext(includingdriver,modelType).HttpTelemetrySenderincludes new fields atproperties.parentInvocationId/properties.workflowContext.*; omitted entirely when absent (noundefinedserialization).JsonTelemetryRepositorysaves and reads new fields; legacy entries without them decode cleanly.deno check,deno lint,deno fmt --check,deno run test(5723 passed, 0 failed),deno run compile.~/git/swamp-mediaand inspected~/git/swamp-media/.swamp/telemetry/. Got one parent + three children (ok-step, fanout-a, fanout-b) with allworkflowContextfields populated and consistentparentInvocationId/runId. Children share the redacted-args shape with directmodel method runinvocations. forEach iterations have distinctstepNames.Consumer side
Verified against swamp-club:
services/telemetry/lib/schema.tsdeclaresproperties: Record<string, unknown>so the additive fields ride across the wire with zero coordination. Existing rollup metrics inconsumers/metrics.tsalready follow the "read what you need from the opaque bag" pattern. A follow-up workflowContext rollup metric (per-driver / per-model-type / per-step counts) is a separate swamp-club issue, not blocking.🤖 Generated with Claude Code