Skip to content

feat(telemetry): per-method telemetry events for workflow runs (swamp-club#301)#1349

Merged
stack72 merged 3 commits intomainfrom
feat/per-method-workflow-telemetry
May 9, 2026
Merged

feat(telemetry): per-method telemetry events for workflow runs (swamp-club#301)#1349
stack72 merged 3 commits intomainfrom
feat/per-method-workflow-telemetry

Conversation

@keeb
Copy link
Copy Markdown
Contributor

@keeb keeb commented May 9, 2026

Summary

Closes swamp-club#301.

Workflow runs now emit one TelemetryEntry per workflow YAML step that resolves to a model method, alongside the parent CLI invocation entry. Children use the existing cli_invocation event shape (same redactions as a direct swamp model method run) and link to the parent via a new optional parentInvocationId field. A new optional workflowContext block carries workflowName / runId / jobName / stepName / modelType / driver so per-driver and per-model-type analytics are first-class without joining through the parent.

The design choice was deliberate: the issue originally proposed a new workflow_method_invocation event type. We pushed back during planning and chose additive optional fields on cli_invocation instead — the swamp-club ingest side declares properties: Record<string, unknown> so additive fields ride across with no consumer-side coordination. Analytics queries that aggregate by command/subcommand/duration immediately see workflow-internal method invocations alongside direct ones.

What's new on the wire

{
  "event": "cli_invocation",
  "properties": {
    "id": "<child-uuid>",
    "invocation": {
      "command": "model",
      "subcommand": "method",
      "args": ["run", "<REDACTED>", "<methodName>"],
      "optionKeys": [],
      "globalOptions": []
    },
    "result": { "status": "success", "exitCode": 0 },
    "parentInvocationId": "<parent-cli-invocation-uuid>",
    "workflowContext": {
      "workflowName": "deploy",
      "runId": "<workflow-run-uuid>",
      "jobName": "build",
      "stepName": "validate",
      "modelType": "command/shell",
      "driver": "local"
    }
    // ... existing fields (startedAt, completedAt, durationMs, swampVersion,
    //     denoVersion, platform, invocationContext) unchanged
  }
}

Older entries continue to decode without parentInvocationId / workflowContext (forward-compat regression test added).

Architecture

  • Bridge (src/libswamp/workflows/telemetry_bridge.ts) — tracks in-flight method invocations by ${jobId}:${stepId}, maps the existing method_executingstep_completed/step_failed event pairs into success/error child entries, synthesizes durationMs = 0 entries for pre-method-executing failures (model lookup, vault expression resolution, vary-key validation, env-var validation), and finalizes any unfinished invocations on stream termination so cancellation/timeout paths don't silently drop telemetry.
  • Sink (WorkflowTelemetrySink in src/libswamp/workflows/run.ts) — narrow callback shape on WorkflowRunDeps. CLI binds it to TelemetryService.recordChildInvocation; non-CLI consumers pass undefined and the bridge becomes a no-op. Keeps libswamp free of direct domain.telemetry imports beyond plain DTOs.
  • Pre-allocated parent idTelemetryService exposes a stable invocationId (constructor pre-allocates a TelemetryId) so children can reference it as parentInvocationId before the parent entry itself is recorded at the end of the CLI lifecycle. Module-scoped accessor (getActiveTelemetryService in src/cli/telemetry_integration.ts) is set in runCli before parse and cleared in the surrounding try/finally.

Domain event extensions

  • step_failed gains optional modelName / methodName / driver, populated only at the model-method failure site (line ~1820 in runStep's catch block). Structural failures — max-nesting-depth, cycle detection, nested-workflow throw/failed — leave them undefined so the bridge can distinguish method failures from structural failures.
  • method_executing gains optional driver, captured from the resolved DriverPlan. The yield is reordered to fire after DriverPlan resolution; vary-key validation failures (which happen between event start and method_executing) become pre-method-executing failures by design — more accurate categorization since the method was never invoked.

Failure semantics

Step outcome Child entry
Success status: success, real duration
Failure after method_executing status: error, real duration
Failure before method_executing (model lookup, vault, vary, env var) status: error, durationMs = 0 (synthesized)
allowFailure: true step status: error on the child (method outcome); parent records workflow success
Workflow-task / nested workflow / cycle / depth No child entry (no method was ever invoked at this step)
Cancellation / timeout / mid-stream throw In-flight invocations drained as error via the bridge's finalize()

V1 limitations (documented in design/workflow.md)

  • Workflow-step granularity only. Sub-method follow-up calls inside DefaultMethodExecutionService.execute are not captured separately.
  • Failures before workflow validation (workflow not found, input schema validation) produce no child entry — no method was ever resolved.

Test Plan

  • Unit testsTelemetryEntry round-trip with/without new fields (back-compat regression locked in); TelemetryService.recordChildInvocation success and error paths with UserError classification; WorkflowTelemetryBridge for all five branches (success, post-method failure, pre-method failure, structural skip, finalize drain) plus idempotency, sequential workflows, forEach, allowFailure semantics — 23 new test cases.
  • libswamp error-terminal test — mid-stream throw with an in-flight method invocation: bridge's try/finally drains it as an error child, parent stream's error event still propagates cleanly.
  • Integration test (integration/telemetry_workflow_method_invocations_test.ts) — end-to-end CLI invocation runs a workflow with success step + forEach iterations, asserts one parent + correct number of children with parentInvocationId linkage and full workflowContext (including driver, modelType).
  • Wire-shape testsHttpTelemetrySender includes new fields at properties.parentInvocationId / properties.workflowContext.*; omitted entirely when absent (no undefined serialization).
  • Repository round-tripJsonTelemetryRepository saves and reads new fields; legacy entries without them decode cleanly.
  • Verification gatesdeno check, deno lint, deno fmt --check, deno run test (5723 passed, 0 failed), deno run compile.
  • Manual end-to-end — ran a throwaway workflow in ~/git/swamp-media and inspected ~/git/swamp-media/.swamp/telemetry/. Got one parent + three children (ok-step, fanout-a, fanout-b) with all workflowContext fields populated and consistent parentInvocationId / runId. Children share the redacted-args shape with direct model method run invocations. forEach iterations have distinct stepNames.

Consumer side

Verified against swamp-club: services/telemetry/lib/schema.ts declares properties: Record<string, unknown> so the additive fields ride across the wire with zero coordination. Existing rollup metrics in consumers/metrics.ts already follow the "read what you need from the opaque bag" pattern. A follow-up workflowContext rollup metric (per-driver / per-model-type / per-step counts) is a separate swamp-club issue, not blocking.

🤖 Generated with Claude Code

keeb and others added 3 commits May 8, 2026 18:37
…-club#301)

Workflow runs now emit one TelemetryEntry per workflow YAML step that
resolves to a model method, alongside the parent CLI invocation entry.
Children use the existing cli_invocation event shape (same redactions as
a direct `swamp model method run`) and link to the parent via a new
optional `parentInvocationId` field. A new optional `workflowContext`
block carries workflowName/runId/jobName/stepName/modelType/driver so
per-driver and per-model-type analytics are first-class without joining
through the parent.

The bridge lives in src/libswamp/workflows/telemetry_bridge.ts: it tracks
in-flight method invocations between method_executing and the matching
step_completed/step_failed events, synthesizes durationMs=0 entries for
pre-method-executing failures (model lookup, vault expression resolution,
vary-key validation, env-var validation), and finalizes any unfinished
invocations on stream termination so cancellation/timeout paths don't
silently drop telemetry.

Domain event extensions:
- step_failed gains optional modelName/methodName/driver, populated only
  at the model-method failure site; structural failures (max-depth,
  cycle, nested-workflow) leave them undefined so the bridge can
  distinguish method failures from structural failures.
- method_executing gains optional driver, captured from the resolved
  DriverPlan; the yield is reordered to fire after DriverPlan
  resolution.

Wire shape is opaque on the swamp-club ingest side (properties:
Record<string, unknown>) so the additive fields ride across with no
consumer-side coordination — verified against
services/telemetry/lib/schema.ts and consumers/metrics.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous matcher used `stepName.startsWith("fanout-") && stepName.includes("a")`
which non-deterministically aliased `"fanout-b"` to the same entry as
`"fanout-a"` because the prefix `"fanout-"` itself contains the letter `"a"`.
Linux CI's directory iteration order returned `"fanout-b"` first, so
`find()` matched it for BOTH `fanoutA` and `fanoutB` and the
distinct-stepNames assertion failed.

Use exact `===` match instead — the iterations are known constants in
this fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fixture uses POSIX shell built-ins (`echo`, `exit`) via the
command/shell model. On Windows the shell exec exits with code -65536
because shell built-ins aren't directly resolvable as Windows binaries —
already a known limitation handled by `keeb_shell_model_test.ts` which
uses the same pattern.

The bridge logic itself is platform-independent and covered by
src/libswamp/workflows/telemetry_bridge_test.ts which runs on all
platforms. This integration test verifies end-to-end CLI plumbing on
POSIX only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI UX Review

Blocking

None.

Suggestions

None.

Verdict

PASS — This PR makes no user-facing changes. All modifications are internal telemetry plumbing: enriching the TelemetryEntry wire shape with parentInvocationId and workflowContext, wiring a telemetry sink into WorkflowRunDeps, and a module-scoped accessor in telemetry_integration.ts. No command flags, help text, log-mode output, JSON-mode output, or error messages were added or changed.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Well-architected feature with comprehensive test coverage across all layers.

Blocking Issues

None.

Suggestions

  1. key.split(":") in finalize() is fragile (src/libswamp/workflows/telemetry_bridge.ts:197): The step key uses ${jobId}:${stepId} as the map key, then split(":") to recover the parts during drain. If a step ID ever contained a colon (e.g., from a CEL expression or template expansion), the split would misattribute the stepName in the workflow context. Step names don't currently use colons so this isn't realistic today, but a safer approach would be to store the (jobId, stepId) tuple directly on InFlightMethodInvocation rather than re-parsing the key. Low-priority since it only affects the error-drain path.

What looks good

  • DDD alignment: WorkflowContext is a proper value object (immutable, equality by value), the bridge acts as an application-layer service mediating between domain events and the telemetry sink, and the sink callback keeps libswamp decoupled from the domain telemetry service.
  • Import boundary: CLI command imports WorkflowTelemetrySink and WorkflowRunDeps from ../../libswamp/mod.ts — no direct internal imports.
  • Additive wire schema: New optional fields on cli_invocation event with backward-compat regression tests for legacy entries missing parentInvocationId/workflowContext. Clean zero-serialization for absent optional fields.
  • Failure semantics: The five-way failure categorization (success, post-method error, pre-method-executing error, structural skip, finalize drain) is well-mapped and each branch has dedicated test coverage.
  • Pre-allocated invocationId: Letting children reference the parent ID before the parent entry is written avoids timestamp-based join heuristics — correct design.
  • method_executing reordering: Moving the yield to after DriverPlan resolution is the right call — it gives the bridge the resolved driver and correctly reclassifies vary-key failures as pre-method-executing.
  • Test breadth: 23 new unit tests across bridge, service, entry, repository, and HTTP sender; plus an integration test that verifies the full CLI → libswamp → persistence path. The finalize() idempotency, sequential-workflows, and mid-stream-throw tests are particularly well-constructed.
  • License headers present on all new files.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

No critical or high severity issues found.

Medium

  1. Unhandled telemetry write failure can crash workflow executionsrc/libswamp/workflows/run.ts:579 and :631

    await telemetryBridge.observe(mapped) inside the main for-await loop and await telemetryBridge.finalize() in the finally block both propagate any exception thrown by sink.recordChildInvocation. If the underlying JsonTelemetryRepository.save() fails (disk full, permission denied, corrupted directory), the workflow fails with a confusing telemetry error instead of completing normally.

    Breaking scenario: Disk nears capacity during a long workflow run. A child telemetry entry write fails → observe() throws → the catch block yields { kind: "error", error: workflowExecutionFailed(diskError) } → the workflow appears to have failed, even though all model methods succeeded.

    Additionally, if finalize() throws in the finally block, it can mask the original workflow error (the thrown error from finally replaces whatever the try/catch was doing).

    Suggested fix: Wrap both the observe and finalize calls in try/catch to swallow telemetry failures gracefully:

    if (telemetryBridge) {
      try { await telemetryBridge.observe(mapped); } catch { /* telemetry best-effort */ }
    }
    // ...
    if (telemetryBridge) {
      try { await telemetryBridge.finalize(); } catch { /* telemetry best-effort */ }
    }

    Severity is medium rather than high because: (a) in practice the JSON repository writes small files to the .swamp/telemetry/ directory which is unlikely to fail in normal operation, and (b) the parent recordSuccess/recordError calls in the CLI lifecycle have the same unguarded pattern, so this isn't a regression — it's consistent with the existing design. But since child invocations fire mid-workflow (not just at CLI exit), the blast radius of a failure is larger here.

  2. key.split(":") in finalize() is fragile when identifiers contain colonssrc/libswamp/workflows/telemetry_bridge.ts:197

    const [jobId, stepId] = key.split(":"); destructures only the first two segments. If a job name or step name contains a colon (e.g. "deploy:prod", or a forEach-expanded name like "step-host:port[0]"), the stepId would be truncated. The stepKey function on line 230 joins with : but the reverse split is not symmetric.

    Breaking scenario: A workflow YAML names a job "deploy:us-east-1". The key becomes "deploy:us-east-1:validate". The split produces jobId = "deploy", stepId = "us-east-1" — both wrong, and "validate" is lost entirely. The workflowContext in the drained telemetry entry would have incorrect jobName and stepName.

    Suggested fix: Use indexOf for a single split: const sep = key.indexOf(":"); const jobId = key.slice(0, sep); const stepId = key.slice(sep + 1);

    Impact is limited to telemetry metadata for drained in-flight entries (the finalize path). The normal observe path uses the original event's jobId/stepId directly and is unaffected.

Low

  1. new Date(0) dead writesrc/libswamp/workflows/telemetry_bridge.ts:159

    The synthesized InFlightMethodInvocation sets startedAt: new Date(0) but this value is never read — sameInstant on line 165 is passed to recordChildInvocation directly. The startedAt inside the synthesized object is only consumed by buildWorkflowContext, which doesn't use it. Not a bug, just a misleading dead value.

  2. Nested workflow events forwarded to parent bridge — The execution service's runWorkflowStep forwards child workflow events to the parent stream (line 1961 in execution_service.ts). If a nested workflow emits its own method_executing / step_completed pairs, the parent bridge would observe them and create child telemetry entries attributed to the parent workflow's workflowName/runId. This is arguably correct (the parent bridge sees all events from the parent stream), but nested workflow method invocations would carry the outer workflow's name, not the inner workflow's. The PR explicitly documents nested workflows as out of scope for V1, so this is just a note for future iterations.

Verdict

PASS — The architecture is well-considered. The bridge design is clean, idempotent, and correctly handles all five documented failure branches. Event ordering is correct — method_executing fires after driver resolution, model_resolved fires before it, and step_failed carries the right context for pre/post-method-executing failures. Wire-shape tests lock the contract. The two medium findings are worth addressing in a follow-up but neither represents data loss or incorrect behavior in normal operation.

@stack72 stack72 merged commit 16e942b into main May 9, 2026
11 checks passed
@stack72 stack72 deleted the feat/per-method-workflow-telemetry branch May 9, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants