Add improved OpenTelemetry instrumentation for workflow observability#933
Add improved OpenTelemetry instrumentation for workflow observability#933
Conversation
🦋 Changeset detectedLatest commit: b142e22 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (169 failed)mongodb (42 failed):
redis (42 failed):
starter (43 failed):
turso (42 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive OpenTelemetry instrumentation to improve workflow observability across HTTP requests, storage operations, event loading, and queue processing.
Changes:
- Added tracing to HTTP requests in world-vercel with detailed attributes for method, endpoint, status, and parse format
- Instrumented all storage operations (runs, steps, events, hooks) with automatic span wrapping
- Added queue timing breakdown tracking (deserialize, execution, serialize times) in step handler
- Added event loading tracing with pagination metrics
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Added @opentelemetry/api 1.9.0 as devDependency for world-vercel |
| packages/world-vercel/package.json | Added @opentelemetry/api as optional peer dependency |
| packages/world-vercel/src/telemetry.ts | New minimal telemetry module with trace helpers and semantic conventions |
| packages/world-vercel/src/utils.ts | Wrapped makeRequest with HTTP tracing including nested parse and validate spans |
| packages/world-vercel/src/storage.ts | Added instrumentObject wrapper to all storage methods |
| packages/world-vercel/src/instrumentObject.ts | New utility to automatically wrap object methods with tracing |
| packages/core/src/telemetry/semantic-conventions.ts | Added semantic conventions for world/storage, events, serialization, and queue timing |
| packages/core/src/telemetry.ts | Exported instrumentObject function (no changes in diff, already existed) |
| packages/core/src/runtime/step-handler.ts | Added timing measurements for deserialize, execution, and serialize phases |
| packages/core/src/runtime/helpers.ts | Wrapped getAllWorkflowRunEvents with tracing and pagination metrics |
| .changeset/improve-otel-tracing.md | Added changeset describing the improvements |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export const WorldParseBytes = SemanticConvention<number>('world.parse.bytes'); | ||
|
|
||
| // Event loading attributes | ||
|
|
||
| /** Number of pagination pages loaded when fetching workflow events */ | ||
| export const WorkflowEventsPagesLoaded = SemanticConvention<number>( | ||
| 'workflow.events.pages_loaded' | ||
| ); | ||
|
|
||
| // Serialization attributes | ||
|
|
||
| /** Format used for serialization (e.g., 'devalue') */ | ||
| export const SerializeFormat = SemanticConvention<string>('serialize.format'); | ||
|
|
||
| /** Size in bytes of the serialized data */ | ||
| export const SerializeBytes = SemanticConvention<number>('serialize.bytes'); | ||
|
|
||
| /** Format used for deserialization (e.g., 'devalue') */ | ||
| export const DeserializeFormat = | ||
| SemanticConvention<string>('deserialize.format'); | ||
|
|
||
| /** Size in bytes of the data being deserialized */ | ||
| export const DeserializeBytes = SemanticConvention<number>('deserialize.bytes'); | ||
|
|
||
| // Queue timing breakdown attributes | ||
|
|
||
| /** Time spent deserializing the queue message in milliseconds */ | ||
| export const QueueDeserializeTimeMs = SemanticConvention<number>( | ||
| 'queue.deserialize_time_ms' | ||
| ); | ||
|
|
||
| /** Time spent executing the handler logic in milliseconds */ | ||
| export const QueueExecutionTimeMs = SemanticConvention<number>( | ||
| 'queue.execution_time_ms' | ||
| ); | ||
|
|
||
| /** Time spent serializing the response in milliseconds */ | ||
| export const QueueSerializeTimeMs = SemanticConvention<number>( | ||
| 'queue.serialize_time_ms' | ||
| ); |
There was a problem hiding this comment.
The semantic conventions WorldParseBytes, SerializeFormat, SerializeBytes, DeserializeFormat, and DeserializeBytes are defined but never used in the codebase. Consider either implementing these attributes where appropriate (e.g., adding WorldParseBytes to the WORLD.parse span in utils.ts, and serialization attributes in the serialization code) or removing them if they're not needed yet. Unused definitions can confuse maintainers about the expected instrumentation coverage.
There was a problem hiding this comment.
Removed the unused semantic conventions (WorldParseBytes, SerializeFormat, SerializeBytes, DeserializeFormat, DeserializeBytes) in commit 161cd67.
| /** HTTP method used in World storage request */ | ||
| export const WorldHttpMethod = SemanticConvention<string>('world.http.method'); | ||
|
|
||
| /** API endpoint path for World storage request */ | ||
| export const WorldHttpEndpoint = SemanticConvention<string>( | ||
| 'world.http.endpoint' | ||
| ); | ||
|
|
||
| /** HTTP status code from World storage request */ | ||
| export const WorldHttpStatus = SemanticConvention<number>('world.http.status'); | ||
|
|
||
| /** Format used for parsing response body (cbor or json) */ | ||
| export const WorldParseFormat = SemanticConvention<'cbor' | 'json'>( | ||
| 'world.parse.format' | ||
| ); | ||
|
|
||
| /** Size in bytes of the parsed response body */ | ||
| export const WorldParseBytes = SemanticConvention<number>('world.parse.bytes'); |
There was a problem hiding this comment.
The semantic conventions WorldHttpMethod, WorldHttpEndpoint, WorldHttpStatus, WorldParseFormat, and WorldParseBytes are duplicated between this file and packages/core/src/telemetry/semantic-conventions.ts (lines 237-253). This duplication can lead to inconsistencies and maintenance issues. Consider importing these from @workflow/core instead, or if avoiding circular dependencies is necessary, document this duplication clearly and ensure the definitions remain synchronized.
There was a problem hiding this comment.
Added documentation at the top of world-vercel/telemetry.ts explaining the intentional duplication due to circular dependency avoidance. The comment now instructs maintainers to keep the conventions synchronized.
| const deserializeStartTime = Date.now(); | ||
| const ops: Promise<void>[] = []; | ||
| const hydratedInput = hydrateStepArguments( | ||
| step.input, | ||
| ops, | ||
| workflowRunId | ||
| ); | ||
| const deserializeTimeMs = Date.now() - deserializeStartTime; |
There was a problem hiding this comment.
The deserializeTimeMs measurement (line 243) captures only the synchronous portion of hydrateStepArguments. However, hydrateStepArguments may initiate async operations that are added to the 'ops' array (line 237), and these operations are executed later via Promise.all(ops) at line 294. This means the deserialize timing doesn't fully capture all deserialization work. Consider documenting this limitation or restructuring to measure the complete deserialization time including async operations.
There was a problem hiding this comment.
Added a comment documenting this limitation. The timing intentionally captures only the synchronous hydration - async operations like stream loading happen in background and their timing is included in the overall step execution time tracked separately.
There was a problem hiding this comment.
Maybe we also need separate spans explicitly for serializing and deserializing etc. rather than just including it as an attribute
| return tracer.startActiveSpan(spanName, opts, async (span) => { | ||
| try { | ||
| const result = await fn(span); | ||
| span.setStatus({ code: otel.SpanStatusCode.OK }); | ||
| return result; | ||
| } catch (e) { | ||
| span.setStatus({ | ||
| code: otel.SpanStatusCode.ERROR, | ||
| message: (e as Error).message, | ||
| }); | ||
| throw e; | ||
| } finally { | ||
| span.end(); | ||
| } | ||
| }); |
There was a problem hiding this comment.
The trace function in world-vercel doesn't handle WorkflowSuspension errors like the @workflow/core version does (see packages/core/src/telemetry.ts lines 105, 117-133). WorkflowSuspension is not an actual error but an algebraic effect for suspension, and should be marked with SpanStatusCode.OK. While world-vercel might not encounter WorkflowSuspension errors directly, if any world operations throw this, the span status will incorrectly show ERROR. Consider adding similar handling or documenting why it's not needed.
There was a problem hiding this comment.
Added documentation explaining why WorkflowSuspension handling isn't needed - world-vercel operates at the HTTP layer and never encounters workflow suspension effects (those only occur in the workflow VM context in @workflow/core).
packages/world-vercel/src/utils.ts
Outdated
| ); | ||
| } | ||
| return trace( | ||
| `WORLD.http ${method} ${endpoint}`, |
There was a problem hiding this comment.
The span name includes the full endpoint path which contains dynamic segments like run IDs (e.g., /v2/runs/${id}). This creates high cardinality in span names, which can cause performance issues in observability backends and make metrics aggregation difficult. Consider parameterizing the endpoint in the span name (e.g., replacing IDs with placeholders like /v2/runs/{id}) or moving the full endpoint to an attribute only. The endpoint is already captured in the WorldHttpEndpoint attribute.
| `WORLD.http ${method} ${endpoint}`, | |
| `WORLD.http ${method}`, |
There was a problem hiding this comment.
Fixed in commit ba58e13. The span name is now just HTTP {method} without the endpoint path. Per OTEL conventions for HTTP client spans, the recommended span name format is just the method. The full URL is captured in the url.full attribute for filtering/searching.
- Add tracing to HTTP requests in world-vercel with method, endpoint, and status attributes - Add tracing to storage operations (runs, steps, events, hooks) - Add tracing to event loading with event count and pages loaded attributes - Add queue timing breakdown attributes (deserialize, execution, serialize times) - Add new semantic conventions for world/storage, events, serialization, and queue breakdown Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace custom world.http.* attributes with standard OTEL HTTP conventions
(http.request.method, http.response.status_code, url.full, server.address, server.port)
- Replace custom queue.* attributes with standard OTEL messaging conventions
(messaging.system, messaging.destination.name, messaging.message.id, messaging.operation.type)
- Add error.type attribute for HTTP error spans
- Add span.recordException() for proper error recording
- Namespace workflow-specific attributes under workflow.* prefix
- Fix span naming: use "HTTP {method}" instead of "WORLD.http {method} {endpoint}"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused semantic conventions (SerializeFormat, SerializeBytes, DeserializeFormat, DeserializeBytes, WorldParseBytes) - Add documentation explaining intentional duplication in world-vercel/telemetry.ts - Add documentation explaining why WorkflowSuspension isn't handled in world-vercel - Add comment about deserialize timing measurement limitation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- world.runs, world.steps, world.events, world.hooks (storage)
- world.parse, world.validate (HTTP parsing)
- http {method} (HTTP client spans)
- workflow.run, workflow.start, workflow.loadEvents
- step {name}, hook.resume
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
f8ce3d7 to
b142e22
Compare
Summary
instrumentObject()Key Changes
@workflow/coretelemetry/semantic-conventions.tsgetAllWorkflowRunEvents()now wrapped withWORKFLOW.loadEventsspanqueue.deserialize_time_ms,queue.execution_time_ms,queue.serialize_time_ms@workflow/world-vercelmakeRequest()wrapped withWORLD.http {method} {endpoint}span (SpanKind.CLIENT)WORLD.parseandWORLD.validateinstrumentObject()@workflow/coreTest plan
pnpm buildpassespnpm testpasses for@workflow/core(1 pre-existing flaky test)pnpm testpasses for@workflow/world-vercel🤖 Generated with Claude Code