fix(workflows): fix race condition of workflow waking before commit#2748
fix(workflows): fix race condition of workflow waking before commit#2748MasterPtato wants to merge 1 commit intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughThis update introduces significant changes to Grafana trace dashboards, workflow database logic, and test infrastructure. It revises SQL queries, panel layouts, and error handling in dashboards, enhances workflow wake logic and error variants in core Rust modules, and updates test execution strategy by removing some test files and refactoring others. Minor improvements to timestamp formatting and logging are also included. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Grafana
participant DB
User->>Grafana: View Traces Dashboard
Grafana->>DB: Run updated SQL queries for panels
DB-->>Grafana: Return trace, error, and log data
Grafana-->>User: Display updated panels, tables, and logs
sequenceDiagram
participant Workflow
participant DB
participant Worker
Workflow->>DB: complete_workflow()
DB->>Workflow: Return wrote_to_wake_idx flag
alt wrote_to_wake_idx == true
Workflow->>Worker: wake_worker()
end
Workflow->>DB: commit_workflow()
DB-->>Workflow: Commit result
Workflow->>Worker: wake_worker() (always after commit)
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Claude finished @MasterPtato's task —— View job Code Review Complete ✅Reviewing PR for race condition fix in workflows. Review Tasks
Key Findings🔍 Race Condition FixThe core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:
📊 Observability Improvements
🧪 Test Strategy Changes
Code Quality Assessment✅ Strengths
|
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
docker/dev-full/grafana/dashboards/traces.json (1)
225-233: Stray zero-width characters break dashboard linksThe URLs contain invisible U+FEFF characters (
). These wreak havoc when Grafana renders the link, resulting in a 404.-"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}" +"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}"Remove the zero-width chars here and in the similar link below (lines 249-251).
🧹 Nitpick comments (1)
docker/dev-full/otel-collector/config.yaml (1)
51-52: Commented-out exporter is fine, but consider a clearer toggle mechanismKeeping the debug exporter as commented YAML works, yet teammates may miss it. A small
OTEL_DEBUG_EXPORTER=trueenv-driven include (via${env:…}or a separateotel-collector.debug.yaml) gives a cleaner on/off switch without editing the file each time.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
docker/dev-full/grafana/dashboards/traces.json(10 hunks)docker/dev-full/otel-collector/config.yaml(1 hunks)packages/common/chirp-workflow/core/src/ctx/listen.rs(1 hunks)packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs(1 hunks)packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs(6 hunks)packages/common/chirp-workflow/core/src/error.rs(1 hunks)packages/common/chirp-workflow/core/src/worker.rs(1 hunks)packages/common/chirp-workflow/core/tests/common.rs(0 hunks)packages/common/chirp-workflow/core/tests/integration.rs(1 hunks)packages/common/chirp-workflow/core/tests/sqlite.rs(0 hunks)packages/common/chirp/client/src/metrics.rs(1 hunks)packages/common/server-cli/src/util/wf/mod.rs(2 hunks)packages/common/server-cli/src/util/wf/signal.rs(2 hunks)
💤 Files with no reviewable changes (2)
- packages/common/chirp-workflow/core/tests/common.rs
- packages/common/chirp-workflow/core/tests/sqlite.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
packages/common/server-cli/src/util/wf/mod.rs (4)
packages/common/server-cli/src/util/format.rs (2)
indent_string(3-22)colored_json(31-33)packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (2)
tags(743-754)tags(2558-2564)packages/common/server-cli/src/commands/wf/signal.rs (1)
tags(48-51)packages/common/server-cli/src/commands/wf/mod.rs (1)
tags(89-92)
packages/common/server-cli/src/util/wf/signal.rs (4)
packages/common/server-cli/src/util/wf/mod.rs (1)
display_state(587-595)packages/common/chirp-workflow/core/src/builder/common/signal.rs (1)
tags(55-68)packages/common/server-cli/src/commands/wf/signal.rs (1)
tags(48-51)packages/common/server-cli/src/util/format.rs (2)
indent_string(3-22)colored_json(31-33)
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (5)
packages/common/pools/src/pools.rs (1)
fdb(199-201)packages/common/chirp-workflow/core/src/ctx/activity.rs (1)
fdb(247-249)packages/common/chirp-workflow/core/src/ctx/api.rs (1)
fdb(251-253)packages/common/chirp-workflow/core/src/ctx/operation.rs (1)
fdb(275-277)packages/common/chirp-workflow/core/src/ctx/standalone.rs (1)
fdb(262-264)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: cargo-check-main
- GitHub Check: claude-review
🔇 Additional comments (14)
packages/common/server-cli/src/util/wf/signal.rs (3)
29-29: Enhanced timestamp precision improves debugging.The addition of milliseconds (%.3f) to timestamp formatting provides better precision for debugging workflow timing issues, which aligns well with the PR objective of fixing race condition issues.
Also applies to: 38-38
51-57: Tags display implementation looks correct.The optional tags field is properly handled with null checking and follows the same formatting pattern as other fields. The use of
colored_jsonandindent_stringwith consistent styling maintains visual consistency.
59-65: Workflow ID display adds valuable context.The optional workflow_id field provides useful debugging information for signal routing, especially helpful for troubleshooting workflow wake-up timing issues mentioned in the PR objective.
packages/common/server-cli/src/util/wf/mod.rs (2)
61-61: Consistent timestamp precision enhancement.The milliseconds addition (%.3f) matches the formatting improvements in signal.rs, providing consistent timestamp precision across workflow CLI utilities.
71-75: Tags display follows consistent formatting pattern.The tags display implementation uses the same
colored_jsonandindent_stringapproach as signal.rs, maintaining visual consistency. The direct access toworkflow.tagsis appropriate since workflow tags appear to be mandatory based on the table implementation below (line 623).packages/common/chirp/client/src/metrics.rs (1)
40-41: Good wording improvementThe revised log string is clearer and still precise. No further action needed.
docker/dev-full/grafana/dashboards/traces.json (1)
731-733: Macro typo:__conditionalAlldoesn’t exist in ClickHouse pluginThe function name should be
__conditionalAllwithout the leading underscores in most plugin versions, or you can drop the macro and reuse the tuple pattern shown above to stay version-agnostic.packages/common/chirp-workflow/core/src/error.rs (1)
147-149: Good addition for error categorization.The new
ConnSqlxvariant provides clearer distinction between connection acquisition failures and general SQL errors, which helps with debugging and monitoring.packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1)
59-59: Correct usage of the new error variant.The change properly categorizes connection acquisition failures using the new
ConnSqlxvariant.packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (3)
1551-1712: Good optimization to reduce unnecessary wake calls.The change to track
wrote_to_wake_idxand conditionally callwake_worker()is a smart optimization that avoids unnecessary wake operations when no workflows are waiting for this one to complete.
1860-1877: Critical race condition fix with excellent documentation.This unconditional
wake_worker()call effectively prevents the race condition where a workflow could miss its wake condition between lease check and commit. The detailed comment clearly explains the scenario and justifies why this approach is simpler than alternatives. This is the key fix addressing the PR's objective.
2287-2287: Good comment clarification.The updated comment more accurately describes what's being checked.
packages/common/chirp-workflow/core/src/ctx/listen.rs (1)
86-86: LGTM: Improved signal lag monitoring sensitivity.Reducing the threshold from 15.0 to 3.0 seconds will help detect workflow timing issues earlier, which aligns well with the PR's goal of fixing race conditions related to workflow waking.
packages/common/chirp-workflow/core/src/worker.rs (1)
115-118: LGTM: Improved shutdown logging accuracy.Pre-filtering finished workflows before logging the remaining count ensures accurate reporting during shutdown. This enhances debugging capabilities and provides clearer operational visibility.
9bf40be to
f8f8729
Compare
|
Claude finished @MasterPtato's task —— View job Code Review Complete ✅Reviewing PR for race condition fix in workflows. Review Tasks
Key Findings🔍 Race Condition FixThe core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:
📊 Observability Improvements
🧪 Test Strategy Changes
Code Quality Assessment✅ Strengths
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
docker/dev-full/grafana/dashboards/traces.json(10 hunks)docker/dev-full/otel-collector/config.yaml(1 hunks)packages/common/chirp-workflow/core/src/ctx/listen.rs(1 hunks)packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs(1 hunks)packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs(6 hunks)packages/common/chirp-workflow/core/src/error.rs(1 hunks)packages/common/chirp-workflow/core/src/worker.rs(1 hunks)packages/common/chirp-workflow/core/tests/common.rs(0 hunks)packages/common/chirp-workflow/core/tests/integration.rs(1 hunks)packages/common/chirp-workflow/core/tests/sqlite.rs(0 hunks)packages/common/chirp/client/src/metrics.rs(1 hunks)packages/common/server-cli/src/commands/start.rs(1 hunks)packages/common/server-cli/src/util/wf/mod.rs(2 hunks)packages/common/server-cli/src/util/wf/signal.rs(2 hunks)packages/edge/services/pegboard/standalone/ws/src/lib.rs(1 hunks)
💤 Files with no reviewable changes (2)
- packages/common/chirp-workflow/core/tests/common.rs
- packages/common/chirp-workflow/core/tests/sqlite.rs
✅ Files skipped from review due to trivial changes (4)
- packages/common/server-cli/src/commands/start.rs
- packages/edge/services/pegboard/standalone/ws/src/lib.rs
- packages/common/chirp/client/src/metrics.rs
- packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs
🚧 Files skipped from review as they are similar to previous changes (8)
- docker/dev-full/otel-collector/config.yaml
- packages/common/chirp-workflow/core/src/ctx/listen.rs
- packages/common/chirp-workflow/core/src/error.rs
- packages/common/chirp-workflow/core/src/worker.rs
- packages/common/server-cli/src/util/wf/mod.rs
- packages/common/server-cli/src/util/wf/signal.rs
- packages/common/chirp-workflow/core/tests/integration.rs
- packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs
🧰 Additional context used
🧠 Learnings (1)
docker/dev-full/grafana/dashboards/traces.json (1)
Learnt from: MasterPtato
PR: rivet-gg/rivet#2748
File: docker/dev-full/grafana/dashboards/traces.json:178-181
Timestamp: 2025-07-15T00:17:16.875Z
Learning: In Grafana dashboards, when "All" is selected for multi-value template variables like service_name and span_name, Grafana automatically handles the expansion and doesn't pass literal "__all" strings to the query that would cause IN ('__all') failures. The ServiceName IN (${service_name:singlequote}) and SpanName IN (${span_name:singlequote}) patterns work correctly without needing tuple conditional logic to check for 'All' values.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: cargo-check-main
- GitHub Check: claude-review
🔇 Additional comments (1)
docker/dev-full/grafana/dashboards/traces.json (1)
730-732: Non-standard macro__$conditionalAll– verify or replaceThe query relies on
__$conditionalAll(TraceId IN (${trace_id:singlequote}), $trace_id).
This macro is not part of upstream Grafana nor the official ClickHouse plugin; if it isn’t defined by your own provisioning, the query will be rendered literally and fail.Please confirm the macro exists in the target environment or switch to the regular pattern that Grafana expands automatically for “All”:
- $__conditionalAll(TraceId IN (${trace_id:singlequote}), $trace_id) + TraceId IN (${trace_id:singlequote})(or add the tuple guard if needed).
| "pluginVersion": "4.0.6", | ||
| "queryType": "traces", | ||
| "rawSql": "WITH '${trace_id}' as trace_id, (SELECT min(Start) FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_start, (SELECT max(End) + 1 FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_end SELECT \"TraceId\" as traceID, \"SpanId\" as spanID, \"ParentSpanId\" as parentSpanID, \"ServiceName\" as serviceName, \"SpanName\" as operationName, \"Timestamp\" as startTime, multiply(\"Duration\", 0.000001) as duration, arrayMap(key -> map('key', key, 'value',\"SpanAttributes\"[key]), mapKeys(\"SpanAttributes\")) as tags, arrayMap(key -> map('key', key, 'value',\"ResourceAttributes\"[key]), mapKeys(\"ResourceAttributes\")) as serviceTags FROM \"otel\".\"otel_traces\" WHERE traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 ) ORDER BY Timestamp DESC, Duration DESC LIMIT 1000", | ||
| "rawSql": "WITH\n\t'${trace_id}' as trace_id,\n\t(SELECT min(Start) FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_start,\n\t(SELECT max(End) + 1 FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_end\nSELECT\n\tTraceId as traceID,\n\tSpanId as spanID,\n\tParentSpanId as parentSpanID,\n\tServiceName as serviceName,\n\tSpanName as operationName, Timestamp as startTime,\n\tmultiply(Duration, 0.000001) as duration,\n\tarrayMap(key -> map('key', key, 'value', SpanAttributes[key]), mapKeys(SpanAttributes)) as tags,\n\tarrayMap(key -> map('key', key, 'value', ResourceAttributes[key]), mapKeys(ResourceAttributes)) as serviceTags,\n\tarrayMap((name, timestamp, attributes) -> tuple(name, toString(multiply(toUnixTimestamp64Nano(timestamp), 0.000001)), arrayMap( key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(name String, timestamp String, fields Array(Map(String, String))), `Events.Name`, `Events.Timestamp`, `Events.Attributes`) AS logs,\n\tarrayMap((traceID, spanID, attributes) -> tuple(traceID, spanID, arrayMap(key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(traceID String, spanID String, tags Array(Map(String, String))), `Links.TraceId`, `Links.SpanId`, `Links.Attributes`) AS references\nFROM otel.otel_traces\nWHERE\n\ttraceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 )\nORDER BY Timestamp DESC, Duration DESC\nLIMIT 1000", | ||
| "refId": "A" |
There was a problem hiding this comment.
startTime alias used in the WHERE-clause breaks the query
ClickHouse does not allow referencing a column alias in the WHERE clause.
startTime is defined in the SELECT list (Timestamp as startTime) but is then reused in the filter:
... WHERE
traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end ...At runtime this raises: “Unknown identifier: startTime”.
Filter on the original column instead.
-WHERE
- traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end
+WHERE
+ TraceId = trace_id
+ AND Timestamp >= trace_start
+ AND Timestamp <= trace_end🤖 Prompt for AI Agents
In docker/dev-full/grafana/dashboards/traces.json around lines 586 to 589, the
SQL query uses the alias 'startTime' in the WHERE clause, which ClickHouse does
not support and causes an "Unknown identifier" error. To fix this, replace all
instances of 'startTime' in the WHERE clause with the original column name
'Timestamp' to ensure the query runs correctly.
Merge activity
|
…2748) <!-- Please make sure there is an issue that this PR is correlated to. --> ## Changes <!-- If there are frontend changes, please include screenshots. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Enhanced CLI output to display workflow and signal timestamps with millisecond precision. * CLI now shows workflow and signal tags and workflow IDs in a formatted, readable manner. * **Improvements** * Grafana traces dashboard updated: simplified queries, improved trace table with error indicators, reorganized panels, and enhanced trace details. * Logs and error panels in the dashboard were restructured for better visibility and usability. * Worker shutdown process improved to accurately reflect remaining workflows at start. * Workflow processing logic updated to ensure timely worker wake-up and avoid race conditions. * **Bug Fixes** * Reduced the threshold for logging long signal receive lag, enabling earlier detection of delays. * **Chores** * Added a new error variant for SQL connection failures to improve error clarity. * Updated test structure: converted async test to sync with explicit runtime and removed obsolete test files. * Improved internal logging messages for metrics loop shutdown. * Minor comment and typo corrections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changes
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores