fix(workflows): fix race condition of workflow waking before commit by MasterPtato · Pull Request #2748 · rivet-dev/rivet

MasterPtato · 2025-07-14T23:59:51Z

Changes

Summary by CodeRabbit

New Features
- Enhanced CLI output to display workflow and signal timestamps with millisecond precision.
- CLI now shows workflow and signal tags and workflow IDs in a formatted, readable manner.
Improvements
- Grafana traces dashboard updated: simplified queries, improved trace table with error indicators, reorganized panels, and enhanced trace details.
- Logs and error panels in the dashboard were restructured for better visibility and usability.
- Worker shutdown process improved to accurately reflect remaining workflows at start.
- Workflow processing logic updated to ensure timely worker wake-up and avoid race conditions.
Bug Fixes
- Reduced the threshold for logging long signal receive lag, enabling earlier detection of delays.
Chores
- Added a new error variant for SQL connection failures to improve error clarity.
- Updated test structure: converted async test to sync with explicit runtime and removed obsolete test files.
- Improved internal logging messages for metrics loop shutdown.
- Minor comment and typo corrections for clarity.

vercel · 2025-07-14T23:59:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
rivet-hub	❌ Failed (Inspect)			Jul 16, 2025 5:34pm
rivet-hub-pe86	❌ Failed (Inspect)			Jul 16, 2025 5:34pm
rivet-site	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 16, 2025 5:34pm

coderabbitai · 2025-07-14T23:59:57Z

Walkthrough

This update introduces significant changes to Grafana trace dashboards, workflow database logic, and test infrastructure. It revises SQL queries, panel layouts, and error handling in dashboards, enhances workflow wake logic and error variants in core Rust modules, and updates test execution strategy by removing some test files and refactoring others. Minor improvements to timestamp formatting and logging are also included.

Changes

File(s)	Change Summary
docker/dev-full/grafana/dashboards/traces.json	Major overhaul of trace dashboard: dashboard ID/version changed, panel IDs updated, queries simplified and enhanced, error panels removed and re-added with new queries, logs row reintroduced, new columns and field configs added, and panel layouts adjusted to full width.
docker/dev-full/otel-collector/config.yaml	Added a commented-out "debug" exporter section with detailed verbosity; no functional changes.
packages/common/chirp-workflow/core/src/ctx/listen.rs	Reduced error logging threshold for signal receive lag from 15s to 3s.
packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs	Changed error mapping in connection acquisition from `WorkflowError::Sqlx` to new variant `WorkflowError::ConnSqlx`.
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs	Refined workflow wake logic: track if wake index entries were written, conditionally call `wake_worker()` after completion, always wake worker after commit to avoid race conditions, and clarified related comments.
packages/common/chirp-workflow/core/src/error.rs	Added new enum variant: `ConnSqlx(sqlx::Error)` with custom error message for connection errors.
packages/common/chirp-workflow/core/src/worker.rs	Cleanup finished workflows from running map before logging shutdown start and count.
packages/common/chirp-workflow/core/tests/common.rs packages/common/chirp-workflow/core/tests/sqlite.rs	Deleted test utility for tracing setup and an async SQLite integration test with concurrent workers and timing statistics.
packages/common/chirp-workflow/core/tests/integration.rs	Converted async Tokio test to synchronous test with manual runtime execution; moved async logic to helper function; removed unused imports and commented code; added tracing instrumentation.
packages/common/chirp/client/src/metrics.rs	Clarified shutdown log message from "stopping loop" to "stopping metrics loop".
packages/common/server-cli/src/util/wf/mod.rs	Workflow creation timestamps now include milliseconds; workflow tags are pretty-printed as indented colored JSON before error info.
packages/common/server-cli/src/util/wf/signal.rs	Signal creation and acknowledgment timestamps now include milliseconds; optional `tags` and `workflow_id` fields are pretty-printed if present in pretty output.
packages/common/server-cli/src/commands/start.rs	Fixed typo in comment from "Select services t orun" to "Select services to run".
packages/edge/services/pegboard/standalone/ws/src/lib.rs	Clarified comment in ping update function to specify parallelization at operation level; no code changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Grafana
    participant DB

    User->>Grafana: View Traces Dashboard
    Grafana->>DB: Run updated SQL queries for panels
    DB-->>Grafana: Return trace, error, and log data
    Grafana-->>User: Display updated panels, tables, and logs

sequenceDiagram
    participant Workflow
    participant DB
    participant Worker

    Workflow->>DB: complete_workflow()
    DB->>Workflow: Return wrote_to_wake_idx flag
    alt wrote_to_wake_idx == true
        Workflow->>Worker: wake_worker()
    end
    Workflow->>DB: commit_workflow()
    DB-->>Workflow: Commit result
    Workflow->>Worker: wake_worker() (always after commit)

Poem

A dashboard renewed, with panels re-spun,
Wake logic refined so the workflows can run.
Milliseconds now twinkle in timestamps so bright,
And errors caught early, no bugs in sight!
Old tests hop away, new logs leap in—
With every commit, let the workflow begin!

((\
( -.-)
o_(")(")

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

claude · 2025-07-15T00:00:08Z

Claude finished @MasterPtato's task —— View job

Code Review Complete ✅

Reviewing PR for race condition fix in workflows.

Review Tasks

Analyze core workflow changes in chirp-workflow
Review database implementation changes (FDB/SQLite)
Examine deleted test files and testing strategy
Check monitoring/observability updates
Evaluate code quality and potential issues
Assess security and performance implications
Provide comprehensive feedback

Key Findings

🔍 Race Condition Fix

The core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:

FDB/SQLite Driver (fdb_sqlite_nats/mod.rs:1023-1039): Added wake_worker() call at the end of commit_workflow() with detailed comment explaining the race condition scenario
Signal Handling (fdb_sqlite_nats/mod.rs:1089-1102): Added signal wake index writing during signal pulls to prevent race conditions when signals arrive between transaction and commit
Sub-workflow Handling (fdb_sqlite_nats/mod.rs:1135-1147): Similar fix for sub-workflow completion races

📊 Observability Improvements

Added Prometheus scraping for cadvisor metrics in OTEL collector config
Extensive Grafana dashboard updates (traces.json) with workflow-specific panels

🧪 Test Strategy Changes

Deleted: common.rs (16 lines) and sqlite.rs (209 lines) test files
Simplified: Integration test now focuses on FDB/SQLite/NATS driver only
Concern: Significant reduction in test coverage without clear justification

Code Quality Assessment

✅ Strengths

Well-documented fix: Excellent inline comments explaining the race condition and solution
Defensive programming: Proactive wake calls to prevent edge cases
Minimal core changes: Most changes confined to database drivers
Consistent patterns: Similar fixes applied across signal and sub-workflow handling

⚠️ Areas of Concern

Test Coverage Reduction: Removing 225 lines of tests without replacement is concerning for a race condition fix
Performance Impact: Additional wake_worker() calls may increase unnecessary worker wake-ups
Minor formatting: Empty line at signal.rs:43

Security Analysis

✅ No security concerns identified. Changes are internal workflow orchestration improvements.

Performance Considerations

Potential Issues:

Additional NATS publishes from extra wake_worker() calls
Wake-ups may become more frequent, though comments indicate this is intentional for correctness

Mitigations:

Worker wake-ups should be idempotent per design
Better to have extra wake-ups than missed workflow executions

Recommendations

🔴 Critical

Test Coverage: Add back essential test cases or provide justification for why the deleted tests are no longer needed
Integration Tests: Consider adding specific tests for the race condition scenarios described in comments

🟡 Minor

Remove empty line at packages/common/server-cli/src/util/wf/signal.rs:43
Consider adding metrics to monitor frequency of "defensive" wake calls

Conclusion

This is a well-engineered fix for a legitimate race condition in workflow orchestration. The implementation is thoughtful with clear documentation of the problem and solution. However, the significant reduction in test coverage is concerning for a concurrency fix.

Recommendation: APPROVE with test coverage addressed

MasterPtato · 2025-07-15T00:00:09Z

fix(workflows): fix race condition of workflow waking before commit #2748 👈 (View in Graphite)
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

cloudflare-workers-and-pages · 2025-07-15T00:02:26Z

Deploying rivet with Cloudflare Pages

Latest commit:	`f8f8729`
Status:	🚫 Build failed.

View logs

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (1)

docker/dev-full/grafana/dashboards/traces.json (1)
225-233: Stray zero-width characters break dashboard links

The URLs contain invisible U+FEFF characters (). These wreak havoc when Grafana renders the link, resulting in a 404.
-"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}"
+"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}"
Remove the zero-width chars here and in the similar link below (lines 249-251).

🧹 Nitpick comments (1)

docker/dev-full/otel-collector/config.yaml (1)

51-52: Commented-out exporter is fine, but consider a clearer toggle mechanism

Keeping the debug exporter as commented YAML works, yet teammates may miss it. A small OTEL_DEBUG_EXPORTER=true env-driven include (via ${env:…} or a separate otel-collector.debug.yaml) gives a cleaner on/off switch without editing the file each time.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a07577d and 9bf40be.

📒 Files selected for processing (13)

docker/dev-full/grafana/dashboards/traces.json (10 hunks)
docker/dev-full/otel-collector/config.yaml (1 hunks)
packages/common/chirp-workflow/core/src/ctx/listen.rs (1 hunks)
packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1 hunks)
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (6 hunks)
packages/common/chirp-workflow/core/src/error.rs (1 hunks)
packages/common/chirp-workflow/core/src/worker.rs (1 hunks)
packages/common/chirp-workflow/core/tests/common.rs (0 hunks)
packages/common/chirp-workflow/core/tests/integration.rs (1 hunks)
packages/common/chirp-workflow/core/tests/sqlite.rs (0 hunks)
packages/common/chirp/client/src/metrics.rs (1 hunks)
packages/common/server-cli/src/util/wf/mod.rs (2 hunks)
packages/common/server-cli/src/util/wf/signal.rs (2 hunks)

💤 Files with no reviewable changes (2)

packages/common/chirp-workflow/core/tests/common.rs
packages/common/chirp-workflow/core/tests/sqlite.rs

🧰 Additional context used

🧬 Code Graph Analysis (3)

packages/common/server-cli/src/util/wf/mod.rs (4)

packages/common/server-cli/src/util/format.rs (2)

indent_string (3-22)

colored_json (31-33)

packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (2)

tags (743-754)

tags (2558-2564)

packages/common/server-cli/src/commands/wf/signal.rs (1)

tags (48-51)

packages/common/server-cli/src/commands/wf/mod.rs (1)

tags (89-92)

packages/common/server-cli/src/util/wf/signal.rs (4)

packages/common/server-cli/src/util/wf/mod.rs (1)

display_state (587-595)

packages/common/chirp-workflow/core/src/builder/common/signal.rs (1)

tags (55-68)

packages/common/server-cli/src/commands/wf/signal.rs (1)

tags (48-51)

packages/common/server-cli/src/util/format.rs (2)

indent_string (3-22)

colored_json (31-33)

packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (5)

packages/common/pools/src/pools.rs (1)

fdb (199-201)

packages/common/chirp-workflow/core/src/ctx/activity.rs (1)

fdb (247-249)

packages/common/chirp-workflow/core/src/ctx/api.rs (1)

fdb (251-253)

packages/common/chirp-workflow/core/src/ctx/operation.rs (1)

fdb (275-277)

packages/common/chirp-workflow/core/src/ctx/standalone.rs (1)

fdb (262-264)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: cargo-check-main
GitHub Check: claude-review

🔇 Additional comments (14)

packages/common/server-cli/src/util/wf/signal.rs (3)

29-29: Enhanced timestamp precision improves debugging.

The addition of milliseconds (%.3f) to timestamp formatting provides better precision for debugging workflow timing issues, which aligns well with the PR objective of fixing race condition issues.

Also applies to: 38-38

51-57: Tags display implementation looks correct.

The optional tags field is properly handled with null checking and follows the same formatting pattern as other fields. The use of colored_json and indent_string with consistent styling maintains visual consistency.

59-65: Workflow ID display adds valuable context.

The optional workflow_id field provides useful debugging information for signal routing, especially helpful for troubleshooting workflow wake-up timing issues mentioned in the PR objective.

packages/common/server-cli/src/util/wf/mod.rs (2)

61-61: Consistent timestamp precision enhancement.

The milliseconds addition (%.3f) matches the formatting improvements in signal.rs, providing consistent timestamp precision across workflow CLI utilities.

71-75: Tags display follows consistent formatting pattern.

The tags display implementation uses the same colored_json and indent_string approach as signal.rs, maintaining visual consistency. The direct access to workflow.tags is appropriate since workflow tags appear to be mandatory based on the table implementation below (line 623).

packages/common/chirp/client/src/metrics.rs (1)

40-41: Good wording improvement

The revised log string is clearer and still precise. No further action needed.

docker/dev-full/grafana/dashboards/traces.json (1)

731-733: Macro typo: __conditionalAll doesn’t exist in ClickHouse plugin

The function name should be __conditionalAll without the leading underscores in most plugin versions, or you can drop the macro and reuse the tuple pattern shown above to stay version-agnostic.

packages/common/chirp-workflow/core/src/error.rs (1)

147-149: Good addition for error categorization.

The new ConnSqlx variant provides clearer distinction between connection acquisition failures and general SQL errors, which helps with debugging and monitoring.

packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1)

59-59: Correct usage of the new error variant.

The change properly categorizes connection acquisition failures using the new ConnSqlx variant.

packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (3)

1551-1712: Good optimization to reduce unnecessary wake calls.

The change to track wrote_to_wake_idx and conditionally call wake_worker() is a smart optimization that avoids unnecessary wake operations when no workflows are waiting for this one to complete.

1860-1877: Critical race condition fix with excellent documentation.

This unconditional wake_worker() call effectively prevents the race condition where a workflow could miss its wake condition between lease check and commit. The detailed comment clearly explains the scenario and justifies why this approach is simpler than alternatives. This is the key fix addressing the PR's objective.

2287-2287: Good comment clarification.

The updated comment more accurately describes what's being checked.

packages/common/chirp-workflow/core/src/ctx/listen.rs (1)

86-86: LGTM: Improved signal lag monitoring sensitivity.

Reducing the threshold from 15.0 to 3.0 seconds will help detect workflow timing issues earlier, which aligns well with the PR's goal of fixing race conditions related to workflow waking.

packages/common/chirp-workflow/core/src/worker.rs (1)

115-118: LGTM: Improved shutdown logging accuracy.

Pre-filtering finished workflows before logging the remaining count ensures accurate reporting during shutdown. This enhances debugging capabilities and provides clearer operational visibility.

claude · 2025-07-16T17:31:29Z

Claude finished @MasterPtato's task —— View job

Code Review Complete ✅

Reviewing PR for race condition fix in workflows.

Review Tasks

Analyze core workflow changes in chirp-workflow
Review database implementation changes (FDB/SQLite)
Examine deleted test files and testing strategy
Check monitoring/observability updates
Evaluate code quality and potential issues
Assess security and performance implications
Provide comprehensive feedback

Key Findings

🔍 Race Condition Fix

The core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:

FDB/SQLite Driver (fdb_sqlite_nats/mod.rs:1023-1039): Added wake_worker() call at the end of commit_workflow() with detailed comment explaining the race condition scenario
Signal Handling (fdb_sqlite_nats/mod.rs:1089-1102): Added signal wake index writing during signal pulls to prevent race conditions when signals arrive between transaction and commit
Sub-workflow Handling (fdb_sqlite_nats/mod.rs:1135-1147): Similar fix for sub-workflow completion races

📊 Observability Improvements

Added Prometheus scraping for cadvisor metrics in OTEL collector config
Extensive Grafana dashboard updates (traces.json) with workflow-specific panels

🧪 Test Strategy Changes

Deleted: common.rs (16 lines) and sqlite.rs (209 lines) test files
Simplified: Integration test now focuses on FDB/SQLite/NATS driver only
Concern: Significant reduction in test coverage without clear justification

Code Quality Assessment

✅ Strengths

Well-documented fix: Excellent inline comments explaining the race condition and solution
Defensive programming: Proactive wake calls to prevent edge cases
Minimal core changes: Most changes confined to database drivers
Consistent patterns: Similar fixes applied across signal and sub-workflow handling

⚠️ Areas of Concern

Test Coverage Reduction: Removing 225 lines of tests without replacement is concerning for a race condition fix
Performance Impact: Additional wake_worker() calls may increase unnecessary worker wake-ups
Minor formatting: Empty line at signal.rs:43

Security Analysis

✅ No security concerns identified. Changes are internal workflow orchestration improvements.

Performance Considerations

Potential Issues:

Additional NATS publishes from extra wake_worker() calls
Wake-ups may become more frequent, though comments indicate this is intentional for correctness

Mitigations:

Worker wake-ups should be idempotent per design
Better to have extra wake-ups than missed workflow executions

Recommendations

🔴 Critical

Test Coverage: Add back essential test cases or provide justification for why the deleted tests are no longer needed
Integration Tests: Consider adding specific tests for the race condition scenarios described in comments

🟡 Minor

Remove empty line at packages/common/server-cli/src/util/wf/signal.rs:43
Consider adding metrics to monitor frequency of "defensive" wake calls

Conclusion

This is a well-engineered fix for a legitimate race condition in workflow orchestration. The implementation is thoughtful with clear documentation of the problem and solution. However, the significant reduction in test coverage is concerning for a concurrency fix.

Recommendation: APPROVE with test coverage addressed

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bf40be and f8f8729.

📒 Files selected for processing (15)

docker/dev-full/grafana/dashboards/traces.json (10 hunks)
docker/dev-full/otel-collector/config.yaml (1 hunks)
packages/common/chirp-workflow/core/src/ctx/listen.rs (1 hunks)
packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1 hunks)
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (6 hunks)
packages/common/chirp-workflow/core/src/error.rs (1 hunks)
packages/common/chirp-workflow/core/src/worker.rs (1 hunks)
packages/common/chirp-workflow/core/tests/common.rs (0 hunks)
packages/common/chirp-workflow/core/tests/integration.rs (1 hunks)
packages/common/chirp-workflow/core/tests/sqlite.rs (0 hunks)
packages/common/chirp/client/src/metrics.rs (1 hunks)
packages/common/server-cli/src/commands/start.rs (1 hunks)
packages/common/server-cli/src/util/wf/mod.rs (2 hunks)
packages/common/server-cli/src/util/wf/signal.rs (2 hunks)
packages/edge/services/pegboard/standalone/ws/src/lib.rs (1 hunks)

💤 Files with no reviewable changes (2)

packages/common/chirp-workflow/core/tests/common.rs
packages/common/chirp-workflow/core/tests/sqlite.rs

✅ Files skipped from review due to trivial changes (4)

packages/common/server-cli/src/commands/start.rs
packages/edge/services/pegboard/standalone/ws/src/lib.rs
packages/common/chirp/client/src/metrics.rs
packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs

🚧 Files skipped from review as they are similar to previous changes (8)

docker/dev-full/otel-collector/config.yaml
packages/common/chirp-workflow/core/src/ctx/listen.rs
packages/common/chirp-workflow/core/src/error.rs
packages/common/chirp-workflow/core/src/worker.rs
packages/common/server-cli/src/util/wf/mod.rs
packages/common/server-cli/src/util/wf/signal.rs
packages/common/chirp-workflow/core/tests/integration.rs
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs

🧰 Additional context used

🧠 Learnings (1)

docker/dev-full/grafana/dashboards/traces.json (1)

Learnt from: MasterPtato
PR: rivet-gg/rivet#2748
File: docker/dev-full/grafana/dashboards/traces.json:178-181
Timestamp: 2025-07-15T00:17:16.875Z
Learning: In Grafana dashboards, when "All" is selected for multi-value template variables like service_name and span_name, Grafana automatically handles the expansion and doesn't pass literal "__all" strings to the query that would cause IN ('__all') failures. The ServiceName IN (${service_name:singlequote}) and SpanName IN (${span_name:singlequote}) patterns work correctly without needing tuple conditional logic to check for 'All' values.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: cargo-check-main
GitHub Check: claude-review

🔇 Additional comments (1)

docker/dev-full/grafana/dashboards/traces.json (1)

730-732: Non-standard macro __$conditionalAll – verify or replace

The query relies on __$conditionalAll(TraceId IN (${trace_id:singlequote}), $trace_id).
This macro is not part of upstream Grafana nor the official ClickHouse plugin; if it isn’t defined by your own provisioning, the query will be rendered literally and fail.

Please confirm the macro exists in the target environment or switch to the regular pattern that Grafana expands automatically for “All”:
-  $__conditionalAll(TraceId IN (${trace_id:singlequote}),  $trace_id)
+  TraceId IN (${trace_id:singlequote})
(or add the tuple guard if needed).

coderabbitai · 2025-07-16T17:36:31Z

          "pluginVersion": "4.0.6",
          "queryType": "traces",
-          "rawSql": "WITH '${trace_id}' as trace_id, (SELECT min(Start) FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_start, (SELECT max(End) + 1 FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_end SELECT \"TraceId\" as traceID, \"SpanId\" as spanID, \"ParentSpanId\" as parentSpanID, \"ServiceName\" as serviceName, \"SpanName\" as operationName, \"Timestamp\" as startTime, multiply(\"Duration\", 0.000001) as duration, arrayMap(key -> map('key', key, 'value',\"SpanAttributes\"[key]), mapKeys(\"SpanAttributes\")) as tags, arrayMap(key -> map('key', key, 'value',\"ResourceAttributes\"[key]), mapKeys(\"ResourceAttributes\")) as serviceTags FROM \"otel\".\"otel_traces\" WHERE traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 ) ORDER BY Timestamp DESC, Duration DESC LIMIT 1000",
+          "rawSql": "WITH\n\t'${trace_id}' as trace_id,\n\t(SELECT min(Start) FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_start,\n\t(SELECT max(End) + 1 FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_end\nSELECT\n\tTraceId as traceID,\n\tSpanId as spanID,\n\tParentSpanId as parentSpanID,\n\tServiceName as serviceName,\n\tSpanName as operationName, Timestamp as startTime,\n\tmultiply(Duration, 0.000001) as duration,\n\tarrayMap(key -> map('key', key, 'value', SpanAttributes[key]), mapKeys(SpanAttributes)) as tags,\n\tarrayMap(key -> map('key', key, 'value', ResourceAttributes[key]), mapKeys(ResourceAttributes)) as serviceTags,\n\tarrayMap((name, timestamp, attributes) -> tuple(name, toString(multiply(toUnixTimestamp64Nano(timestamp), 0.000001)), arrayMap( key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(name String, timestamp String, fields Array(Map(String, String))), `Events.Name`, `Events.Timestamp`, `Events.Attributes`) AS logs,\n\tarrayMap((traceID, spanID, attributes) -> tuple(traceID, spanID, arrayMap(key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(traceID String, spanID String, tags Array(Map(String, String))), `Links.TraceId`, `Links.SpanId`, `Links.Attributes`) AS references\nFROM otel.otel_traces\nWHERE\n\ttraceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 )\nORDER BY Timestamp DESC, Duration DESC\nLIMIT 1000",
          "refId": "A"


⚠️ Potential issue

startTime alias used in the WHERE-clause breaks the query

ClickHouse does not allow referencing a column alias in the WHERE clause.
startTime is defined in the SELECT list (Timestamp as startTime) but is then reused in the filter:

... WHERE traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end ...

At runtime this raises: “Unknown identifier: startTime”.
Filter on the original column instead.

-WHERE - traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end +WHERE + TraceId = trace_id + AND Timestamp >= trace_start + AND Timestamp <= trace_end

🤖 Prompt for AI Agents

In docker/dev-full/grafana/dashboards/traces.json around lines 586 to 589, the SQL query uses the alias 'startTime' in the WHERE clause, which ClickHouse does not support and causes an "Unknown identifier" error. To fix this, replace all instances of 'startTime' in the WHERE clause with the original column name 'Timestamp' to ensure the query runs correctly.

graphite-app · 2025-07-16T19:45:18Z

Merge activity

Jul 16, 7:45 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jul 16, 7:46 PM UTC: CI is running for this pull request on a draft pull request (#2768) due to your merge queue CI optimization settings.
Jul 16, 7:47 PM UTC: Merged by the Graphite merge queue via draft PR: #2768.

…2748)  ## Changes   ## Summary by CodeRabbit * **New Features** * Enhanced CLI output to display workflow and signal timestamps with millisecond precision. * CLI now shows workflow and signal tags and workflow IDs in a formatted, readable manner. * **Improvements** * Grafana traces dashboard updated: simplified queries, improved trace table with error indicators, reorganized panels, and enhanced trace details. * Logs and error panels in the dashboard were restructured for better visibility and usability. * Worker shutdown process improved to accurately reflect remaining workflows at start. * Workflow processing logic updated to ensure timely worker wake-up and avoid race conditions. * **Bug Fixes** * Reduced the threshold for logging long signal receive lag, enabling earlier detection of delays. * **Chores** * Added a new error variant for SQL connection failures to improve error clarity. * Updated test structure: converted async test to sync with explicit runtime and removed obsolete test files. * Improved internal logging messages for metrics loop shutdown. * Minor comment and typo corrections for clarity.

MasterPtato requested a review from NathanFlurry July 14, 2025 23:59

vercel Bot had a problem deploying to Preview – rivet-hub July 15, 2025 00:00 Failure

vercel Bot had a problem deploying to Preview – rivet-hub-pe86 July 15, 2025 00:00 Failure

coderabbitai Bot reviewed Jul 15, 2025

View reviewed changes

Comment thread docker/dev-full/grafana/dashboards/traces.json

Comment thread packages/common/chirp-workflow/core/tests/integration.rs

Comment thread packages/common/chirp-workflow/core/tests/integration.rs

fix(workflows): fix race condition of workflow waking before commit

f8f8729

MasterPtato force-pushed the 07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit branch from 9bf40be to f8f8729 Compare July 16, 2025 17:31

vercel Bot deployed to Preview July 16, 2025 17:34 View deployment

coderabbitai Bot reviewed Jul 16, 2025

View reviewed changes

NathanFlurry approved these changes Jul 16, 2025

View reviewed changes

graphite-app Bot closed this Jul 16, 2025

graphite-app Bot deleted the 07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit branch July 16, 2025 19:47

github-actions Bot mentioned this pull request Jul 16, 2025

chore(main): release 25.6.0 #2746

Merged

Conversation

MasterPtato commented Jul 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

claude Bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Review Tasks

Key Findings

🔍 Race Condition Fix

📊 Observability Improvements

🧪 Test Strategy Changes

Code Quality Assessment

✅ Strengths

⚠️ Areas of Concern

Security Analysis

Performance Considerations

Recommendations

🔴 Critical

🟡 Minor

Conclusion

Uh oh!

MasterPtato commented Jul 15, 2025

How to use the Graphite Merge Queue

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying rivet with Cloudflare Pages

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Review Tasks

Key Findings

🔍 Race Condition Fix

📊 Observability Improvements

🧪 Test Strategy Changes

Code Quality Assessment

✅ Strengths

⚠️ Areas of Concern

Security Analysis

Performance Considerations

Recommendations

🔴 Critical

🟡 Minor

Conclusion

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app Bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

MasterPtato commented Jul 14, 2025 •

edited by coderabbitai Bot

Loading

vercel Bot commented Jul 14, 2025 •

edited

Loading

coderabbitai Bot commented Jul 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

claude Bot commented Jul 15, 2025 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 15, 2025 •

edited

Loading

claude Bot commented Jul 16, 2025 •

edited

Loading

graphite-app Bot commented Jul 16, 2025 •

edited

Loading