Skip to content

fix(workflows): fix race condition of workflow waking before commit#2748

Closed
MasterPtato wants to merge 1 commit intomainfrom
07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit
Closed

fix(workflows): fix race condition of workflow waking before commit#2748
MasterPtato wants to merge 1 commit intomainfrom
07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

@MasterPtato MasterPtato commented Jul 14, 2025

Changes

Summary by CodeRabbit

  • New Features

    • Enhanced CLI output to display workflow and signal timestamps with millisecond precision.
    • CLI now shows workflow and signal tags and workflow IDs in a formatted, readable manner.
  • Improvements

    • Grafana traces dashboard updated: simplified queries, improved trace table with error indicators, reorganized panels, and enhanced trace details.
    • Logs and error panels in the dashboard were restructured for better visibility and usability.
    • Worker shutdown process improved to accurately reflect remaining workflows at start.
    • Workflow processing logic updated to ensure timely worker wake-up and avoid race conditions.
  • Bug Fixes

    • Reduced the threshold for logging long signal receive lag, enabling earlier detection of delays.
  • Chores

    • Added a new error variant for SQL connection failures to improve error clarity.
    • Updated test structure: converted async test to sync with explicit runtime and removed obsolete test files.
    • Improved internal logging messages for metrics loop shutdown.
    • Minor comment and typo corrections for clarity.

@MasterPtato MasterPtato requested a review from NathanFlurry July 14, 2025 23:59
@vercel
Copy link
Copy Markdown

vercel Bot commented Jul 14, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
rivet-hub ❌ Failed (Inspect) Jul 16, 2025 5:34pm
rivet-hub-pe86 ❌ Failed (Inspect) Jul 16, 2025 5:34pm
rivet-site ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 16, 2025 5:34pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jul 14, 2025

Walkthrough

This update introduces significant changes to Grafana trace dashboards, workflow database logic, and test infrastructure. It revises SQL queries, panel layouts, and error handling in dashboards, enhances workflow wake logic and error variants in core Rust modules, and updates test execution strategy by removing some test files and refactoring others. Minor improvements to timestamp formatting and logging are also included.

Changes

File(s) Change Summary
docker/dev-full/grafana/dashboards/traces.json Major overhaul of trace dashboard: dashboard ID/version changed, panel IDs updated, queries simplified and enhanced, error panels removed and re-added with new queries, logs row reintroduced, new columns and field configs added, and panel layouts adjusted to full width.
docker/dev-full/otel-collector/config.yaml Added a commented-out "debug" exporter section with detailed verbosity; no functional changes.
packages/common/chirp-workflow/core/src/ctx/listen.rs Reduced error logging threshold for signal receive lag from 15s to 3s.
packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs Changed error mapping in connection acquisition from WorkflowError::Sqlx to new variant WorkflowError::ConnSqlx.
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs Refined workflow wake logic: track if wake index entries were written, conditionally call wake_worker() after completion, always wake worker after commit to avoid race conditions, and clarified related comments.
packages/common/chirp-workflow/core/src/error.rs Added new enum variant: ConnSqlx(sqlx::Error) with custom error message for connection errors.
packages/common/chirp-workflow/core/src/worker.rs Cleanup finished workflows from running map before logging shutdown start and count.
packages/common/chirp-workflow/core/tests/common.rs
packages/common/chirp-workflow/core/tests/sqlite.rs
Deleted test utility for tracing setup and an async SQLite integration test with concurrent workers and timing statistics.
packages/common/chirp-workflow/core/tests/integration.rs Converted async Tokio test to synchronous test with manual runtime execution; moved async logic to helper function; removed unused imports and commented code; added tracing instrumentation.
packages/common/chirp/client/src/metrics.rs Clarified shutdown log message from "stopping loop" to "stopping metrics loop".
packages/common/server-cli/src/util/wf/mod.rs Workflow creation timestamps now include milliseconds; workflow tags are pretty-printed as indented colored JSON before error info.
packages/common/server-cli/src/util/wf/signal.rs Signal creation and acknowledgment timestamps now include milliseconds; optional tags and workflow_id fields are pretty-printed if present in pretty output.
packages/common/server-cli/src/commands/start.rs Fixed typo in comment from "Select services t orun" to "Select services to run".
packages/edge/services/pegboard/standalone/ws/src/lib.rs Clarified comment in ping update function to specify parallelization at operation level; no code changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Grafana
    participant DB

    User->>Grafana: View Traces Dashboard
    Grafana->>DB: Run updated SQL queries for panels
    DB-->>Grafana: Return trace, error, and log data
    Grafana-->>User: Display updated panels, tables, and logs
Loading
sequenceDiagram
    participant Workflow
    participant DB
    participant Worker

    Workflow->>DB: complete_workflow()
    DB->>Workflow: Return wrote_to_wake_idx flag
    alt wrote_to_wake_idx == true
        Workflow->>Worker: wake_worker()
    end
    Workflow->>DB: commit_workflow()
    DB-->>Workflow: Commit result
    Workflow->>Worker: wake_worker() (always after commit)
Loading

Poem

A dashboard renewed, with panels re-spun,
Wake logic refined so the workflows can run.
Milliseconds now twinkle in timestamps so bright,
And errors caught early, no bugs in sight!
Old tests hop away, new logs leap in—
With every commit, let the workflow begin!

((\
( -.-)
o_(")(")

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@claude
Copy link
Copy Markdown

claude Bot commented Jul 15, 2025

Claude finished @MasterPtato's task —— View job


Code Review Complete ✅

Reviewing PR for race condition fix in workflows.

Review Tasks

  • Analyze core workflow changes in chirp-workflow
  • Review database implementation changes (FDB/SQLite)
  • Examine deleted test files and testing strategy
  • Check monitoring/observability updates
  • Evaluate code quality and potential issues
  • Assess security and performance implications
  • Provide comprehensive feedback

Key Findings

🔍 Race Condition Fix

The core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:

  1. FDB/SQLite Driver (fdb_sqlite_nats/mod.rs:1023-1039): Added wake_worker() call at the end of commit_workflow() with detailed comment explaining the race condition scenario
  2. Signal Handling (fdb_sqlite_nats/mod.rs:1089-1102): Added signal wake index writing during signal pulls to prevent race conditions when signals arrive between transaction and commit
  3. Sub-workflow Handling (fdb_sqlite_nats/mod.rs:1135-1147): Similar fix for sub-workflow completion races

📊 Observability Improvements

  • Added Prometheus scraping for cadvisor metrics in OTEL collector config
  • Extensive Grafana dashboard updates (traces.json) with workflow-specific panels

🧪 Test Strategy Changes

  • Deleted: common.rs (16 lines) and sqlite.rs (209 lines) test files
  • Simplified: Integration test now focuses on FDB/SQLite/NATS driver only
  • Concern: Significant reduction in test coverage without clear justification

Code Quality Assessment

Strengths

  1. Well-documented fix: Excellent inline comments explaining the race condition and solution
  2. Defensive programming: Proactive wake calls to prevent edge cases
  3. Minimal core changes: Most changes confined to database drivers
  4. Consistent patterns: Similar fixes applied across signal and sub-workflow handling

⚠️ Areas of Concern

  1. Test Coverage Reduction: Removing 225 lines of tests without replacement is concerning for a race condition fix
  2. Performance Impact: Additional wake_worker() calls may increase unnecessary worker wake-ups
  3. Minor formatting: Empty line at signal.rs:43

Security Analysis

No security concerns identified. Changes are internal workflow orchestration improvements.

Performance Considerations

Potential Issues:

  • Additional NATS publishes from extra wake_worker() calls
  • Wake-ups may become more frequent, though comments indicate this is intentional for correctness

Mitigations:

  • Worker wake-ups should be idempotent per design
  • Better to have extra wake-ups than missed workflow executions

Recommendations

🔴 Critical

  1. Test Coverage: Add back essential test cases or provide justification for why the deleted tests are no longer needed
  2. Integration Tests: Consider adding specific tests for the race condition scenarios described in comments

🟡 Minor

  1. Remove empty line at packages/common/server-cli/src/util/wf/signal.rs:43
  2. Consider adding metrics to monitor frequency of "defensive" wake calls

Conclusion

This is a well-engineered fix for a legitimate race condition in workflow orchestration. The implementation is thoughtful with clear documentation of the problem and solution. However, the significant reduction in test coverage is concerning for a concurrency fix.

Recommendation: APPROVE with test coverage addressed


Copy link
Copy Markdown
Contributor Author


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jul 15, 2025

Deploying rivet with  Cloudflare Pages  Cloudflare Pages

Latest commit: f8f8729
Status:🚫  Build failed.

View logs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
docker/dev-full/grafana/dashboards/traces.json (1)

225-233: Stray zero-width characters break dashboard links

The URLs contain invisible U+FEFF characters (). These wreak havoc when Grafana renders the link, resulting in a 404.

-"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}"
+"/d/8klBUGfVk/otel-traces?${__url_time_range}&${service_name:queryparam}&var-trace_id=${__value.raw}"

Remove the zero-width chars here and in the similar link below (lines 249-251).

🧹 Nitpick comments (1)
docker/dev-full/otel-collector/config.yaml (1)

51-52: Commented-out exporter is fine, but consider a clearer toggle mechanism

Keeping the debug exporter as commented YAML works, yet teammates may miss it. A small OTEL_DEBUG_EXPORTER=true env-driven include (via ${env:…} or a separate otel-collector.debug.yaml) gives a cleaner on/off switch without editing the file each time.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a07577d and 9bf40be.

📒 Files selected for processing (13)
  • docker/dev-full/grafana/dashboards/traces.json (10 hunks)
  • docker/dev-full/otel-collector/config.yaml (1 hunks)
  • packages/common/chirp-workflow/core/src/ctx/listen.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (6 hunks)
  • packages/common/chirp-workflow/core/src/error.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/worker.rs (1 hunks)
  • packages/common/chirp-workflow/core/tests/common.rs (0 hunks)
  • packages/common/chirp-workflow/core/tests/integration.rs (1 hunks)
  • packages/common/chirp-workflow/core/tests/sqlite.rs (0 hunks)
  • packages/common/chirp/client/src/metrics.rs (1 hunks)
  • packages/common/server-cli/src/util/wf/mod.rs (2 hunks)
  • packages/common/server-cli/src/util/wf/signal.rs (2 hunks)
💤 Files with no reviewable changes (2)
  • packages/common/chirp-workflow/core/tests/common.rs
  • packages/common/chirp-workflow/core/tests/sqlite.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
packages/common/server-cli/src/util/wf/mod.rs (4)
packages/common/server-cli/src/util/format.rs (2)
  • indent_string (3-22)
  • colored_json (31-33)
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (2)
  • tags (743-754)
  • tags (2558-2564)
packages/common/server-cli/src/commands/wf/signal.rs (1)
  • tags (48-51)
packages/common/server-cli/src/commands/wf/mod.rs (1)
  • tags (89-92)
packages/common/server-cli/src/util/wf/signal.rs (4)
packages/common/server-cli/src/util/wf/mod.rs (1)
  • display_state (587-595)
packages/common/chirp-workflow/core/src/builder/common/signal.rs (1)
  • tags (55-68)
packages/common/server-cli/src/commands/wf/signal.rs (1)
  • tags (48-51)
packages/common/server-cli/src/util/format.rs (2)
  • indent_string (3-22)
  • colored_json (31-33)
packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (5)
packages/common/pools/src/pools.rs (1)
  • fdb (199-201)
packages/common/chirp-workflow/core/src/ctx/activity.rs (1)
  • fdb (247-249)
packages/common/chirp-workflow/core/src/ctx/api.rs (1)
  • fdb (251-253)
packages/common/chirp-workflow/core/src/ctx/operation.rs (1)
  • fdb (275-277)
packages/common/chirp-workflow/core/src/ctx/standalone.rs (1)
  • fdb (262-264)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: cargo-check-main
  • GitHub Check: claude-review
🔇 Additional comments (14)
packages/common/server-cli/src/util/wf/signal.rs (3)

29-29: Enhanced timestamp precision improves debugging.

The addition of milliseconds (%.3f) to timestamp formatting provides better precision for debugging workflow timing issues, which aligns well with the PR objective of fixing race condition issues.

Also applies to: 38-38


51-57: Tags display implementation looks correct.

The optional tags field is properly handled with null checking and follows the same formatting pattern as other fields. The use of colored_json and indent_string with consistent styling maintains visual consistency.


59-65: Workflow ID display adds valuable context.

The optional workflow_id field provides useful debugging information for signal routing, especially helpful for troubleshooting workflow wake-up timing issues mentioned in the PR objective.

packages/common/server-cli/src/util/wf/mod.rs (2)

61-61: Consistent timestamp precision enhancement.

The milliseconds addition (%.3f) matches the formatting improvements in signal.rs, providing consistent timestamp precision across workflow CLI utilities.


71-75: Tags display follows consistent formatting pattern.

The tags display implementation uses the same colored_json and indent_string approach as signal.rs, maintaining visual consistency. The direct access to workflow.tags is appropriate since workflow tags appear to be mandatory based on the table implementation below (line 623).

packages/common/chirp/client/src/metrics.rs (1)

40-41: Good wording improvement

The revised log string is clearer and still precise. No further action needed.

docker/dev-full/grafana/dashboards/traces.json (1)

731-733: Macro typo: __conditionalAll doesn’t exist in ClickHouse plugin

The function name should be __conditionalAll without the leading underscores in most plugin versions, or you can drop the macro and reuse the tuple pattern shown above to stay version-agnostic.

packages/common/chirp-workflow/core/src/error.rs (1)

147-149: Good addition for error categorization.

The new ConnSqlx variant provides clearer distinction between connection acquisition failures and general SQL errors, which helps with debugging and monitoring.

packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1)

59-59: Correct usage of the new error variant.

The change properly categorizes connection acquisition failures using the new ConnSqlx variant.

packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (3)

1551-1712: Good optimization to reduce unnecessary wake calls.

The change to track wrote_to_wake_idx and conditionally call wake_worker() is a smart optimization that avoids unnecessary wake operations when no workflows are waiting for this one to complete.


1860-1877: Critical race condition fix with excellent documentation.

This unconditional wake_worker() call effectively prevents the race condition where a workflow could miss its wake condition between lease check and commit. The detailed comment clearly explains the scenario and justifies why this approach is simpler than alternatives. This is the key fix addressing the PR's objective.


2287-2287: Good comment clarification.

The updated comment more accurately describes what's being checked.

packages/common/chirp-workflow/core/src/ctx/listen.rs (1)

86-86: LGTM: Improved signal lag monitoring sensitivity.

Reducing the threshold from 15.0 to 3.0 seconds will help detect workflow timing issues earlier, which aligns well with the PR's goal of fixing race conditions related to workflow waking.

packages/common/chirp-workflow/core/src/worker.rs (1)

115-118: LGTM: Improved shutdown logging accuracy.

Pre-filtering finished workflows before logging the remaining count ensures accurate reporting during shutdown. This enhances debugging capabilities and provides clearer operational visibility.

Comment thread docker/dev-full/grafana/dashboards/traces.json
Comment thread packages/common/chirp-workflow/core/tests/integration.rs
Comment thread packages/common/chirp-workflow/core/tests/integration.rs
@MasterPtato MasterPtato force-pushed the 07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit branch from 9bf40be to f8f8729 Compare July 16, 2025 17:31
@claude
Copy link
Copy Markdown

claude Bot commented Jul 16, 2025

Claude finished @MasterPtato's task —— View job


Code Review Complete ✅

Reviewing PR for race condition fix in workflows.

Review Tasks

  • Analyze core workflow changes in chirp-workflow
  • Review database implementation changes (FDB/SQLite)
  • Examine deleted test files and testing strategy
  • Check monitoring/observability updates
  • Evaluate code quality and potential issues
  • Assess security and performance implications
  • Provide comprehensive feedback

Key Findings

🔍 Race Condition Fix

The core issue being addressed is a race condition where workflows could wake before their state was committed to the database. The main fixes are:

  1. FDB/SQLite Driver (fdb_sqlite_nats/mod.rs:1023-1039): Added wake_worker() call at the end of commit_workflow() with detailed comment explaining the race condition scenario
  2. Signal Handling (fdb_sqlite_nats/mod.rs:1089-1102): Added signal wake index writing during signal pulls to prevent race conditions when signals arrive between transaction and commit
  3. Sub-workflow Handling (fdb_sqlite_nats/mod.rs:1135-1147): Similar fix for sub-workflow completion races

📊 Observability Improvements

  • Added Prometheus scraping for cadvisor metrics in OTEL collector config
  • Extensive Grafana dashboard updates (traces.json) with workflow-specific panels

🧪 Test Strategy Changes

  • Deleted: common.rs (16 lines) and sqlite.rs (209 lines) test files
  • Simplified: Integration test now focuses on FDB/SQLite/NATS driver only
  • Concern: Significant reduction in test coverage without clear justification

Code Quality Assessment

Strengths

  1. Well-documented fix: Excellent inline comments explaining the race condition and solution
  2. Defensive programming: Proactive wake calls to prevent edge cases
  3. Minimal core changes: Most changes confined to database drivers
  4. Consistent patterns: Similar fixes applied across signal and sub-workflow handling

⚠️ Areas of Concern

  1. Test Coverage Reduction: Removing 225 lines of tests without replacement is concerning for a race condition fix
  2. Performance Impact: Additional wake_worker() calls may increase unnecessary worker wake-ups
  3. Minor formatting: Empty line at signal.rs:43

Security Analysis

No security concerns identified. Changes are internal workflow orchestration improvements.

Performance Considerations

Potential Issues:

  • Additional NATS publishes from extra wake_worker() calls
  • Wake-ups may become more frequent, though comments indicate this is intentional for correctness

Mitigations:

  • Worker wake-ups should be idempotent per design
  • Better to have extra wake-ups than missed workflow executions

Recommendations

🔴 Critical

  1. Test Coverage: Add back essential test cases or provide justification for why the deleted tests are no longer needed
  2. Integration Tests: Consider adding specific tests for the race condition scenarios described in comments

🟡 Minor

  1. Remove empty line at packages/common/server-cli/src/util/wf/signal.rs:43
  2. Consider adding metrics to monitor frequency of "defensive" wake calls

Conclusion

This is a well-engineered fix for a legitimate race condition in workflow orchestration. The implementation is thoughtful with clear documentation of the problem and solution. However, the significant reduction in test coverage is concerning for a concurrency fix.

Recommendation: APPROVE with test coverage addressed


Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bf40be and f8f8729.

📒 Files selected for processing (15)
  • docker/dev-full/grafana/dashboards/traces.json (10 hunks)
  • docker/dev-full/otel-collector/config.yaml (1 hunks)
  • packages/common/chirp-workflow/core/src/ctx/listen.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs (6 hunks)
  • packages/common/chirp-workflow/core/src/error.rs (1 hunks)
  • packages/common/chirp-workflow/core/src/worker.rs (1 hunks)
  • packages/common/chirp-workflow/core/tests/common.rs (0 hunks)
  • packages/common/chirp-workflow/core/tests/integration.rs (1 hunks)
  • packages/common/chirp-workflow/core/tests/sqlite.rs (0 hunks)
  • packages/common/chirp/client/src/metrics.rs (1 hunks)
  • packages/common/server-cli/src/commands/start.rs (1 hunks)
  • packages/common/server-cli/src/util/wf/mod.rs (2 hunks)
  • packages/common/server-cli/src/util/wf/signal.rs (2 hunks)
  • packages/edge/services/pegboard/standalone/ws/src/lib.rs (1 hunks)
💤 Files with no reviewable changes (2)
  • packages/common/chirp-workflow/core/tests/common.rs
  • packages/common/chirp-workflow/core/tests/sqlite.rs
✅ Files skipped from review due to trivial changes (4)
  • packages/common/server-cli/src/commands/start.rs
  • packages/edge/services/pegboard/standalone/ws/src/lib.rs
  • packages/common/chirp/client/src/metrics.rs
  • packages/common/chirp-workflow/core/src/db/crdb_nats/mod.rs
🚧 Files skipped from review as they are similar to previous changes (8)
  • docker/dev-full/otel-collector/config.yaml
  • packages/common/chirp-workflow/core/src/ctx/listen.rs
  • packages/common/chirp-workflow/core/src/error.rs
  • packages/common/chirp-workflow/core/src/worker.rs
  • packages/common/server-cli/src/util/wf/mod.rs
  • packages/common/server-cli/src/util/wf/signal.rs
  • packages/common/chirp-workflow/core/tests/integration.rs
  • packages/common/chirp-workflow/core/src/db/fdb_sqlite_nats/mod.rs
🧰 Additional context used
🧠 Learnings (1)
docker/dev-full/grafana/dashboards/traces.json (1)
Learnt from: MasterPtato
PR: rivet-gg/rivet#2748
File: docker/dev-full/grafana/dashboards/traces.json:178-181
Timestamp: 2025-07-15T00:17:16.875Z
Learning: In Grafana dashboards, when "All" is selected for multi-value template variables like service_name and span_name, Grafana automatically handles the expansion and doesn't pass literal "__all" strings to the query that would cause IN ('__all') failures. The ServiceName IN (${service_name:singlequote}) and SpanName IN (${span_name:singlequote}) patterns work correctly without needing tuple conditional logic to check for 'All' values.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: cargo-check-main
  • GitHub Check: claude-review
🔇 Additional comments (1)
docker/dev-full/grafana/dashboards/traces.json (1)

730-732: Non-standard macro __$conditionalAll – verify or replace

The query relies on __$conditionalAll(TraceId IN (${trace_id:singlequote}), $trace_id).
This macro is not part of upstream Grafana nor the official ClickHouse plugin; if it isn’t defined by your own provisioning, the query will be rendered literally and fail.

Please confirm the macro exists in the target environment or switch to the regular pattern that Grafana expands automatically for “All”:

-  $__conditionalAll(TraceId IN (${trace_id:singlequote}),  $trace_id)
+  TraceId IN (${trace_id:singlequote})

(or add the tuple guard if needed).

Comment on lines 586 to 589
"pluginVersion": "4.0.6",
"queryType": "traces",
"rawSql": "WITH '${trace_id}' as trace_id, (SELECT min(Start) FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_start, (SELECT max(End) + 1 FROM \"otel\".\"otel_traces_trace_id_ts\" WHERE TraceId = trace_id) as trace_end SELECT \"TraceId\" as traceID, \"SpanId\" as spanID, \"ParentSpanId\" as parentSpanID, \"ServiceName\" as serviceName, \"SpanName\" as operationName, \"Timestamp\" as startTime, multiply(\"Duration\", 0.000001) as duration, arrayMap(key -> map('key', key, 'value',\"SpanAttributes\"[key]), mapKeys(\"SpanAttributes\")) as tags, arrayMap(key -> map('key', key, 'value',\"ResourceAttributes\"[key]), mapKeys(\"ResourceAttributes\")) as serviceTags FROM \"otel\".\"otel_traces\" WHERE traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 ) ORDER BY Timestamp DESC, Duration DESC LIMIT 1000",
"rawSql": "WITH\n\t'${trace_id}' as trace_id,\n\t(SELECT min(Start) FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_start,\n\t(SELECT max(End) + 1 FROM otel.otel_traces_trace_id_ts WHERE TraceId = trace_id) as trace_end\nSELECT\n\tTraceId as traceID,\n\tSpanId as spanID,\n\tParentSpanId as parentSpanID,\n\tServiceName as serviceName,\n\tSpanName as operationName, Timestamp as startTime,\n\tmultiply(Duration, 0.000001) as duration,\n\tarrayMap(key -> map('key', key, 'value', SpanAttributes[key]), mapKeys(SpanAttributes)) as tags,\n\tarrayMap(key -> map('key', key, 'value', ResourceAttributes[key]), mapKeys(ResourceAttributes)) as serviceTags,\n\tarrayMap((name, timestamp, attributes) -> tuple(name, toString(multiply(toUnixTimestamp64Nano(timestamp), 0.000001)), arrayMap( key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(name String, timestamp String, fields Array(Map(String, String))), `Events.Name`, `Events.Timestamp`, `Events.Attributes`) AS logs,\n\tarrayMap((traceID, spanID, attributes) -> tuple(traceID, spanID, arrayMap(key -> map('key', key, 'value', attributes[key]), mapKeys(attributes)))::Tuple(traceID String, spanID String, tags Array(Map(String, String))), `Links.TraceId`, `Links.SpanId`, `Links.Attributes`) AS references\nFROM otel.otel_traces\nWHERE\n\ttraceID = trace_id AND startTime >= trace_start AND startTime <= trace_end AND ( Duration > 0 )\nORDER BY Timestamp DESC, Duration DESC\nLIMIT 1000",
"refId": "A"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

startTime alias used in the WHERE-clause breaks the query

ClickHouse does not allow referencing a column alias in the WHERE clause.
startTime is defined in the SELECT list (Timestamp as startTime) but is then reused in the filter:

... WHERE
    traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end ...

At runtime this raises: “Unknown identifier: startTime”.
Filter on the original column instead.

-WHERE
-    traceID = trace_id AND startTime >= trace_start AND startTime <= trace_end
+WHERE
+    TraceId = trace_id
+    AND Timestamp >= trace_start
+    AND Timestamp <= trace_end
🤖 Prompt for AI Agents
In docker/dev-full/grafana/dashboards/traces.json around lines 586 to 589, the
SQL query uses the alias 'startTime' in the WHERE clause, which ClickHouse does
not support and causes an "Unknown identifier" error. To fix this, replace all
instances of 'startTime' in the WHERE clause with the original column name
'Timestamp' to ensure the query runs correctly.

@graphite-app
Copy link
Copy Markdown
Contributor

graphite-app Bot commented Jul 16, 2025

Merge activity

  • Jul 16, 7:45 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jul 16, 7:46 PM UTC: CI is running for this pull request on a draft pull request (#2768) due to your merge queue CI optimization settings.
  • Jul 16, 7:47 PM UTC: Merged by the Graphite merge queue via draft PR: #2768.

graphite-app Bot pushed a commit that referenced this pull request Jul 16, 2025
…2748)

<!-- Please make sure there is an issue that this PR is correlated to. -->

## Changes

<!-- If there are frontend changes, please include screenshots. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Enhanced CLI output to display workflow and signal timestamps with millisecond precision.
  * CLI now shows workflow and signal tags and workflow IDs in a formatted, readable manner.

* **Improvements**
  * Grafana traces dashboard updated: simplified queries, improved trace table with error indicators, reorganized panels, and enhanced trace details.
  * Logs and error panels in the dashboard were restructured for better visibility and usability.
  * Worker shutdown process improved to accurately reflect remaining workflows at start.
  * Workflow processing logic updated to ensure timely worker wake-up and avoid race conditions.

* **Bug Fixes**
  * Reduced the threshold for logging long signal receive lag, enabling earlier detection of delays.

* **Chores**
  * Added a new error variant for SQL connection failures to improve error clarity.
  * Updated test structure: converted async test to sync with explicit runtime and removed obsolete test files.
  * Improved internal logging messages for metrics loop shutdown.
  * Minor comment and typo corrections for clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@graphite-app graphite-app Bot closed this Jul 16, 2025
@graphite-app graphite-app Bot deleted the 07-14-fix_workflows_fix_race_condition_of_workflow_waking_before_commit branch July 16, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants