Skip to content

Trace aggregation reports workflow as 'failed' after loop recovery #240

@spinje

Description

@spinje

Summary

When a node fails on visit 1 and succeeds on visit 2 via loop recovery, the workflow completes successfully and produces correct output — but --report summary says Status: failed and lists the node in the Errors section. This is a cosmetic issue in trace event aggregation, not a runtime correctness bug.

Background

Found during Task 148 verification while testing loop recovery semantics. The Task 148 invariant is upheld correctly at runtime (__failures__ is cleared on loop re-entry, shared[node_id] holds the fresh success data, workflow exits 0, template resolution sees success data). The issue is entirely in how --report / trace consumers aggregate per-visit events.

Task 148's spec covered the runtime invariant and direct failed-node reference UX, but explicitly did not address trace aggregation for loop recovery. The only --report acceptance criterion in Task 148 was for the non-loop #208 repro. Filing this as a follow-up so it's tracked.

Reproduction

scratchpads/task148-verification/09_loop_recovery.pflow.md:

### setup
Remove any pre-existing marker so the test is repeatable.
- type: shell
- next: maybe-fail
- cache: false
\`\`\`shell command
rm -f /tmp/pflow-loop-marker
\`\`\`

### maybe-fail
Fails first time, creates marker, succeeds on retry.
- type: shell
- on-error: retry
- next: end
- cache: false
\`\`\`shell command
if [ -f /tmp/pflow-loop-marker ]; then
  echo "succeeded-on-retry"
else
  touch /tmp/pflow-loop-marker
  exit 9
fi
\`\`\`

### retry
Loops back to maybe-fail.
- type: shell
- next: maybe-fail
- cache: false
\`\`\`shell command
echo "retrying"
\`\`\`

## Outputs
### result
- source: ${maybe-fail.stdout}

Run:

rm -f /tmp/pflow-loop-marker
uv run pflow scratchpads/task148-verification/09_loop_recovery.pflow.md --no-cache --report

CLI output (correct): succeeded-on-retry, exit 0, workflow completed successfully.

summary.md (wrong):

```

Execution Report: 09_loop_recovery

  • Status: failed
  • Nodes: 4

Pipeline

# Node Type Time Status
1 setup ShellNode 6ms ok
2 maybe-fail ShellNode 5ms FAILED
3 retry ShellNode 3ms ok
4 maybe-fail ShellNode 3ms ok

Errors

  • maybe-fail (ShellNode): Command failed with exit code 9
    ```

The workflow succeeded and produced succeeded-on-retry, but the report claims it failed and lists a stale error from visit 1.

Root cause

src/pflow/runtime/workflow_trace.py:275-288 (_determine_trace_status):

```python
failed = any(not e.get("success", True) for e in self.events)
if failed:
return "failed"
```

This is monotonic over the full event list. Each node execution appends its own event, and events are never removed. Visit 1's success: False event persists forever in self.events, so any() returns True even though visit 2 succeeded.

src/pflow/runtime/workflow_trace.py:408-421 (save_to_file):

```python
failed_nodes = [e for e in self.events if not e.get("success", True)]
trace_data = {
"final_status": final_status,
"nodes_executed": len(self.events),
"nodes_failed": len(failed_nodes),
...
}
```

Same pattern — counts failed events, not failed nodes-by-final-state.

src/pflow/core/trace_report.py:223-229 (_collect_errors) filters events the same way, which is why the Errors section lists visit 1's exit 9 even after visit 2 recovered.

The trace format currently conflates "this node ever failed during the run" with "this node's final state is failed". For non-loop workflows the two are identical; for loop recovery they diverge.

Design question (not prescribing a fix)

There are legitimate uses for both interpretations:

  1. Final-state view: An operator wants to know "did the workflow succeed?" → yes. Loop recovery should report green.
  2. Audit view: An agent debugging flakiness wants to know "did this node ever fail during the run?" → yes. Loop recovery should surface the visit-1 failure.

A reasonable fix would report both: top-level final_status from the final per-node state, but preserve per-event history (the pipeline table already shows both visits). Options to consider:

  • Group events by `node_id`, take the last event's `success` flag as each node's final state. Compute `final_status` from that. Pipeline table and per-node files stay unchanged.
  • Walk the engine's `shared["execution"]["completed_nodes"]` and `shared["failures"]` at save time to compute final state from runtime invariant rather than event history. Requires threading shared state into the trace collector.
  • Tag each event with a `visit_number` and have the report aggregator distinguish "final" from "prior". More intrusive but preserves all the data.

The first option is the smallest change and matches what users intuitively expect from `--report`.

Acceptance criteria

  • Running the reproducer with `--report` produces `Status: succeeded` (or equivalent) in `summary.md`.
  • The pipeline table still shows both `maybe-fail` visits (first FAILED, second ok) — per-visit history is audit data, not rewritten.
  • The Errors section does NOT list `maybe-fail` as an error, because its final state is success.
  • Non-loop workflows are unaffected: Coalesce (\?\?) resolves to empty string instead of falling through when on-error source node failed #208 repro (and any single-pass workflow with a truly failed node) still reports `Status: failed` with the failed node in the Errors section.
  • Regression test added that runs the loop recovery reproducer and asserts `summary.md` contains `Status: succeeded` (or uses `trace.final_status` equivalent), not `failed`.

Related

  • Task 148 — Failed-node invariant fix. This issue is explicitly out of scope per `task-148.md` (only non-loop `--report` criterion was specified).
  • Replace runner._extract_runtime_warnings canned suggestions with actionable signal #235 — Replace `runner._extract_runtime_warnings` canned suggestions (related UX cleanup in the same area).
  • Verified during Task 148 manual testing that runtime correctness is not affected: `shared["failures"]` is correctly empty after loop recovery, `${maybe-fail.stdout}` correctly resolves to `succeeded-on-retry`, and the workflow exits 0. Only report/trace aggregation is affected.

Files likely to change

  • `src/pflow/runtime/workflow_trace.py` — `_determine_trace_status`, `save_to_file` (per-node-final-state aggregation)
  • `src/pflow/core/trace_report.py` — `_collect_errors` (filter by final state)
  • Tests: `tests/test_runtime/test_workflow_trace.py` or a new `tests/test_core/test_trace_report.py` for the loop recovery regression case

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions