Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Braindump: Shell Stderr Visibility Bug Fix - Overlap with Task 108

**Date:** 2026-01-09
**Context:** Just completed implementation of shell stderr visibility fix, which has significant overlap with Task 108's `--debug` flag

## Where I Am

I just finished implementing a fix for a bug where shell node stderr was hidden when `exit_code=0`. This was a debugging nightmare - pipelines would "succeed" but produce wrong output, and the stderr containing the actual error was invisible.

**What we implemented:**
- `execution_state.py`: Added `has_stderr`/`stderr` detection for shell nodes with exit_code=0
- `main.py`: Added `_display_stderr_warnings()` function, modified status indicators to show ⚠️
- 28 new tests covering CLI, JSON output, and the exact bug scenario

**The fix is complete and working.** This braindump is about how it affects Task 108.

## User's Mental Model

The user thinks about debugging in terms of **immediate visibility**:
- "Quality over quantity" for tests - they want tests that catch real bugs
- They want warnings to be **always visible**, not behind flags
- They verified manually that all scenarios work before considering done
- They explicitly asked about Task 108 overlap before closing out

Key quote from earlier research (from another agent's session):
> "The problem: There was no error. The workflow 'succeeded' with wrong output. So it's not about error messages - it's about visibility."

## Key Insights

### 1. The Stderr Fix and Task 108's `--debug` Are Complementary, Not Overlapping

Initially I thought there was significant overlap. After implementation, I see they solve different problems:

| Aspect | Stderr Fix (Done) | Task 108 `--debug` |
|--------|-------------------|-------------------|
| **Problem solved** | Hidden errors from pipelines | Can't see data flow |
| **Trigger** | Automatic (always on) | Manual flag |
| **Scope** | Only stderr with exit_code=0 | All shared store data |
| **Output location** | In execution summary | After execution summary |
| **Primary user** | Humans debugging failures | Agents understanding data flow |

### 2. Task 108's `--debug` No Longer Needs to Special-Case Stderr

The research doc (`research-inline-debug-flag.md`) mentioned showing stderr in the `--debug` output. Now that stderr warnings are always visible, `--debug` can focus purely on showing the shared store state without needing to highlight stderr specially.

### 3. The `has_stderr` Flag in JSON is Critical for Agents

We added `has_stderr: true` and `stderr: "..."` to the JSON output's `execution.steps[]`. This is important for Task 108 because agents consuming JSON now have a programmatic way to detect stderr issues without parsing text output.

Example JSON output:
```json
{
"execution": {
"steps": [{
"node_id": "pipeline-node",
"has_stderr": true,
"stderr": "sed: RE error: repetition-operator operand invalid"
}]
}
}
```

## Assumptions & Uncertainties

**ASSUMPTION:** Task 108's `--debug` flag should NOT duplicate the stderr warning display. The shared store dump will include `node.stderr` anyway, so showing it twice would be redundant.

**ASSUMPTION:** The `has_stderr` flag in JSON steps is sufficient for agent detection. Task 108 might not need additional stderr handling.

**UNCLEAR:** Should Task 108's `--debug` output indicate which nodes had stderr warnings? Or just show raw shared store? The warning indicators are already in the normal output, so maybe `--debug` should be purely data dump.

**NEEDS VERIFICATION:** Does the user want Task 108's `--debug` to be purely additive (show shared store AFTER existing output), or should it modify the existing output format?

## Unexplored Territory

**UNEXPLORED:** We didn't discuss whether the JSON output's `status` field should change from `"success"` to `"degraded"` when stderr is present. Currently it stays `"success"`. This might matter for agents making decisions based on status.

**CONSIDER:** The `--debug` flag from Task 108 research shows truncated values (e.g., "2.3kb"). We used 300 char truncation for stderr. Should these be consistent?

**MIGHT MATTER:** The stderr fix only applies to shell nodes. If future nodes produce stderr-like warnings (e.g., deprecation warnings from HTTP nodes), we might need a more generic "node warnings" system. Task 108's design should consider this.

**CONSIDER:** Should `--debug` work in MCP context? The research doc asked this question but didn't answer it. Now that stderr visibility is handled, this is purely about shared store visibility for agents.

## What I'd Tell Myself

1. **Don't try to merge these features.** Stderr visibility (always-on warnings) and `--debug` (opt-in data dump) solve different problems. Keep them separate.

2. **The JSON output is the real interface for agents.** The `has_stderr` flag in `execution.steps[]` is how agents will detect issues. Task 108 should ensure `--debug` mode provides equally structured data.

3. **The user values "always visible" over "opt-in".** They explicitly chose to show stderr warnings without requiring flags. Task 108's `--debug` is opt-in by design, but consider if any of its features should be always-on.

## Open Threads

1. **Update Task 108 spec?** The task spec might mention stderr visibility as a goal. It's now handled separately and should be noted.

2. **Consistency of truncation.** Stderr uses 300 chars, Task 108 research mentioned configurable thresholds. Should standardize.

3. **The `status` field question.** Workflows with stderr show `status: "success"` in JSON. Is this correct? Or should it be `"degraded"` or a new `"success_with_warnings"`?

## Relevant Files & References

**Files I modified (stderr fix):**
- `src/pflow/execution/execution_state.py:129-137` - stderr detection logic
- `src/pflow/cli/main.py:732-758` - `_display_stderr_warnings()` function
- `src/pflow/cli/main.py:797-814` - `_display_workflow_completion_status()` helper
- `tests/test_cli/test_shell_stderr_warnings.py` - 16 tests
- `tests/test_execution/test_execution_state.py` - 12 tests

**Task 108 research (read these first):**
- `.taskmaster/tasks/task_108/starting-context/research-inline-debug-flag.md` - `--debug` flag proposal
- `.taskmaster/tasks/task_108/starting-context/braindump-inline-debug-discovery.md` - How the need was discovered
- `.taskmaster/tasks/task_108/task-108.md` - Main task definition

**Related PR:**
- PR #52 - Surfaces upstream stderr when downstream nodes fail (different scenario, but related)

## For the Next Agent

**Start by:** Reading Task 108's spec and research docs, then compare with what's now implemented for stderr. Note that stderr visibility is DONE - don't re-implement it.

**Don't bother with:** Trying to show stderr in `--debug` output specially. It's already shown in the execution summary. Just include it in the raw shared store dump like any other data.

**The user cares most about:** Immediate visibility without requiring flags. Task 108's `--debug` is opt-in by design, which is fine since it's for deeper debugging. But if you're adding features, consider if they should be always-on.

**Key decision to make:** Should `--debug` show raw shared store data only, or should it include the new `has_stderr` metadata? The metadata is already in JSON output's `execution.steps[]`, so maybe `--debug` just needs to show the raw `shared[node_id]` dict which includes `stderr` naturally.

**Test the overlap:** Run `uv run pflow scratchpads/shell-stderr-visibility-bug/test-cases/stderr-hidden-exit0.json` to see what stderr visibility looks like now. Then design `--debug` to add value beyond this.

---

> **Note to next agent**: Read this document fully before taking any action. When ready, confirm you've read and understood by summarizing the key points, then state you're ready to proceed.
63 changes: 56 additions & 7 deletions src/pflow/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -690,6 +690,10 @@ def _format_node_status_line(step: dict[str, Any]) -> str:
return f" ✓ {node_id} {timing} - {total}/{total} items succeeded{tag_str}"

# Regular node
# Use warning indicator if node produced stderr (has_stderr already implies completed)
if step.get("has_stderr"):
return f" ⚠ {node_id} {timing}{tag_str}"

indicator = _get_status_indicator(status)
return f" {indicator} {node_id} {timing}{tag_str}"

Expand Down Expand Up @@ -725,6 +729,35 @@ def _display_batch_errors(steps: list[dict[str, Any]]) -> None:
click.echo(f" ...and {truncated} more errors", err=True)


def _display_stderr_warnings(steps: list[dict[str, Any]]) -> None:
"""Display stderr warnings for shell nodes that succeeded but produced stderr.

This helps surface hidden errors from shell pipeline failures where
intermediate commands fail but the overall exit code is 0.

Args:
steps: List of execution step dicts (may contain has_stderr and stderr fields)
"""
stderr_warnings = [
(step.get("node_id", "unknown"), step.get("stderr", ""))
for step in steps
if step.get("has_stderr") and step.get("stderr")
]

if not stderr_warnings:
return

click.echo("\n⚠️ Shell stderr (exit code 0):", err=True)
for node_id, stderr in stderr_warnings:
# Truncate long stderr to 300 chars
stderr_preview = stderr[:300]
if len(stderr) > 300:
stderr_preview += "..."
# Indent multiline stderr for readability
indented = stderr_preview.replace("\n", "\n ")
click.echo(f" • {node_id}: {indented}", err=True)


def _display_workflow_action(workflow_name: str, workflow_action: str) -> None:
"""Display workflow name and action message.

Expand Down Expand Up @@ -761,6 +794,24 @@ def _display_cost_summary(total_cost: float | None, formatted_result: dict[str,
click.echo(f"💰 Cost: ${total_cost:.4f}", err=True)


def _display_workflow_completion_status(duration_s: float, status: str, has_stderr_warnings: bool) -> None:
"""Display workflow completion status with appropriate indicator.

Args:
duration_s: Execution duration in seconds
status: Workflow status ("success", "degraded", "failed")
has_stderr_warnings: Whether any shell node produced stderr with exit_code=0
"""
if status == "degraded":
click.echo(f"⚠️ Workflow completed with warnings in {duration_s:.3f}s", err=True)
elif status == "failed":
click.echo(f"❌ Workflow failed after {duration_s:.3f}s", err=True)
elif has_stderr_warnings:
click.echo(f"⚠️ Workflow completed in {duration_s:.3f}s", err=True)
else:
click.echo(f"✓ Workflow completed in {duration_s:.3f}s", err=True)


def _display_execution_summary(formatted_result: dict[str, Any], verbose: bool) -> None:
"""Display execution summary with metrics in text mode.

Expand Down Expand Up @@ -796,13 +847,8 @@ def _display_execution_summary(formatted_result: dict[str, Any], verbose: bool)
if duration_ms is not None:
duration_s = duration_ms / 1000.0
status = formatted_result.get("status", "success")

if status == "degraded":
click.echo(f"⚠️ Workflow completed with warnings in {duration_s:.3f}s", err=True)
elif status == "failed":
click.echo(f"❌ Workflow failed after {duration_s:.3f}s", err=True)
else:
click.echo(f"✓ Workflow completed in {duration_s:.3f}s", err=True)
has_stderr_warnings = any(step.get("has_stderr") for step in steps)
_display_workflow_completion_status(duration_s, status, has_stderr_warnings)

# Show per-node execution details
if steps:
Expand All @@ -814,6 +860,9 @@ def _display_execution_summary(formatted_result: dict[str, Any], verbose: bool)
# Show batch errors section if any batch nodes had failures
_display_batch_errors(steps)

# Show stderr warnings for shell nodes that succeeded but produced stderr
_display_stderr_warnings(steps)

# Show cost if > 0
_display_cost_summary(total_cost, formatted_result)

Expand Down
Loading
Loading