Skip to content

feat: add smart output display for registry run (closes #3)#4

Merged
spinje merged 4 commits into
mainfrom
feat/smart-output-display
Dec 20, 2025
Merged

feat: add smart output display for registry run (closes #3)#4
spinje merged 4 commits into
mainfrom
feat/smart-output-display

Conversation

@spinje
Copy link
Copy Markdown
Owner

@spinje spinje commented Dec 19, 2025

Summary

Extends Task 89's structure-only mode with configurable output modes for pflow registry run. Users can now see actual values directly in the output, with intelligent truncation for large data.

Closes #3

Changes

  • Settings: Add output_mode field to RegistrySettings (smart | structure | full)
  • CLI Commands: pflow settings registry output-mode [mode] to view/set mode
  • Smart Formatting: New functions for value display with truncation tracking
    • format_value_for_smart_display() - formats values with (truncated) indicator
    • format_smart_paths_with_values() - paths with values, shows read-fields hint when truncated
    • format_full_paths_with_values() - all values without truncation
  • Routing: format_structure_output() now routes based on output_mode
  • Integration: CLI and MCP both read settings and pass output_mode

Output Modes

Mode Filtering Values Truncation Default
smart Yes (>25 fields) Yes Yes (200 chars)
structure Yes (>25 fields) No N/A
full No Yes No

Example Output (Smart Mode)

✓ Node executed successfully

Execution ID: exec-1766180609-f4e20a18

Output:
  ✓ ${stdout} (str) = "hello world"
  ✓ ${stderr} (str) = ""
  ✓ ${exit_code} (int) = 0

Execution time: 7ms

Files Changed

 src/pflow/core/settings.py                         |  14 +
 src/pflow/cli/commands/settings.py                 |  51 +++
 src/pflow/cli/registry_run.py                      |  18 +-
 src/pflow/execution/formatters/node_output_formatter.py  | 289 ++++++++++---
 src/pflow/mcp_server/services/execution_service.py |   8 +
 tests/test_execution/formatters/test_node_output_formatter.py | 150 ++++++-
 tests/test_cli/test_settings_cli.py                |  31 ++

Testing

  • Added 10 new tests for smart output mode and settings validation
  • Updated 3 existing tests to explicitly use output_mode="structure"
  • All 2940 tests pass
make test  # Run all tests
make check # Run linting and type checks

Documentation

  • scratchpads/smart-output-display/implementation-spec.md - Full specification
  • scratchpads/smart-output-display/manual-test-plan.md - Manual testing guide

Some MCP servers (e.g., klavis-youtube-transcripts) incorrectly return
str(dict) instead of json.dumps(dict), producing Python repr format
with single quotes that breaks JSON tools like jq.

Changes:
- Add ast.literal_eval() fallback in _safe_parse_json() when JSON
  parsing fails for dict/list-like strings
- Add type checks to preserve pre-parsed dict/list in content.text
- Add dict/list preservation in _extract_unknown_content() and
  _extract_result() fallback
- Add 9 new tests for pre-parsed data and Python repr handling
- Add bugfix log entry documenting the fix

The fix is backward compatible: JSON is always tried first, and
ast.literal_eval() is only used as a fallback for non-compliant servers.
@claude
Copy link
Copy Markdown

claude Bot commented Dec 19, 2025

Code Review - PR #4: Smart Output Display for Registry Run

Summary

This PR extends Task 89's structure-only mode with configurable output modes. The implementation is well-structured and follows the project's patterns. I've identified a few critical issues and several suggestions for improvement.


Strengths

  1. Excellent test coverage: 10 new tests with clear documentation of what they protect against
  2. Good separation of concerns: Settings, CLI commands, and formatting logic properly separated
  3. Backward compatible: Defaults to 'smart' mode, structure-only available via settings
  4. Comprehensive documentation: Implementation spec and manual test plan in scratchpads
  5. Security-aware: MCP fix handles Python repr format safely with ast.literal_eval()
  6. Smart filtering integration: Reuses existing smart filtering infrastructure

🔴 Critical — Must fix before merge

1. Missing truncation indicator in string formatting

File: src/pflow/execution/formatters/node_output_formatter.py:673-676

The string truncation logic is inconsistent. When truncating, you add "..."" to the string but then append (truncated)` separately, which could confuse parsing.

Current:

if len(value) > SMART_MAX_STRING_LENGTH:
    truncated = value[: SMART_MAX_STRING_LENGTH - 3] + "..."
    return f'"{truncated}" (truncated)', True

Issue: The calculation is off - you're taking 197 chars (200 - 3) then adding 3 chars ("..."), resulting in exactly 200 chars in the truncated variable, BUT then you add (truncated) which makes the total output ~212 chars, defeating the purpose of the 200-char limit.

Fix:

if len(value) > SMART_MAX_STRING_LENGTH:
    # Reserve space for "..." (3) + " (truncated)" (12) = 15 chars
    max_content = SMART_MAX_STRING_LENGTH - 15
    truncated = value[:max_content] + "..."
    return f'"{truncated}" (truncated)', True

Or if the 200-char limit should apply to the displayed output (not the original value):

if len(value) > SMART_MAX_STRING_LENGTH:
    truncated = value[:SMART_MAX_STRING_LENGTH - 3] + "..."
    return f'"{truncated}" (truncated)', True
return f'"{value}"', False

2. Inconsistent truncation tracking for collections

File: src/pflow/execution/formatters/node_output_formatter.py:659-665

def _format_collection_smart(value: dict | list) -> tuple[str, bool]:
    """Format dict or list for smart display."""
    is_dict = isinstance(value, dict)
    count = len(value)
    threshold = SMART_MAX_DICT_KEYS if is_dict else SMART_MAX_LIST_ITEMS
    summary = f"{{...{count} keys}}" if is_dict else f"[...{count} items]"

    if count > threshold:
        return summary, False  # ❌ Should return True (was truncated/summarized)

Issue: When a collection is summarized (e.g., {...10 keys}), you return False for the truncation flag. This means the "Use pflow read-fields for full values" hint won't appear even though the user can't see the actual data.

Fix:

if count > threshold:
    return summary, True  # Mark as truncated so hint appears

⚠️ Warnings — Should be addressed

3. Type annotation could be more specific

File: src/pflow/execution/formatters/node_output_formatter.py:690

def format_smart_paths_with_values(
    paths: list[tuple[str, str]],
    outputs: dict[str, Any],
    shared_store: dict[str, Any],
    source_description: str | None,
    execution_id: str,
) -> tuple[list[str], bool]:

The return type says bool for the second element, but it's specifically tracking "any_truncated". Consider using a more descriptive variable name or a NamedTuple for clarity.

Suggestion:

from typing import NamedTuple

class FormattedOutput(NamedTuple):
    lines: list[str]
    has_truncated_values: bool

def format_smart_paths_with_values(...) -> FormattedOutput:
    # ...
    return FormattedOutput(lines=lines, has_truncated_values=any_truncated)

Or at minimum, document the return tuple better in the docstring.

4. Potential issue with TemplateResolver.resolve_value failure modes

File: src/pflow/execution/formatters/node_output_formatter.py:715-720

value = TemplateResolver.resolve_value(path, outputs)
if value is None:
    # Try shared_store as fallback
    value = TemplateResolver.resolve_value(path, shared_store)

# Format the value
if value is None:
    formatted_value = "<not found>"

Issue: resolve_value might return None for two reasons:

  1. The path doesn't exist (error case)
  2. The value is legitimately None (valid data)

You can't distinguish between these cases. Consider checking if the path exists first, or handling the None value case explicitly.

Suggestion:

try:
    value = TemplateResolver.resolve_value(path, outputs)
except (KeyError, AttributeError):
    try:
        value = TemplateResolver.resolve_value(path, shared_store)
    except (KeyError, AttributeError):
        value = "<not found>"
        formatted_value = value
        truncated = False
else:
    formatted_value, truncated = format_value_for_smart_display(value)

5. MCP node: ast.literal_eval security consideration

File: src/pflow/nodes/mcp/node.py:644-654

While ast.literal_eval() is designed to be safe (only evaluates literals, doesn't execute code), you should add a comment explaining WHY it's safe for untrusted input. This helps future maintainers understand the security model.

Current:

try:
    import ast
    parsed = ast.literal_eval(text_stripped)

Suggested:

try:
    import ast
    # ast.literal_eval() is safe for untrusted input - it only evaluates
    # Python literals (strings, numbers, tuples, lists, dicts) and never
    # executes arbitrary code. It's the recommended way to parse Python repr format.
    parsed = ast.literal_eval(text_stripped)

6. Fallback string formatting might double-truncate

File: src/pflow/execution/formatters/node_output_formatter.py:685-687

value_str = str(value)
if len(value_str) > SMART_MAX_STRING_LENGTH:
    return f"{value_str[: SMART_MAX_STRING_LENGTH - 3]}..." + " (truncated)", True

Same issue as #1 - you're not accounting for the length of (truncated) when slicing. Also, string concatenation here is inefficient.

Fix:

value_str = str(value)
if len(value_str) > SMART_MAX_STRING_LENGTH:
    max_len = SMART_MAX_STRING_LENGTH - 3 - len(" (truncated)")
    return f"{value_str[:max_len]}... (truncated)", True
return value_str, False

7. Settings validation happens at wrong layer

File: src/pflow/core/settings.py:38-44

The validation is in the Pydantic model, which is good, but the CLI also validates via click.Choice. This is duplication and could lead to inconsistency if the valid modes change.

Current (in CLI):

@click.argument("mode", required=False, type=click.Choice(["smart", "structure", "full"]))

Suggestion: Define the valid modes as a constant that both can import:

# In settings.py
OUTPUT_MODES = ["smart", "structure", "full"]

class RegistrySettings(BaseModel):
    # ...
    @field_validator("output_mode")
    @classmethod
    def validate_output_mode(cls, v: str) -> str:
        if v not in OUTPUT_MODES:
            raise ValueError(f"Invalid output_mode: {v}. Must be one of: {', '.join(OUTPUT_MODES)}")
        return v

# In CLI
from pflow.core.settings import OUTPUT_MODES
@click.argument("mode", required=False, type=click.Choice(OUTPUT_MODES))

💡 Suggestions — Optional improvements

8. Consider using an Enum for output modes

from enum import Enum

class OutputMode(str, Enum):
    SMART = "smart"
    STRUCTURE = "structure"
    FULL = "full"

class RegistrySettings(BaseModel):
    output_mode: OutputMode = Field(default=OutputMode.SMART, ...)

This provides type safety and autocompletion in IDEs.

9. Test could be more specific

File: tests/test_execution/formatters/test_node_output_formatter.py:660-670

def test_smart_mode_summarizes_large_dicts(self):
    large_dict = {f"key{i}": f"value{i}" for i in range(10)}
    formatted, truncated = format_value_for_smart_display(large_dict)
    assert "{...10 keys}" in formatted
    assert truncated is False  # Summarized, not truncated

Issue: The comment says "Summarized, not truncated", but per issue #2 above, this SHOULD be True because the user can't see the actual data. The test and implementation both need fixing.

10. Hint message could be more helpful

File: src/pflow/execution/formatters/node_output_formatter.py:737

lines.append(f"\nUse `pflow read-fields {execution_id} <path>` for full values.")

Suggestion: Include an example path from the actual output:

if any_truncated and paths:
    example_path = paths[0][0]  # First path
    lines.append(f"\nUse `pflow read-fields {execution_id} {example_path}` to see full values.")

11. Missing edge case: what if execution_id is empty?

File: src/pflow/execution/formatters/node_output_formatter.py:737

If execution_id is an empty string, the hint will be malformed: pflow read-fields <path>

Fix:

if any_truncated and execution_id:  # Only show hint if we have an execution ID
    lines.append(f"\nUse `pflow read-fields {execution_id} <path>` for full values.")

12. Code duplication in format_full_paths_with_values

File: The pattern of resolving values appears in both format_smart_paths_with_values and format_full_paths_with_values. Consider extracting a helper:

def _resolve_template_value(path: str, outputs: dict, shared_store: dict) -> Any:
    """Resolve a template path to its actual value."""
    value = TemplateResolver.resolve_value(path, outputs)
    if value is None:
        value = TemplateResolver.resolve_value(path, shared_store)
    return value

📋 Test Coverage Assessment

Good:

  • Tests cover the happy path for all three modes
  • Tests verify truncation behavior
  • Tests verify settings persistence and validation

Missing (consider adding):

  • Test for None values vs missing values (issue feat: add smart output display for registry run (closes #3) #4)
  • Test for binary data handling in smart mode
  • Test for deeply nested structures
  • Test for empty collections ({}, [])
  • Test for unicode/emoji in truncated strings
  • Test for collections that fail JSON serialization in _format_collection_smart

🔍 Performance Considerations

No major concerns, but note:

  • TemplateResolver.resolve_value is called twice per path (outputs then shared_store). Consider if there's a way to make this more efficient.
  • json.dumps in _format_collection_smart could be expensive for large objects. Since you're checking len(compact), this could cause performance issues. Consider catching the performance-critical cases earlier.

🛡️ Security Review

Good:

  • ast.literal_eval() is appropriate for parsing untrusted Python repr format
  • Settings validation prevents invalid modes
  • No credential exposure risks

Note: Consider documenting that MCP servers returning malformed data (Python repr instead of JSON) are non-compliant and should be reported upstream.


📚 Documentation

Excellent: The PR description, scratchpad specs, and manual test plan are comprehensive.

Suggestion: Add a note to the main docs about the three output modes and when to use each.


Summary Score

Category Score Notes
Code Quality 7/10 Several bugs in truncation logic
Test Coverage 8/10 Good coverage, missing edge cases
Performance 9/10 No major concerns
Security 9/10 Handles untrusted input safely
Documentation 9/10 Excellent spec and manual tests
Overall 8/10 Strong PR with a few critical fixes needed

Recommended Actions

  1. Fix truncation calculation (issues docs: Set up Mintlify documentation (Task 93) #1, LLM node swallows exception details; trace shows incorrect success status #6) - ensures 200-char limit is respected
  2. Fix collection truncation flag (issue feat: unify LLM usage via Simon Willison's llm library (Task 95) #2, docs: separate parallel execution into Task 96 (batch) and Task 39 (fan-out) #9) - ensures hint appears when needed
  3. Add edge case tests - particularly for None values and empty collections
  4. Extract OUTPUT_MODES constant (issue Model suggestions don't work for API keys stored in environment variables or pflow settings #7) - DRY principle
  5. Improve hint message (issues feat(claude-code): migrate to Claude Agent SDK and streamline parameters #10, feat: add batch processing with sequential and parallel execution #11) - better UX

Once these critical issues are addressed, this PR is ready to merge. Great work on the comprehensive testing and documentation! 🎉

@spinje spinje merged commit 8bd44c2 into main Dec 20, 2025
6 of 7 checks passed
Extends Task 89's structure-only mode with configurable output modes:

- smart (default): Shows template paths WITH values, truncates large values
  (>200 chars for strings, >5 items for collections), applies smart filtering
- structure: Original Task 89 behavior - paths only, no values
- full: Shows all paths with full values, no filtering or truncation

Changes:
- Add output_mode field to RegistrySettings with validator
- Add CLI commands: pflow settings registry output-mode [smart|structure|full]
- Implement format_value_for_smart_display() with truncation tracking
- Implement format_smart_paths_with_values() and format_full_paths_with_values()
- Update format_structure_output() to route based on output_mode
- Update CLI and MCP to read setting and pass output_mode
- Add 10 new tests for smart output mode and settings

Output format examples:
- smart: ${stdout} (str) = "hello world"
- smart truncated: ${content} (str) = "first 197 chars..." (truncated)
- smart dict: ${result.data} (dict) = {...5 keys}
- structure: ${stdout} (str)
spinje added a commit that referenced this pull request Jan 4, 2026
Add three high-value principles for AI agents:
- Epistemic #7: Solve observed problems, not theorized ones
- Operational #4: Bad research → bad plans → bad code (verify early)
- Operational #7: Prefer reversible decisions, leave room to navigate
spinje added a commit that referenced this pull request Jan 8, 2026
Code review fixes:
- Remove default from _store_output() method signature (warning #4)
- Add explicit strip_newline=False for stderr call
- Fix type hints: str | bytes with isinstance narrowing (warning #5)
- Update docstrings to clarify stdout-only behavior (warning #3)
- Add test for binary data + strip_newline interaction (critical #1)
- Add test for empty stdout edge case (critical #2)
- Update agent instructions to clarify stderr is never modified

All code review items addressed except Windows CRLF (documented as Unix-first).
spinje added a commit that referenced this pull request Feb 5, 2026
…eview]

Review feedback addressed:

1. Identity check pattern fragility (Warning #1):
   - Changed _coerce_provided_input() to return (value, was_coerced) tuple
   - Makes intent explicit and prevents subtle bugs from identity assumptions

2. Type alias normalization (Warning #4):
   - _normalize_type() now preserves case for unknown types
   - Supports future custom/user-defined types

3. dict/list to string uses Python repr (Review #2):
   - _coerce_to_string() now uses json.dumps() for containers
   - Produces valid JSON strings consistent with coerce_to_declared_type()

4. Env/defaults values not coerced (Review #2):
   - Apply coercion to values from os.environ, settings_env, and defaults
   - Ensures "respect declared types" works for all input sources

5. None value handling:
   - Preserve None values (don't coerce to "None" string)
   - Optional inputs without defaults correctly resolve to None

Also:
- Added 4 new tests for env/settings coercion
- Documented lenient coercion behavior in docstrings
- Created Task 120 for future strict input type validation
spinje added a commit that referenced this pull request Apr 8, 2026
…49 follow-up)

Adversarial verification of the Task 149 refactor found three real
regressions the test suite didn't catch (CliRunner intercepts stderr
differently from a real subprocess, so logging-based corruption was
invisible to the mocked tests). This commit fixes them.

Fix #2 — Nested workflow rendering (OutputController partial-line tracking):
- OutputController now tracks _partial_line_open so every _handle_node_*
  path self-terminates any open partial line before writing a fresh line.
- _handle_node_complete / _cached / _warning now accept node_id + indent
  and re-emit a fresh "  node_id" lead-in if the partial line was closed
  between node_start and node_complete (e.g. by a sub-workflow's own
  progress events or a logger write).
- Before: `  nested_call...    inner_a... ✓ 0.5s` (parent's partial line
  concatenated with child's first start, parent's completion orphaned).
- After: clean two-level indented rendering, parent re-emitted on its
  own line as `  nested_call ✓ 1.0s`.

Fix #3 — Drop redundant logger.warning("Command failed with exit code N")
from shell.py's post() method:
- That call wrote directly to stderr via Python logging, bypassing
  OutputController, and concatenated onto the partial `node_id...` line
  as `will_fail...WARNING: Command failed with exit code 1\n ✗ Failed`.
- The information is redundant — pflow's diagnostic pipeline already
  surfaces the exact same error via call_completion_callback (which
  reads shared["exit_code"] and builds the message) and via the
  post-execution diagnostic block that prints shared["error"].
- Updated test_auto_handling.py's TestLoggingBehavior to assert on the
  canonical behavior (action="error" + shared["exit_code"] + shared["error"])
  instead of the removed logger.warning.

Fix #1 — JSON-encode structured outputs in safe_output:
- Before: `{'key': 'value', 'n': 42}` (Python repr, single quotes — jq
  cannot parse).
- After: `{"key": "value", "n": 42}` (valid JSON, `pflow foo | jq` works).
- Strings still pass through verbatim. Non-JSON-serializable objects
  fall back to str() so safe_output never raises from the output path.
- Updated test_complex_output_types to verify JSON tokens instead of
  str(value) — Python True/None were pre-existing never-worked cases
  exposed by the GH #194 routing fix (data now actually lands on stdout
  where consumers can pipe it).

Fix #4 — Real subprocess regression tests:
- New tests/test_cli/test_progress_streaming_subprocess.py with two tests
  that spawn `uv run pflow` in a real subprocess (not CliRunner) so the
  captured stderr reflects what an agent or CI system actually sees:
  * test_failing_shell_node_progress_line_is_clean — asserts no
    `will_fail...WARNING:` or `will_fail...Command failed` corruption.
  * test_nested_workflow_progress_lines_are_not_concatenated —
    asserts no `nested_call...    inner_a` concatenation + that the
    parent's completion line is re-emitted attached to its node id.
- Guards against both Fix #2 and Fix #3 regressing.

Verification:
- `make test`: 4641 passed, 9 skipped.
- `make check`: ruff + mypy + deptry clean.
- Manual subprocess: failing shell, nested workflow, structured outputs,
  and the original #194 fix all clean.
spinje added a commit that referenced this pull request May 8, 2026
…kip review]

PR #378 review finding #4. Closes the synthetic-builder ↔ production-analyze()
fidelity gap in tests/test_core/test_cache_analysis_renderers.py.

What landed
- _BUILDER_DOCUMENTED_DEFAULTS frozenset (6 entries) names the AnalysisSummary
  fields the synthetic builder cannot faithfully model. Tests asserting on
  these MUST drive analyze() end-to-end.
- TestMakeAnalysisShapeParity class with two methods:
  * test_builder_field_set_matches_dataclass_minus_documented_defaults — uses
    dataclasses.fields() introspection to fail noisily when a new field is
    added without builder coverage or allowlist documentation.
  * test_documented_defaults_get_overwritten_by_production — drives analyze()
    against a contrived IR + trace that triggers each documented-default
    overwrite, catching the case where production's overwrite logic is
    deleted while the allowlist stays stale.
- Three renderer tests migrated from synthetic to e2e analyze() calls:
  * test_json_partial_trace_exposes_evidence_scope_and_observed_models
  * test_json_summary_exposes_projection_exclusions_and_delta_reason
  * test_render_json_includes_rollup_workflow_paths_and_unavailable_models_by_workflow

Reviewer's two non-issues confirmed via inspection
- test_text_summary_renders_blocking_errors_categorically (line 272 in PR
  baseline) only asserts on builder-populated fields. Kept synthetic.
- test_json_emits_root_and_sub_workflow_llm_node_counts already drove
  analyze() end-to-end against the committed 3-deep fixture. No migration
  needed.

Verification
- 5 mutation contracts checked by reverting production code:
  * Add new AnalysisSummary field → parity method 1 fails naming the field.
  * Delete observed_models_in_trace overwrite → migration #2 + parity
    method 2 fail with documented diagnostics.
  * Delete unavailable_models_by_workflow overwrite → migration #4 + parity
    method 2 fail.
- 6,335 tests passing on default suite.
- make check clean (ruff + ruff-format + mypy + deptry).
- test_golden_baseline_hashes_match (DD#19) green; test_plan_drift.py 34/34.

Plan + progress log
- Atomic plan at .taskmaster/tasks/task_159/implementation/fix-plans/
  renderer-test-fidelity-shape-parity-plan.md.
- Consolidated PR #378 review-fix sweep entry appended to
  implementation-progress-log.md, documenting all four phases (Phase 1
  easy bundle / Phase 2 medium bundle / Phase 3 cohort-key correctness /
  Phase 4 this commit), the five disputed findings (with citations), the
  GH #380 follow-up filed, and the cross-cutting insights from the sweep.

GH issue #380 filed for the deferred test-bloat parametrize-collapse work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Smart output display for registry run

1 participant