Skip to content

feat: unify LLM usage via Simon Willison's llm library (Task 95)#2

Merged
spinje merged 3 commits into
mainfrom
feat/llm-usage
Dec 19, 2025
Merged

feat: unify LLM usage via Simon Willison's llm library (Task 95)#2
spinje merged 3 commits into
mainfrom
feat/llm-usage

Conversation

@spinje
Copy link
Copy Markdown
Owner

@spinje spinje commented Dec 19, 2025

Summary

Fix critical bug where ALL Claude models were silently redirected to a hardcoded model (claude-sonnet-4-20250514), ignoring user's model choice. Migrate discovery and filtering features to use standard llm library with configurable models.

Changes

Bug Fix (Phase 1) - Critical

  • Move monkey-patch from workflow_command() entry to planner-only path
  • File/saved workflows now use standard llm library
  • User's model choice is now respected
  • Invalid models now properly error (not silently succeed)

Discovery Migration (Phase 2)

  • Remove install_anthropic_model() from registry discover
  • Remove install_anthropic_model() from workflow discover
  • Use get_model_for_feature("discovery") for configurable model

Smart Filter + Settings (Phase 3+4)

  • Add LLMSettings class with default_model, discovery_model, filtering_model
  • Add get_model_for_feature() helper with resolution chain
  • Smart filter uses configurable model instead of hardcoded Haiku

Default Model Configuration

  • LLM nodes now require explicit model configuration
  • Resolution order: IR params → settings → llm CLI → CompilationError
  • Clear error messages with setup instructions

Model Updates

  • Add GPT-5.2, Gemini 3 Flash to pricing
  • Update default detection to use latest models
  • Bump llm-gemini>=0.28.1, llm-anthropic==0.23

Files Changed (24 files)

Core Implementation:

  • src/pflow/cli/main.py - Move monkey-patch to planner path
  • src/pflow/cli/registry.py - Use configurable discovery model
  • src/pflow/cli/commands/workflow.py - Use configurable discovery model
  • src/pflow/core/llm_config.py - Add model resolution helpers
  • src/pflow/core/settings.py - Add LLMSettings class
  • src/pflow/core/smart_filter.py - Use configurable filtering model
  • src/pflow/runtime/compiler.py - Add LLM model injection

Tests (21 new tests):

  • tests/test_core/test_llm_config_workflow_model.py (15 tests)
  • tests/test_runtime/test_compiler_llm_model.py (6 tests)

Task Documentation

  • .taskmaster/tasks/task_95/implementation/progress-log.md
  • .taskmaster/tasks/task_95/research/llm-usage-findings.md

Testing

All 3388 tests pass:

make test  → 3388 passed, 7 skipped
make check → All checks pass

Manual verification completed for all phases (see progress-log.md).

Fix critical bug where ALL Claude models were silently redirected to a
hardcoded model, ignoring user's model choice. Migrate discovery and
filtering features to use standard llm library with configurable models.

Key changes:

Bug Fix (Phase 1):
- Move monkey-patch from workflow_command() entry to planner-only path
- File/saved workflows now use standard llm library
- User's model choice is now respected

Discovery Migration (Phase 2):
- Remove install_anthropic_model() from registry discover
- Remove install_anthropic_model() from workflow discover
- Use get_model_for_feature("discovery") for configurable model

Smart Filter + Settings (Phase 3+4):
- Add LLMSettings class with default_model, discovery_model, filtering_model
- Add get_model_for_feature() helper with resolution chain
- Smart filter uses configurable model instead of hardcoded Haiku

Default Model Configuration:
- LLM nodes now require explicit model configuration
- Resolution order: IR params → settings → llm CLI → CompilationError
- Clear error messages with setup instructions

Model Updates:
- Add GPT-5.2, Gemini 3 Flash to pricing
- Update default detection to use latest models
- Bump llm-gemini>=0.28.1, llm-anthropic==0.23

Tests:
- Add test_llm_config_workflow_model.py (15 tests)
- Add test_compiler_llm_model.py (6 tests)
- Add wildcard model support to llm_mock.py
- Update existing tests for new patterns

All 3388 tests pass.
- Remove CI debug print statements from llm_config.py
- Move imports to top-level in compiler.py (avoid repeated imports)
- Move SettingsManager import to top-level in llm_config.py (no circular import)
- Simplify test mocking patterns (use standard patch() syntax)
- Update test patch locations to match top-level imports

All 3388 tests pass.
@claude
Copy link
Copy Markdown

claude Bot commented Dec 19, 2025

Code Review: Task 95 - Unify LLM Usage via Simon Willison's llm Library

Overview

This PR successfully addresses a critical bug where all Claude models were being silently redirected to a hardcoded model, while also migrating the codebase to use a configurable LLM system based on Simon Willison's llm library. The implementation is well-structured, thoroughly tested (3388 passing tests), and includes comprehensive documentation.


Strengths

Architecture & Design

  • Excellent separation of concerns: The bug fix (Phase 1) is cleanly separated from feature additions (Phases 2-5)
  • Proper error handling: Compile-time failures with helpful error messages instead of silent runtime failures
  • Provider agnostic: Works with any LLM provider supported by the llm library (Anthropic, Gemini, OpenAI, OpenRouter, Ollama, etc.)
  • Clear resolution chain: IR params → settings → llm CLI default → error (well documented and tested)

Code Quality

  • Defensive programming: Input validation on provider names (allowlist), subprocess security (no shell injection), proper timeout handling
  • Immutability preserved: params = {**params, "model": default_model} creates new dict instead of mutating IR
  • Test coverage: 21 new tests covering all resolution paths, edge cases, and error conditions
  • Documentation: Excellent docstrings with resolution order, examples, and clear explanations

Security

  • Subprocess safety: Uses shutil.which() for path validation, stdin=subprocess.DEVNULL, timeouts, no shell expansion
  • Provider allowlist: ALLOWED_PROVIDERS prevents arbitrary command injection
  • No credentials in logs: Sensitive data properly handled

⚠️ Warnings — Should Be Addressed

1. Import Location in Compiler (compiler.py:605)

The import is placed inside the function, which is executed for every LLM node compilation:

if node_type == "llm" and "model" not in params:
    from pflow.core.llm_config import get_default_workflow_model, get_model_not_configured_help  # ← Inside function

Issue: Repeated imports on every LLM node compilation add unnecessary overhead.

Fix: Move to top-level imports:

# At top of compiler.py
from pflow.core.llm_config import get_default_workflow_model, get_model_not_configured_help

# In function
if node_type == "llm" and "model" not in params:
    default_model = get_default_workflow_model()
    # ...

Rationale: Follows Python best practices and the project's "boring and obvious" guideline. The circular import concern doesn't apply here since llm_config doesn't import from compiler.


2. Patch Import Pattern in llm_config.py (llm_config.py:243)

from pflow.core.settings import SettingsManager  # ← Inside function to avoid circular import

Issue: This pattern is used in multiple functions (get_model_for_feature, get_default_workflow_model), suggesting a deeper architectural issue.

Concern: If this is truly a circular import, it indicates tight coupling. If not, the comment is misleading.

Recommendation:

  1. Verify if the circular import actually exists by testing top-level import
  2. If it does exist, consider refactoring to break the cycle (e.g., extract shared utilities)
  3. If it doesn't exist, move to top-level imports and remove misleading comment

3. Global State Caching (llm_config.py:16-18)

_cached_default_model: Optional[str] = None
_detection_complete: bool = False

Issue: Global mutable state makes testing harder and introduces potential bugs in concurrent scenarios.

Current mitigation: PYTEST_CURRENT_TEST check skips caching in tests.

Better approach:

  • Use a singleton pattern with explicit cache management
  • Or: Accept the cost of re-detection and remove caching entirely (subprocess calls are already fast with 1-2s timeouts)
  • Or: Document thread-safety assumptions clearly

Why it matters: The current approach works but is fragile. If pflow adds multi-threading in the future, this could cause race conditions.


4. CI Debug Logging (llm_config.py:74-102)

if os.environ.get("CI"):
    print(f"[DEBUG CI] About to run subprocess: {command}", file=sys.stderr, flush=True)
    # ... more debug prints

Issue: This debug code is still present in production. It was likely added to diagnose CI issues but should be removed before merge.

Recommendation: Remove all [DEBUG CI] print statements or move to proper logging:

logger.debug(f"Running subprocess: {command}")  # Uses logging, controlled by log level

5. Test Mock Pattern Inconsistency (test_llm_config_workflow_model.py:37-39)

with patch.object(
    __import__("pflow.core.llm_config", fromlist=["get_llm_cli_default_model"]),
    "get_llm_cli_default_model",
    return_value="claude-3-sonnet",
):

Issue: This is unnecessarily complex. Compare to the cleaner approach in other tests:

with patch("pflow.core.llm_config.get_llm_cli_default_model") as mock_cli:
    mock_cli.return_value = "claude-3-sonnet"

Recommendation: Simplify the patching approach for consistency and readability.


💡 Suggestions — Optional Improvements

1. Error Message Formatting

The error message in get_model_not_configured_help() is excellent, but the JSON formatting could be improved for readability:

Current:

{"id": "my-llm", "type": "llm", "params": {"model": "gpt-5.2", "prompt": "..."}}

Suggested (multi-line JSON):

{
  "id": "my-llm",
  "type": "llm",
  "params": {
    "model": "gpt-5.2",
    "prompt": "..."
  }
}

This is more readable but not critical.


2. Type Hints in Test Fixtures

@pytest.fixture
def mock_registry(self):  # ← Missing return type
    """Create mock registry with LLM node."""
    registry = MagicMock()
    # ...
    return registry

Suggested:

@pytest.fixture
def mock_registry(self) -> MagicMock:
    """Create mock registry with LLM node."""

This improves IDE support and type checking, though not strictly necessary for test code.


3. Consolidate Model Resolution Logic

There's some duplication in model resolution between:

  • get_default_workflow_model() (settings → llm CLI → None)
  • get_model_for_feature() (settings → auto-detect → fallback)

Potential refactoring:

def _resolve_model(
    settings_getter: Callable[[LLMSettings], Optional[str]],
    allow_fallback: bool = False
) -> Optional[str]:
    """Generic model resolution with configurable fallback."""
    # Common logic here

This is a minor optimization and may not be worth the added abstraction at this stage.


🔍 Specific Code Observations

Good Practices Observed

  1. Subprocess security: Excellent use of stdin=subprocess.DEVNULL, timeouts, and validated paths
  2. Error message quality: The get_model_not_configured_help() message is a model for user-friendly errors
  3. Test coverage: Edge cases well-covered (settings load failure, timeout, invalid provider, etc.)
  4. Immutability: IR data structures are never mutated (new dict created in compiler)
  5. Logging levels: Proper use of logger.debug() for internal operations and logger.info() for user-visible actions

Areas for Improvement

  1. Inline imports: Should be moved to top-level where possible
  2. Global state: Consider alternatives to module-level caching
  3. Debug code: Remove CI-specific print statements
  4. Test mocking: Use simpler patch() syntax consistently

📋 Test Quality Assessment

Excellent Coverage

  • Resolution chain tested: All three sources (IR, settings, llm CLI) verified
  • Error paths tested: Missing config, load failures, timeouts, invalid providers
  • Edge cases tested: Empty stdout, subprocess failures, PYTEST_CURRENT_TEST skipping
  • Integration verified: Manual testing confirms file workflows work correctly

Test Quality

  • Tests are focused and clear - each tests one behavior
  • Good use of fixtures - mock_registry reused appropriately
  • Proper assertions - verifies both return values and function calls
  • No flaky tests - deterministic mocking, no timing dependencies

Minor Improvements

  • Some tests could be more concise (the patch.object(__import__()) pattern)
  • Consider parameterized tests for similar scenarios (e.g., different providers)

🎯 Breaking Changes

The PR correctly identifies this as a breaking change for users who:

  1. Don't specify model in workflows
  2. Don't have settings configured
  3. Don't have llm models default set

Mitigation: The error message provides three clear solutions with examples. This is the right approach - fail fast with helpful guidance.


📊 Overall Assessment

Criterion Rating Notes
Correctness ✅ Excellent Bug fixed, new features work as intended
Code Quality ⚠️ Good Minor issues (imports, debug code) prevent "Excellent"
Security ✅ Excellent Subprocess handling, input validation exemplary
Test Coverage ✅ Excellent Comprehensive, well-structured tests
Documentation ✅ Excellent Clear docstrings, progress log, error messages
Architecture ✅ Excellent Clean separation, proper abstractions
User Experience ✅ Excellent Helpful errors, clear resolution chain

🏁 Final Recommendation

Approve with minor fixes - This is high-quality work that significantly improves the codebase. The warnings above are non-critical but should be addressed before merge:

Required Before Merge

  1. ✅ Remove CI debug print statements (or convert to proper logging)
  2. ✅ Move imports to top-level in compiler.py (avoid repeated imports)

Recommended Before Merge

  1. Investigate and resolve/document circular import in llm_config.py
  2. Simplify test mocking patterns for consistency

Optional (Future Tasks)

  1. Consider alternatives to global state caching
  2. Format error message JSON for better readability

💬 Additional Comments

This PR demonstrates excellent engineering practice:

  • Clear problem identification and incremental solution (5 phases)
  • Thorough manual and automated verification
  • Comprehensive documentation (progress log is exemplary)
  • User-centric design (error messages, configuration options)

The implementation follows the project's "boring and obvious" principle while solving a complex problem elegantly. Great work! 🎉


Reviewed by: Claude Code
Date: 2025-12-19
Test Status: ✅ 3388 passed, 7 skipped
Code Quality: ✅ All checks pass

@spinje spinje merged commit 309b6b8 into main Dec 19, 2025
7 checks passed
spinje added a commit that referenced this pull request Jan 8, 2026
Code review fixes:
- Remove default from _store_output() method signature (warning #4)
- Add explicit strip_newline=False for stderr call
- Fix type hints: str | bytes with isinstance narrowing (warning #5)
- Update docstrings to clarify stdout-only behavior (warning #3)
- Add test for binary data + strip_newline interaction (critical #1)
- Add test for empty stdout edge case (critical #2)
- Update agent instructions to clarify stderr is never modified

All code review items addressed except Windows CRLF (documented as Unix-first).
spinje added a commit that referenced this pull request Feb 5, 2026
…eview]

Review feedback addressed:

1. Identity check pattern fragility (Warning #1):
   - Changed _coerce_provided_input() to return (value, was_coerced) tuple
   - Makes intent explicit and prevents subtle bugs from identity assumptions

2. Type alias normalization (Warning #4):
   - _normalize_type() now preserves case for unknown types
   - Supports future custom/user-defined types

3. dict/list to string uses Python repr (Review #2):
   - _coerce_to_string() now uses json.dumps() for containers
   - Produces valid JSON strings consistent with coerce_to_declared_type()

4. Env/defaults values not coerced (Review #2):
   - Apply coercion to values from os.environ, settings_env, and defaults
   - Ensures "respect declared types" works for all input sources

5. None value handling:
   - Preserve None values (don't coerce to "None" string)
   - Optional inputs without defaults correctly resolve to None

Also:
- Added 4 new tests for env/settings coercion
- Documented lenient coercion behavior in docstrings
- Created Task 120 for future strict input type validation
spinje added a commit that referenced this pull request Apr 8, 2026
…49 follow-up)

Adversarial verification of the Task 149 refactor found three real
regressions the test suite didn't catch (CliRunner intercepts stderr
differently from a real subprocess, so logging-based corruption was
invisible to the mocked tests). This commit fixes them.

Fix #2 — Nested workflow rendering (OutputController partial-line tracking):
- OutputController now tracks _partial_line_open so every _handle_node_*
  path self-terminates any open partial line before writing a fresh line.
- _handle_node_complete / _cached / _warning now accept node_id + indent
  and re-emit a fresh "  node_id" lead-in if the partial line was closed
  between node_start and node_complete (e.g. by a sub-workflow's own
  progress events or a logger write).
- Before: `  nested_call...    inner_a... ✓ 0.5s` (parent's partial line
  concatenated with child's first start, parent's completion orphaned).
- After: clean two-level indented rendering, parent re-emitted on its
  own line as `  nested_call ✓ 1.0s`.

Fix #3 — Drop redundant logger.warning("Command failed with exit code N")
from shell.py's post() method:
- That call wrote directly to stderr via Python logging, bypassing
  OutputController, and concatenated onto the partial `node_id...` line
  as `will_fail...WARNING: Command failed with exit code 1\n ✗ Failed`.
- The information is redundant — pflow's diagnostic pipeline already
  surfaces the exact same error via call_completion_callback (which
  reads shared["exit_code"] and builds the message) and via the
  post-execution diagnostic block that prints shared["error"].
- Updated test_auto_handling.py's TestLoggingBehavior to assert on the
  canonical behavior (action="error" + shared["exit_code"] + shared["error"])
  instead of the removed logger.warning.

Fix #1 — JSON-encode structured outputs in safe_output:
- Before: `{'key': 'value', 'n': 42}` (Python repr, single quotes — jq
  cannot parse).
- After: `{"key": "value", "n": 42}` (valid JSON, `pflow foo | jq` works).
- Strings still pass through verbatim. Non-JSON-serializable objects
  fall back to str() so safe_output never raises from the output path.
- Updated test_complex_output_types to verify JSON tokens instead of
  str(value) — Python True/None were pre-existing never-worked cases
  exposed by the GH #194 routing fix (data now actually lands on stdout
  where consumers can pipe it).

Fix #4 — Real subprocess regression tests:
- New tests/test_cli/test_progress_streaming_subprocess.py with two tests
  that spawn `uv run pflow` in a real subprocess (not CliRunner) so the
  captured stderr reflects what an agent or CI system actually sees:
  * test_failing_shell_node_progress_line_is_clean — asserts no
    `will_fail...WARNING:` or `will_fail...Command failed` corruption.
  * test_nested_workflow_progress_lines_are_not_concatenated —
    asserts no `nested_call...    inner_a` concatenation + that the
    parent's completion line is re-emitted attached to its node id.
- Guards against both Fix #2 and Fix #3 regressing.

Verification:
- `make test`: 4641 passed, 9 skipped.
- `make check`: ruff + mypy + deptry clean.
- Manual subprocess: failing shell, nested workflow, structured outputs,
  and the original #194 fix all clean.
spinje added a commit that referenced this pull request Apr 8, 2026
… review]

Applies two items from the "anything else?" review re-visit:

1. claude[bot] Warning #3 — promote _pflow_validation_warnings to
   first-class WorkflowValidationError constructor kwarg. Reversed my
   Round 8 "defer" recommendation after re-reading the two dynamic-
   attribute sites in runner.py and realizing they're different
   concerns. _pflow_parser_diagnostics stays (cross-cutting annotation
   on any exception type). _pflow_validation_warnings moves to a proper
   kwarg because it's specific to WorkflowValidationError and has a
   clean destination.

2. Fix [1] cascade side effect — discovered during the "anything else?"
   verification pass. When a workflow has an unknown node type AND a
   downstream template reference, Fix [1]'s silent-skip made the
   defensive fallback in _get_node_outputs_from_registry reachable.
   The old fallback had a logger.warning("this is unexpected") call
   that leaked to user-visible stderr. Demoted to logger.debug; the
   case is now legitimate and doesn't warrant a stderr warning.

## Production changes

- core/exceptions.py — WorkflowValidationError gains validation_warnings
  constructor kwarg. Stored as self.validation_warnings. Docstring
  documents the two-field contract.

- runtime/template_validation/path_validation.py:787-817 —
  _get_node_outputs_from_registry: logger.warning → logger.debug,
  docstring updated to describe the two legitimate-reach cases
  (unknown node type + defensive backstop).

- execution/runner.py:373-377 — _validate() uses validation_warnings
  constructor kwarg instead of dynamic attribute assignment. The
  # type: ignore[attr-defined] is gone.

- execution/runner.py:538-549 — _exception_to_result reads
  exception.validation_warnings (instead of _pflow_validation_warnings).
  Still getattr because exception is loosely typed at this layer.

## Regression tests (3)

- tests/test_core/test_exception_hierarchy.py::
  test_workflow_validation_error_carries_warnings_as_first_class_attr
  — locks in the kwarg contract: round-trips errors + warnings,
  defaults to empty list when omitted, backward-compat summary
  constructor still works.

- tests/test_execution/test_runner.py::
  test_workflow_validation_error_warnings_survive_via_kwarg
  — end-to-end: WorkflowValidationError raised with
  validation_warnings=[...] survives through _exception_to_result
  into ExecutionResult.diagnostics.

- tests/test_runtime/test_template_validation/test_enhanced_errors.py::
  test_unknown_node_type_downstream_ref_no_stderr_warning
  — three-part structural guard for the fallback log demotion:
  (a) template error diagnostic still produced (behavior),
  (b) no WARNING-level log from fallback (UX fix),
  (c) DEBUG-level log still fires (observability preserved).

## Verification

- make test: 4680 passed (was 4677 + 3 new regressions)
- make check: clean (ruff + ruff-format + mypy 171 files + deptry)
- Manual cascade repro: 2 clean errors, zero stderr noise
- Grep: zero _pflow_validation_warnings in src/ (promotion complete)
- Baseline refresh: zero drift (Round 9 changes are rendering-neutral)

## Deferred (now truly done)

- claude[bot] Suggestion #2 (generic runtime warning text): still
  deferred. Requires runtime warning categorization infrastructure
  that doesn't exist. No user-reported pain point.
- _pflow_parser_diagnostics cleanup: explicitly NOT touched — different
  concern, correctly uses dynamic-attr pattern for cross-cutting
  exception annotation. Task 147 braindump's "attr-defined pattern is
  intentional" applies to this site, not to _pflow_validation_warnings.

Refs: #219, #244
spinje added a commit that referenced this pull request Apr 8, 2026
… review]

Applies two items from the "anything else?" review re-visit:

1. claude[bot] Warning #3 — promote _pflow_validation_warnings to
   first-class WorkflowValidationError constructor kwarg. Reversed my
   Round 8 "defer" recommendation after re-reading the two dynamic-
   attribute sites in runner.py and realizing they're different
   concerns. _pflow_parser_diagnostics stays (cross-cutting annotation
   on any exception type). _pflow_validation_warnings moves to a proper
   kwarg because it's specific to WorkflowValidationError and has a
   clean destination.

2. Fix [1] cascade side effect — discovered during the "anything else?"
   verification pass. When a workflow has an unknown node type AND a
   downstream template reference, Fix [1]'s silent-skip made the
   defensive fallback in _get_node_outputs_from_registry reachable.
   The old fallback had a logger.warning("this is unexpected") call
   that leaked to user-visible stderr. Demoted to logger.debug; the
   case is now legitimate and doesn't warrant a stderr warning.

## Production changes

- core/exceptions.py — WorkflowValidationError gains validation_warnings
  constructor kwarg. Stored as self.validation_warnings. Docstring
  documents the two-field contract.

- runtime/template_validation/path_validation.py:787-817 —
  _get_node_outputs_from_registry: logger.warning → logger.debug,
  docstring updated to describe the two legitimate-reach cases
  (unknown node type + defensive backstop).

- execution/runner.py:373-377 — _validate() uses validation_warnings
  constructor kwarg instead of dynamic attribute assignment. The
  # type: ignore[attr-defined] is gone.

- execution/runner.py:538-549 — _exception_to_result reads
  exception.validation_warnings (instead of _pflow_validation_warnings).
  Still getattr because exception is loosely typed at this layer.

## Regression tests (3)

- tests/test_core/test_exception_hierarchy.py::
  test_workflow_validation_error_carries_warnings_as_first_class_attr
  — locks in the kwarg contract: round-trips errors + warnings,
  defaults to empty list when omitted, backward-compat summary
  constructor still works.

- tests/test_execution/test_runner.py::
  test_workflow_validation_error_warnings_survive_via_kwarg
  — end-to-end: WorkflowValidationError raised with
  validation_warnings=[...] survives through _exception_to_result
  into ExecutionResult.diagnostics.

- tests/test_runtime/test_template_validation/test_enhanced_errors.py::
  test_unknown_node_type_downstream_ref_no_stderr_warning
  — three-part structural guard for the fallback log demotion:
  (a) template error diagnostic still produced (behavior),
  (b) no WARNING-level log from fallback (UX fix),
  (c) DEBUG-level log still fires (observability preserved).

## Verification

- make test: 4680 passed (was 4677 + 3 new regressions)
- make check: clean (ruff + ruff-format + mypy 171 files + deptry)
- Manual cascade repro: 2 clean errors, zero stderr noise
- Grep: zero _pflow_validation_warnings in src/ (promotion complete)
- Baseline refresh: zero drift (Round 9 changes are rendering-neutral)

## Deferred (now truly done)

- claude[bot] Suggestion #2 (generic runtime warning text): still
  deferred. Requires runtime warning categorization infrastructure
  that doesn't exist. No user-reported pain point.
- _pflow_parser_diagnostics cleanup: explicitly NOT touched — different
  concern, correctly uses dynamic-attr pattern for cross-cutting
  exception annotation. Task 147 braindump's "attr-defined pattern is
  intentional" applies to this site, not to _pflow_validation_warnings.

Refs: #219, #244
spinje added a commit that referenced this pull request Apr 8, 2026
…ti-failure status, trace errors

Three regressions found during post-implementation verification of Task 148,
all violating explicit spec requirements or acceptance criteria.

BUG #1 — output_resolver silently swallowed all-failed coalesce. The gate at
output_resolver.py:168-172 skipped any unresolved coalesce expression, so
${primary.stdout ?? fallback.stdout} with both operands in __failures__
silently dropped the output instead of raising OutputResolutionError. This
is the exact silent-bad-output behavior Task 148 set out to eliminate (spec
requirement: "both failed produces a structured error" + "All coalesce
operands failed" summary block). Fix: extract _is_all_absent_coalesce helper
that calls classify_unresolved_references and only silent-skips when every
ref is status=absent (Task 128 branch-convergence semantic). Any FAILED or
PATH_ERROR operand now falls through to error recording with the structured
unresolved_references payload.

BUG #2 — build_execution_steps used singular failed_node for per-node status.
execution_state.py:110-133 determined status from exec_state["failed_node"]
(singular pointer — only last failure). In multi-failure workflows, earlier
failed nodes matched neither completed_nodes nor the singular and fell
through to "not_executed". An AI agent consuming the JSON output would see
status:"not_executed" for a node that clearly did execute and fail. Missed
Phase 3 migration — only the metadata get_node_output lookup at line 143
was migrated, status determination at 128-133 was left on the pre-invariant
singular-field model. Fix: replace the three-arm conditional with a call to
node_state.get_node_status() mapped through _STATUS_MAP to the existing
status strings. Single source of truth.

BUG #3 — --report summary showed "Unknown error" for #208 repro. Task 148
acceptance criterion: "--report for the #208 repro shows primary as failed
with full context". Root cause: record_trace was called with no error kwarg
in the happy-path action="error" branch (only raised exceptions passed error).
For shell exit N routed via on-error, the trace event had success:False but
no error field, so trace_report.py fell back to "Unknown error". Fix:
pre-compute trace_error from shared[node_id].error and pass it to record_trace
as the new Optional[Exception | str] error kwarg.

Test fixture migration. BUG #2 fix broke 4 tests whose fixtures set
failed_node singular without populating __failures__ (pre-invariant shape).
Updated test_execution_state.py (3 tests) and test_error_formatter.py (1
test) to populate __failures__ dict per the Task 148 invariant.

Regression tests. Added 4 new tests targeting the exact code paths that
would have caught the bugs:
  - test_multi_failure_all_show_failed_status — multi-failure build_execution_steps
  - test_all_failed_coalesce_in_output_raises_structured_error — BUG #1
  - test_mixed_absent_and_failed_coalesce_in_output_raises_error — BUG #1 variant
  - test_all_absent_coalesce_in_output_is_silently_skipped — Task 128 positive regression

Verification. make check clean. pytest -n 4 —> 4708 passed (was 4704).
#208 repro, test 02 ignore_errors, test 03 both-fail, test 09 loop
recovery, test 23 mixed absent+failed all produce their expected output.
Direct build_execution_steps probe with multi-failure state labels both
failed nodes correctly.

Follow-ups filed (out of scope, both verified during the same pass):
  - #240 — Trace aggregation reports workflow as 'failed' after
    loop recovery (pre-existing trace semantic gap)
  - #241 — Invariant does not hold for enable_namespacing=false
    (real bug, not reachable via user paths, recommended deprecation)

Progress log at .taskmaster/tasks/task_149/implementation/progress-log.md
documents the full verification pass including withdrawn claims and
Bug #2b (stale failed_node pointer) rationale for deferral.
spinje added a commit that referenced this pull request May 8, 2026
…kip review]

PR #378 review finding #4. Closes the synthetic-builder ↔ production-analyze()
fidelity gap in tests/test_core/test_cache_analysis_renderers.py.

What landed
- _BUILDER_DOCUMENTED_DEFAULTS frozenset (6 entries) names the AnalysisSummary
  fields the synthetic builder cannot faithfully model. Tests asserting on
  these MUST drive analyze() end-to-end.
- TestMakeAnalysisShapeParity class with two methods:
  * test_builder_field_set_matches_dataclass_minus_documented_defaults — uses
    dataclasses.fields() introspection to fail noisily when a new field is
    added without builder coverage or allowlist documentation.
  * test_documented_defaults_get_overwritten_by_production — drives analyze()
    against a contrived IR + trace that triggers each documented-default
    overwrite, catching the case where production's overwrite logic is
    deleted while the allowlist stays stale.
- Three renderer tests migrated from synthetic to e2e analyze() calls:
  * test_json_partial_trace_exposes_evidence_scope_and_observed_models
  * test_json_summary_exposes_projection_exclusions_and_delta_reason
  * test_render_json_includes_rollup_workflow_paths_and_unavailable_models_by_workflow

Reviewer's two non-issues confirmed via inspection
- test_text_summary_renders_blocking_errors_categorically (line 272 in PR
  baseline) only asserts on builder-populated fields. Kept synthetic.
- test_json_emits_root_and_sub_workflow_llm_node_counts already drove
  analyze() end-to-end against the committed 3-deep fixture. No migration
  needed.

Verification
- 5 mutation contracts checked by reverting production code:
  * Add new AnalysisSummary field → parity method 1 fails naming the field.
  * Delete observed_models_in_trace overwrite → migration #2 + parity
    method 2 fail with documented diagnostics.
  * Delete unavailable_models_by_workflow overwrite → migration #4 + parity
    method 2 fail.
- 6,335 tests passing on default suite.
- make check clean (ruff + ruff-format + mypy + deptry).
- test_golden_baseline_hashes_match (DD#19) green; test_plan_drift.py 34/34.

Plan + progress log
- Atomic plan at .taskmaster/tasks/task_159/implementation/fix-plans/
  renderer-test-fidelity-shape-parity-plan.md.
- Consolidated PR #378 review-fix sweep entry appended to
  implementation-progress-log.md, documenting all four phases (Phase 1
  easy bundle / Phase 2 medium bundle / Phase 3 cohort-key correctness /
  Phase 4 this commit), the five disputed findings (with citations), the
  GH #380 follow-up filed, and the cross-cutting insights from the sweep.

GH issue #380 filed for the deferred test-bloat parametrize-collapse work.
spinje added a commit that referenced this pull request May 10, 2026
`pflow analyze-cache --from-trace` previously rendered three delta lines
(Actual savings + First-run + Rerun) in trace mode. The first-run-with-
cache projection assumes a fresh run with no memo cache hits and no
provider implicit caching — neither modeled. When those fire (the common
case), the projection diverges from actual cost by an order of magnitude
(lyrics-generator: projection said "saves 1%"; actual was 49%) and the
two competing numbers anchor agents on the smaller, misleading figure.

Option B: drop the first-run delta line in trace mode entirely. Show only
the measured number plus the steady-state forward-looking projection.

- `_render_summary_deltas` split into 4-line dispatch + two named helpers
  `_render_trace_deltas` and `_render_greenfield_deltas`. Dispatch on
  `evidence_scope ∈ {"complete_trace", "truncated_trace_executed_subset"}`
  rather than `actually_paid_usd is not None` to handle the all-unpriced-
  trace edge case correctly.
- Trace mode renders `Actual savings (this run):` (or `unavailable
  (projection excludes …)`) + `Rerun delta (projected):`. The
  `(projected)` suffix on rerun signals "model, not measurement" right
  at the line.
- Greenfield mode unchanged shape: `First-run delta:` + `Rerun delta:`,
  no `(projected)` suffix (both are projections by construction; the
  absence of an actual-savings line carries the signal).
- The truncated-mode `(executed)` qualifier on rerun was retired —
  the suppression note already conveys executed-subset context.
- `_format_delta` actual-delta label simplified `"actual vs no-cache"`
  → `"vs no-cache"` (the row label "Actual savings (this run):" already
  says "actual"; doubled word was a Stage A artifact).
- Dead branch `if not in_trace_mode and actual_line:` removed
  (production-unreachable per searcher #2; only synthetic fixtures hit it).
- `_make_analysis` test builder upgraded to set `evidence_scope` based
  on `actually_paid` (matches production shape; closes a Pitfall #19
  gap that searcher #2 specifically warned about under the new dispatch).

JSON shape unchanged — all three deltas still on `AnalysisSummary`.

Tests: 2 new regression tests with verified mutation contracts; 1
existing test renamed and rewritten for Option B truncated semantics;
1 new mutation guard on the `vs no-cache` label simplification. Three
mutation contracts verified by reverting production code and observing
the matching test fail.

6,396 default-suite tests pass. 65/65 baselines pass (no drift — the
parallel L-12/L-2/L-1 commit had already regenerated the 3 affected
baselines anticipating Option B's final shape). Manual smoke confirms
trace mode shows actual + rerun (projected) only; greenfield shows
first-run + rerun.

Closes L-3 from BASELINE-AUDIT Section F.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant