feat: unify LLM usage via Simon Willison's llm library (Task 95) by spinje · Pull Request #2 · spinje/pflow

spinje · 2025-12-19T14:39:51Z

Summary

Fix critical bug where ALL Claude models were silently redirected to a hardcoded model (claude-sonnet-4-20250514), ignoring user's model choice. Migrate discovery and filtering features to use standard llm library with configurable models.

Changes

Bug Fix (Phase 1) - Critical

Move monkey-patch from workflow_command() entry to planner-only path
File/saved workflows now use standard llm library
User's model choice is now respected
Invalid models now properly error (not silently succeed)

Discovery Migration (Phase 2)

Remove install_anthropic_model() from registry discover
Remove install_anthropic_model() from workflow discover
Use get_model_for_feature("discovery") for configurable model

Smart Filter + Settings (Phase 3+4)

Add LLMSettings class with default_model, discovery_model, filtering_model
Add get_model_for_feature() helper with resolution chain
Smart filter uses configurable model instead of hardcoded Haiku

Default Model Configuration

LLM nodes now require explicit model configuration
Resolution order: IR params → settings → llm CLI → CompilationError
Clear error messages with setup instructions

Model Updates

Add GPT-5.2, Gemini 3 Flash to pricing
Update default detection to use latest models
Bump llm-gemini>=0.28.1, llm-anthropic==0.23

Files Changed (24 files)

Core Implementation:

src/pflow/cli/main.py - Move monkey-patch to planner path
src/pflow/cli/registry.py - Use configurable discovery model
src/pflow/cli/commands/workflow.py - Use configurable discovery model
src/pflow/core/llm_config.py - Add model resolution helpers
src/pflow/core/settings.py - Add LLMSettings class
src/pflow/core/smart_filter.py - Use configurable filtering model
src/pflow/runtime/compiler.py - Add LLM model injection

Tests (21 new tests):

tests/test_core/test_llm_config_workflow_model.py (15 tests)
tests/test_runtime/test_compiler_llm_model.py (6 tests)

Task Documentation

.taskmaster/tasks/task_95/implementation/progress-log.md
.taskmaster/tasks/task_95/research/llm-usage-findings.md

Testing

All 3388 tests pass:

make test  → 3388 passed, 7 skipped
make check → All checks pass

Manual verification completed for all phases (see progress-log.md).

Fix critical bug where ALL Claude models were silently redirected to a hardcoded model, ignoring user's model choice. Migrate discovery and filtering features to use standard llm library with configurable models. Key changes: Bug Fix (Phase 1): - Move monkey-patch from workflow_command() entry to planner-only path - File/saved workflows now use standard llm library - User's model choice is now respected Discovery Migration (Phase 2): - Remove install_anthropic_model() from registry discover - Remove install_anthropic_model() from workflow discover - Use get_model_for_feature("discovery") for configurable model Smart Filter + Settings (Phase 3+4): - Add LLMSettings class with default_model, discovery_model, filtering_model - Add get_model_for_feature() helper with resolution chain - Smart filter uses configurable model instead of hardcoded Haiku Default Model Configuration: - LLM nodes now require explicit model configuration - Resolution order: IR params → settings → llm CLI → CompilationError - Clear error messages with setup instructions Model Updates: - Add GPT-5.2, Gemini 3 Flash to pricing - Update default detection to use latest models - Bump llm-gemini>=0.28.1, llm-anthropic==0.23 Tests: - Add test_llm_config_workflow_model.py (15 tests) - Add test_compiler_llm_model.py (6 tests) - Add wildcard model support to llm_mock.py - Update existing tests for new patterns All 3388 tests pass.

- Remove CI debug print statements from llm_config.py - Move imports to top-level in compiler.py (avoid repeated imports) - Move SettingsManager import to top-level in llm_config.py (no circular import) - Simplify test mocking patterns (use standard patch() syntax) - Update test patch locations to match top-level imports All 3388 tests pass.

claude · 2025-12-19T14:42:28Z

Code Review: Task 95 - Unify LLM Usage via Simon Willison's llm Library

Overview

This PR successfully addresses a critical bug where all Claude models were being silently redirected to a hardcoded model, while also migrating the codebase to use a configurable LLM system based on Simon Willison's llm library. The implementation is well-structured, thoroughly tested (3388 passing tests), and includes comprehensive documentation.

✅ Strengths

Architecture & Design

Excellent separation of concerns: The bug fix (Phase 1) is cleanly separated from feature additions (Phases 2-5)
Proper error handling: Compile-time failures with helpful error messages instead of silent runtime failures
Provider agnostic: Works with any LLM provider supported by the llm library (Anthropic, Gemini, OpenAI, OpenRouter, Ollama, etc.)
Clear resolution chain: IR params → settings → llm CLI default → error (well documented and tested)

Code Quality

Defensive programming: Input validation on provider names (allowlist), subprocess security (no shell injection), proper timeout handling
Immutability preserved: params = {**params, "model": default_model} creates new dict instead of mutating IR
Test coverage: 21 new tests covering all resolution paths, edge cases, and error conditions
Documentation: Excellent docstrings with resolution order, examples, and clear explanations

Security

Subprocess safety: Uses shutil.which() for path validation, stdin=subprocess.DEVNULL, timeouts, no shell expansion
Provider allowlist: ALLOWED_PROVIDERS prevents arbitrary command injection
No credentials in logs: Sensitive data properly handled

⚠️ Warnings — Should Be Addressed

1. Import Location in Compiler (compiler.py:605)

The import is placed inside the function, which is executed for every LLM node compilation:

if node_type == "llm" and "model" not in params:
    from pflow.core.llm_config import get_default_workflow_model, get_model_not_configured_help  # ← Inside function

Issue: Repeated imports on every LLM node compilation add unnecessary overhead.

Fix: Move to top-level imports:

# At top of compiler.py
from pflow.core.llm_config import get_default_workflow_model, get_model_not_configured_help

# In function
if node_type == "llm" and "model" not in params:
    default_model = get_default_workflow_model()
    # ...

Rationale: Follows Python best practices and the project's "boring and obvious" guideline. The circular import concern doesn't apply here since llm_config doesn't import from compiler.

2. Patch Import Pattern in llm_config.py (llm_config.py:243)

from pflow.core.settings import SettingsManager  # ← Inside function to avoid circular import

Issue: This pattern is used in multiple functions (get_model_for_feature, get_default_workflow_model), suggesting a deeper architectural issue.

Concern: If this is truly a circular import, it indicates tight coupling. If not, the comment is misleading.

Recommendation:

Verify if the circular import actually exists by testing top-level import
If it does exist, consider refactoring to break the cycle (e.g., extract shared utilities)
If it doesn't exist, move to top-level imports and remove misleading comment

3. Global State Caching (llm_config.py:16-18)

_cached_default_model: Optional[str] = None
_detection_complete: bool = False

Issue: Global mutable state makes testing harder and introduces potential bugs in concurrent scenarios.

Current mitigation: PYTEST_CURRENT_TEST check skips caching in tests.

Better approach:

Use a singleton pattern with explicit cache management
Or: Accept the cost of re-detection and remove caching entirely (subprocess calls are already fast with 1-2s timeouts)
Or: Document thread-safety assumptions clearly

Why it matters: The current approach works but is fragile. If pflow adds multi-threading in the future, this could cause race conditions.

4. CI Debug Logging (llm_config.py:74-102)

if os.environ.get("CI"):
    print(f"[DEBUG CI] About to run subprocess: {command}", file=sys.stderr, flush=True)
    # ... more debug prints

Issue: This debug code is still present in production. It was likely added to diagnose CI issues but should be removed before merge.

Recommendation: Remove all [DEBUG CI] print statements or move to proper logging:

logger.debug(f"Running subprocess: {command}")  # Uses logging, controlled by log level

5. Test Mock Pattern Inconsistency (test_llm_config_workflow_model.py:37-39)

with patch.object(
    __import__("pflow.core.llm_config", fromlist=["get_llm_cli_default_model"]),
    "get_llm_cli_default_model",
    return_value="claude-3-sonnet",
):

Issue: This is unnecessarily complex. Compare to the cleaner approach in other tests:

with patch("pflow.core.llm_config.get_llm_cli_default_model") as mock_cli:
    mock_cli.return_value = "claude-3-sonnet"

Recommendation: Simplify the patching approach for consistency and readability.

💡 Suggestions — Optional Improvements

1. Error Message Formatting

The error message in get_model_not_configured_help() is excellent, but the JSON formatting could be improved for readability:

Current:

{"id": "my-llm", "type": "llm", "params": {"model": "gpt-5.2", "prompt": "..."}}

Suggested (multi-line JSON):

{
  "id": "my-llm",
  "type": "llm",
  "params": {
    "model": "gpt-5.2",
    "prompt": "..."
  }
}

This is more readable but not critical.

2. Type Hints in Test Fixtures

@pytest.fixture
def mock_registry(self):  # ← Missing return type
    """Create mock registry with LLM node."""
    registry = MagicMock()
    # ...
    return registry

Suggested:

@pytest.fixture
def mock_registry(self) -> MagicMock:
    """Create mock registry with LLM node."""

This improves IDE support and type checking, though not strictly necessary for test code.

3. Consolidate Model Resolution Logic

There's some duplication in model resolution between:

get_default_workflow_model() (settings → llm CLI → None)
get_model_for_feature() (settings → auto-detect → fallback)

Potential refactoring:

def _resolve_model(
    settings_getter: Callable[[LLMSettings], Optional[str]],
    allow_fallback: bool = False
) -> Optional[str]:
    """Generic model resolution with configurable fallback."""
    # Common logic here

This is a minor optimization and may not be worth the added abstraction at this stage.

🔍 Specific Code Observations

Good Practices Observed

Subprocess security: Excellent use of stdin=subprocess.DEVNULL, timeouts, and validated paths
Error message quality: The get_model_not_configured_help() message is a model for user-friendly errors
Test coverage: Edge cases well-covered (settings load failure, timeout, invalid provider, etc.)
Immutability: IR data structures are never mutated (new dict created in compiler)
Logging levels: Proper use of logger.debug() for internal operations and logger.info() for user-visible actions

Areas for Improvement

Inline imports: Should be moved to top-level where possible
Global state: Consider alternatives to module-level caching
Debug code: Remove CI-specific print statements
Test mocking: Use simpler patch() syntax consistently

📋 Test Quality Assessment

Excellent Coverage

Resolution chain tested: All three sources (IR, settings, llm CLI) verified
Error paths tested: Missing config, load failures, timeouts, invalid providers
Edge cases tested: Empty stdout, subprocess failures, PYTEST_CURRENT_TEST skipping
Integration verified: Manual testing confirms file workflows work correctly

Test Quality

Tests are focused and clear - each tests one behavior
Good use of fixtures - mock_registry reused appropriately
Proper assertions - verifies both return values and function calls
No flaky tests - deterministic mocking, no timing dependencies

Minor Improvements

Some tests could be more concise (the patch.object(__import__()) pattern)
Consider parameterized tests for similar scenarios (e.g., different providers)

🎯 Breaking Changes

The PR correctly identifies this as a breaking change for users who:

Don't specify model in workflows
Don't have settings configured
Don't have llm models default set

Mitigation: The error message provides three clear solutions with examples. This is the right approach - fail fast with helpful guidance.

📊 Overall Assessment

Criterion	Rating	Notes
Correctness	✅ Excellent	Bug fixed, new features work as intended
Code Quality	⚠️ Good	Minor issues (imports, debug code) prevent "Excellent"
Security	✅ Excellent	Subprocess handling, input validation exemplary
Test Coverage	✅ Excellent	Comprehensive, well-structured tests
Documentation	✅ Excellent	Clear docstrings, progress log, error messages
Architecture	✅ Excellent	Clean separation, proper abstractions
User Experience	✅ Excellent	Helpful errors, clear resolution chain

🏁 Final Recommendation

Approve with minor fixes - This is high-quality work that significantly improves the codebase. The warnings above are non-critical but should be addressed before merge:

Required Before Merge

✅ Remove CI debug print statements (or convert to proper logging)
✅ Move imports to top-level in compiler.py (avoid repeated imports)

Recommended Before Merge

Investigate and resolve/document circular import in llm_config.py
Simplify test mocking patterns for consistency

Optional (Future Tasks)

Consider alternatives to global state caching
Format error message JSON for better readability

💬 Additional Comments

This PR demonstrates excellent engineering practice:

Clear problem identification and incremental solution (5 phases)
Thorough manual and automated verification
Comprehensive documentation (progress log is exemplary)
User-centric design (error messages, configuration options)

The implementation follows the project's "boring and obvious" principle while solving a complex problem elegantly. Great work! 🎉

Reviewed by: Claude Code
Date: 2025-12-19
Test Status: ✅ 3388 passed, 7 skipped
Code Quality: ✅ All checks pass

Code review fixes: - Remove default from _store_output() method signature (warning #4) - Add explicit strip_newline=False for stderr call - Fix type hints: str | bytes with isinstance narrowing (warning #5) - Update docstrings to clarify stdout-only behavior (warning #3) - Add test for binary data + strip_newline interaction (critical #1) - Add test for empty stdout edge case (critical #2) - Update agent instructions to clarify stderr is never modified All code review items addressed except Windows CRLF (documented as Unix-first).

…eview] Review feedback addressed: 1. Identity check pattern fragility (Warning #1): - Changed _coerce_provided_input() to return (value, was_coerced) tuple - Makes intent explicit and prevents subtle bugs from identity assumptions 2. Type alias normalization (Warning #4): - _normalize_type() now preserves case for unknown types - Supports future custom/user-defined types 3. dict/list to string uses Python repr (Review #2): - _coerce_to_string() now uses json.dumps() for containers - Produces valid JSON strings consistent with coerce_to_declared_type() 4. Env/defaults values not coerced (Review #2): - Apply coercion to values from os.environ, settings_env, and defaults - Ensures "respect declared types" works for all input sources 5. None value handling: - Preserve None values (don't coerce to "None" string) - Optional inputs without defaults correctly resolve to None Also: - Added 4 new tests for env/settings coercion - Documented lenient coercion behavior in docstrings - Created Task 120 for future strict input type validation

…49 follow-up) Adversarial verification of the Task 149 refactor found three real regressions the test suite didn't catch (CliRunner intercepts stderr differently from a real subprocess, so logging-based corruption was invisible to the mocked tests). This commit fixes them. Fix #2 — Nested workflow rendering (OutputController partial-line tracking): - OutputController now tracks _partial_line_open so every _handle_node_* path self-terminates any open partial line before writing a fresh line. - _handle_node_complete / _cached / _warning now accept node_id + indent and re-emit a fresh " node_id" lead-in if the partial line was closed between node_start and node_complete (e.g. by a sub-workflow's own progress events or a logger write). - Before: ` nested_call... inner_a... ✓ 0.5s` (parent's partial line concatenated with child's first start, parent's completion orphaned). - After: clean two-level indented rendering, parent re-emitted on its own line as ` nested_call ✓ 1.0s`. Fix #3 — Drop redundant logger.warning("Command failed with exit code N") from shell.py's post() method: - That call wrote directly to stderr via Python logging, bypassing OutputController, and concatenated onto the partial `node_id...` line as `will_fail...WARNING: Command failed with exit code 1\n ✗ Failed`. - The information is redundant — pflow's diagnostic pipeline already surfaces the exact same error via call_completion_callback (which reads shared["exit_code"] and builds the message) and via the post-execution diagnostic block that prints shared["error"]. - Updated test_auto_handling.py's TestLoggingBehavior to assert on the canonical behavior (action="error" + shared["exit_code"] + shared["error"]) instead of the removed logger.warning. Fix #1 — JSON-encode structured outputs in safe_output: - Before: `{'key': 'value', 'n': 42}` (Python repr, single quotes — jq cannot parse). - After: `{"key": "value", "n": 42}` (valid JSON, `pflow foo | jq` works). - Strings still pass through verbatim. Non-JSON-serializable objects fall back to str() so safe_output never raises from the output path. - Updated test_complex_output_types to verify JSON tokens instead of str(value) — Python True/None were pre-existing never-worked cases exposed by the GH #194 routing fix (data now actually lands on stdout where consumers can pipe it). Fix #4 — Real subprocess regression tests: - New tests/test_cli/test_progress_streaming_subprocess.py with two tests that spawn `uv run pflow` in a real subprocess (not CliRunner) so the captured stderr reflects what an agent or CI system actually sees: * test_failing_shell_node_progress_line_is_clean — asserts no `will_fail...WARNING:` or `will_fail...Command failed` corruption. * test_nested_workflow_progress_lines_are_not_concatenated — asserts no `nested_call... inner_a` concatenation + that the parent's completion line is re-emitted attached to its node id. - Guards against both Fix #2 and Fix #3 regressing. Verification: - `make test`: 4641 passed, 9 skipped. - `make check`: ruff + mypy + deptry clean. - Manual subprocess: failing shell, nested workflow, structured outputs, and the original #194 fix all clean.

… review] Applies two items from the "anything else?" review re-visit: 1. claude[bot] Warning #3 — promote _pflow_validation_warnings to first-class WorkflowValidationError constructor kwarg. Reversed my Round 8 "defer" recommendation after re-reading the two dynamic- attribute sites in runner.py and realizing they're different concerns. _pflow_parser_diagnostics stays (cross-cutting annotation on any exception type). _pflow_validation_warnings moves to a proper kwarg because it's specific to WorkflowValidationError and has a clean destination. 2. Fix [1] cascade side effect — discovered during the "anything else?" verification pass. When a workflow has an unknown node type AND a downstream template reference, Fix [1]'s silent-skip made the defensive fallback in _get_node_outputs_from_registry reachable. The old fallback had a logger.warning("this is unexpected") call that leaked to user-visible stderr. Demoted to logger.debug; the case is now legitimate and doesn't warrant a stderr warning. ## Production changes - core/exceptions.py — WorkflowValidationError gains validation_warnings constructor kwarg. Stored as self.validation_warnings. Docstring documents the two-field contract. - runtime/template_validation/path_validation.py:787-817 — _get_node_outputs_from_registry: logger.warning → logger.debug, docstring updated to describe the two legitimate-reach cases (unknown node type + defensive backstop). - execution/runner.py:373-377 — _validate() uses validation_warnings constructor kwarg instead of dynamic attribute assignment. The # type: ignore[attr-defined] is gone. - execution/runner.py:538-549 — _exception_to_result reads exception.validation_warnings (instead of _pflow_validation_warnings). Still getattr because exception is loosely typed at this layer. ## Regression tests (3) - tests/test_core/test_exception_hierarchy.py:: test_workflow_validation_error_carries_warnings_as_first_class_attr — locks in the kwarg contract: round-trips errors + warnings, defaults to empty list when omitted, backward-compat summary constructor still works. - tests/test_execution/test_runner.py:: test_workflow_validation_error_warnings_survive_via_kwarg — end-to-end: WorkflowValidationError raised with validation_warnings=[...] survives through _exception_to_result into ExecutionResult.diagnostics. - tests/test_runtime/test_template_validation/test_enhanced_errors.py:: test_unknown_node_type_downstream_ref_no_stderr_warning — three-part structural guard for the fallback log demotion: (a) template error diagnostic still produced (behavior), (b) no WARNING-level log from fallback (UX fix), (c) DEBUG-level log still fires (observability preserved). ## Verification - make test: 4680 passed (was 4677 + 3 new regressions) - make check: clean (ruff + ruff-format + mypy 171 files + deptry) - Manual cascade repro: 2 clean errors, zero stderr noise - Grep: zero _pflow_validation_warnings in src/ (promotion complete) - Baseline refresh: zero drift (Round 9 changes are rendering-neutral) ## Deferred (now truly done) - claude[bot] Suggestion #2 (generic runtime warning text): still deferred. Requires runtime warning categorization infrastructure that doesn't exist. No user-reported pain point. - _pflow_parser_diagnostics cleanup: explicitly NOT touched — different concern, correctly uses dynamic-attr pattern for cross-cutting exception annotation. Task 147 braindump's "attr-defined pattern is intentional" applies to this site, not to _pflow_validation_warnings. Refs: #219, #244

…ti-failure status, trace errors Three regressions found during post-implementation verification of Task 148, all violating explicit spec requirements or acceptance criteria. BUG #1 — output_resolver silently swallowed all-failed coalesce. The gate at output_resolver.py:168-172 skipped any unresolved coalesce expression, so ${primary.stdout ?? fallback.stdout} with both operands in __failures__ silently dropped the output instead of raising OutputResolutionError. This is the exact silent-bad-output behavior Task 148 set out to eliminate (spec requirement: "both failed produces a structured error" + "All coalesce operands failed" summary block). Fix: extract _is_all_absent_coalesce helper that calls classify_unresolved_references and only silent-skips when every ref is status=absent (Task 128 branch-convergence semantic). Any FAILED or PATH_ERROR operand now falls through to error recording with the structured unresolved_references payload. BUG #2 — build_execution_steps used singular failed_node for per-node status. execution_state.py:110-133 determined status from exec_state["failed_node"] (singular pointer — only last failure). In multi-failure workflows, earlier failed nodes matched neither completed_nodes nor the singular and fell through to "not_executed". An AI agent consuming the JSON output would see status:"not_executed" for a node that clearly did execute and fail. Missed Phase 3 migration — only the metadata get_node_output lookup at line 143 was migrated, status determination at 128-133 was left on the pre-invariant singular-field model. Fix: replace the three-arm conditional with a call to node_state.get_node_status() mapped through _STATUS_MAP to the existing status strings. Single source of truth. BUG #3 — --report summary showed "Unknown error" for #208 repro. Task 148 acceptance criterion: "--report for the #208 repro shows primary as failed with full context". Root cause: record_trace was called with no error kwarg in the happy-path action="error" branch (only raised exceptions passed error). For shell exit N routed via on-error, the trace event had success:False but no error field, so trace_report.py fell back to "Unknown error". Fix: pre-compute trace_error from shared[node_id].error and pass it to record_trace as the new Optional[Exception | str] error kwarg. Test fixture migration. BUG #2 fix broke 4 tests whose fixtures set failed_node singular without populating __failures__ (pre-invariant shape). Updated test_execution_state.py (3 tests) and test_error_formatter.py (1 test) to populate __failures__ dict per the Task 148 invariant. Regression tests. Added 4 new tests targeting the exact code paths that would have caught the bugs: - test_multi_failure_all_show_failed_status — multi-failure build_execution_steps - test_all_failed_coalesce_in_output_raises_structured_error — BUG #1 - test_mixed_absent_and_failed_coalesce_in_output_raises_error — BUG #1 variant - test_all_absent_coalesce_in_output_is_silently_skipped — Task 128 positive regression Verification. make check clean. pytest -n 4 —> 4708 passed (was 4704). #208 repro, test 02 ignore_errors, test 03 both-fail, test 09 loop recovery, test 23 mixed absent+failed all produce their expected output. Direct build_execution_steps probe with multi-failure state labels both failed nodes correctly. Follow-ups filed (out of scope, both verified during the same pass): - #240 — Trace aggregation reports workflow as 'failed' after loop recovery (pre-existing trace semantic gap) - #241 — Invariant does not hold for enable_namespacing=false (real bug, not reachable via user paths, recommended deprecation) Progress log at .taskmaster/tasks/task_149/implementation/progress-log.md documents the full verification pass including withdrawn claims and Bug #2b (stale failed_node pointer) rationale for deferral.

…kip review] PR #378 review finding #4. Closes the synthetic-builder ↔ production-analyze() fidelity gap in tests/test_core/test_cache_analysis_renderers.py. What landed - _BUILDER_DOCUMENTED_DEFAULTS frozenset (6 entries) names the AnalysisSummary fields the synthetic builder cannot faithfully model. Tests asserting on these MUST drive analyze() end-to-end. - TestMakeAnalysisShapeParity class with two methods: * test_builder_field_set_matches_dataclass_minus_documented_defaults — uses dataclasses.fields() introspection to fail noisily when a new field is added without builder coverage or allowlist documentation. * test_documented_defaults_get_overwritten_by_production — drives analyze() against a contrived IR + trace that triggers each documented-default overwrite, catching the case where production's overwrite logic is deleted while the allowlist stays stale. - Three renderer tests migrated from synthetic to e2e analyze() calls: * test_json_partial_trace_exposes_evidence_scope_and_observed_models * test_json_summary_exposes_projection_exclusions_and_delta_reason * test_render_json_includes_rollup_workflow_paths_and_unavailable_models_by_workflow Reviewer's two non-issues confirmed via inspection - test_text_summary_renders_blocking_errors_categorically (line 272 in PR baseline) only asserts on builder-populated fields. Kept synthetic. - test_json_emits_root_and_sub_workflow_llm_node_counts already drove analyze() end-to-end against the committed 3-deep fixture. No migration needed. Verification - 5 mutation contracts checked by reverting production code: * Add new AnalysisSummary field → parity method 1 fails naming the field. * Delete observed_models_in_trace overwrite → migration #2 + parity method 2 fail with documented diagnostics. * Delete unavailable_models_by_workflow overwrite → migration #4 + parity method 2 fail. - 6,335 tests passing on default suite. - make check clean (ruff + ruff-format + mypy + deptry). - test_golden_baseline_hashes_match (DD#19) green; test_plan_drift.py 34/34. Plan + progress log - Atomic plan at .taskmaster/tasks/task_159/implementation/fix-plans/ renderer-test-fidelity-shape-parity-plan.md. - Consolidated PR #378 review-fix sweep entry appended to implementation-progress-log.md, documenting all four phases (Phase 1 easy bundle / Phase 2 medium bundle / Phase 3 cohort-key correctness / Phase 4 this commit), the five disputed findings (with citations), the GH #380 follow-up filed, and the cross-cutting insights from the sweep. GH issue #380 filed for the deferred test-bloat parametrize-collapse work.

`pflow analyze-cache --from-trace` previously rendered three delta lines (Actual savings + First-run + Rerun) in trace mode. The first-run-with- cache projection assumes a fresh run with no memo cache hits and no provider implicit caching — neither modeled. When those fire (the common case), the projection diverges from actual cost by an order of magnitude (lyrics-generator: projection said "saves 1%"; actual was 49%) and the two competing numbers anchor agents on the smaller, misleading figure. Option B: drop the first-run delta line in trace mode entirely. Show only the measured number plus the steady-state forward-looking projection. - `_render_summary_deltas` split into 4-line dispatch + two named helpers `_render_trace_deltas` and `_render_greenfield_deltas`. Dispatch on `evidence_scope ∈ {"complete_trace", "truncated_trace_executed_subset"}` rather than `actually_paid_usd is not None` to handle the all-unpriced- trace edge case correctly. - Trace mode renders `Actual savings (this run):` (or `unavailable (projection excludes …)`) + `Rerun delta (projected):`. The `(projected)` suffix on rerun signals "model, not measurement" right at the line. - Greenfield mode unchanged shape: `First-run delta:` + `Rerun delta:`, no `(projected)` suffix (both are projections by construction; the absence of an actual-savings line carries the signal). - The truncated-mode `(executed)` qualifier on rerun was retired — the suppression note already conveys executed-subset context. - `_format_delta` actual-delta label simplified `"actual vs no-cache"` → `"vs no-cache"` (the row label "Actual savings (this run):" already says "actual"; doubled word was a Stage A artifact). - Dead branch `if not in_trace_mode and actual_line:` removed (production-unreachable per searcher #2; only synthetic fixtures hit it). - `_make_analysis` test builder upgraded to set `evidence_scope` based on `actually_paid` (matches production shape; closes a Pitfall #19 gap that searcher #2 specifically warned about under the new dispatch). JSON shape unchanged — all three deltas still on `AnalysisSummary`. Tests: 2 new regression tests with verified mutation contracts; 1 existing test renamed and rewritten for Option B truncated semantics; 1 new mutation guard on the `vs no-cache` label simplification. Three mutation contracts verified by reverting production code and observing the matching test fail. 6,396 default-suite tests pass. 65/65 baselines pass (no drift — the parallel L-12/L-2/L-1 commit had already regenerated the 3 affected baselines anticipating Option B's final shape). Manual smoke confirms trace mode shows actual + rerun (projected) only; greenfield shows first-run + rerun. Closes L-3 from BASELINE-AUDIT Section F.

spinje added 3 commits December 19, 2025 12:00

handover and task review [skip review]

cde44f0

spinje merged commit 309b6b8 into main Dec 19, 2025
7 checks passed

claude Bot mentioned this pull request Dec 19, 2025

feat: add smart output display for registry run (closes #3) #4

Merged

claude Bot mentioned this pull request Dec 29, 2025

fix: allow arrays/dicts to serialize to JSON in string contexts #19

Merged

spinje mentioned this pull request Dec 29, 2025

feat: display batch node progress as items complete #20

Merged

claude Bot mentioned this pull request Jan 1, 2026

feat: preserve inline object type in template resolution (fixes #31) #32

Merged

claude Bot mentioned this pull request Jan 8, 2026

fix: resolve optional inputs without defaults to None (fixes #47) #49

Merged

spinje mentioned this pull request Jan 8, 2026

fix: strip trailing newlines from shell stdout by default #51

Merged

claude Bot mentioned this pull request Jan 9, 2026

fix: surface shell stderr when exit_code=0 (fixes #55) #56

Merged

spinje mentioned this pull request Jan 14, 2026

fix: unify pre-execution validation with --validate-only (fixes #66) #67

Merged

claude Bot mentioned this pull request Jan 17, 2026

docs: add task browser script and improve CLAUDE.md documentation #70

Merged

spinje mentioned this pull request Feb 4, 2026

Batch workflows fail through MCP server with unresolved template variables #79

Closed

This was referenced Feb 5, 2026

feat: Add pflow skill command to publish workflows as AI agent skills #81

Merged

fix: Respect declared input types for CLI parameters #84

Merged

spinje mentioned this pull request Mar 31, 2026

Sub-workflow per-item input type coercion lost after compile-once #188

Open

claude Bot mentioned this pull request Apr 5, 2026

refactor: shared MCP error handling with Diagnostic integration #227

Merged

claude Bot mentioned this pull request Apr 8, 2026

feat: task 147 — validator produces Diagnostics natively (#219) #244

Merged

spinje mentioned this pull request Apr 8, 2026

Trace event records success=True for node that hits routing failure with custom non-error action #250

Closed

spinje mentioned this pull request Apr 17, 2026

fix: error pipeline silent failures cluster (fixes #292, #288) #298

Merged

claude Bot mentioned this pull request Apr 17, 2026

fix: consolidate mermaid ref resolution (fixes #283, #263) #299

Merged

spinje mentioned this pull request Apr 17, 2026

Validator: reject same-name collisions between declared inputs and node IDs #300

Open

claude Bot mentioned this pull request Apr 26, 2026

refactor(llm): replace llm library with LiteLLM via pflow-owned adapter #356

Merged

This was referenced May 1, 2026

Refactor: consolidate trace-event walkers with divergent cached filter behavior #364

Closed

feat: Prompt caching via declarative ## Cache block + pflow analyze-cache #378

Merged

claude Bot mentioned this pull request May 11, 2026

fix(parser): preserve blank lines inside multi-line YAML block scalars (#387) #388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify LLM usage via Simon Willison's llm library (Task 95)#2

feat: unify LLM usage via Simon Willison's llm library (Task 95)#2
spinje merged 3 commits into
mainfrom
feat/llm-usage

spinje commented Dec 19, 2025

Uh oh!

claude Bot commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spinje commented Dec 19, 2025

Summary

Changes

Bug Fix (Phase 1) - Critical

Discovery Migration (Phase 2)

Smart Filter + Settings (Phase 3+4)

Default Model Configuration

Model Updates

Files Changed (24 files)

Task Documentation

Testing

Uh oh!

claude Bot commented Dec 19, 2025

Code Review: Task 95 - Unify LLM Usage via Simon Willison's llm Library

Overview

✅ Strengths

Architecture & Design

Code Quality

Security

⚠️ Warnings — Should Be Addressed

1. Import Location in Compiler (compiler.py:605)

2. Patch Import Pattern in llm_config.py (llm_config.py:243)

3. Global State Caching (llm_config.py:16-18)

4. CI Debug Logging (llm_config.py:74-102)

5. Test Mock Pattern Inconsistency (test_llm_config_workflow_model.py:37-39)

💡 Suggestions — Optional Improvements

1. Error Message Formatting

2. Type Hints in Test Fixtures

3. Consolidate Model Resolution Logic

🔍 Specific Code Observations

Good Practices Observed

Areas for Improvement

📋 Test Quality Assessment

Excellent Coverage

Test Quality

Minor Improvements

🎯 Breaking Changes

📊 Overall Assessment

🏁 Final Recommendation

Required Before Merge

Recommended Before Merge

Optional (Future Tasks)

💬 Additional Comments

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant