feat: unify LLM usage via Simon Willison's llm library (Task 95)#2
Conversation
Fix critical bug where ALL Claude models were silently redirected to a
hardcoded model, ignoring user's model choice. Migrate discovery and
filtering features to use standard llm library with configurable models.
Key changes:
Bug Fix (Phase 1):
- Move monkey-patch from workflow_command() entry to planner-only path
- File/saved workflows now use standard llm library
- User's model choice is now respected
Discovery Migration (Phase 2):
- Remove install_anthropic_model() from registry discover
- Remove install_anthropic_model() from workflow discover
- Use get_model_for_feature("discovery") for configurable model
Smart Filter + Settings (Phase 3+4):
- Add LLMSettings class with default_model, discovery_model, filtering_model
- Add get_model_for_feature() helper with resolution chain
- Smart filter uses configurable model instead of hardcoded Haiku
Default Model Configuration:
- LLM nodes now require explicit model configuration
- Resolution order: IR params → settings → llm CLI → CompilationError
- Clear error messages with setup instructions
Model Updates:
- Add GPT-5.2, Gemini 3 Flash to pricing
- Update default detection to use latest models
- Bump llm-gemini>=0.28.1, llm-anthropic==0.23
Tests:
- Add test_llm_config_workflow_model.py (15 tests)
- Add test_compiler_llm_model.py (6 tests)
- Add wildcard model support to llm_mock.py
- Update existing tests for new patterns
All 3388 tests pass.
- Remove CI debug print statements from llm_config.py - Move imports to top-level in compiler.py (avoid repeated imports) - Move SettingsManager import to top-level in llm_config.py (no circular import) - Simplify test mocking patterns (use standard patch() syntax) - Update test patch locations to match top-level imports All 3388 tests pass.
Code Review: Task 95 - Unify LLM Usage via Simon Willison's llm LibraryOverviewThis PR successfully addresses a critical bug where all Claude models were being silently redirected to a hardcoded model, while also migrating the codebase to use a configurable LLM system based on Simon Willison's ✅ StrengthsArchitecture & Design
Code Quality
Security
|
| Criterion | Rating | Notes |
|---|---|---|
| Correctness | ✅ Excellent | Bug fixed, new features work as intended |
| Code Quality | Minor issues (imports, debug code) prevent "Excellent" | |
| Security | ✅ Excellent | Subprocess handling, input validation exemplary |
| Test Coverage | ✅ Excellent | Comprehensive, well-structured tests |
| Documentation | ✅ Excellent | Clear docstrings, progress log, error messages |
| Architecture | ✅ Excellent | Clean separation, proper abstractions |
| User Experience | ✅ Excellent | Helpful errors, clear resolution chain |
🏁 Final Recommendation
Approve with minor fixes - This is high-quality work that significantly improves the codebase. The warnings above are non-critical but should be addressed before merge:
Required Before Merge
- ✅ Remove CI debug print statements (or convert to proper logging)
- ✅ Move imports to top-level in
compiler.py(avoid repeated imports)
Recommended Before Merge
- Investigate and resolve/document circular import in
llm_config.py - Simplify test mocking patterns for consistency
Optional (Future Tasks)
- Consider alternatives to global state caching
- Format error message JSON for better readability
💬 Additional Comments
This PR demonstrates excellent engineering practice:
- Clear problem identification and incremental solution (5 phases)
- Thorough manual and automated verification
- Comprehensive documentation (progress log is exemplary)
- User-centric design (error messages, configuration options)
The implementation follows the project's "boring and obvious" principle while solving a complex problem elegantly. Great work! 🎉
Reviewed by: Claude Code
Date: 2025-12-19
Test Status: ✅ 3388 passed, 7 skipped
Code Quality: ✅ All checks pass
Code review fixes: - Remove default from _store_output() method signature (warning #4) - Add explicit strip_newline=False for stderr call - Fix type hints: str | bytes with isinstance narrowing (warning #5) - Update docstrings to clarify stdout-only behavior (warning #3) - Add test for binary data + strip_newline interaction (critical #1) - Add test for empty stdout edge case (critical #2) - Update agent instructions to clarify stderr is never modified All code review items addressed except Windows CRLF (documented as Unix-first).
…eview] Review feedback addressed: 1. Identity check pattern fragility (Warning #1): - Changed _coerce_provided_input() to return (value, was_coerced) tuple - Makes intent explicit and prevents subtle bugs from identity assumptions 2. Type alias normalization (Warning #4): - _normalize_type() now preserves case for unknown types - Supports future custom/user-defined types 3. dict/list to string uses Python repr (Review #2): - _coerce_to_string() now uses json.dumps() for containers - Produces valid JSON strings consistent with coerce_to_declared_type() 4. Env/defaults values not coerced (Review #2): - Apply coercion to values from os.environ, settings_env, and defaults - Ensures "respect declared types" works for all input sources 5. None value handling: - Preserve None values (don't coerce to "None" string) - Optional inputs without defaults correctly resolve to None Also: - Added 4 new tests for env/settings coercion - Documented lenient coercion behavior in docstrings - Created Task 120 for future strict input type validation
…49 follow-up) Adversarial verification of the Task 149 refactor found three real regressions the test suite didn't catch (CliRunner intercepts stderr differently from a real subprocess, so logging-based corruption was invisible to the mocked tests). This commit fixes them. Fix #2 — Nested workflow rendering (OutputController partial-line tracking): - OutputController now tracks _partial_line_open so every _handle_node_* path self-terminates any open partial line before writing a fresh line. - _handle_node_complete / _cached / _warning now accept node_id + indent and re-emit a fresh " node_id" lead-in if the partial line was closed between node_start and node_complete (e.g. by a sub-workflow's own progress events or a logger write). - Before: ` nested_call... inner_a... ✓ 0.5s` (parent's partial line concatenated with child's first start, parent's completion orphaned). - After: clean two-level indented rendering, parent re-emitted on its own line as ` nested_call ✓ 1.0s`. Fix #3 — Drop redundant logger.warning("Command failed with exit code N") from shell.py's post() method: - That call wrote directly to stderr via Python logging, bypassing OutputController, and concatenated onto the partial `node_id...` line as `will_fail...WARNING: Command failed with exit code 1\n ✗ Failed`. - The information is redundant — pflow's diagnostic pipeline already surfaces the exact same error via call_completion_callback (which reads shared["exit_code"] and builds the message) and via the post-execution diagnostic block that prints shared["error"]. - Updated test_auto_handling.py's TestLoggingBehavior to assert on the canonical behavior (action="error" + shared["exit_code"] + shared["error"]) instead of the removed logger.warning. Fix #1 — JSON-encode structured outputs in safe_output: - Before: `{'key': 'value', 'n': 42}` (Python repr, single quotes — jq cannot parse). - After: `{"key": "value", "n": 42}` (valid JSON, `pflow foo | jq` works). - Strings still pass through verbatim. Non-JSON-serializable objects fall back to str() so safe_output never raises from the output path. - Updated test_complex_output_types to verify JSON tokens instead of str(value) — Python True/None were pre-existing never-worked cases exposed by the GH #194 routing fix (data now actually lands on stdout where consumers can pipe it). Fix #4 — Real subprocess regression tests: - New tests/test_cli/test_progress_streaming_subprocess.py with two tests that spawn `uv run pflow` in a real subprocess (not CliRunner) so the captured stderr reflects what an agent or CI system actually sees: * test_failing_shell_node_progress_line_is_clean — asserts no `will_fail...WARNING:` or `will_fail...Command failed` corruption. * test_nested_workflow_progress_lines_are_not_concatenated — asserts no `nested_call... inner_a` concatenation + that the parent's completion line is re-emitted attached to its node id. - Guards against both Fix #2 and Fix #3 regressing. Verification: - `make test`: 4641 passed, 9 skipped. - `make check`: ruff + mypy + deptry clean. - Manual subprocess: failing shell, nested workflow, structured outputs, and the original #194 fix all clean.
… review] Applies two items from the "anything else?" review re-visit: 1. claude[bot] Warning #3 — promote _pflow_validation_warnings to first-class WorkflowValidationError constructor kwarg. Reversed my Round 8 "defer" recommendation after re-reading the two dynamic- attribute sites in runner.py and realizing they're different concerns. _pflow_parser_diagnostics stays (cross-cutting annotation on any exception type). _pflow_validation_warnings moves to a proper kwarg because it's specific to WorkflowValidationError and has a clean destination. 2. Fix [1] cascade side effect — discovered during the "anything else?" verification pass. When a workflow has an unknown node type AND a downstream template reference, Fix [1]'s silent-skip made the defensive fallback in _get_node_outputs_from_registry reachable. The old fallback had a logger.warning("this is unexpected") call that leaked to user-visible stderr. Demoted to logger.debug; the case is now legitimate and doesn't warrant a stderr warning. ## Production changes - core/exceptions.py — WorkflowValidationError gains validation_warnings constructor kwarg. Stored as self.validation_warnings. Docstring documents the two-field contract. - runtime/template_validation/path_validation.py:787-817 — _get_node_outputs_from_registry: logger.warning → logger.debug, docstring updated to describe the two legitimate-reach cases (unknown node type + defensive backstop). - execution/runner.py:373-377 — _validate() uses validation_warnings constructor kwarg instead of dynamic attribute assignment. The # type: ignore[attr-defined] is gone. - execution/runner.py:538-549 — _exception_to_result reads exception.validation_warnings (instead of _pflow_validation_warnings). Still getattr because exception is loosely typed at this layer. ## Regression tests (3) - tests/test_core/test_exception_hierarchy.py:: test_workflow_validation_error_carries_warnings_as_first_class_attr — locks in the kwarg contract: round-trips errors + warnings, defaults to empty list when omitted, backward-compat summary constructor still works. - tests/test_execution/test_runner.py:: test_workflow_validation_error_warnings_survive_via_kwarg — end-to-end: WorkflowValidationError raised with validation_warnings=[...] survives through _exception_to_result into ExecutionResult.diagnostics. - tests/test_runtime/test_template_validation/test_enhanced_errors.py:: test_unknown_node_type_downstream_ref_no_stderr_warning — three-part structural guard for the fallback log demotion: (a) template error diagnostic still produced (behavior), (b) no WARNING-level log from fallback (UX fix), (c) DEBUG-level log still fires (observability preserved). ## Verification - make test: 4680 passed (was 4677 + 3 new regressions) - make check: clean (ruff + ruff-format + mypy 171 files + deptry) - Manual cascade repro: 2 clean errors, zero stderr noise - Grep: zero _pflow_validation_warnings in src/ (promotion complete) - Baseline refresh: zero drift (Round 9 changes are rendering-neutral) ## Deferred (now truly done) - claude[bot] Suggestion #2 (generic runtime warning text): still deferred. Requires runtime warning categorization infrastructure that doesn't exist. No user-reported pain point. - _pflow_parser_diagnostics cleanup: explicitly NOT touched — different concern, correctly uses dynamic-attr pattern for cross-cutting exception annotation. Task 147 braindump's "attr-defined pattern is intentional" applies to this site, not to _pflow_validation_warnings. Refs: #219, #244
… review] Applies two items from the "anything else?" review re-visit: 1. claude[bot] Warning #3 — promote _pflow_validation_warnings to first-class WorkflowValidationError constructor kwarg. Reversed my Round 8 "defer" recommendation after re-reading the two dynamic- attribute sites in runner.py and realizing they're different concerns. _pflow_parser_diagnostics stays (cross-cutting annotation on any exception type). _pflow_validation_warnings moves to a proper kwarg because it's specific to WorkflowValidationError and has a clean destination. 2. Fix [1] cascade side effect — discovered during the "anything else?" verification pass. When a workflow has an unknown node type AND a downstream template reference, Fix [1]'s silent-skip made the defensive fallback in _get_node_outputs_from_registry reachable. The old fallback had a logger.warning("this is unexpected") call that leaked to user-visible stderr. Demoted to logger.debug; the case is now legitimate and doesn't warrant a stderr warning. ## Production changes - core/exceptions.py — WorkflowValidationError gains validation_warnings constructor kwarg. Stored as self.validation_warnings. Docstring documents the two-field contract. - runtime/template_validation/path_validation.py:787-817 — _get_node_outputs_from_registry: logger.warning → logger.debug, docstring updated to describe the two legitimate-reach cases (unknown node type + defensive backstop). - execution/runner.py:373-377 — _validate() uses validation_warnings constructor kwarg instead of dynamic attribute assignment. The # type: ignore[attr-defined] is gone. - execution/runner.py:538-549 — _exception_to_result reads exception.validation_warnings (instead of _pflow_validation_warnings). Still getattr because exception is loosely typed at this layer. ## Regression tests (3) - tests/test_core/test_exception_hierarchy.py:: test_workflow_validation_error_carries_warnings_as_first_class_attr — locks in the kwarg contract: round-trips errors + warnings, defaults to empty list when omitted, backward-compat summary constructor still works. - tests/test_execution/test_runner.py:: test_workflow_validation_error_warnings_survive_via_kwarg — end-to-end: WorkflowValidationError raised with validation_warnings=[...] survives through _exception_to_result into ExecutionResult.diagnostics. - tests/test_runtime/test_template_validation/test_enhanced_errors.py:: test_unknown_node_type_downstream_ref_no_stderr_warning — three-part structural guard for the fallback log demotion: (a) template error diagnostic still produced (behavior), (b) no WARNING-level log from fallback (UX fix), (c) DEBUG-level log still fires (observability preserved). ## Verification - make test: 4680 passed (was 4677 + 3 new regressions) - make check: clean (ruff + ruff-format + mypy 171 files + deptry) - Manual cascade repro: 2 clean errors, zero stderr noise - Grep: zero _pflow_validation_warnings in src/ (promotion complete) - Baseline refresh: zero drift (Round 9 changes are rendering-neutral) ## Deferred (now truly done) - claude[bot] Suggestion #2 (generic runtime warning text): still deferred. Requires runtime warning categorization infrastructure that doesn't exist. No user-reported pain point. - _pflow_parser_diagnostics cleanup: explicitly NOT touched — different concern, correctly uses dynamic-attr pattern for cross-cutting exception annotation. Task 147 braindump's "attr-defined pattern is intentional" applies to this site, not to _pflow_validation_warnings. Refs: #219, #244
…ti-failure status, trace errors Three regressions found during post-implementation verification of Task 148, all violating explicit spec requirements or acceptance criteria. BUG #1 — output_resolver silently swallowed all-failed coalesce. The gate at output_resolver.py:168-172 skipped any unresolved coalesce expression, so ${primary.stdout ?? fallback.stdout} with both operands in __failures__ silently dropped the output instead of raising OutputResolutionError. This is the exact silent-bad-output behavior Task 148 set out to eliminate (spec requirement: "both failed produces a structured error" + "All coalesce operands failed" summary block). Fix: extract _is_all_absent_coalesce helper that calls classify_unresolved_references and only silent-skips when every ref is status=absent (Task 128 branch-convergence semantic). Any FAILED or PATH_ERROR operand now falls through to error recording with the structured unresolved_references payload. BUG #2 — build_execution_steps used singular failed_node for per-node status. execution_state.py:110-133 determined status from exec_state["failed_node"] (singular pointer — only last failure). In multi-failure workflows, earlier failed nodes matched neither completed_nodes nor the singular and fell through to "not_executed". An AI agent consuming the JSON output would see status:"not_executed" for a node that clearly did execute and fail. Missed Phase 3 migration — only the metadata get_node_output lookup at line 143 was migrated, status determination at 128-133 was left on the pre-invariant singular-field model. Fix: replace the three-arm conditional with a call to node_state.get_node_status() mapped through _STATUS_MAP to the existing status strings. Single source of truth. BUG #3 — --report summary showed "Unknown error" for #208 repro. Task 148 acceptance criterion: "--report for the #208 repro shows primary as failed with full context". Root cause: record_trace was called with no error kwarg in the happy-path action="error" branch (only raised exceptions passed error). For shell exit N routed via on-error, the trace event had success:False but no error field, so trace_report.py fell back to "Unknown error". Fix: pre-compute trace_error from shared[node_id].error and pass it to record_trace as the new Optional[Exception | str] error kwarg. Test fixture migration. BUG #2 fix broke 4 tests whose fixtures set failed_node singular without populating __failures__ (pre-invariant shape). Updated test_execution_state.py (3 tests) and test_error_formatter.py (1 test) to populate __failures__ dict per the Task 148 invariant. Regression tests. Added 4 new tests targeting the exact code paths that would have caught the bugs: - test_multi_failure_all_show_failed_status — multi-failure build_execution_steps - test_all_failed_coalesce_in_output_raises_structured_error — BUG #1 - test_mixed_absent_and_failed_coalesce_in_output_raises_error — BUG #1 variant - test_all_absent_coalesce_in_output_is_silently_skipped — Task 128 positive regression Verification. make check clean. pytest -n 4 —> 4708 passed (was 4704). #208 repro, test 02 ignore_errors, test 03 both-fail, test 09 loop recovery, test 23 mixed absent+failed all produce their expected output. Direct build_execution_steps probe with multi-failure state labels both failed nodes correctly. Follow-ups filed (out of scope, both verified during the same pass): - #240 — Trace aggregation reports workflow as 'failed' after loop recovery (pre-existing trace semantic gap) - #241 — Invariant does not hold for enable_namespacing=false (real bug, not reachable via user paths, recommended deprecation) Progress log at .taskmaster/tasks/task_149/implementation/progress-log.md documents the full verification pass including withdrawn claims and Bug #2b (stale failed_node pointer) rationale for deferral.
…kip review] PR #378 review finding #4. Closes the synthetic-builder ↔ production-analyze() fidelity gap in tests/test_core/test_cache_analysis_renderers.py. What landed - _BUILDER_DOCUMENTED_DEFAULTS frozenset (6 entries) names the AnalysisSummary fields the synthetic builder cannot faithfully model. Tests asserting on these MUST drive analyze() end-to-end. - TestMakeAnalysisShapeParity class with two methods: * test_builder_field_set_matches_dataclass_minus_documented_defaults — uses dataclasses.fields() introspection to fail noisily when a new field is added without builder coverage or allowlist documentation. * test_documented_defaults_get_overwritten_by_production — drives analyze() against a contrived IR + trace that triggers each documented-default overwrite, catching the case where production's overwrite logic is deleted while the allowlist stays stale. - Three renderer tests migrated from synthetic to e2e analyze() calls: * test_json_partial_trace_exposes_evidence_scope_and_observed_models * test_json_summary_exposes_projection_exclusions_and_delta_reason * test_render_json_includes_rollup_workflow_paths_and_unavailable_models_by_workflow Reviewer's two non-issues confirmed via inspection - test_text_summary_renders_blocking_errors_categorically (line 272 in PR baseline) only asserts on builder-populated fields. Kept synthetic. - test_json_emits_root_and_sub_workflow_llm_node_counts already drove analyze() end-to-end against the committed 3-deep fixture. No migration needed. Verification - 5 mutation contracts checked by reverting production code: * Add new AnalysisSummary field → parity method 1 fails naming the field. * Delete observed_models_in_trace overwrite → migration #2 + parity method 2 fail with documented diagnostics. * Delete unavailable_models_by_workflow overwrite → migration #4 + parity method 2 fail. - 6,335 tests passing on default suite. - make check clean (ruff + ruff-format + mypy + deptry). - test_golden_baseline_hashes_match (DD#19) green; test_plan_drift.py 34/34. Plan + progress log - Atomic plan at .taskmaster/tasks/task_159/implementation/fix-plans/ renderer-test-fidelity-shape-parity-plan.md. - Consolidated PR #378 review-fix sweep entry appended to implementation-progress-log.md, documenting all four phases (Phase 1 easy bundle / Phase 2 medium bundle / Phase 3 cohort-key correctness / Phase 4 this commit), the five disputed findings (with citations), the GH #380 follow-up filed, and the cross-cutting insights from the sweep. GH issue #380 filed for the deferred test-bloat parametrize-collapse work.
`pflow analyze-cache --from-trace` previously rendered three delta lines
(Actual savings + First-run + Rerun) in trace mode. The first-run-with-
cache projection assumes a fresh run with no memo cache hits and no
provider implicit caching — neither modeled. When those fire (the common
case), the projection diverges from actual cost by an order of magnitude
(lyrics-generator: projection said "saves 1%"; actual was 49%) and the
two competing numbers anchor agents on the smaller, misleading figure.
Option B: drop the first-run delta line in trace mode entirely. Show only
the measured number plus the steady-state forward-looking projection.
- `_render_summary_deltas` split into 4-line dispatch + two named helpers
`_render_trace_deltas` and `_render_greenfield_deltas`. Dispatch on
`evidence_scope ∈ {"complete_trace", "truncated_trace_executed_subset"}`
rather than `actually_paid_usd is not None` to handle the all-unpriced-
trace edge case correctly.
- Trace mode renders `Actual savings (this run):` (or `unavailable
(projection excludes …)`) + `Rerun delta (projected):`. The
`(projected)` suffix on rerun signals "model, not measurement" right
at the line.
- Greenfield mode unchanged shape: `First-run delta:` + `Rerun delta:`,
no `(projected)` suffix (both are projections by construction; the
absence of an actual-savings line carries the signal).
- The truncated-mode `(executed)` qualifier on rerun was retired —
the suppression note already conveys executed-subset context.
- `_format_delta` actual-delta label simplified `"actual vs no-cache"`
→ `"vs no-cache"` (the row label "Actual savings (this run):" already
says "actual"; doubled word was a Stage A artifact).
- Dead branch `if not in_trace_mode and actual_line:` removed
(production-unreachable per searcher #2; only synthetic fixtures hit it).
- `_make_analysis` test builder upgraded to set `evidence_scope` based
on `actually_paid` (matches production shape; closes a Pitfall #19
gap that searcher #2 specifically warned about under the new dispatch).
JSON shape unchanged — all three deltas still on `AnalysisSummary`.
Tests: 2 new regression tests with verified mutation contracts; 1
existing test renamed and rewritten for Option B truncated semantics;
1 new mutation guard on the `vs no-cache` label simplification. Three
mutation contracts verified by reverting production code and observing
the matching test fail.
6,396 default-suite tests pass. 65/65 baselines pass (no drift — the
parallel L-12/L-2/L-1 commit had already regenerated the 3 affected
baselines anticipating Option B's final shape). Manual smoke confirms
trace mode shows actual + rerun (projected) only; greenfield shows
first-run + rerun.
Closes L-3 from BASELINE-AUDIT Section F.
Summary
Fix critical bug where ALL Claude models were silently redirected to a hardcoded model (
claude-sonnet-4-20250514), ignoring user's model choice. Migrate discovery and filtering features to use standardllmlibrary with configurable models.Changes
Bug Fix (Phase 1) - Critical
workflow_command()entry to planner-only pathDiscovery Migration (Phase 2)
install_anthropic_model()from registry discoverinstall_anthropic_model()from workflow discoverget_model_for_feature("discovery")for configurable modelSmart Filter + Settings (Phase 3+4)
LLMSettingsclass withdefault_model,discovery_model,filtering_modelget_model_for_feature()helper with resolution chainDefault Model Configuration
Model Updates
llm-gemini>=0.28.1,llm-anthropic==0.23Files Changed (24 files)
Core Implementation:
src/pflow/cli/main.py- Move monkey-patch to planner pathsrc/pflow/cli/registry.py- Use configurable discovery modelsrc/pflow/cli/commands/workflow.py- Use configurable discovery modelsrc/pflow/core/llm_config.py- Add model resolution helperssrc/pflow/core/settings.py- Add LLMSettings classsrc/pflow/core/smart_filter.py- Use configurable filtering modelsrc/pflow/runtime/compiler.py- Add LLM model injectionTests (21 new tests):
tests/test_core/test_llm_config_workflow_model.py(15 tests)tests/test_runtime/test_compiler_llm_model.py(6 tests)Task Documentation
.taskmaster/tasks/task_95/implementation/progress-log.md.taskmaster/tasks/task_95/research/llm-usage-findings.mdTesting
All 3388 tests pass:
Manual verification completed for all phases (see progress-log.md).