refactor(llm): replace llm library with LiteLLM via pflow-owned adapter#356
Conversation
Phase A.1 of Task 158 — installs LiteLLM next to the existing llm/llm-anthropic/llm-gemini trio without removing them. The pflow-owned adapter (A.3) will import litellm; old deps get removed in A.11 once all callsites have migrated. Pinned to 1.82.6 (not 1.83.7 as recommended in the Phase 0 spike): all 1.83.x releases hard-pin click==8.1.8, which downgrades click from 8.3.1 and breaks 3 CliRunner tests that depend on click 8.2+'s default stderr separation. 1.82.6 leaves click unconstrained. Gemini PR #15226 (2025-10-07) is present in 1.82.6. Temporary deptry ignore for litellm — adapter imports it in A.3.
Phase A.2 — `src/pflow/core/llm_reasoning_map.py`:
Replaces the live `model.Options.model_fields` introspection at the previous
`nodes/llm/llm.py:35-114`. LiteLLM has no equivalent contract, so capabilities
are detected by model-name string sniffing (same approach as
`registry/smart_filter.py:178`). Output shape preserves the legacy
llm-library kwarg shape (e.g. `{"thinking": True, "thinking_budget": N}`);
the adapter translates to LiteLLM-native shapes at the API boundary.
Critical: Anthropic Opus 4.5 thinking_effort precedence preserved.
Phase A.3 — `src/pflow/core/llm_client.py`:
Single seam for all LLM calls. Wraps `litellm.completion`. Public API:
`complete(...)` returns an `AdapterResponse` with stable `.text` (attribute,
not callable), `.usage` dict (matching keys downstream code already expects),
`.model`, `.has_schema`, plus `.error`/`.status` for the PATTERN EXCEPTION
case. Catches `BadRequestError` and returns an error-marked response so
the Node retry loop doesn't burn 3 attempts on permanent failures.
Cache-token normalization handles both Anthropic and Gemini/OpenAI paths
into one stable `usage` dict shape. Cost is read from LiteLLM's
`_hidden_params["response_cost"]` per Phase 0 outcome A.
`trace_hook` parameter is the seam that will replace the
`runtime/workflow_trace.py:520-599` monkey-patch in Phase A.6.
No production code is wired to either module yet; that's A.5 and A.7.
Tests: 60 in test_llm_reasoning_map.py + 35 in test_llm_client.py, all
mocked, no network.
Phase A.4 — test infrastructure:
* `tests/shared/llm_mock.py`: add `MockLLMClient` returning `AdapterResponse`.
Shares `_DEFAULT_RESPONSES` table with the legacy `MockLLMModel`. Preserves
the 500-char `call_history` truncation; adds parallel `call_history_full`
for future cache-structure tests. `MockLLMModel` and `MockGetModel` are
retained — both fixtures coexist during A.4-A.7 while callers migrate.
* `tests/conftest.py`: add `mock_llm_client` autouse fixture that patches
`pflow.core.llm_client.complete` plus the consumer-module bindings.
Same `/llm/` skip pattern as the legacy fixture.
Phase A.5 — LLMNode rewrite:
* `src/pflow/nodes/llm/llm.py`:
- Drop `import llm` and `from pydantic import ValidationError`.
- `_call_llm` calls `pflow.core.llm_client.complete` instead of
`model.prompt`. The PATTERN EXCEPTION (deterministic BadRequestError →
error-marked response, no retry burned) moves into the adapter; the
branch in LLMNode now reads `adapter_response.status == "error"`.
- `prep` builds `Attachment` (the new dataclass) instead of
`llm.Attachment`.
- `post` simplified — the adapter normalizes usage to a stable dict shape
so the legacy object-with-.input/.output/.details path is gone.
- `exec_fallback` uses `isinstance` on `litellm.exceptions.{Timeout,
NotFoundError, AuthenticationError, BadRequestError}` instead of class-
name string matching. User-facing error text adjusted minimally:
`'llm models'` → `'pflow settings llm show'`,
`'llm keys set <provider>'` → `'pflow settings env set'`.
- `EFFORT_RATIOS`, `DEFAULT_MAX_TOKENS_BASE` re-exported from
`pflow.core.llm_reasoning_map` for backward compatibility.
- New `_active_trace_hook()` reads the active collector from
`WorkflowTraceCollector._active_collectors[thread_id]` and returns
`collector.get_trace_hook(node_id)` to pass to the adapter.
* `src/pflow/runtime/workflow_trace.py`: add `get_trace_hook(node_id)`
returning a callable that captures `before_call.prompt` into
`self.llm_prompts[node_id]` — same destination the monkey-patch writes
to. `setup_llm_interception` / `cleanup_llm_interception` retained
during A.5-A.6 (still active for the 3 discovery callsites until A.7).
Test migration:
* `tests/test_nodes/test_llm/test_llm.py`: 61 tests rewritten to use
the new fixture + assert against `mock_llm_client.call_history`. Added
new `TestReasoningEffortValidation` and `TestReasoningKwargsForwarded`
classes (12 tests) to cover prep validation and kwarg forwarding.
* `tests/test_nodes/test_llm/test_llm_images.py`: 12 tests rewritten to
assert against the new `Attachment` dataclass shape.
* `tests/test_nodes/test_llm/test_llm_reasoning.py`: superseded by
`tests/test_core/test_llm_reasoning_map.py` + the new test_llm.py
classes. Module-level `pytest.skip` so collection is clean; body
retained pending user-approved deletion. Per-file F821/F401 ignore in
pyproject.toml for the dead-code body.
* `tests/test_integration/test_metrics_integration.py`: `test_llm_cost_calculation`
and `test_llm_accumulation_across_nodes` rewired to monkeypatch
`pflow.nodes.llm.llm.complete` with custom AdapterResponse-returning
functions (the per-test usage-token control they need).
Verification:
* `make test` → 5303 passed, 10 skipped, 0 failed
* `tests/test_execution/test_plan_drift.py` → 32 passed (sacred parity intact)
* `make check` → ruff/mypy/deptry all green
The mock intentionally OMITS `cost_usd` from the usage dict so
`enrich_llm_usage_with_cost` runs the existing pricing-from-tokens path
(several tests rely on historical-cost propagation through the memo cache).
…leanup (A.6+A.7+A.8) Phase A.6 — tracing redesign: * `src/pflow/runtime/workflow_trace.py`: deleted the `setup_llm_interception` monkey-patch (intercepted `llm.get_model` and every `model.prompt` instance); replaced with a thin per-thread registration step that the engine still calls before each LLM-capable node. The new path is the `trace_hook` parameter the LiteLLM adapter takes (added in A.3, wired by LLMNode in A.5). * Dropped class-level state that supported the patch: `_llm_interception_count` and `_original_get_model`. Kept `_active_collectors`, `_thread_local`, and `_llm_lock` — LLMNode's `_active_trace_hook()` reads them to find the right collector. * `cleanup_llm_interception` is now a thin unregister + idempotent guard. Phase A.7 — three discovery callsites migrated: * `src/pflow/registry/discovery.py::find_components` * `src/pflow/registry/smart_filter.py::smart_filter_fields` * `src/pflow/core/workflow/discovery.py::find_workflow` * All three now call `pflow.core.llm_client.complete(...)` instead of `llm.get_model(...).prompt(...)`. Pydantic schema → JSON Schema dict via `Class.model_json_schema()` at the call site. * Smart filter's Gemini thinking heuristics flow through `model_options` (LiteLLM accepts these top-level for Gemini). * `parse_structured_response` already tolerated both the old callable `text()` and the new attribute `text` — no change needed. Phase A.8 — mass test cleanup: * `tests/conftest.py`: deleted the legacy `mock_llm_calls` autouse fixture (which patched `llm.get_model`) and the `mock_llm_responses` helper. `mock_llm_client` is now the sole LLM mock. * `tests/shared/llm_mock.py`: docstring updated. `MockLLMModel` / `MockGetModel` / `create_mock_get_model` retained but marked OBSOLETE per user instruction (no file/symbol deletes during Phase A — they go on the end-of-task deletion checklist). * `tests/test_integration/test_metrics_integration.py`: `test_trace_captures_llm_calls` migrated to monkeypatch `pflow.nodes.llm.llm.complete`. The bespoke `mock_llm` fixture removed (no remaining consumers); `test_cost_calculation_accuracy` signature trimmed. * `tests/test_cli/test_dry_run.py`: dry-run "no LLM/HTTP calls" test now patches `pflow.nodes.llm.llm.complete` (the live binding). * `tests/test_nodes/test_llm/test_llm_integration.py::has_openai_api_key`: rewritten to read `OPENAI_API_KEY` directly — the LiteLLM adapter resolves keys from env vars, no `llm` library introspection needed. Keeps the file collectable after A.11 removes the legacy package. * Subagent migrated 5 test files (test_smart_filter, test_component_discovery, test_workflow_discovery, test_node_output_formatter, test_nested_workflow_cli) — all bulk renames `mock_llm_calls` → `mock_llm_client`. One regression-guard fix in `test_nested_workflow_cli.py` where the old assertion targeted the wrong mock. * `pyproject.toml`: added `llm` to the temporary `DEP002` ignore — it remains in the lockfile until A.11 removes it from `dependencies`. Verification: * `make test` → 5303 passed, 10 skipped, 0 failed * `tests/test_execution/test_plan_drift.py` → 32 passed (sacred parity intact) * `make check` → ruff/mypy/deptry all green * No production code references `llm` (the package). The two remaining references are: a docstring in `runtime/workflow_trace.py` (historical context for the deleted patch), and `tests/shared/llm_mock.py`'s obsolete-but-retained code block. Pending deletion checklist (added to in this commit): * `MockLLMModel` / `MockGetModel` / `create_mock_get_model` in `tests/shared/llm_mock.py` — unreferenced. * The bespoke `mock_llm` fixture references in `tests/test_integration/test_metrics_integration.py` (already removed in this commit; checklist entry tracks the import line).
Removes the `llm keys get`/`llm models default` subprocess shells from llm_config.py — LiteLLM (since A.5) reads keys from os.environ directly, so the subprocess fallback is dead weight. Production changes (`src/pflow/core/llm_config.py`): - DROP `_has_llm_key()`, `get_llm_cli_default_model()`, `_get_validated_llm_path()` - DROP `LLM_COMMAND`, `_LLM_KEYS_SUBCOMMAND` constants - DROP `subprocess` and `shutil` imports (no callers remain) - `_has_provider_key()` now only consults env vars + pflow settings (two sources) - `_detect_default_model()` no longer skip-guards on PYTEST_CURRENT_TEST (the guard was there to prevent subprocess hangs; no subprocess to hang now) - `get_default_workflow_model()` resolution: settings → auto-detect → None - `get_llm_setup_help()` rewritten: env vars + `pflow settings set-env` instead of `llm keys set <provider>` - `get_model_not_configured_help()`: removed `llm models default`, `llm models list`, `llm keys list` references; points at `pflow settings llm show` instead - `inject_settings_env_vars()` UNCHANGED (still preserves the test guard to avoid polluting test env) Help-text updates (`src/pflow/cli/commands/settings.py`): - Settings group docstring: drop "via Simon Willison's llm tool"; show env-var + `pflow settings set-env` instead - `_get_resolved_model()`: drop `get_llm_cli_default_model` import; default-model resolution now shows auto-detect, not "llm CLI default" - `llm_show` resolution-order text and `llm_unset` "default" message updated to remove "llm CLI default" phrasing Lint: - Drop the `S603` per-file ignore for `llm_config.py` (no subprocess remains in the file) Tests: - `test_llm_config.py`: replace `test_pytest_environment_skips_detection` (tested the removed guard) with `test_detection_only_uses_env_and_settings` - `test_llm_config_provider_detection.py`: replace `test_falls_back_to_llm_cli` with `test_returns_false_when_neither_env_nor_settings_have_key`; drop the PYTEST_CURRENT_TEST bypass dance from `_detect_default_model` tests - `test_llm_config_workflow_model.py`: drop `TestGetLlmCliDefaultModel` class entirely (function gone); drop the 3 `TestGetDefaultWorkflowModel` tests that exercised the llm-CLI fallback chain; rewrite `TestGetModelNotConfiguredHelp` assertions to expect `pflow settings` guidance and reject the removed `llm models default`/`llm keys list` text - `test_compiler_llm_model.py`: same help-text assertion update Verification: - `make test` — 5294 passed, 0 failed (9 fewer tests = removed dead llm-CLI subprocess tests) - `tests/test_execution/test_plan_drift.py` — 32 passed (sacred parity intact) - `make check` — ruff, ruff-format, mypy, deptry all green - Manual: `pflow settings llm show` and `pflow settings --help` produce sensible output with no llm-CLI references
Phase 0 spike confirmed Outcome A: LiteLLM's per-model pricing data covers every model pflow knows about (2,678 models vs pflow's 41) and fixes the per-provider cache-multiplier bugs in pflow's table (e.g. OpenAI cache_read should be 0.5×, pflow had 0.1× universally — 80% off). The Phase 0 spike's live regression test on Gemini PR #15226 confirmed the double-count fix is present in the pinned LiteLLM version. Production: - DELETE `src/pflow/core/llm_pricing.py` (188 lines: MODEL_PRICING, MODEL_ALIASES, calculate_llm_cost, get_model_pricing, the inline enrich_llm_usage_with_cost computation). - MOVE `enrich_llm_usage_with_cost` to `pflow.core.llm_client` as a 10-line wrapper. Single responsibility: ensure `cost_usd` key is present (preserve when set; mirror Claude Code SDK's `total_cost_usd`; otherwise None). No pricing math. - Production callers (`metrics.py`, `nodes/llm/llm.py`, `runtime/engine/instrumentation.py`, `runtime/engine/batch_executor.py`) updated to import from the new location. - `core/__init__.py` docstring updated: `llm_client` is the public surface now, `llm_pricing` no longer referenced. Mock contract change (option (c) — the right end-state, not the least-disruptive migration): `MockLLMClient` previously OMITTED `cost_usd` so the deleted `enrich_llm_usage_with_cost` could compute it from MODEL_PRICING. With that path gone, the honest end-state is: - Default response: `cost_usd: None` in usage dict — matches production behavior when LiteLLM has no pricing data (custom endpoints, brand-new models, Ollama). - `set_response()` accepts an optional `cost_usd=` kwarg so tests that care about specific values can declare them upfront. Self-documenting: reading the test tells you what cost the test expects. This rejected option (a) `cost_usd: 0.0` (still a fiction; production never returns 0.0 for a real call) and option (b) `cost_usd = input_tokens * X + output_tokens * Y` (a fake formula that future test authors would mistake for a real computation). Tests updated (15 affected, all the failures from the production deletion): - `test_metrics.py`: rewrote 4 tests that exercised pricing math — `test_aggregation_sums_explicit_cost_usd` (sums injected cost_usd), `test_summary_generation_with_workflow_metrics` (injects cost_usd), `test_missing_fields_in_llm_usage_handled_gracefully` (verifies the new "no cost_usd → unavailable" semantic with a partial-cost entry), `test_cost_rounding_to_six_decimal_places` (rounds the sum). Removed `test_cost_calculation_for_different_models` — it tested the 4-model pricing table directly. - `test_metrics_thinking_cache.py`: removed `test_thinking_cost_calculation` and `test_cache_cost_calculation` (tested pricing math); updated `test_no_thinking_or_cache_tokens` to inject cost_usd; dropped now-unused `pytest` import. - `test_metrics_integration.py`: updated `test_llm_cost_calculation` and `test_llm_accumulation_across_nodes` to set cost_usd on inline mock usage dicts and assert exact sums (more rigorous than the prior `> 0` check). Removed `test_cost_calculation_accuracy` — it directly tested the deleted MODEL_PRICING table. - `test_unknown_model_user_experience.py`: renamed `test_mixed_models_shows_partial_cost_clearly` to `test_mixed_priced_unpriced_shows_partial_cost_clearly` (the relevant split is now "has cost_usd vs doesn't", not "known vs unknown model"). Also updated obsolete "Update MODEL_PRICING" comment. - `test_instrumented_wrapper.py`: rewrote `test_llm_usage_enriched_with_cost_in_shared_store` and `test_trace_event_has_cost_usd` to verify the new contract (cost_usd preserved when set, None when not), instead of pinning gpt-4 dollar values from the deleted pricing table. - `test_plan_drift.py::test_plan_cost_nested_rollup` now takes the `mock_llm_client` fixture and sets an explicit cost — assertions pin the exact value through the rollup chain (sharper than the prior `> 0` checks). - `test_llm_pricing.py`: pytest.skip'd at module level (matches the Phase A.5 `test_llm_reasoning.py` pattern). Body retained pending user-approved deletion. Per-file F821/F401 ignore added. Documentation: - `core/CLAUDE.md`: replaced the `llm_pricing.py` section with a new `llm_client.py` section that documents the cost-from-LiteLLM contract. Removed the "46+ models" / "Broken aliases" claims. Trimmed `llm_config.py` section to remove the deleted llm-CLI default resolution step. Verification: - `make test` — 5266 passed, 0 failed (1 fewer than A.9 baseline = removed `test_cost_calculation_accuracy`) - `tests/test_execution/test_plan_drift.py` — 32 passed (sacred parity invariant intact) - `make check` — ruff, ruff-format, mypy, deptry all green Deletion checklist additions (pending user approval): - `tests/test_core/test_llm_pricing.py` (entire file body) — already pytest.skip'd at module level - The temporary F821/F401 per-file ignore for `test_llm_pricing.py` in `pyproject.toml`
Production code finished migrating to the LiteLLM adapter in A.7; the last test/fixture references were removed in A.8. The llm trio (and its temporary DEP002 ignore entries) can now be dropped. - pyproject.toml: remove `llm>=0.29`, `llm-anthropic==0.25`, `llm-gemini>=0.30` from `[project] dependencies`. Reduce `[tool.deptry.per_rule_ignores] DEP002` to just `["PyYAML"]` (PyYAML is loaded via lazy import in the markdown parser). Also remove the obsolete commented-out `[project.optional-dependencies]` stub. - uv.lock: regenerated via `uv sync`. llm, llm-anthropic, llm-gemini, and their transitive deps (ijson, puremagic, sqlite-utils, etc.) uninstalled. Verification: - `uv pip list | grep -iE '^(llm|llm-)'` → empty (only `litellm` present) - `grep -rn 'import llm$|from llm import' src/pflow/ tests/` → zero hits - `make test` — 5266 passed, 0 failed - `make check` — ruff, ruff-format, mypy, deptry all green
Sweeps the user-facing docs and bundled MCP agent instructions for the LiteLLM migration. Adds a changelog entry describing the swap, the provider-coverage upgrade, and the breaking change for users storing keys via Simon Willison's `llm` CLI. User-facing docs: - `docs/quickstart.mdx`: drop the now-invalid "If you already have llm installed and configured" tip; replace the "Simon Willison's llm" pointer with LiteLLM's provider list. Add a tip showing the env-var path works too. - `docs/reference/nodes/llm.mdx`: rewrite the intro and replace the "Extending with plugins" section (which described `llm-openrouter` / `llm-ollama` plugin install) with a simpler "Other providers" section noting that LiteLLM speaks 100+ providers natively. OpenRouter and Ollama get concrete examples without plugin install. Update the error- table guidance to point at `pflow settings llm show` and the LiteLLM provider list. - `docs/reference/cli/settings.mdx`: drop the "If you use Simon Willison's llm" alternative; show the env-var path. Remove the now- invalid `llm models default` step from the model-resolution chain. - `docs/roadmap.mdx`: update the "Unified model support" current-status line to credit LiteLLM and note the 100+ provider count. Bundled MCP instructions: - `mcp-agent-instructions.md`: rewrite the "For LLM providers" block — use `pflow settings set-env` and shell env vars instead of the deleted `llm keys set <provider>`. Update the cheatsheet's `llm keys set` line to `pflow settings llm show`. - `mcp-sandbox-agent-instructions.md`: same treatment for the sandboxed-agent guidance. Node README: - `src/pflow/nodes/llm/README.md`: rewrite the installation section end-to-end. Provider keys are now set via `pflow settings set-env` or shell env vars; OpenRouter and Ollama are first-class without plugin install. Drop the dead `pip install llm-anthropic` / `llm-gpt4all` / `pflow[all-llms]` instructions. Update the token-usage example to show `cost_usd`. Update error-handling guidance. Changelog: - New `<Update>` block at the top labeled "Unreleased" documenting the swap. Highlights: native 100+ provider support, current pricing data. Breaking changes: env vars / `pflow settings set-env` only (no automatic pickup of `llm keys set` storage); `llm`/`llm-anthropic`/ `llm-gemini` packages no longer pflow dependencies. Accordion enumerating what stays the same so users don't worry their workflows will need rewriting. Intentionally retained: - `src/pflow/core/llm_reasoning_map.py` module docstring's reference to Simon Willison's `llm` library — this is historical context that explains *why* the file exists (LiteLLM doesn't expose the `Options` introspection contract the original code relied on). Removing the reference would erase the design rationale. - `docs/changelog.mdx` historical entry for the original `llm` integration — that's a record of what was true at release time. Verification: - `make test` — 5266 passed, 0 failed - `make check` — ruff, ruff-format, mypy, deptry all green - `grep -rln 'llm keys|llm models|Simon Willison' src/pflow/ docs/` — only the two intentional retentions remain
Five items the user explicitly deferred during Phase A so they could review before we cut anything. All approved. Deletes (items 1-3): - DELETE `tests/test_nodes/test_llm/test_llm_reasoning.py` — entire file. Module was pytest.skip'd since A.5; coverage migrated to `tests/test_core/test_llm_reasoning_map.py` (60 tests on the new `map_reasoning_options` API) plus `TestReasoningEffortValidation` and `TestReasoningKwargsForwarded` in `tests/test_nodes/test_llm/test_llm.py`. - DELETE `tests/test_core/test_llm_pricing.py` — entire file. Module was pytest.skip'd since A.10; the production `pflow.core.llm_pricing` module was deleted in that step. Pricing is LiteLLM's responsibility now. - DELETE `MockLLMModel`, `MockGetModel`, `create_mock_get_model` from `tests/shared/llm_mock.py`. Unreferenced since A.8 — they patched the `llm.get_model` seam that no longer exists. Kept the shared `_DEFAULT_RESPONSES` table and `_schema_name()` helper (still used by `MockLLMClient`). Updated `tests/shared/README.md` to document the new state. Config (item 4): - Drop the two `[tool.ruff.lint.per-file-ignores]` entries for `test_llm_reasoning.py` (F821, F401) and `test_llm_pricing.py` (F821, F401) — added during A.5 and A.10 to silence linter warnings on the dead-code bodies retained inside the skipped files. Both files are gone, so the ignores are gone too. Rename (item 5): - `WorkflowTraceCollector.setup_llm_interception` → `register_for_llm_call` - `WorkflowTraceCollector.cleanup_llm_interception` → `unregister_from_llm_call` - Module-level engine wrapper `setup_llm_interception(...)` → `register_for_llm_call(...)` in `runtime/engine/instrumentation.py` - Caller updates: `runtime/engine/engine.py:45,293` (import + call site) and `execution/runner.py:708-712` (cleanup hasattr check + call) - Docstring touch-ups in `nodes/llm/llm.py` (`_active_trace_hook`), `runtime/workflow_trace.py` (`get_trace_hook` + the renamed methods' docstrings), and `runtime/engine/CLAUDE.md` (lifecycle diagram + registration entry) Why rename: A.6 collapsed the prior global `llm.get_model` / `model.prompt` monkey-patch (with reference-counted lazy install/ teardown) into a thin per-thread registration step. The "interception" name described what the methods did before A.6 — it misleads now. The new names describe the actual work: the engine "registers" each LLM-capable node before it runs so the adapter's trace_hook can find the right collector at call time, and the runner "unregisters" the collector at cleanup. Behavior unchanged. The class-level state (`_active_collectors`, `_thread_local`, `_llm_lock`, `enable_llm_interception` flag, etc.) keeps the same semantics — only the method names change. Verification: - `make test` — 5266 passed, 0 skipped, 0 failed - `tests/test_execution/test_plan_drift.py` — 32/32 sacred parity tests green - `make check` — ruff, ruff-format, mypy, deptry all green - Two retained references (`workflow_trace.py:566` historical context in the renamed method's docstring; `test_metrics_integration.py:21` obsolete-fixture comment) are intentional — they document why the current code looks the way it does.
Capture the A.9-A.12 + end-of-task cleanup work in the progress log and hand off a review-focused braindump for the next agent. Progress log §30 covers: - A.9: llm CLI subprocess paths dropped, with the spec deviation (`pflow settings set-env`, NOT `pflow settings env set` as the prior braindumps assumed) - A.10: the cost-mock decision the user pushed back on — option (c) with `cost_usd: None` default, tests inject explicit costs - A.11: drop the llm trio from pyproject - A.12: docs pass + CHANGELOG migration entry, with the two intentional retentions (llm_reasoning_map.py docstring, historical changelog entry) documented - Migration of user's llm-stored keys into pflow settings - Real-API smoke test on Gemini-3-flash-preview ($0.0005), with the pre-existing reasoning-model UX issue surfaced and flagged - End-of-task cleanup: 5 items processed (2 file deletes, MockLLMModel family removed, 2 lint ignores dropped, methods renamed) - Final state at handoff and loose-ends acknowledged but not blockers Braindump replacement: - DELETE the now-obsolete A9-A12 handoff (that work is done) - NEW review-focused braindump for the next agent: lists 10 high-leverage review angles in descending risk order, admits weak reasoning where I have it, scopes out the Phase B-G items so the reviewer doesn't flag them as missing
…or tip The AuthenticationError fallback in `LLMNode.exec_fallback` referenced `pflow settings env set`, which is not a real command. The correct name is `pflow settings set-env <KEY> <value>` — already used elsewhere in A.9's help text but missed in this site. Also enrich the NotFoundError tip with a pointer to LiteLLM's provider list so an agent that mistypes a model can find the valid model strings without trial-and-error. Updated the Interface docstring's `cost_usd` description to reflect the post-A.10 reality (cost is None when LiteLLM has no pricing data — there is no longer a pricing table to maintain).
…coverage Three regression-prone gaps surfaced during the Phase A code review: 1. Anthropic Opus 4.5 has BOTH thinking_effort AND thinking_budget. When a caller provides BOTH reasoning_effort AND reasoning_max_tokens, max_tokens wins (returns thinking_budget shape). Reordering the precedence dispatch in `map_reasoning_options` would silently degrade Opus 4.5 reasoning behavior — a test pins the contract. 2. `enrich_llm_usage_with_cost` has three branches (cost_usd preserved, total_cost_usd mirrored, neither → None). The total_cost_usd mirror path is the Claude Code SDK contract — it lost its dedicated test when `tests/test_core/test_llm_pricing.py` was deleted in A.10. Five tests now cover all three branches plus the `cost_usd=None` corner case (explicit None must not be overwritten by the mirror). 3. `smart_filter` flows Gemini-specific reasoning kwargs via `model_options` to disable thinking on filtering calls (otherwise gemini-3 burns tokens on a trivial classification task). No assertion existed on `mock_llm_client.call_history[-1]["model_options"]` — five parametrized tests pin the wiring across gemini-3-flash, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite (no kwargs), and a non-Gemini control.
Phase A.10 deleted `pflow.core.llm_pricing` and Phase A.11 dropped the `llm`/`llm-anthropic`/`llm-gemini` packages, but several user-facing and agent-facing docs still referenced the removed code. The Phase A code review caught these. Doc cleanup across 10 files: User-facing - docs/changelog.mdx: add a Note that memo cache entries written before this upgrade lack `cost_usd`, so dry-run plans for cached nodes show `estimated_cost_usd: null` until the 24h TTL refreshes them. - docs/reference/cli/settings.mdx: replace stale `pflow settings llm show` output (5-tier resolution chain with the dead "llm CLI default" step) with the actual per-feature output. - src/pflow/cli/workflow_output.py + src/pflow/execution/formatters/ success_formatter.py: rephrase "model not in pricing table" warning to "pricing data missing for" — consistent across CLI and MCP paths per the formatter parity contract. Agent-facing - src/pflow/nodes/llm/llm.py: Interface docstring's `cost_usd` description (already corrected in commit 96f5f3d along with the typo fix; this commit only addresses non-llm.py files). - src/pflow/core/settings.py: `LLMSettings` docstring resolution chain drops the "llm CLI default" tier. - src/pflow/core/CLAUDE.md: directory tree adds `llm_client.py` and `llm_reasoning_map.py`, removes `llm_pricing.py`. - src/pflow/runtime/engine/CLAUDE.md: cross-module dependencies point at `core/llm_client.py::enrich_llm_usage_with_cost` instead of the deleted `llm_pricing.py`. - tests/CLAUDE.md: rewrite the autouse-fixture and LLM-mock-resolution sections to match post-A.8 reality (`mock_llm_client`, AdapterResponse shape, `cost_usd=None` default, `set_response(..., cost_usd=...)` pattern). Old text described the deleted `mock_llm_calls` fixture and the legacy `response.text()` callable contract. - tests/test_nodes/test_llm/TESTING.md: full rewrite. Old version told testers to install `llm-anthropic` plugin and run `llm keys set` — both gone. New version covers `pflow settings set-env` / env vars and notes that LiteLLM speaks 100+ providers natively (no plugin install). - tests/test_cli/test_direct_execution_helpers.py: small docstring refresh on the partial-cost warning test.
…asoning output
Two related improvements to the LLM adapter, surfaced by the Phase A
code review.
(1) Adapter raises typed LLMCallError instead of returning error-marked
AdapterResponse for deterministic provider errors
The adapter previously had a dual contract: AdapterResponse could be
either a successful response (status="ok", error=None) OR an error
marker (status="error", error="..."). Every consumer had to check
status before reading text/usage. Three of four production consumers
(registry/discovery × 2, registry/smart_filter) didn't check status —
they passed the empty-text response straight into parse_structured_response
and surfaced ValueError("LLM returned empty response") instead of the
real provider error message. Smart_filter's catch-all swallowed it as a
silent degradation.
The cleaner contract: AdapterResponse means "successful response", full
stop. Deterministic errors raise LLMCallError (new, in core/exceptions.py).
LLMNode catches LLMCallError at its single _call_llm seam to preserve
the PATTERN EXCEPTION semantics (Node retry loop must not burn three
attempts on a permanent failure). Discovery callers let it propagate.
Smart_filter's existing except-Exception still catches it — intentional
graceful degradation for smart filtering.
Net change: deletes the `error` and `status` fields from AdapterResponse,
removes the dual-contract conditional in LLMNode, and discovery callers
get correct error handling for free without any callsite modification.
The trace_hook still fires `after_call` with `error` set before the
raise, so traces capture the failure.
(2) Warn when a reasoning model consumes tokens but emits no visible text
The Phase 0 spike on gemini-3-flash-preview surfaced an opaque failure
mode: max_tokens too low for a reasoning model means the entire token
budget gets spent on internal thinking before any visible output is
emitted. The provider returns content=None with finish_reason=length;
the adapter normalizes this to text="" and pflow surfaces an empty
result with no clue why.
`_normalize` now logs a targeted warning when:
text == "" AND output_tokens > 0 AND finish_reason in {length, max_tokens}
No false positives because text="" on a normal successful call is
genuinely strange. The warning explains the likely cause and the fix
(increase max_tokens). Pre-existing UX issue; this surfaces it cleanly
without changing any behavior.
The trace_hook abstraction added in Phase A.6 was non-functional in production. The engine registered the collector against the main thread (`_active_collectors[main_thread_id] = collector`), but `LLMNode._call_llm` runs inside an inner `ThreadPoolExecutor(max_workers=1)`. The worker thread's id never matched the registered id, so `_active_trace_hook()` returned None on every adapter call and the `trace_hook` parameter was never invoked. Verified by smoke test against real Gemini-3: `event["llm_prompt"]` was missing from the trace JSON for every literal-prompt LLM node, and `pflow report` had no `## Prompt` section for those nodes as a result. Replace ~110 lines of dead infrastructure with the existing pflow shared-store pattern: - `WorkflowEngine.run()` save/restore `shared["_trace_collector"]` around the graph walk. Reuses the established `_trace_collector` shared key (already in `_PROPAGATED_KEYS`, already read by formatters at `success_formatter.py:64` etc., already installed by the runner at `runner.py:490`). Save/restore correctly handles nested sub-workflow runs in both `storage_mode: mapped` and `storage_mode: shared` — the child engine's installation is local to its run frame, then the parent's collector is reinstated. Write-back form (not `.pop()`) because shared may be a `NamespacedSharedStore` which doesn't expose `pop`; all consumers use `.get()` so writing `None` back is indistinguishable from absence. - `LLMNode.prep()` reads `shared["_trace_collector"]` and `self.node_id` (compiler-set dynamic attribute, see `compilation/compiler.py:299`), resolves the per-call hook via `collector.get_trace_hook(node_id)`, stores it in `prep_res["_trace_hook"]`. `LLMNode.exec()` captures the hook from `prep_res` BEFORE submitting `_call_llm` to the inner pool. The hook is then passed as an explicit arg through the thread boundary — no thread-id lookup, no global registry. DELETE: - `WorkflowTraceCollector._active_collectors`, `_thread_local`, `_llm_lock`, `_llm_interceptor_installed` class state - `WorkflowTraceCollector.register_for_llm_call` / `unregister_from_llm_call` methods - `WorkflowTraceCollector.enable_llm_interception` instance flag (and the line in `workflow_executor.py:343` that set it on child collectors — now unnecessary because save/restore handles inheritance) - `_active_trace_hook()` function in `nodes/llm/llm.py` - `register_for_llm_call(...)` wrapper in `instrumentation.py` + its call site in `engine.py::_execute_node` + import - `unregister_from_llm_call` cleanup in `runner.py::_cleanup` - `threading` and `ClassVar` imports in `workflow_trace.py`, `threading` import in `nodes/llm/llm.py` KEEP: - `WorkflowTraceCollector.get_trace_hook(node_id)` — same contract, now invoked from LLMNode.prep instead of the deleted `_active_trace_hook` - `WorkflowTraceCollector.llm_prompts` dict — same writer (the hook fires from inside the adapter), same reader (`_add_llm_data`) Free behavior fixes (intentional — the bug fix has user-visible improvements, no regressions): 1. `event["llm_prompt"]` now populates in trace JSON for every non-batch LLM call. `pflow report` gains the `## Prompt` section for those nodes (was empty for literal-prompt nodes due to thread-id mismatch in the trace_hook plumbing). 2. Sub-workflow LLM prompts now correctly land in the child collector's `llm_prompts` under the child node's id (was misrouted under the parent's WorkflowExecutor event when the trace_hook would have fired at all — which it didn't). 3. Cross-test stale-collector contamination eliminated (no class-level globals to leak across tests). Documentation updated: - `runtime/engine/CLAUDE.md` — lifecycle diagram + trace section - `runtime/CLAUDE.md` — WorkflowTraceCollector description Known limitation NOT addressed by this refactor: parallel batch LLM prompts continue to be keyed by the batch wrapper id (last-item-wins) — same as today; per-item prompt indexing is a separate concern that would require either writing prompts to `node_output` so `_capture_item_trace` picks them up, or keying `llm_prompts` by `(node_id, batch_idx)`. Not in scope for this commit. Plan: .taskmaster/tasks/task_158/implementation/plan-refactor-trace.md (also at ~/.claude/plans/magical-swinging-taco.md)
…orage_mode Companion to commit 96003f3. Four new tests pin the behavior the shared-store-seam refactor delivers; two old tests deleted because they covered class-level state that no longer exists. DELETED tests: - tests/test_runtime/test_workflow_trace.py::TestWorkflowTraceCollector ::test_enable_llm_interception_attribute (the flag is gone) - tests/test_runtime/test_workflow_trace.py::TestThreadLocalCurrentNode ::test_current_node_is_thread_local (the registry is gone) DELETED 4 setup lines in tests/test_runtime/test_trace_integration.py that assigned `collector.enable_llm_interception = False` (no longer needed; sub-workflow inheritance is handled by the engine.run save/ restore pattern). NEW tests in tests/test_runtime/test_trace_integration.py: - TestLLMTraceHookCapture::test_llm_prompt_captured_via_trace_hook Direct regression for the worker-thread-mismatch bug. Runs an LLMNode through the engine, asserts collector.llm_prompts[node_id] populates with the rendered prompt AND the trace event surfaces llm_prompt. This test would have caught the bug pre-Phase-A. - TestSubWorkflowTraceCollector ::test_sub_workflow_llm_prompt_in_child_collector Verifies sub-workflow LLM prompts land in the CHILD collector under the child node's id, not in the parent's WorkflowExecutor event. Confirms the engine.run save/restore correctly swaps in the child collector for the duration of the child run. - TestSubWorkflowTraceCollector ::test_storage_mode_shared_does_not_pollute_parent_collector Identity-check (`is`) that parent's collector is reinstated in shared["_trace_collector"] after a storage_mode=shared sub-workflow completes. Catches any case where save/restore returned an equal-but- different object. - TestParallelBatchSubWorkflowTrace ::test_each_batch_item_subworkflow_captures_own_llm_prompt The most complex nested case: parallel batch where each item is a sub-workflow containing an LLM call. Asserts each child collector captures its own per-item prompt (no cross-worker contamination). This is exactly where the previous thread-local design would have regressed; the per-worker item_shared + per-engine save/restore design handles it correctly. 5284 passed (was 5280 + 4 new); test_plan_drift.py 32 green; make check green.
…r all deterministic LiteLLM errors Bundles Phase A code-review item #6 (typed-error translation) with three deferred findings from the §31 review (#8 smart_filter narrowing, #9 parse_structured_response → LLMCallError, trace_report unpriced-cost propagation). Item #6 — typed exception hierarchy at the adapter seam: - Add UnknownModelError / MissingApiKeyError / InvalidRequestError as LLMCallError subclasses. UnknownModelError carries a structured `reason` attribute ("unknown_name" | "missing_prefix") so consumers branch on a typed contract rather than substring-matching message text. - Adapter catch tuple expanded from BadRequestError only to all 4 deterministic LiteLLM exception classes; new _classify_litellm_error helper does the translation in one place at the LiteLLM boundary. - Drop `import litellm.exceptions` from nodes/llm/llm.py — the architectural seal is now complete and grep-verifiable. - LLMNode._call_llm rewrites the catch chain into 3 typed branches with precise messages; exec_fallback shrinks from 4 isinstance branches (~50 lines) to a substring-matched timeout case + generic fallback (~15 lines). All 5 error paths populate a new error_class field surfaced to shared["error_class"] so JSON-mode consumers get a machine-parseable cause. Bundled deferred findings: - parse_structured_response now raises LLMCallError instead of ValueError, giving smart_filter and discovery callers a single typed contract for parsing failures. - smart_filter narrowed from `except Exception` to `except (LLMCallError, ConnectionError, TimeoutError, OSError)` so programming errors surface loudly while documented failure modes still degrade gracefully. - trace_report propagates `cost_usd: None` (unpriced models like Ollama / custom endpoints / brand-new releases) instead of silently coercing to 0.0 — reports show "—" for unpriced subtrees rather than misleading partial sums. Verification: 5301 tests green; ruff + ruff-format + mypy + deptry clean; `grep -rn 'import litellm\.exceptions' src/pflow/` returns exactly the one expected match in core/llm_client.py. Progress log §32 captures the discriminator-loss bug pattern caught by review-agent-ux, the dead-field bug pattern (error_class added but not surfaced), the autouse-mock skip-pattern testing trap, and the established conventions for any future agents extending this seam.
…eline Closes 3 critical + 7 high-value findings from Phase A code review #2 by completing the typed-exception architecture so error info flows from adapter → exception → diagnostic → JSON without duplication. Architectural changes (10 steps; see plan-review-revised plan for full rationale): - core/exceptions.py: model arg on LLMCallError; new LLMTransientError marker (Timeout/RateLimit/InternalServerError) and LLMResponseParseError (JSON parse); kind discriminator on MissingApiKeyError mirroring UnknownModelError.reason; to_diagnostics() override per subclass producing rich Diagnostics with structured context (category="llm_failure", error_class, model, reason/kind/provider_message) + remediation suggestions + see_also=["llm"]. Single source of truth for prose + structured info. - core/diagnostic.py: LLM_FAILURE_CATEGORY constant + CATEGORY_TITLES entries. Constant referenced by both override (pre-execution path) and _FAILURE_CATEGORY_MAP (runtime path) so they cannot drift. - core/llm_client.py: catch tuple extended from 4 to 7 LiteLLM exceptions; _classify_litellm_error wraps transients in LLMTransientError; every typed exception construction passes model=. AdapterResponse gains warnings: list[dict] field; _normalize emits structured warnings for the full empty-content finish_reason matrix (length/max_tokens reasoning vs non-reasoning, content_filter, stop, None; tool_calls intentionally silent). Replaces logger.warning emission with surfaced warnings that reach JSON output. Lint-driven extraction: _detect_empty_response_warnings. - core/llm_utils.py: parse_structured_response signature gains model= kwarg; raises LLMResponseParseError instead of bare LLMCallError. Three callers updated to thread model. - nodes/llm/llm.py: deleted _error_dict + _api_key_tip helpers (~50 lines); 3 typed-catch branches collapsed to one `except LLMCallError` (LLMNode consumes e.to_diagnostics()[0]; LLMTransientError re-raises so retry loop fires). New _propagate_error_to_shared single-seam mutation helper. post() reads adapter_response.warnings into shared['__warnings__']; JSON-parse path raises LLMResponseParseError. - execution/executor_service.py: new LLM branch in _enrich_error_from_node_output lifts _diagnostic_context into runtime Diagnostic (the dual-transport design that delivers the same context shape on both pre-execution and runtime paths). _FAILURE_CATEGORY_MAP gains 'llm_failure' entry. - runtime/node_state.py + runtime/engine/engine.py: FAILURE_CATEGORY_LLM constant; _NODE_TYPE_FAILURE_CATEGORY adds LLMNode → LLM. - registry/smart_filter.py: umbrella narrowed to (LLMCallError, ConnectionError, OSError) — now correctly catches transient errors that used to crash the discovery caller (verified at runtime: litellm Timeout doesn't inherit from builtin TimeoutError). - cli/find_errors.py: new LLMCallError branch using to_diagnostics() — discovery callers (find/find_workflow) get rich UX without duplicating remediation logic. - _trace_collector → __trace_collector__ rename (24 sites: 8 production + 14 tests + 2 CLAUDE.md + 1 load-bearing filter at workflow_trace.py:313). Aligns with __failures__/__warnings__/__progress_callback__ convention; routes to root deterministically via NamespacedSharedStore dunder bypass. - execution/formatters/success_formatter.py: top-level cost tri-state mirror (pricing_available/partial_cost_usd/unavailable_models). Lint- driven extraction: _mirror_pricing_tri_state. - core/llm_config.py: openai/gpt-5.2 fallback prefix (LiteLLM rejects bare names). get_model_not_configured_help text updated. - Doc/example sweep — 17 files: provider-prefix model names in mintlify docs, pflow guide, MCP agent instructions, example workflows. CLAUDE.md exception hierarchy + adapter section refreshed. CHANGELOG enumerates 3 new JSON-shape additions (category='llm_failure', error_class/model context fields, top-level cost tri-state, empty-response warnings via __warnings__). Verification: - 5306 tests pass; tests/test_execution/test_plan_drift.py 32/32 sacred parity tests green throughout every step. - make check clean (ruff, ruff-format, mypy, deptry). - Architectural seal intact: grep -rn 'import litellm.exceptions' src/pflow/ returns exactly 1 match (core/llm_client.py:35). Loose ends documented in scratchpads/task-158-phase-A-completion-loose-ends.md — most important gap is integration test coverage (the new contract that JSON output's errors[i].context carries error_class/model/reason/kind is verified at unit level only). Plan + plan-review docs at .taskmaster/tasks/task_158/implementation/phase-A-completion-plan.md.
Preserve structured LLM warnings through the runtime diagnostic pipeline and add end-to-end coverage for the LLM failure context contract. Made-with: Cursor
…plete()
Phase A regressed CLI startup from ~0.3s to ~1.2s on every pflow
invocation: a stray import of `enrich_llm_usage_with_cost` from
`llm_client.py` (which had `import litellm` at module top, ~700ms cold)
was sitting in `runtime/engine/batch_executor.py` — squarely in pflow's
eager startup chain. The ~30 subprocess tests in the test suite
multiplied the cost into a 20s → 86s test-suite regression.
Stage 1 — delete the obsolete helper:
- `enrich_llm_usage_with_cost` was already a no-op for new data:
ClaudeCodeNode mirrors `total_cost_usd → cost_usd` at the producer
(claude_code.py:883), LLMNode reads `cost_usd` from the adapter
response, and every consumer uses `.get("cost_usd")`. Helper deleted
along with its 4 callsites + the lazy-import workaround in metrics.py.
- Engine modules (`instrumentation.py`, `batch_executor.py`, `engine.py`)
no longer import from `llm_client.py`. Architectural rule encoded in
`runtime/engine/CLAUDE.md`: heavy-import modules at leaves of the
dependency graph.
Stage 2 — lazy-import litellm inside the adapter:
- Moved `import litellm` and `import litellm.exceptions` from module
top of `llm_client.py` into `complete()` and `_classify_litellm_error()`.
- The first `complete()` call in a process pays the ~700ms; subsequent
calls resolve from `sys.modules`. Paths that don't need litellm
(`pflow validate`, `--dry-run`, fully-cached LLM workflows, and the
future `analyze-cache`) skip the cost entirely.
- Side effect: the first LLM node's recorded duration includes the
one-time import. Documented in `core/CLAUDE.md`. Affects 1/N nodes
per workflow run; cost predictions are token-based and unaffected.
- Tests: 20 patches in `test_llm_client.py` rewritten from
`pflow.core.llm_client.litellm.completion` to `litellm.completion`
(the former path no longer exists as a module attribute).
Results:
- pflow --version: 1.17s → 0.29s
- pflow validate: 1.04s → 0.28s
- Non-LLM workflow run: 1.05s → 0.27s
- pflow --dry-run on LLM workflow: 1.08s → 0.28s
- Cached LLM workflow run: 1.12s → 0.30s
- make test (parallel): ~55s → 18-20s (faster than main)
- Tests passing: 5305; make check green
- litellm in eager CLI chain: 779 hits → 0 hits
…pty-response gate Closes the 5 findings from scratchpads/task-158-phase-a-code-review-20260425.md. Adapter seal (findings #1, #5): - Catch openai.OpenAIError in complete() instead of the enumerative tuple. Verified by introspection that litellm.exceptions.OpenAIError is a SIBLING class to LiteLLM's other exceptions, not their base — every litellm.X actually inherits from openai.X. Catching the OpenAI SDK base is what makes the seal structural. - Rewrite _classify_litellm_error to dispatch via openai.X for mirrored classes plus litellm.exceptions.X for litellm-only ones. New typed coverage: APIConnectionError, ServiceUnavailableError, BadGatewayError, LiteLLMUnknownProvider, APIResponseValidationError. Default branch treats unknown OpenAIError subclasses as deterministic InvalidRequestError (fail-fast over infinite-retry). - Add TestAdapterSealContract: 16 parametrized cases pinning the "every classified LiteLLM exception wraps to LLMCallError" contract, plus a synthetic-unknown subclass test for the safe default. - Add smart_filter regression test for APIConnectionError graceful degradation (the central reviewer reproduction). - Declare openai as a direct pyproject dep (was transitive via litellm). Empty-response warnings (finding #3): - Drop the output_tokens <= 0 short-circuit in _detect_empty_response_warnings — provider refusals (content_filter, stop, None) fire at zero token counts, and silencing them dropped exactly the signal the warning system exists for. Gate now: text presence + tool_calls carve-out only. - Delete test_no_warning_when_zero_tokens_and_empty (phantom case). - Add zero-token tests for content_filter, stop, None, length. LLMNode model fallback (finding #4 — internal): - Delete the silent "gemini-3-flash-preview" fallback in prep(). Production never hits it (compiler always injects via get_default_workflow_model or raises CompilationError). ~60 unit tests that bypass compilation now set "model" explicitly. User-facing examples (finding #2): - Prefix every model name in CLI docstrings (settings.py x3), LLMSettings docstring (settings.py), README "Available Models" + token-usage example, complete() docstring example, and 4 mintlify doc tables across docs/reference/cli/settings.mdx and docs/reference/nodes/llm.mdx (auto-detect defaults + 2 performance comparison tables + Model support table). README rewrite (finding #4 — README): - Replace invented "pflow llm --prompt=..." CLI examples with real .pflow.md workflow snippets demonstrating type: llm. Verification: - make test: 5325 passed - make check: ruff, ruff-format, mypy, deptry all green - tests/test_execution/test_plan_drift.py: 32 sacred parity tests green
Three fixes from end-to-end verification: 1. Auto-prefix bare model names at the adapter seam and at settings-write time. LiteLLM routes bare `gemini-3-flash-preview` to Vertex AI instead of Google AI Studio; `gemini/gemini-3-flash-preview` routes correctly. Settings CLI now normalizes with user feedback; adapter normalizes as safety net for workflow files. 2. Classify APIConnectionError from missing SDK installs (ImportError in __cause__ chain) as InvalidRequestError instead of LLMTransientError. Prevents 3 retries on a permanent failure. 3. Set LiteLLM logger to CRITICAL — the adapter's typed exception system is the single error surface; redundant ERROR tracebacks on stderr are suppressed. Bonus: add inject_settings_env_vars() to probe command (pre-existing gap).
… (159)
The original Task 158 ("Prompt Caching via LiteLLM + Declarative `## Cache`
Block") covered both the LiteLLM library migration and the caching feature.
After Phase A landed, the migration scope had grown into something
architecturally significant on its own (typed-exception architecture,
diagnostic pipeline integration, tracing redesign, perf fixes) — splitting
gives each side its own coherent narrative and PR review attention.
Task 158 (this branch) is now scoped to the migration only:
- spec rewritten to migration-focus, what/why with no implementation detail
- progress log gets a new §25 capturing the phased-implementation decision
that triggered the split; existing §26-§38 unchanged; renumbered duplicate
§35 to §36, then §36→§37, §37→§38; added header with quick-reference table
- CLAUDE.md marks Task 158 complete and lists Task 159 as next
Task 159 (new) is the prompt-caching feature:
- spec extracted from original 158, trimmed to caching-only with explicit
Task 158 prerequisite; migration-related Design Decisions marked as
"Resolved by Task 158"
- progress log inherits original design discussion (§1-§24) plus the
caching half of session §25 (8 ambiguities resolved, caching spec
revisions, batch-threshold deferred, pflow publish open assumption)
- braindump rewritten for post-migration reality: documents what's already
in place (adapter, typed exceptions, diagnostic pipeline, trace seam,
cost contract, mock infrastructure), what's still relevant for Phase B-G,
and what got resolved during Task 158
The CHANGELOG entry for the LiteLLM migration was deferred — Phase A ships without an Unreleased changelog block. Progress log updated to remove all references to docs/changelog.mdx changes so the file appears untouched in this branch.
…exception completion, reasoning-path consolidation Review pass after Task 158 Phase A: instead of patching individual review findings, fix the duplicated contracts they were symptoms of. Reduces overall code volume and prevents the same bug class from recurring. ## Provider metadata: one canonical registry `src/pflow/core/llm_providers.py` (new) — `ProviderInfo` dataclass + `PROVIDERS` registry + `detect_provider` / `normalize_model_name` / `model_name_without_provider`. Single source of truth consumed by: - `llm_client._normalize_model_name` and `_is_anthropic` - `llm_reasoning_map._detect_capabilities` (Anthropic / Gemini / OpenAI branches all use `detect_provider`) - `exceptions._derive_env_var_for_model` Adding a new provider/model family is one registry-line. Closes the o4-* drift (the o4 prefix existed only in `_BARE_MODEL_PREFIX_MAP`, missing from `_detect_capabilities` and `_derive_env_var_for_model`). Prefix matching is exact `startswith(provider_prefix)`, so `openrouter/anthropic/...` is no longer mis-classified as Anthropic. ## Typed exception architecture completed `LLMCallError` base now defines `diagnostic_context(**extra)` invariant — every subclass extends it (no more "InvalidRequestError remembered provider_message, MissingSdkError forgot it" failure mode). Subclasses that previously discarded raw upstream text now preserve it via the new `provider_message: str | None` field. - `LLMTransientError` gains `kind` discriminator (timeout / rate_limit / server_error / connection) mirroring the `UnknownModelError(reason)` and `MissingApiKeyError(kind)` pattern. `to_diagnostics()` override branches on kind for targeted suggestions instead of a generic "LLM Call Failed" diagnostic. - `_classify_litellm_error` now translates every transient class with the correct kind discriminator AND threads `provider_message=str(exc)` to every classification branch (8 sites including the unknown-OpenAI- subclass default branch). Captured at the seam, surfaced honestly through `diagnostic_context["provider_message"]`. - `LLMNode.exec_fallback` preserves the original transient kind as `transient_kind` after retry exhaustion so downstream consumers see both `kind="retry_exhausted"` and the underlying provider failure type. ## One reasoning path `_validate_model_options` (new) rejects reasoning-shaped keys (`thinking`, `thinking_budget`, `thinking_effort`, `reasoning_effort`, `reasoning_max_tokens`, `thinking_level`) in raw `model_options` — pflow has dedicated `reasoning_effort` / `reasoning_max_tokens` channels and letting `model_options` bypass them silently drops provider-specific translation. `_translate_reasoning_for_litellm` raises `InvalidRequestError` on Anthropic `thinking=True` without a budget (previously dropped silently). `smart_filter.py` rewrites: the bespoke gemini-3 / gemini-2.5 substring heuristics (which passed `thinking_level` / `thinking_budget` via `model_options` — exactly what `_validate_model_options` rejects) are replaced with `reasoning_kwargs=map_reasoning_options(filtering_model, "none", None, None)`. Routes through the canonical reasoning translator; provider detection lives in one place. `map_reasoning_options` `effort="none"` branch extended to handle `thinking_level` capability (Gemini 3 has no off-switch — "minimal" is the lowest equivalent). Latent gap independent of smart_filter: a user passing `reasoning_effort: none` to a Gemini 3 LLM node previously got default expensive thinking, silently. ## Other contract alignments - Empty-response `_detect_empty_response_warnings` no longer silently returns `[]` for unrecognized finish_reason values — adds a `kind="llm_empty_response_unrecognized_finish_reason"` warning. Aligns the implementation with the function's own documented invariant. - `parse_structured_response` raises `LLMResponseParseError` on Pydantic validation failure (previously logged at warning and returned the raw invalid dict, which let callers fail later with confusing KeyError/TypeError). Aligns with `LLMResponseParseError`'s docstring contract. - `_emit_trace` exception swallow logs at WARNING (was DEBUG); a trace-hook bug shouldn't break workflows but also shouldn't be invisible. - Dead-code: `CriticalDiscoveryError` removed (had no production raise sites; remaining CLI handler was unreachable; `LLMCallError`'s rich `to_diagnostics()` covers the discovery error UX). ## Tests 22 new test cases covering: - Provider registry contract (`test_llm_providers.py` new). - `o4-*` reasoning capability + env-var derivation regression guards. - Pydantic-validation failure raises `LLMResponseParseError`. - Smart_filter routes via `reasoning_kwargs`, never `model_options` (parametrized; `assert last_call["model_options"] is None` is the regression guard for the bypass). - `effort="none"` Gemini 3 → `thinking_level=minimal`. - `LLMTransientError` retry-exhaustion preserves `transient_kind`. - Anthropic `thinking=True` without budget raises. - `_validate_model_options` rejects reasoning keys (with `litellm.completion` patched directly, exercising the real validation pipeline). - Parametrized seal-contract test verifies `provider_message=str(exc)` is threaded through every classification branch. - End-to-end round-trip: real LiteLLM exception → seam → `_classify_litellm_error` → typed exception → LLMNode → runner → `result.errors[i].context["provider_message"]`. Pins the central UX promise across 5+ layers; a regression dropping `provider_message` in any layer fails this test loudly. ## Verification - `make test`: 5380 passed, 9 skipped (was 5325 pre-pass; +55 net). - `make check`: ruff, ruff-format, mypy, deptry — all green. - Architectural seal verified: `grep -rn 'litellm\.exceptions\|^import openai\|^from openai' src/pflow/` returns hits only in `core/llm_client.py` (the seam). ## Follow-up issues filed - #353 — typed sub-discriminators for `MissingApiKeyError` quota/suspended/region - #354 — enrich diagnostic suggestions from `provider_message` text - #355 — codify autouse-mock-bypass test pattern in `tests/CLAUDE.md` - #348 commented — `llm_providers.py` substantially closes the duplication-elimination half; remaining work is provider expansion gated on real demand. See `.taskmaster/tasks/task_158/implementation/progress-log.md` §39 §40 for the full session-by-session narrative including the smart_filter regression discovery and the discriminator-loss bug pattern that drove the `provider_message` field design.
`pflow find` and `pflow mcp find` invoke the LLM adapter via `find_workflow` / `find_components` but never called `inject_settings_env_vars()`. Users who stored API keys via `pflow settings set-env` (the recommended migration path post-Task 158) saw "API Key Missing or Insufficient" even though `pflow run` worked with the same env. Surfaced during independent end-to-end verification. Matches the §38 probe.py pattern: lazy import + single call before the LLM-using code path. 4 lines added across 2 files. Repro before fix: env -i HOME="$HOME" PATH="$PATH" pflow find "github" # Error: API Key Missing or Insufficient After fix: same command succeeds (settings inject as designed). Verification: - `make test`: 5380 passed - `make check`: ruff, ruff-format, mypy, deptry — all green - 249 targeted tests for find/probe/mcp pass
Comprehensive review of Task 158 (LiteLLM migration) optimized for future AI agents — especially the agent implementing Task 159 (caching) which builds on this substrate. Covers: integration points, shared store keys + routing rules, 10 architectural decisions with alternatives, 8 gotchas, 10 reusable patterns, 9 anti-patterns, breaking changes, extension points for caching, and 10 common pitfalls each tied to a real bug caught during the implementation review cycles. Companion to implementation/progress-log.md (the session-by-session narrative) and task-158.md (the spec).
… GOOGLE_API_KEY alias, _normalize envelope [skip review] Three correctness fixes from PR #356 review (claude[bot] warnings 1-3): 1. ProviderInfo.env_var (str) -> env_vars (tuple[str, ...]). Gemini now carries both GEMINI_API_KEY (canonical) and GOOGLE_API_KEY (the alias LiteLLM checks first in its Gemini auth path). MissingApiKeyError's missing_key remediation surfaces the canonical in the primary suggestion and lists aliases in a follow-up so users with only GOOGLE_API_KEY set see it. ctx field renamed env_var -> env_vars (list, canonical first). llm_config.PROVIDER_ENV_VARS now derives from the registry — single source of truth. 2. UnknownModelError.reason / MissingApiKeyError.kind / LLMTransientError.kind tightened from str to Literal[...]. Mypy immediately caught one untyped str return in _classify_transient_kind, which now returns LLMTransientKind. Typo at any raise site now fails type-check rather than silently falling through to a default branch with wrong remediation. 3. _normalize call in complete() wrapped in try/except (IndexError, AttributeError) -> LLMResponseParseError. Without this envelope, a malformed LiteLLM response (empty choices list, missing usage) would escape the typed-exception seal AND skip the after_call trace event, breaking both invariants the adapter owns. The wrap also fires after_call with error= so trace consumers preserve their begin/end-pair invariant. Tests: TestNormalizeMalformedResponse pins the new envelope (3 cases: empty choices, missing usage, after_call still fires). New test_gemini_missing_key_lists_both_aliases pins the dual-env-var remediation. Seal contract gains a connection-kind regression case (suggestion S6). test_provider_env_vars_canonical_first asserts the registry shape. 5386 passed, make check green (ruff, ruff-format, mypy, deptry).
…ip review] Two coverage gaps from PR #356 review (claude[bot] warnings 4-5): 1. tests/test_cli/test_lazy_imports.py — new file. Subprocess test asserts ``litellm`` does not enter ``sys.modules`` when only the pflow CLI entry point is imported. The ~700ms LiteLLM import cost is documented as a contract in core/llm_client.py's module docstring; this test catches a regression that hoists ``import litellm`` to module top or routes an engine module through llm_client.py. 2. tests/test_runtime/test_trace_integration.py — test_llm_failure_in_one_item_does_not_corrupt_siblings. Failure-path complement to the existing parallel-batch + sub-workflow + LLM success test. Patches the LLMNode binding so one item raises MissingApiKeyError; asserts each worker reports its own outcome, sibling success traces are intact, and the typed _diagnostic_context propagates from the typed exception through the worker boundary into the user-facing surface (either trace event or batch errors list). The §31 worker-thread bug had this exact shape — silent failure or sibling corruption — but only the success path was pinned. 5388 passed, make check green.
… artifacts [skip review] Three review-cleanup items from PR #356 (claude[bot] warnings 7-9): 1. pyproject.toml: openai gains a >=1.0 floor. The adapter catches openai.OpenAIError (the structural base of LiteLLM's exception hierarchy); a future major-version restructuring of openai would silently break the seal. uv.lock unchanged (current resolution is well above 1.0). 2. Docstring + mintlify docs for set-default / set-discovery / set-filtering now name the bare-name auto-prefix behavior. The _normalize_and_warn_model helper has been the contract for a while, but neither command help nor docs/reference/cli/settings.mdx surfaced it. A single Note block in settings.mdx covers all three commands; each command's own docstring gets a one-liner pointing at the same behavior. 3. scratchpads/task-158-spike/ removed from git tracking. The spike scripts and outputs were always intended as throwaway runnable documentation per phase-a-execution-guide.md and progress-log.md §27, but were inadvertently committed in a7d8c77. The 2.1 MB spike_4_output.txt also leaked a developer's macOS home path. The research guide updated to reflect reality (the artifacts no longer exist; §27 of the progress log is the canonical findings record). 5388 passed, make check green.
… unknown providers [skip review] Pre-existing UX bug surfaced during the PR #356 review discussion: a user calling ``together_ai/llama-3-70b`` (or any provider absent from pflow's registry) who hit an authentication error saw a misleading suggestion to ``export ANTHROPIC_API_KEY=...``. The hardcoded fallback literal in ``MissingApiKeyError.to_diagnostics()`` was always wrong for non-Anthropic providers — it just happened to be the registry's first entry, used as a placeholder. The fix splits the missing_key path into three branches: 1. Known provider (registry hit) — unchanged, precise canonical env var plus aliases (the GEMINI_API_KEY / GOOGLE_API_KEY case Tier A added). 2. Unknown provider with parseable prefix — defer to provider_message (the raw LiteLLM error text usually names the expected env var) plus a heuristic "<PROVIDER>_API_KEY" candidate framed as likely, NOT authoritative. The suggestion text explicitly disclaims multi-key cloud providers (AWS Bedrock, Azure, Vertex AI need additional credentials beyond a single API key). The heuristic is correct for the Mistral/Cohere/Groq/DeepSeek/Together/OpenRouter/Anyscale majority that follow the convention; the disclaimer covers the meaningful exceptions. 3. No parseable prefix — pure docs link, no fabricated env var. New ``extract_provider_prefix(model)`` primitive in llm_providers.py (provider parsing belongs there); the env-var derivation lives in exceptions.py as ``_likely_env_var_for_unknown_provider`` since it's specific to the diagnostic UX. Agents reading the JSON output get a new structured field ``context["likely_env_var"]`` when the heuristic fires (filtered out when None — the diagnostic_context base helper drops Nones already). Tests cover all three branches: TogetherAI (heuristic fires, no ANTHROPIC_API_KEY anywhere in suggestions), Bedrock (heuristic fires but suggestion disclaims multi-key providers explicitly), and the no-prefix case (pure docs link, no fabricated env var of any kind). 5393 passed, make check green.
…ow-up commits [skip review]
…t-state braindump [skip review] The two prior handovers (task-94-handover-01.md and -02.md, 622 lines combined) described the llm-library architecture that Task 158 demolished. Most of their guidance referenced infrastructure that no longer exists: _has_llm_key subprocess paths, llm models list, llm keys get, the pflow registry describe llm and pflow registry discover commands removed by Task 151, and the multi-source detection that checked llm CLI as a third tier. Replaced with a single ~200-line starting-context document that: - States explicitly what it supersedes and why - Captures the premise of the task (still valid) and the plural-models clarification from today's discussion - Maps what's gone (llm library, registry CLI commands, hardcoded defaults) vs what's new and infrastructure-ready (PROVIDERS registry, _has_provider_key, _likely_env_var_for_unknown_provider, MissingApiKeyError.to_diagnostics(), pflow settings llm show, the lazy-import contract) - Pulls forward the 5 insights from the originals that still apply (display-time check, don't enumerate 100+ providers, the user's "don't over-engineer" framing, PYTEST_CURRENT_TEST short-circuit, detection-vs-display split) - Surfaces the model-enumeration question (raw LiteLLM dump vs curated regex vs hybrid) as the central new design choice - Lists 6 open design questions explicitly NOT resolved - Updates file paths (the originals pointed at dead files) - Gives concrete first-steps guidance The new doc is a starting-context document, not a design spec — design decisions remain open for the agent picking up Task 94.
There was a problem hiding this comment.
Code Review
This pull request replaces the llm library with a pflow-owned LiteLLM adapter, introduces a typed exception hierarchy for LLM failures, and redesigns the tracing mechanism to use a shared-store save/restore seam. Feedback suggests that specific exception handling in the adapter may be redundant and recommends enhancing the warning propagation logic to support multiple warnings per call.
| # error LiteLLM raises in the call path pflow uses — both the | ||
| # known subclasses and any future additions. The architectural | ||
| # seal stays intact: consumers (LLMNode retry loop, smart_filter, | ||
| # discovery callers) catch the LLMCallError umbrella without ever |
There was a problem hiding this comment.
| node_id = getattr(self, "node_id", None) | ||
| if warnings_list and node_id is not None: | ||
| # In v1 the adapter emits at most one warning per call. If a | ||
| # future case needs multiple, change the contract to a list value. |
Code Review — PR #356 (Task 158:
|
Summary
Replaces Simon Willison's
llmlibrary with a pflow-owned adapter (src/pflow/core/llm_client.py) backed by LiteLLM, becoming the single seam for every LLM call in the codebase. Three architectural changes ship together because the migration touches all of them: a typed exception hierarchy with structured discriminators, a diagnostic pipeline integrated through a newllm_failurecategory, and a tracing redesign that replaces the brokenllm.get_modelmonkey-patch with ashared["__trace_collector__"]save/restore seam.This PR is pure substrate — caching syntax (the original motivating use case) is deferred to Task 159. The architectural work is independently valuable and warrants its own review attention.
Task
158
Changes
New core modules
core/llm_client.py— single seam for all LLM calls;complete()returnsAdapterResponsecore/llm_providers.py— canonical provider metadata registry (Anthropic / OpenAI / Gemini); single source of truth for model-name normalization, env-var derivation, reasoning capability detectioncore/llm_reasoning_map.py— provider-neutral reasoning capability detection +map_reasoning_options()Typed exception hierarchy in
core/exceptions.pyLLMCallError(PflowError)base + 6 subclasses withto_diagnostics()overrides:UnknownModelError(reason="unknown_name" | "missing_prefix")MissingApiKeyError(kind="missing_key" | "lacks_permission")LLMTransientError(kind="timeout" | "rate_limit" | "server_error" | "connection")MissingSdkError(package=...)InvalidRequestError,LLMResponseParseErrormodel+provider_message(raw upstream text, distinct from pflow framing)category="llm_failure", structured_diagnostic_contextpropagates from exception → shared store →__failures__→ExecutionResult.errors[i].contextTracing redesign
llm.get_modelmonkey-patch and the_active_collectors[thread_id]registry (broken in production — worker thread id never matched main thread id)shared["__trace_collector__"]save/restore aroundengine.run's graph walkprep(), capturestrace_hookBEFORE submitting to innerThreadPoolExecutorevent["llm_prompt"]now actually populates in trace JSONPerformance
import litellminsidecomplete()— keeps non-LLM CLI paths fast (pflow --version,--validate-only,--dry-run, cached runs)llm_client.py(architectural rule encoded inruntime/engine/CLAUDE.md)make testparallel: ~18-20s (was 55s+ mid-migration)Other
model_optionsrejects reasoning-shaped keys (forces use of canonicalreasoning_effort/reasoning_max_tokenschannels)_hidden_params["response_cost"]; deletedcore/llm_pricing.py(188 lines, 41 stale model entries)inject_settings_env_vars()added tofind.py+mcp.py::find_tools(audit gap caught in independent verification)1.82.6(1.83.x hard-pinsclick==8.1.8which downgrades pflow's click and breaks 3 tests)Dependencies
llm,llm-anthropic,llm-geminilitellm==1.82.6,openai(direct dep — adapter catchesopenai.OpenAIError, the structural base of LiteLLM's exception hierarchy)Explanation
Three reasons the migration shipped independently of the caching feature it was originally Phase 0 + Phase A of:
llm-anthropic==0.25blocks caching. The plugin'scache: boolonly markscache_controlon attachments / last prior user message. System prompts and first-turn user content (where the savings live) cannot be cached.llm.models.Optionsisextra="forbid"— no kwarg passthrough. Keepingllmand monkey-patching is fragile across plugin upgrades.cache_controlsyntax across Anthropic / Gemini / OpenAI; Ollama and other local-model runtimes still supported; provider pricing data is comprehensive (2,678 entries vs pflow's 41) and per-provider correct (pflow's table had Anthropic-only cache multipliers applied universally).The architectural seal is structural: the adapter catches
openai.OpenAIError(the OpenAI SDK base —litellm.exceptions.OpenAIErroris a sibling class, not a parent). Past the seal, consumers branch on structured exception attributes (reason,kind,transient_kind), never on message text.129 files changed, +30,156 / −5,535. Most of the addition is in
.taskmaster/tasks/task_158/documentation (progress log, task review, research notes, spike scripts), the LiteLLM-bundleduv.lockregeneration, and the new test files; the production diff is much smaller.Created Docs
.taskmaster/tasks/task_158/task-158.md— task spec.taskmaster/tasks/task_158/task-review.md— comprehensive review optimized for future agents (especially Task 159 / caching). Covers integration points, shared store keys + routing rules, 10 architectural decisions with alternatives, 8 gotchas, 10 reusable patterns, 9 anti-patterns, 10 common pitfalls each tied to a real bug caught during review cycles.taskmaster/tasks/task_158/implementation/progress-log.md— session-by-session narrative (§25–§41) including 3 rounds of code review, perf-fix pass, end-to-end verification, and the §41 verification-driveninject_settings_env_varsfixdocs/quickstart.mdx,docs/reference/cli/settings.mdx,docs/reference/nodes/llm.mdx,docs/reference/configuration.mdx,docs/how-it-works/{batch-processing,template-variables}.mdx,docs/guides/debugging.mdx,docs/roadmap.mdx— purged Simon Willisonllmlibrary /llm-anthropicplugin references; replaced with LiteLLM provider list andpflow settings set-envpathsTesting
Sacred parity tests (32) green at every commit:
tests/test_execution/test_plan_drift.py— planner ↔ runtime cost parity invariantAdapter contract tests (
tests/test_core/test_llm_client.py):TestAdapterSealContract.test_litellm_exception_wraps_to_typed_pflow_exception— parametrized over 16 LiteLLM exception classes; verifies every classified exception becomes a typedLLMCallErrorsubclassTestAdapterSealContract.test_seam_threads_provider_message_to_every_classification— every classification branch threadsprovider_message=str(exc)to the typed exceptionTestAdapterSealContract.test_unknown_litellm_subclass_wraps_safely— default fallback wraps unknown future classes asInvalidRequestError(deterministic, NOT transient — won't infinite-retry)test_real_litellm_exception_provider_message_reaches_runner_diagnostics— full pipeline: real exception → seam → LLMNode → runner →result.errors[i].context.provider_message(5+ layers; regression in any layer fails this test loudly)Trace seam regression guards (
tests/test_runtime/test_trace_integration.py):TestLLMTraceHookCapture— trace_hook fires across worker-thread boundary;event["llm_prompt"]populatesTestSubWorkflowTraceCollector— sub-workflow LLM prompts land in child collectorTestParallelBatchSubWorkflowTrace— parallel batch + sub-workflow trace isolationProvider/reasoning regression guards:
TestSmartFilterMinimizesReasoning— smart_filter routes viareasoning_kwargs(canonical channel), withassert last_call["model_options"] is Noneregression guardtest_disable_gemini_3—effort="none"on Gemini 3 maps tothinking_level=minimaltest_opus_45_max_tokens_takes_precedence_over_effort— Anthropic Opus 4.5thinking_effortprecedence preserved (load-bearing legacy invariant)End-to-end manual verification (real APIs across all 3 providers, ~$0.0001 spend):
gpt-4o-mini→openai/gpt-4o-mini)llm_promptpopulates end-to-end (verified via--reportmarkdown output)cost_usdfrom LiteLLM rolls up correctly throughtotal_cost_usd;pricing_availabletri-state correctcategory="llm_failure",error_class,model,reason/kind,provider_message(raw upstream text)pflow probe llmworks with all 9 stable usage keysbatch_items[i]has its ownllm_prompt)pflow find+pflow mcp findwork with API keys stored in pflow settings (the §41 fix)Architectural seal verified:
grep -rn 'litellm\.exceptions\|^import openai\|^from openai' src/pflow/returns hits only incore/llm_client.py(the seam)grep -rn '^import llm$\|^from llm import' src/pflow/ tests/returns zero hitspython -X importtime $(which pflow) --version 2>&1 | grep -c litellmreturns 0Final state:
make test5,380 passed,make checkclean (ruff, ruff-format, mypy, deptry).