Skip to content

refactor(llm): replace llm library with LiteLLM via pflow-owned adapter#356

Merged
spinje merged 41 commits into
mainfrom
feat/prompt-caching-lite-llm
Apr 26, 2026
Merged

refactor(llm): replace llm library with LiteLLM via pflow-owned adapter#356
spinje merged 41 commits into
mainfrom
feat/prompt-caching-lite-llm

Conversation

@spinje
Copy link
Copy Markdown
Owner

@spinje spinje commented Apr 26, 2026

Summary

Replaces Simon Willison's llm library with a pflow-owned adapter (src/pflow/core/llm_client.py) backed by LiteLLM, becoming the single seam for every LLM call in the codebase. Three architectural changes ship together because the migration touches all of them: a typed exception hierarchy with structured discriminators, a diagnostic pipeline integrated through a new llm_failure category, and a tracing redesign that replaces the broken llm.get_model monkey-patch with a shared["__trace_collector__"] save/restore seam.

This PR is pure substrate — caching syntax (the original motivating use case) is deferred to Task 159. The architectural work is independently valuable and warrants its own review attention.

Task

158

Changes

New core modules

  • core/llm_client.py — single seam for all LLM calls; complete() returns AdapterResponse
  • core/llm_providers.py — canonical provider metadata registry (Anthropic / OpenAI / Gemini); single source of truth for model-name normalization, env-var derivation, reasoning capability detection
  • core/llm_reasoning_map.py — provider-neutral reasoning capability detection + map_reasoning_options()

Typed exception hierarchy in core/exceptions.py

  • LLMCallError(PflowError) base + 6 subclasses with to_diagnostics() overrides:
    • UnknownModelError(reason="unknown_name" | "missing_prefix")
    • MissingApiKeyError(kind="missing_key" | "lacks_permission")
    • LLMTransientError(kind="timeout" | "rate_limit" | "server_error" | "connection")
    • MissingSdkError(package=...)
    • InvalidRequestError, LLMResponseParseError
  • Every instance carries model + provider_message (raw upstream text, distinct from pflow framing)
  • Diagnostic pipeline: category="llm_failure", structured _diagnostic_context propagates from exception → shared store → __failures__ExecutionResult.errors[i].context

Tracing redesign

  • Deleted the llm.get_model monkey-patch and the _active_collectors[thread_id] registry (broken in production — worker thread id never matched main thread id)
  • Replaced with shared["__trace_collector__"] save/restore around engine.run's graph walk
  • LLMNode reads in prep(), captures trace_hook BEFORE submitting to inner ThreadPoolExecutor
  • Free behavior fix: event["llm_prompt"] now actually populates in trace JSON

Performance

  • Lazy import litellm inside complete() — keeps non-LLM CLI paths fast (pflow --version, --validate-only, --dry-run, cached runs)
  • Engine modules MUST NOT import llm_client.py (architectural rule encoded in runtime/engine/CLAUDE.md)
  • CLI startup: 0.27s (was 1.17s mid-migration); make test parallel: ~18-20s (was 55s+ mid-migration)

Other

  • Bare model names auto-prefix in two layers: settings CLI at write time + adapter as safety net
  • model_options rejects reasoning-shaped keys (forces use of canonical reasoning_effort / reasoning_max_tokens channels)
  • Cost from LiteLLM's _hidden_params["response_cost"]; deleted core/llm_pricing.py (188 lines, 41 stale model entries)
  • inject_settings_env_vars() added to find.py + mcp.py::find_tools (audit gap caught in independent verification)
  • LiteLLM pinned at 1.82.6 (1.83.x hard-pins click==8.1.8 which downgrades pflow's click and breaks 3 tests)

Dependencies

  • Removed: llm, llm-anthropic, llm-gemini
  • Added: litellm==1.82.6, openai (direct dep — adapter catches openai.OpenAIError, the structural base of LiteLLM's exception hierarchy)

Explanation

Three reasons the migration shipped independently of the caching feature it was originally Phase 0 + Phase A of:

  1. llm-anthropic==0.25 blocks caching. The plugin's cache: bool only marks cache_control on attachments / last prior user message. System prompts and first-turn user content (where the savings live) cannot be cached. llm.models.Options is extra="forbid" — no kwarg passthrough. Keeping llm and monkey-patching is fragile across plugin upgrades.
  2. LiteLLM is the right substrate. Unified cache_control syntax across Anthropic / Gemini / OpenAI; Ollama and other local-model runtimes still supported; provider pricing data is comprehensive (2,678 entries vs pflow's 41) and per-provider correct (pflow's table had Anthropic-only cache multipliers applied universally).
  3. The architectural work is real. Typed exceptions, diagnostic pipeline, tracing redesign are independently valuable and warrant their own review attention before stacking caching on top.

The architectural seal is structural: the adapter catches openai.OpenAIError (the OpenAI SDK base — litellm.exceptions.OpenAIError is a sibling class, not a parent). Past the seal, consumers branch on structured exception attributes (reason, kind, transient_kind), never on message text.

129 files changed, +30,156 / −5,535. Most of the addition is in .taskmaster/tasks/task_158/ documentation (progress log, task review, research notes, spike scripts), the LiteLLM-bundled uv.lock regeneration, and the new test files; the production diff is much smaller.

Created Docs

  • .taskmaster/tasks/task_158/task-158.md — task spec
  • .taskmaster/tasks/task_158/task-review.md — comprehensive review optimized for future agents (especially Task 159 / caching). Covers integration points, shared store keys + routing rules, 10 architectural decisions with alternatives, 8 gotchas, 10 reusable patterns, 9 anti-patterns, 10 common pitfalls each tied to a real bug caught during review cycles
  • .taskmaster/tasks/task_158/implementation/progress-log.md — session-by-session narrative (§25–§41) including 3 rounds of code review, perf-fix pass, end-to-end verification, and the §41 verification-driven inject_settings_env_vars fix
  • User-facing docs updated: docs/quickstart.mdx, docs/reference/cli/settings.mdx, docs/reference/nodes/llm.mdx, docs/reference/configuration.mdx, docs/how-it-works/{batch-processing,template-variables}.mdx, docs/guides/debugging.mdx, docs/roadmap.mdx — purged Simon Willison llm library / llm-anthropic plugin references; replaced with LiteLLM provider list and pflow settings set-env paths

Testing

Sacred parity tests (32) green at every commit:

  • tests/test_execution/test_plan_drift.py — planner ↔ runtime cost parity invariant

Adapter contract tests (tests/test_core/test_llm_client.py):

  • TestAdapterSealContract.test_litellm_exception_wraps_to_typed_pflow_exception — parametrized over 16 LiteLLM exception classes; verifies every classified exception becomes a typed LLMCallError subclass
  • TestAdapterSealContract.test_seam_threads_provider_message_to_every_classification — every classification branch threads provider_message=str(exc) to the typed exception
  • TestAdapterSealContract.test_unknown_litellm_subclass_wraps_safely — default fallback wraps unknown future classes as InvalidRequestError (deterministic, NOT transient — won't infinite-retry)
  • test_real_litellm_exception_provider_message_reaches_runner_diagnostics — full pipeline: real exception → seam → LLMNode → runner → result.errors[i].context.provider_message (5+ layers; regression in any layer fails this test loudly)

Trace seam regression guards (tests/test_runtime/test_trace_integration.py):

  • TestLLMTraceHookCapture — trace_hook fires across worker-thread boundary; event["llm_prompt"] populates
  • TestSubWorkflowTraceCollector — sub-workflow LLM prompts land in child collector
  • TestParallelBatchSubWorkflowTrace — parallel batch + sub-workflow trace isolation

Provider/reasoning regression guards:

  • TestSmartFilterMinimizesReasoning — smart_filter routes via reasoning_kwargs (canonical channel), with assert last_call["model_options"] is None regression guard
  • test_disable_gemini_3effort="none" on Gemini 3 maps to thinking_level=minimal
  • test_opus_45_max_tokens_takes_precedence_over_effort — Anthropic Opus 4.5 thinking_effort precedence preserved (load-bearing legacy invariant)

End-to-end manual verification (real APIs across all 3 providers, ~$0.0001 spend):

  • Bare-name auto-prefix verified at workflow + adapter layers (gpt-4o-miniopenai/gpt-4o-mini)
  • Trace JSON llm_prompt populates end-to-end (verified via --report markdown output)
  • cost_usd from LiteLLM rolls up correctly through total_cost_usd; pricing_available tri-state correct
  • All 4 typed-exception paths produce structured JSON output with category="llm_failure", error_class, model, reason/kind, provider_message (raw upstream text)
  • pflow probe llm works with all 9 stable usage keys
  • Sub-workflow LLM cost rollup ($0.0000110 = child + parent verified)
  • Parallel batch LLM per-item prompt capture (B11 fix verified — each batch_items[i] has its own llm_prompt)
  • pflow find + pflow mcp find work with API keys stored in pflow settings (the §41 fix)

Architectural seal verified:

  • grep -rn 'litellm\.exceptions\|^import openai\|^from openai' src/pflow/ returns hits only in core/llm_client.py (the seam)
  • grep -rn '^import llm$\|^from llm import' src/pflow/ tests/ returns zero hits
  • python -X importtime $(which pflow) --version 2>&1 | grep -c litellm returns 0

Final state: make test 5,380 passed, make check clean (ruff, ruff-format, mypy, deptry).

spinje added 30 commits April 24, 2026 12:00
Phase A.1 of Task 158 — installs LiteLLM next to the existing llm/llm-anthropic/llm-gemini
trio without removing them. The pflow-owned adapter (A.3) will import litellm; old deps
get removed in A.11 once all callsites have migrated.

Pinned to 1.82.6 (not 1.83.7 as recommended in the Phase 0 spike): all 1.83.x releases
hard-pin click==8.1.8, which downgrades click from 8.3.1 and breaks 3 CliRunner tests
that depend on click 8.2+'s default stderr separation. 1.82.6 leaves click unconstrained.
Gemini PR #15226 (2025-10-07) is present in 1.82.6.

Temporary deptry ignore for litellm — adapter imports it in A.3.
Phase A.2 — `src/pflow/core/llm_reasoning_map.py`:
Replaces the live `model.Options.model_fields` introspection at the previous
`nodes/llm/llm.py:35-114`. LiteLLM has no equivalent contract, so capabilities
are detected by model-name string sniffing (same approach as
`registry/smart_filter.py:178`). Output shape preserves the legacy
llm-library kwarg shape (e.g. `{"thinking": True, "thinking_budget": N}`);
the adapter translates to LiteLLM-native shapes at the API boundary.
Critical: Anthropic Opus 4.5 thinking_effort precedence preserved.

Phase A.3 — `src/pflow/core/llm_client.py`:
Single seam for all LLM calls. Wraps `litellm.completion`. Public API:
`complete(...)` returns an `AdapterResponse` with stable `.text` (attribute,
not callable), `.usage` dict (matching keys downstream code already expects),
`.model`, `.has_schema`, plus `.error`/`.status` for the PATTERN EXCEPTION
case. Catches `BadRequestError` and returns an error-marked response so
the Node retry loop doesn't burn 3 attempts on permanent failures.

Cache-token normalization handles both Anthropic and Gemini/OpenAI paths
into one stable `usage` dict shape. Cost is read from LiteLLM's
`_hidden_params["response_cost"]` per Phase 0 outcome A.

`trace_hook` parameter is the seam that will replace the
`runtime/workflow_trace.py:520-599` monkey-patch in Phase A.6.

No production code is wired to either module yet; that's A.5 and A.7.

Tests: 60 in test_llm_reasoning_map.py + 35 in test_llm_client.py, all
mocked, no network.
Phase A.4 — test infrastructure:
* `tests/shared/llm_mock.py`: add `MockLLMClient` returning `AdapterResponse`.
  Shares `_DEFAULT_RESPONSES` table with the legacy `MockLLMModel`. Preserves
  the 500-char `call_history` truncation; adds parallel `call_history_full`
  for future cache-structure tests. `MockLLMModel` and `MockGetModel` are
  retained — both fixtures coexist during A.4-A.7 while callers migrate.
* `tests/conftest.py`: add `mock_llm_client` autouse fixture that patches
  `pflow.core.llm_client.complete` plus the consumer-module bindings.
  Same `/llm/` skip pattern as the legacy fixture.

Phase A.5 — LLMNode rewrite:
* `src/pflow/nodes/llm/llm.py`:
  - Drop `import llm` and `from pydantic import ValidationError`.
  - `_call_llm` calls `pflow.core.llm_client.complete` instead of
    `model.prompt`. The PATTERN EXCEPTION (deterministic BadRequestError →
    error-marked response, no retry burned) moves into the adapter; the
    branch in LLMNode now reads `adapter_response.status == "error"`.
  - `prep` builds `Attachment` (the new dataclass) instead of
    `llm.Attachment`.
  - `post` simplified — the adapter normalizes usage to a stable dict shape
    so the legacy object-with-.input/.output/.details path is gone.
  - `exec_fallback` uses `isinstance` on `litellm.exceptions.{Timeout,
    NotFoundError, AuthenticationError, BadRequestError}` instead of class-
    name string matching. User-facing error text adjusted minimally:
    `'llm models'` → `'pflow settings llm show'`,
    `'llm keys set <provider>'` → `'pflow settings env set'`.
  - `EFFORT_RATIOS`, `DEFAULT_MAX_TOKENS_BASE` re-exported from
    `pflow.core.llm_reasoning_map` for backward compatibility.
  - New `_active_trace_hook()` reads the active collector from
    `WorkflowTraceCollector._active_collectors[thread_id]` and returns
    `collector.get_trace_hook(node_id)` to pass to the adapter.
* `src/pflow/runtime/workflow_trace.py`: add `get_trace_hook(node_id)`
  returning a callable that captures `before_call.prompt` into
  `self.llm_prompts[node_id]` — same destination the monkey-patch writes
  to. `setup_llm_interception` / `cleanup_llm_interception` retained
  during A.5-A.6 (still active for the 3 discovery callsites until A.7).

Test migration:
* `tests/test_nodes/test_llm/test_llm.py`: 61 tests rewritten to use
  the new fixture + assert against `mock_llm_client.call_history`. Added
  new `TestReasoningEffortValidation` and `TestReasoningKwargsForwarded`
  classes (12 tests) to cover prep validation and kwarg forwarding.
* `tests/test_nodes/test_llm/test_llm_images.py`: 12 tests rewritten to
  assert against the new `Attachment` dataclass shape.
* `tests/test_nodes/test_llm/test_llm_reasoning.py`: superseded by
  `tests/test_core/test_llm_reasoning_map.py` + the new test_llm.py
  classes. Module-level `pytest.skip` so collection is clean; body
  retained pending user-approved deletion. Per-file F821/F401 ignore in
  pyproject.toml for the dead-code body.
* `tests/test_integration/test_metrics_integration.py`: `test_llm_cost_calculation`
  and `test_llm_accumulation_across_nodes` rewired to monkeypatch
  `pflow.nodes.llm.llm.complete` with custom AdapterResponse-returning
  functions (the per-test usage-token control they need).

Verification:
* `make test` → 5303 passed, 10 skipped, 0 failed
* `tests/test_execution/test_plan_drift.py` → 32 passed (sacred parity intact)
* `make check` → ruff/mypy/deptry all green

The mock intentionally OMITS `cost_usd` from the usage dict so
`enrich_llm_usage_with_cost` runs the existing pricing-from-tokens path
(several tests rely on historical-cost propagation through the memo cache).
…leanup (A.6+A.7+A.8)

Phase A.6 — tracing redesign:
* `src/pflow/runtime/workflow_trace.py`: deleted the
  `setup_llm_interception` monkey-patch (intercepted `llm.get_model` and
  every `model.prompt` instance); replaced with a thin per-thread
  registration step that the engine still calls before each LLM-capable
  node. The new path is the `trace_hook` parameter the LiteLLM adapter
  takes (added in A.3, wired by LLMNode in A.5).
* Dropped class-level state that supported the patch:
  `_llm_interception_count` and `_original_get_model`. Kept
  `_active_collectors`, `_thread_local`, and `_llm_lock` — LLMNode's
  `_active_trace_hook()` reads them to find the right collector.
* `cleanup_llm_interception` is now a thin unregister + idempotent guard.

Phase A.7 — three discovery callsites migrated:
* `src/pflow/registry/discovery.py::find_components`
* `src/pflow/registry/smart_filter.py::smart_filter_fields`
* `src/pflow/core/workflow/discovery.py::find_workflow`
* All three now call `pflow.core.llm_client.complete(...)` instead of
  `llm.get_model(...).prompt(...)`. Pydantic schema → JSON Schema dict
  via `Class.model_json_schema()` at the call site.
* Smart filter's Gemini thinking heuristics flow through `model_options`
  (LiteLLM accepts these top-level for Gemini).
* `parse_structured_response` already tolerated both the old callable
  `text()` and the new attribute `text` — no change needed.

Phase A.8 — mass test cleanup:
* `tests/conftest.py`: deleted the legacy `mock_llm_calls` autouse
  fixture (which patched `llm.get_model`) and the `mock_llm_responses`
  helper. `mock_llm_client` is now the sole LLM mock.
* `tests/shared/llm_mock.py`: docstring updated. `MockLLMModel` /
  `MockGetModel` / `create_mock_get_model` retained but marked OBSOLETE
  per user instruction (no file/symbol deletes during Phase A — they go
  on the end-of-task deletion checklist).
* `tests/test_integration/test_metrics_integration.py`:
  `test_trace_captures_llm_calls` migrated to monkeypatch
  `pflow.nodes.llm.llm.complete`. The bespoke `mock_llm` fixture
  removed (no remaining consumers); `test_cost_calculation_accuracy`
  signature trimmed.
* `tests/test_cli/test_dry_run.py`: dry-run "no LLM/HTTP calls" test
  now patches `pflow.nodes.llm.llm.complete` (the live binding).
* `tests/test_nodes/test_llm/test_llm_integration.py::has_openai_api_key`:
  rewritten to read `OPENAI_API_KEY` directly — the LiteLLM adapter
  resolves keys from env vars, no `llm` library introspection needed.
  Keeps the file collectable after A.11 removes the legacy package.
* Subagent migrated 5 test files (test_smart_filter, test_component_discovery,
  test_workflow_discovery, test_node_output_formatter,
  test_nested_workflow_cli) — all bulk renames `mock_llm_calls` →
  `mock_llm_client`. One regression-guard fix in
  `test_nested_workflow_cli.py` where the old assertion targeted the
  wrong mock.

* `pyproject.toml`: added `llm` to the temporary `DEP002` ignore — it
  remains in the lockfile until A.11 removes it from `dependencies`.

Verification:
* `make test` → 5303 passed, 10 skipped, 0 failed
* `tests/test_execution/test_plan_drift.py` → 32 passed (sacred parity intact)
* `make check` → ruff/mypy/deptry all green
* No production code references `llm` (the package). The two remaining
  references are: a docstring in `runtime/workflow_trace.py` (historical
  context for the deleted patch), and `tests/shared/llm_mock.py`'s
  obsolete-but-retained code block.

Pending deletion checklist (added to in this commit):
* `MockLLMModel` / `MockGetModel` / `create_mock_get_model` in
  `tests/shared/llm_mock.py` — unreferenced.
* The bespoke `mock_llm` fixture references in
  `tests/test_integration/test_metrics_integration.py` (already removed
  in this commit; checklist entry tracks the import line).
Removes the `llm keys get`/`llm models default` subprocess shells from
llm_config.py — LiteLLM (since A.5) reads keys from os.environ directly,
so the subprocess fallback is dead weight.

Production changes (`src/pflow/core/llm_config.py`):
- DROP `_has_llm_key()`, `get_llm_cli_default_model()`, `_get_validated_llm_path()`
- DROP `LLM_COMMAND`, `_LLM_KEYS_SUBCOMMAND` constants
- DROP `subprocess` and `shutil` imports (no callers remain)
- `_has_provider_key()` now only consults env vars + pflow settings (two sources)
- `_detect_default_model()` no longer skip-guards on PYTEST_CURRENT_TEST
  (the guard was there to prevent subprocess hangs; no subprocess to hang now)
- `get_default_workflow_model()` resolution: settings → auto-detect → None
- `get_llm_setup_help()` rewritten: env vars + `pflow settings set-env`
  instead of `llm keys set <provider>`
- `get_model_not_configured_help()`: removed `llm models default`,
  `llm models list`, `llm keys list` references; points at
  `pflow settings llm show` instead
- `inject_settings_env_vars()` UNCHANGED (still preserves the test guard
  to avoid polluting test env)

Help-text updates (`src/pflow/cli/commands/settings.py`):
- Settings group docstring: drop "via Simon Willison's llm tool"; show
  env-var + `pflow settings set-env` instead
- `_get_resolved_model()`: drop `get_llm_cli_default_model` import;
  default-model resolution now shows auto-detect, not "llm CLI default"
- `llm_show` resolution-order text and `llm_unset` "default" message
  updated to remove "llm CLI default" phrasing

Lint:
- Drop the `S603` per-file ignore for `llm_config.py` (no subprocess
  remains in the file)

Tests:
- `test_llm_config.py`: replace `test_pytest_environment_skips_detection`
  (tested the removed guard) with `test_detection_only_uses_env_and_settings`
- `test_llm_config_provider_detection.py`: replace
  `test_falls_back_to_llm_cli` with
  `test_returns_false_when_neither_env_nor_settings_have_key`; drop the
  PYTEST_CURRENT_TEST bypass dance from `_detect_default_model` tests
- `test_llm_config_workflow_model.py`: drop `TestGetLlmCliDefaultModel`
  class entirely (function gone); drop the 3 `TestGetDefaultWorkflowModel`
  tests that exercised the llm-CLI fallback chain; rewrite
  `TestGetModelNotConfiguredHelp` assertions to expect `pflow settings`
  guidance and reject the removed `llm models default`/`llm keys list`
  text
- `test_compiler_llm_model.py`: same help-text assertion update

Verification:
- `make test` — 5294 passed, 0 failed (9 fewer tests = removed dead llm-CLI
  subprocess tests)
- `tests/test_execution/test_plan_drift.py` — 32 passed (sacred parity intact)
- `make check` — ruff, ruff-format, mypy, deptry all green
- Manual: `pflow settings llm show` and `pflow settings --help` produce
  sensible output with no llm-CLI references
Phase 0 spike confirmed Outcome A: LiteLLM's per-model pricing data
covers every model pflow knows about (2,678 models vs pflow's 41) and
fixes the per-provider cache-multiplier bugs in pflow's table (e.g.
OpenAI cache_read should be 0.5×, pflow had 0.1× universally — 80% off).
The Phase 0 spike's live regression test on Gemini PR #15226 confirmed
the double-count fix is present in the pinned LiteLLM version.

Production:
- DELETE `src/pflow/core/llm_pricing.py` (188 lines: MODEL_PRICING,
  MODEL_ALIASES, calculate_llm_cost, get_model_pricing, the inline
  enrich_llm_usage_with_cost computation).
- MOVE `enrich_llm_usage_with_cost` to `pflow.core.llm_client` as a
  10-line wrapper. Single responsibility: ensure `cost_usd` key is
  present (preserve when set; mirror Claude Code SDK's
  `total_cost_usd`; otherwise None). No pricing math.
- Production callers (`metrics.py`, `nodes/llm/llm.py`,
  `runtime/engine/instrumentation.py`,
  `runtime/engine/batch_executor.py`) updated to import from the new
  location.
- `core/__init__.py` docstring updated: `llm_client` is the public
  surface now, `llm_pricing` no longer referenced.

Mock contract change (option (c) — the right end-state, not the
least-disruptive migration):

`MockLLMClient` previously OMITTED `cost_usd` so the deleted
`enrich_llm_usage_with_cost` could compute it from MODEL_PRICING. With
that path gone, the honest end-state is:

- Default response: `cost_usd: None` in usage dict — matches production
  behavior when LiteLLM has no pricing data (custom endpoints, brand-new
  models, Ollama).
- `set_response()` accepts an optional `cost_usd=` kwarg so tests that
  care about specific values can declare them upfront. Self-documenting:
  reading the test tells you what cost the test expects.

This rejected option (a) `cost_usd: 0.0` (still a fiction; production
never returns 0.0 for a real call) and option (b) `cost_usd =
input_tokens * X + output_tokens * Y` (a fake formula that future test
authors would mistake for a real computation).

Tests updated (15 affected, all the failures from the production
deletion):

- `test_metrics.py`: rewrote 4 tests that exercised pricing math —
  `test_aggregation_sums_explicit_cost_usd` (sums injected cost_usd),
  `test_summary_generation_with_workflow_metrics` (injects cost_usd),
  `test_missing_fields_in_llm_usage_handled_gracefully` (verifies the
  new "no cost_usd → unavailable" semantic with a partial-cost entry),
  `test_cost_rounding_to_six_decimal_places` (rounds the sum).
  Removed `test_cost_calculation_for_different_models` — it tested the
  4-model pricing table directly.
- `test_metrics_thinking_cache.py`: removed
  `test_thinking_cost_calculation` and `test_cache_cost_calculation`
  (tested pricing math); updated `test_no_thinking_or_cache_tokens` to
  inject cost_usd; dropped now-unused `pytest` import.
- `test_metrics_integration.py`: updated `test_llm_cost_calculation` and
  `test_llm_accumulation_across_nodes` to set cost_usd on inline mock
  usage dicts and assert exact sums (more rigorous than the prior `> 0`
  check). Removed `test_cost_calculation_accuracy` — it directly tested
  the deleted MODEL_PRICING table.
- `test_unknown_model_user_experience.py`: renamed
  `test_mixed_models_shows_partial_cost_clearly` to
  `test_mixed_priced_unpriced_shows_partial_cost_clearly` (the relevant
  split is now "has cost_usd vs doesn't", not "known vs unknown
  model"). Also updated obsolete "Update MODEL_PRICING" comment.
- `test_instrumented_wrapper.py`: rewrote
  `test_llm_usage_enriched_with_cost_in_shared_store` and
  `test_trace_event_has_cost_usd` to verify the new contract (cost_usd
  preserved when set, None when not), instead of pinning gpt-4 dollar
  values from the deleted pricing table.
- `test_plan_drift.py::test_plan_cost_nested_rollup` now takes the
  `mock_llm_client` fixture and sets an explicit cost — assertions
  pin the exact value through the rollup chain (sharper than
  the prior `> 0` checks).
- `test_llm_pricing.py`: pytest.skip'd at module level (matches the
  Phase A.5 `test_llm_reasoning.py` pattern). Body retained pending
  user-approved deletion. Per-file F821/F401 ignore added.

Documentation:
- `core/CLAUDE.md`: replaced the `llm_pricing.py` section with a new
  `llm_client.py` section that documents the cost-from-LiteLLM
  contract. Removed the "46+ models" / "Broken aliases" claims.
  Trimmed `llm_config.py` section to remove the deleted llm-CLI default
  resolution step.

Verification:
- `make test` — 5266 passed, 0 failed (1 fewer than A.9 baseline =
  removed `test_cost_calculation_accuracy`)
- `tests/test_execution/test_plan_drift.py` — 32 passed (sacred
  parity invariant intact)
- `make check` — ruff, ruff-format, mypy, deptry all green

Deletion checklist additions (pending user approval):
- `tests/test_core/test_llm_pricing.py` (entire file body) — already
  pytest.skip'd at module level
- The temporary F821/F401 per-file ignore for `test_llm_pricing.py`
  in `pyproject.toml`
Production code finished migrating to the LiteLLM adapter in A.7; the
last test/fixture references were removed in A.8. The llm trio (and
its temporary DEP002 ignore entries) can now be dropped.

- pyproject.toml: remove `llm>=0.29`, `llm-anthropic==0.25`,
  `llm-gemini>=0.30` from `[project] dependencies`. Reduce
  `[tool.deptry.per_rule_ignores] DEP002` to just `["PyYAML"]`
  (PyYAML is loaded via lazy import in the markdown parser).
  Also remove the obsolete commented-out `[project.optional-dependencies]`
  stub.
- uv.lock: regenerated via `uv sync`. llm, llm-anthropic, llm-gemini,
  and their transitive deps (ijson, puremagic, sqlite-utils, etc.)
  uninstalled.

Verification:
- `uv pip list | grep -iE '^(llm|llm-)'` → empty (only `litellm` present)
- `grep -rn 'import llm$|from llm import' src/pflow/ tests/` → zero hits
- `make test` — 5266 passed, 0 failed
- `make check` — ruff, ruff-format, mypy, deptry all green
Sweeps the user-facing docs and bundled MCP agent instructions for the
LiteLLM migration. Adds a changelog entry describing the swap, the
provider-coverage upgrade, and the breaking change for users storing
keys via Simon Willison's `llm` CLI.

User-facing docs:
- `docs/quickstart.mdx`: drop the now-invalid "If you already have llm
  installed and configured" tip; replace the "Simon Willison's llm"
  pointer with LiteLLM's provider list. Add a tip showing the env-var
  path works too.
- `docs/reference/nodes/llm.mdx`: rewrite the intro and replace the
  "Extending with plugins" section (which described `llm-openrouter` /
  `llm-ollama` plugin install) with a simpler "Other providers" section
  noting that LiteLLM speaks 100+ providers natively. OpenRouter and
  Ollama get concrete examples without plugin install. Update the error-
  table guidance to point at `pflow settings llm show` and the LiteLLM
  provider list.
- `docs/reference/cli/settings.mdx`: drop the "If you use Simon
  Willison's llm" alternative; show the env-var path. Remove the now-
  invalid `llm models default` step from the model-resolution chain.
- `docs/roadmap.mdx`: update the "Unified model support" current-status
  line to credit LiteLLM and note the 100+ provider count.

Bundled MCP instructions:
- `mcp-agent-instructions.md`: rewrite the "For LLM providers" block —
  use `pflow settings set-env` and shell env vars instead of the deleted
  `llm keys set <provider>`. Update the cheatsheet's `llm keys set` line
  to `pflow settings llm show`.
- `mcp-sandbox-agent-instructions.md`: same treatment for the
  sandboxed-agent guidance.

Node README:
- `src/pflow/nodes/llm/README.md`: rewrite the installation section
  end-to-end. Provider keys are now set via `pflow settings set-env` or
  shell env vars; OpenRouter and Ollama are first-class without plugin
  install. Drop the dead `pip install llm-anthropic` / `llm-gpt4all` /
  `pflow[all-llms]` instructions. Update the token-usage example to show
  `cost_usd`. Update error-handling guidance.

Changelog:
- New `<Update>` block at the top labeled "Unreleased" documenting the
  swap. Highlights: native 100+ provider support, current pricing data.
  Breaking changes: env vars / `pflow settings set-env` only (no
  automatic pickup of `llm keys set` storage); `llm`/`llm-anthropic`/
  `llm-gemini` packages no longer pflow dependencies. Accordion
  enumerating what stays the same so users don't worry their workflows
  will need rewriting.

Intentionally retained:
- `src/pflow/core/llm_reasoning_map.py` module docstring's reference to
  Simon Willison's `llm` library — this is historical context that
  explains *why* the file exists (LiteLLM doesn't expose the `Options`
  introspection contract the original code relied on). Removing the
  reference would erase the design rationale.
- `docs/changelog.mdx` historical entry for the original `llm`
  integration — that's a record of what was true at release time.

Verification:
- `make test` — 5266 passed, 0 failed
- `make check` — ruff, ruff-format, mypy, deptry all green
- `grep -rln 'llm keys|llm models|Simon Willison' src/pflow/ docs/` —
  only the two intentional retentions remain
Five items the user explicitly deferred during Phase A so they could
review before we cut anything. All approved.

Deletes (items 1-3):
- DELETE `tests/test_nodes/test_llm/test_llm_reasoning.py` — entire file.
  Module was pytest.skip'd since A.5; coverage migrated to
  `tests/test_core/test_llm_reasoning_map.py` (60 tests on the new
  `map_reasoning_options` API) plus `TestReasoningEffortValidation` and
  `TestReasoningKwargsForwarded` in `tests/test_nodes/test_llm/test_llm.py`.
- DELETE `tests/test_core/test_llm_pricing.py` — entire file. Module was
  pytest.skip'd since A.10; the production `pflow.core.llm_pricing`
  module was deleted in that step. Pricing is LiteLLM's responsibility
  now.
- DELETE `MockLLMModel`, `MockGetModel`, `create_mock_get_model` from
  `tests/shared/llm_mock.py`. Unreferenced since A.8 — they patched the
  `llm.get_model` seam that no longer exists. Kept the shared
  `_DEFAULT_RESPONSES` table and `_schema_name()` helper (still used by
  `MockLLMClient`). Updated `tests/shared/README.md` to document the
  new state.

Config (item 4):
- Drop the two `[tool.ruff.lint.per-file-ignores]` entries for
  `test_llm_reasoning.py` (F821, F401) and `test_llm_pricing.py`
  (F821, F401) — added during A.5 and A.10 to silence linter warnings
  on the dead-code bodies retained inside the skipped files. Both
  files are gone, so the ignores are gone too.

Rename (item 5):
- `WorkflowTraceCollector.setup_llm_interception` →
  `register_for_llm_call`
- `WorkflowTraceCollector.cleanup_llm_interception` →
  `unregister_from_llm_call`
- Module-level engine wrapper `setup_llm_interception(...)` →
  `register_for_llm_call(...)` in `runtime/engine/instrumentation.py`
- Caller updates: `runtime/engine/engine.py:45,293` (import + call site)
  and `execution/runner.py:708-712` (cleanup hasattr check + call)
- Docstring touch-ups in `nodes/llm/llm.py` (`_active_trace_hook`),
  `runtime/workflow_trace.py` (`get_trace_hook` + the renamed methods'
  docstrings), and `runtime/engine/CLAUDE.md` (lifecycle diagram +
  registration entry)

Why rename: A.6 collapsed the prior global `llm.get_model` /
`model.prompt` monkey-patch (with reference-counted lazy install/
teardown) into a thin per-thread registration step. The
"interception" name described what the methods did before A.6 — it
misleads now. The new names describe the actual work: the engine
"registers" each LLM-capable node before it runs so the adapter's
trace_hook can find the right collector at call time, and the
runner "unregisters" the collector at cleanup.

Behavior unchanged. The class-level state (`_active_collectors`,
`_thread_local`, `_llm_lock`, `enable_llm_interception` flag, etc.)
keeps the same semantics — only the method names change.

Verification:
- `make test` — 5266 passed, 0 skipped, 0 failed
- `tests/test_execution/test_plan_drift.py` — 32/32 sacred parity tests green
- `make check` — ruff, ruff-format, mypy, deptry all green
- Two retained references (`workflow_trace.py:566` historical context
  in the renamed method's docstring; `test_metrics_integration.py:21`
  obsolete-fixture comment) are intentional — they document why the
  current code looks the way it does.
Capture the A.9-A.12 + end-of-task cleanup work in the progress log and
hand off a review-focused braindump for the next agent.

Progress log §30 covers:
- A.9: llm CLI subprocess paths dropped, with the spec deviation
  (`pflow settings set-env`, NOT `pflow settings env set` as the prior
  braindumps assumed)
- A.10: the cost-mock decision the user pushed back on — option (c)
  with `cost_usd: None` default, tests inject explicit costs
- A.11: drop the llm trio from pyproject
- A.12: docs pass + CHANGELOG migration entry, with the two
  intentional retentions (llm_reasoning_map.py docstring, historical
  changelog entry) documented
- Migration of user's llm-stored keys into pflow settings
- Real-API smoke test on Gemini-3-flash-preview ($0.0005), with the
  pre-existing reasoning-model UX issue surfaced and flagged
- End-of-task cleanup: 5 items processed (2 file deletes, MockLLMModel
  family removed, 2 lint ignores dropped, methods renamed)
- Final state at handoff and loose-ends acknowledged but not blockers

Braindump replacement:
- DELETE the now-obsolete A9-A12 handoff (that work is done)
- NEW review-focused braindump for the next agent: lists 10
  high-leverage review angles in descending risk order, admits weak
  reasoning where I have it, scopes out the Phase B-G items so the
  reviewer doesn't flag them as missing
…or tip

The AuthenticationError fallback in `LLMNode.exec_fallback` referenced
`pflow settings env set`, which is not a real command. The correct name
is `pflow settings set-env <KEY> <value>` — already used elsewhere in
A.9's help text but missed in this site.

Also enrich the NotFoundError tip with a pointer to LiteLLM's provider
list so an agent that mistypes a model can find the valid model strings
without trial-and-error.

Updated the Interface docstring's `cost_usd` description to reflect the
post-A.10 reality (cost is None when LiteLLM has no pricing data — there
is no longer a pricing table to maintain).
…coverage

Three regression-prone gaps surfaced during the Phase A code review:

1. Anthropic Opus 4.5 has BOTH thinking_effort AND thinking_budget.
   When a caller provides BOTH reasoning_effort AND reasoning_max_tokens,
   max_tokens wins (returns thinking_budget shape). Reordering the
   precedence dispatch in `map_reasoning_options` would silently degrade
   Opus 4.5 reasoning behavior — a test pins the contract.

2. `enrich_llm_usage_with_cost` has three branches (cost_usd preserved,
   total_cost_usd mirrored, neither → None). The total_cost_usd mirror
   path is the Claude Code SDK contract — it lost its dedicated test
   when `tests/test_core/test_llm_pricing.py` was deleted in A.10. Five
   tests now cover all three branches plus the `cost_usd=None` corner
   case (explicit None must not be overwritten by the mirror).

3. `smart_filter` flows Gemini-specific reasoning kwargs via
   `model_options` to disable thinking on filtering calls (otherwise
   gemini-3 burns tokens on a trivial classification task). No
   assertion existed on `mock_llm_client.call_history[-1]["model_options"]`
   — five parametrized tests pin the wiring across gemini-3-flash,
   gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite (no kwargs),
   and a non-Gemini control.
Phase A.10 deleted `pflow.core.llm_pricing` and Phase A.11 dropped the
`llm`/`llm-anthropic`/`llm-gemini` packages, but several user-facing and
agent-facing docs still referenced the removed code. The Phase A code
review caught these.

Doc cleanup across 10 files:

User-facing
- docs/changelog.mdx: add a Note that memo cache entries written before
  this upgrade lack `cost_usd`, so dry-run plans for cached nodes show
  `estimated_cost_usd: null` until the 24h TTL refreshes them.
- docs/reference/cli/settings.mdx: replace stale `pflow settings llm
  show` output (5-tier resolution chain with the dead "llm CLI default"
  step) with the actual per-feature output.
- src/pflow/cli/workflow_output.py + src/pflow/execution/formatters/
  success_formatter.py: rephrase "model not in pricing table" warning
  to "pricing data missing for" — consistent across CLI and MCP paths
  per the formatter parity contract.

Agent-facing
- src/pflow/nodes/llm/llm.py: Interface docstring's `cost_usd`
  description (already corrected in commit 96f5f3d along with the typo
  fix; this commit only addresses non-llm.py files).
- src/pflow/core/settings.py: `LLMSettings` docstring resolution chain
  drops the "llm CLI default" tier.
- src/pflow/core/CLAUDE.md: directory tree adds `llm_client.py` and
  `llm_reasoning_map.py`, removes `llm_pricing.py`.
- src/pflow/runtime/engine/CLAUDE.md: cross-module dependencies point
  at `core/llm_client.py::enrich_llm_usage_with_cost` instead of the
  deleted `llm_pricing.py`.
- tests/CLAUDE.md: rewrite the autouse-fixture and LLM-mock-resolution
  sections to match post-A.8 reality (`mock_llm_client`, AdapterResponse
  shape, `cost_usd=None` default, `set_response(..., cost_usd=...)`
  pattern). Old text described the deleted `mock_llm_calls` fixture and
  the legacy `response.text()` callable contract.
- tests/test_nodes/test_llm/TESTING.md: full rewrite. Old version told
  testers to install `llm-anthropic` plugin and run `llm keys set` —
  both gone. New version covers `pflow settings set-env` / env vars and
  notes that LiteLLM speaks 100+ providers natively (no plugin install).
- tests/test_cli/test_direct_execution_helpers.py: small docstring
  refresh on the partial-cost warning test.
…asoning output

Two related improvements to the LLM adapter, surfaced by the Phase A
code review.

(1) Adapter raises typed LLMCallError instead of returning error-marked
AdapterResponse for deterministic provider errors

The adapter previously had a dual contract: AdapterResponse could be
either a successful response (status="ok", error=None) OR an error
marker (status="error", error="..."). Every consumer had to check
status before reading text/usage. Three of four production consumers
(registry/discovery × 2, registry/smart_filter) didn't check status —
they passed the empty-text response straight into parse_structured_response
and surfaced ValueError("LLM returned empty response") instead of the
real provider error message. Smart_filter's catch-all swallowed it as a
silent degradation.

The cleaner contract: AdapterResponse means "successful response", full
stop. Deterministic errors raise LLMCallError (new, in core/exceptions.py).
LLMNode catches LLMCallError at its single _call_llm seam to preserve
the PATTERN EXCEPTION semantics (Node retry loop must not burn three
attempts on a permanent failure). Discovery callers let it propagate.
Smart_filter's existing except-Exception still catches it — intentional
graceful degradation for smart filtering.

Net change: deletes the `error` and `status` fields from AdapterResponse,
removes the dual-contract conditional in LLMNode, and discovery callers
get correct error handling for free without any callsite modification.
The trace_hook still fires `after_call` with `error` set before the
raise, so traces capture the failure.

(2) Warn when a reasoning model consumes tokens but emits no visible text

The Phase 0 spike on gemini-3-flash-preview surfaced an opaque failure
mode: max_tokens too low for a reasoning model means the entire token
budget gets spent on internal thinking before any visible output is
emitted. The provider returns content=None with finish_reason=length;
the adapter normalizes this to text="" and pflow surfaces an empty
result with no clue why.

`_normalize` now logs a targeted warning when:
  text == "" AND output_tokens > 0 AND finish_reason in {length, max_tokens}

No false positives because text="" on a normal successful call is
genuinely strange. The warning explains the likely cause and the fix
(increase max_tokens). Pre-existing UX issue; this surfaces it cleanly
without changing any behavior.
The trace_hook abstraction added in Phase A.6 was non-functional in
production. The engine registered the collector against the main thread
(`_active_collectors[main_thread_id] = collector`), but `LLMNode._call_llm`
runs inside an inner `ThreadPoolExecutor(max_workers=1)`. The worker
thread's id never matched the registered id, so `_active_trace_hook()`
returned None on every adapter call and the `trace_hook` parameter was
never invoked. Verified by smoke test against real Gemini-3:
`event["llm_prompt"]` was missing from the trace JSON for every
literal-prompt LLM node, and `pflow report` had no `## Prompt` section
for those nodes as a result.

Replace ~110 lines of dead infrastructure with the existing pflow
shared-store pattern:

- `WorkflowEngine.run()` save/restore `shared["_trace_collector"]` around
  the graph walk. Reuses the established `_trace_collector` shared key
  (already in `_PROPAGATED_KEYS`, already read by formatters at
  `success_formatter.py:64` etc., already installed by the runner at
  `runner.py:490`). Save/restore correctly handles nested sub-workflow
  runs in both `storage_mode: mapped` and `storage_mode: shared` — the
  child engine's installation is local to its run frame, then the
  parent's collector is reinstated. Write-back form (not `.pop()`)
  because shared may be a `NamespacedSharedStore` which doesn't expose
  `pop`; all consumers use `.get()` so writing `None` back is
  indistinguishable from absence.

- `LLMNode.prep()` reads `shared["_trace_collector"]` and `self.node_id`
  (compiler-set dynamic attribute, see `compilation/compiler.py:299`),
  resolves the per-call hook via `collector.get_trace_hook(node_id)`,
  stores it in `prep_res["_trace_hook"]`. `LLMNode.exec()` captures the
  hook from `prep_res` BEFORE submitting `_call_llm` to the inner pool.
  The hook is then passed as an explicit arg through the thread
  boundary — no thread-id lookup, no global registry.

DELETE:
- `WorkflowTraceCollector._active_collectors`, `_thread_local`,
  `_llm_lock`, `_llm_interceptor_installed` class state
- `WorkflowTraceCollector.register_for_llm_call` /
  `unregister_from_llm_call` methods
- `WorkflowTraceCollector.enable_llm_interception` instance flag (and
  the line in `workflow_executor.py:343` that set it on child
  collectors — now unnecessary because save/restore handles inheritance)
- `_active_trace_hook()` function in `nodes/llm/llm.py`
- `register_for_llm_call(...)` wrapper in `instrumentation.py` + its call
  site in `engine.py::_execute_node` + import
- `unregister_from_llm_call` cleanup in `runner.py::_cleanup`
- `threading` and `ClassVar` imports in `workflow_trace.py`,
  `threading` import in `nodes/llm/llm.py`

KEEP:
- `WorkflowTraceCollector.get_trace_hook(node_id)` — same contract,
  now invoked from LLMNode.prep instead of the deleted `_active_trace_hook`
- `WorkflowTraceCollector.llm_prompts` dict — same writer (the hook
  fires from inside the adapter), same reader (`_add_llm_data`)

Free behavior fixes (intentional — the bug fix has user-visible
improvements, no regressions):

1. `event["llm_prompt"]` now populates in trace JSON for every non-batch
   LLM call. `pflow report` gains the `## Prompt` section for those
   nodes (was empty for literal-prompt nodes due to thread-id mismatch
   in the trace_hook plumbing).
2. Sub-workflow LLM prompts now correctly land in the child collector's
   `llm_prompts` under the child node's id (was misrouted under the
   parent's WorkflowExecutor event when the trace_hook would have fired
   at all — which it didn't).
3. Cross-test stale-collector contamination eliminated (no class-level
   globals to leak across tests).

Documentation updated:
- `runtime/engine/CLAUDE.md` — lifecycle diagram + trace section
- `runtime/CLAUDE.md` — WorkflowTraceCollector description

Known limitation NOT addressed by this refactor: parallel batch LLM
prompts continue to be keyed by the batch wrapper id (last-item-wins) —
same as today; per-item prompt indexing is a separate concern that
would require either writing prompts to `node_output` so
`_capture_item_trace` picks them up, or keying `llm_prompts` by
`(node_id, batch_idx)`. Not in scope for this commit.

Plan: .taskmaster/tasks/task_158/implementation/plan-refactor-trace.md
(also at ~/.claude/plans/magical-swinging-taco.md)
…orage_mode

Companion to commit 96003f3. Four new tests pin the behavior the
shared-store-seam refactor delivers; two old tests deleted because they
covered class-level state that no longer exists.

DELETED tests:
- tests/test_runtime/test_workflow_trace.py::TestWorkflowTraceCollector
  ::test_enable_llm_interception_attribute (the flag is gone)
- tests/test_runtime/test_workflow_trace.py::TestThreadLocalCurrentNode
  ::test_current_node_is_thread_local (the registry is gone)

DELETED 4 setup lines in tests/test_runtime/test_trace_integration.py
that assigned `collector.enable_llm_interception = False` (no longer
needed; sub-workflow inheritance is handled by the engine.run save/
restore pattern).

NEW tests in tests/test_runtime/test_trace_integration.py:

- TestLLMTraceHookCapture::test_llm_prompt_captured_via_trace_hook
  Direct regression for the worker-thread-mismatch bug. Runs an LLMNode
  through the engine, asserts collector.llm_prompts[node_id] populates
  with the rendered prompt AND the trace event surfaces llm_prompt.
  This test would have caught the bug pre-Phase-A.

- TestSubWorkflowTraceCollector
  ::test_sub_workflow_llm_prompt_in_child_collector
  Verifies sub-workflow LLM prompts land in the CHILD collector under
  the child node's id, not in the parent's WorkflowExecutor event.
  Confirms the engine.run save/restore correctly swaps in the child
  collector for the duration of the child run.

- TestSubWorkflowTraceCollector
  ::test_storage_mode_shared_does_not_pollute_parent_collector
  Identity-check (`is`) that parent's collector is reinstated in
  shared["_trace_collector"] after a storage_mode=shared sub-workflow
  completes. Catches any case where save/restore returned an equal-but-
  different object.

- TestParallelBatchSubWorkflowTrace
  ::test_each_batch_item_subworkflow_captures_own_llm_prompt
  The most complex nested case: parallel batch where each item is a
  sub-workflow containing an LLM call. Asserts each child collector
  captures its own per-item prompt (no cross-worker contamination).
  This is exactly where the previous thread-local design would have
  regressed; the per-worker item_shared + per-engine save/restore design
  handles it correctly.

5284 passed (was 5280 + 4 new); test_plan_drift.py 32 green; make
check green.
…r all deterministic LiteLLM errors

Bundles Phase A code-review item #6 (typed-error translation) with three
deferred findings from the §31 review (#8 smart_filter narrowing, #9
parse_structured_response → LLMCallError, trace_report unpriced-cost
propagation).

Item #6 — typed exception hierarchy at the adapter seam:
- Add UnknownModelError / MissingApiKeyError / InvalidRequestError as
  LLMCallError subclasses. UnknownModelError carries a structured
  `reason` attribute ("unknown_name" | "missing_prefix") so consumers
  branch on a typed contract rather than substring-matching message text.
- Adapter catch tuple expanded from BadRequestError only to all 4
  deterministic LiteLLM exception classes; new _classify_litellm_error
  helper does the translation in one place at the LiteLLM boundary.
- Drop `import litellm.exceptions` from nodes/llm/llm.py — the
  architectural seal is now complete and grep-verifiable.
- LLMNode._call_llm rewrites the catch chain into 3 typed branches
  with precise messages; exec_fallback shrinks from 4 isinstance
  branches (~50 lines) to a substring-matched timeout case + generic
  fallback (~15 lines). All 5 error paths populate a new error_class
  field surfaced to shared["error_class"] so JSON-mode consumers get
  a machine-parseable cause.

Bundled deferred findings:
- parse_structured_response now raises LLMCallError instead of
  ValueError, giving smart_filter and discovery callers a single
  typed contract for parsing failures.
- smart_filter narrowed from `except Exception` to
  `except (LLMCallError, ConnectionError, TimeoutError, OSError)` so
  programming errors surface loudly while documented failure modes
  still degrade gracefully.
- trace_report propagates `cost_usd: None` (unpriced models like
  Ollama / custom endpoints / brand-new releases) instead of silently
  coercing to 0.0 — reports show "—" for unpriced subtrees rather
  than misleading partial sums.

Verification: 5301 tests green; ruff + ruff-format + mypy + deptry
clean; `grep -rn 'import litellm\.exceptions' src/pflow/` returns
exactly the one expected match in core/llm_client.py.

Progress log §32 captures the discriminator-loss bug pattern caught by
review-agent-ux, the dead-field bug pattern (error_class added but not
surfaced), the autouse-mock skip-pattern testing trap, and the
established conventions for any future agents extending this seam.
…eline

Closes 3 critical + 7 high-value findings from Phase A code review #2 by
completing the typed-exception architecture so error info flows from
adapter → exception → diagnostic → JSON without duplication.

Architectural changes (10 steps; see plan-review-revised plan for full
rationale):

- core/exceptions.py: model arg on LLMCallError; new LLMTransientError
  marker (Timeout/RateLimit/InternalServerError) and LLMResponseParseError
  (JSON parse); kind discriminator on MissingApiKeyError mirroring
  UnknownModelError.reason; to_diagnostics() override per subclass
  producing rich Diagnostics with structured context (category="llm_failure",
  error_class, model, reason/kind/provider_message) + remediation
  suggestions + see_also=["llm"]. Single source of truth for prose +
  structured info.
- core/diagnostic.py: LLM_FAILURE_CATEGORY constant + CATEGORY_TITLES
  entries. Constant referenced by both override (pre-execution path) and
  _FAILURE_CATEGORY_MAP (runtime path) so they cannot drift.
- core/llm_client.py: catch tuple extended from 4 to 7 LiteLLM exceptions;
  _classify_litellm_error wraps transients in LLMTransientError; every
  typed exception construction passes model=. AdapterResponse gains
  warnings: list[dict] field; _normalize emits structured warnings for the
  full empty-content finish_reason matrix (length/max_tokens reasoning
  vs non-reasoning, content_filter, stop, None; tool_calls intentionally
  silent). Replaces logger.warning emission with surfaced warnings that
  reach JSON output. Lint-driven extraction: _detect_empty_response_warnings.
- core/llm_utils.py: parse_structured_response signature gains model=
  kwarg; raises LLMResponseParseError instead of bare LLMCallError. Three
  callers updated to thread model.
- nodes/llm/llm.py: deleted _error_dict + _api_key_tip helpers (~50 lines);
  3 typed-catch branches collapsed to one `except LLMCallError` (LLMNode
  consumes e.to_diagnostics()[0]; LLMTransientError re-raises so retry
  loop fires). New _propagate_error_to_shared single-seam mutation
  helper. post() reads adapter_response.warnings into shared['__warnings__'];
  JSON-parse path raises LLMResponseParseError.
- execution/executor_service.py: new LLM branch in
  _enrich_error_from_node_output lifts _diagnostic_context into runtime
  Diagnostic (the dual-transport design that delivers the same context
  shape on both pre-execution and runtime paths). _FAILURE_CATEGORY_MAP
  gains 'llm_failure' entry.
- runtime/node_state.py + runtime/engine/engine.py: FAILURE_CATEGORY_LLM
  constant; _NODE_TYPE_FAILURE_CATEGORY adds LLMNode → LLM.
- registry/smart_filter.py: umbrella narrowed to (LLMCallError,
  ConnectionError, OSError) — now correctly catches transient errors that
  used to crash the discovery caller (verified at runtime: litellm Timeout
  doesn't inherit from builtin TimeoutError).
- cli/find_errors.py: new LLMCallError branch using to_diagnostics() —
  discovery callers (find/find_workflow) get rich UX without duplicating
  remediation logic.
- _trace_collector → __trace_collector__ rename (24 sites: 8 production +
  14 tests + 2 CLAUDE.md + 1 load-bearing filter at workflow_trace.py:313).
  Aligns with __failures__/__warnings__/__progress_callback__ convention;
  routes to root deterministically via NamespacedSharedStore dunder bypass.
- execution/formatters/success_formatter.py: top-level cost tri-state
  mirror (pricing_available/partial_cost_usd/unavailable_models). Lint-
  driven extraction: _mirror_pricing_tri_state.
- core/llm_config.py: openai/gpt-5.2 fallback prefix (LiteLLM rejects
  bare names). get_model_not_configured_help text updated.
- Doc/example sweep — 17 files: provider-prefix model names in mintlify
  docs, pflow guide, MCP agent instructions, example workflows. CLAUDE.md
  exception hierarchy + adapter section refreshed. CHANGELOG enumerates 3
  new JSON-shape additions (category='llm_failure', error_class/model
  context fields, top-level cost tri-state, empty-response warnings via
  __warnings__).

Verification:
- 5306 tests pass; tests/test_execution/test_plan_drift.py 32/32 sacred
  parity tests green throughout every step.
- make check clean (ruff, ruff-format, mypy, deptry).
- Architectural seal intact: grep -rn 'import litellm.exceptions' src/pflow/
  returns exactly 1 match (core/llm_client.py:35).

Loose ends documented in
scratchpads/task-158-phase-A-completion-loose-ends.md — most important
gap is integration test coverage (the new contract that JSON output's
errors[i].context carries error_class/model/reason/kind is verified at
unit level only). Plan + plan-review docs at
.taskmaster/tasks/task_158/implementation/phase-A-completion-plan.md.
Preserve structured LLM warnings through the runtime diagnostic pipeline and add end-to-end coverage for the LLM failure context contract.

Made-with: Cursor
…plete()

Phase A regressed CLI startup from ~0.3s to ~1.2s on every pflow
invocation: a stray import of `enrich_llm_usage_with_cost` from
`llm_client.py` (which had `import litellm` at module top, ~700ms cold)
was sitting in `runtime/engine/batch_executor.py` — squarely in pflow's
eager startup chain. The ~30 subprocess tests in the test suite
multiplied the cost into a 20s → 86s test-suite regression.

Stage 1 — delete the obsolete helper:
- `enrich_llm_usage_with_cost` was already a no-op for new data:
  ClaudeCodeNode mirrors `total_cost_usd → cost_usd` at the producer
  (claude_code.py:883), LLMNode reads `cost_usd` from the adapter
  response, and every consumer uses `.get("cost_usd")`. Helper deleted
  along with its 4 callsites + the lazy-import workaround in metrics.py.
- Engine modules (`instrumentation.py`, `batch_executor.py`, `engine.py`)
  no longer import from `llm_client.py`. Architectural rule encoded in
  `runtime/engine/CLAUDE.md`: heavy-import modules at leaves of the
  dependency graph.

Stage 2 — lazy-import litellm inside the adapter:
- Moved `import litellm` and `import litellm.exceptions` from module
  top of `llm_client.py` into `complete()` and `_classify_litellm_error()`.
- The first `complete()` call in a process pays the ~700ms; subsequent
  calls resolve from `sys.modules`. Paths that don't need litellm
  (`pflow validate`, `--dry-run`, fully-cached LLM workflows, and the
  future `analyze-cache`) skip the cost entirely.
- Side effect: the first LLM node's recorded duration includes the
  one-time import. Documented in `core/CLAUDE.md`. Affects 1/N nodes
  per workflow run; cost predictions are token-based and unaffected.
- Tests: 20 patches in `test_llm_client.py` rewritten from
  `pflow.core.llm_client.litellm.completion` to `litellm.completion`
  (the former path no longer exists as a module attribute).

Results:
- pflow --version: 1.17s → 0.29s
- pflow validate: 1.04s → 0.28s
- Non-LLM workflow run: 1.05s → 0.27s
- pflow --dry-run on LLM workflow: 1.08s → 0.28s
- Cached LLM workflow run: 1.12s → 0.30s
- make test (parallel): ~55s → 18-20s (faster than main)
- Tests passing: 5305; make check green
- litellm in eager CLI chain: 779 hits → 0 hits
…pty-response gate

Closes the 5 findings from scratchpads/task-158-phase-a-code-review-20260425.md.

Adapter seal (findings #1, #5):
- Catch openai.OpenAIError in complete() instead of the enumerative tuple.
  Verified by introspection that litellm.exceptions.OpenAIError is a
  SIBLING class to LiteLLM's other exceptions, not their base — every
  litellm.X actually inherits from openai.X. Catching the OpenAI SDK
  base is what makes the seal structural.
- Rewrite _classify_litellm_error to dispatch via openai.X for mirrored
  classes plus litellm.exceptions.X for litellm-only ones. New typed
  coverage: APIConnectionError, ServiceUnavailableError, BadGatewayError,
  LiteLLMUnknownProvider, APIResponseValidationError. Default branch
  treats unknown OpenAIError subclasses as deterministic
  InvalidRequestError (fail-fast over infinite-retry).
- Add TestAdapterSealContract: 16 parametrized cases pinning the
  "every classified LiteLLM exception wraps to LLMCallError" contract,
  plus a synthetic-unknown subclass test for the safe default.
- Add smart_filter regression test for APIConnectionError graceful
  degradation (the central reviewer reproduction).
- Declare openai as a direct pyproject dep (was transitive via litellm).

Empty-response warnings (finding #3):
- Drop the output_tokens <= 0 short-circuit in
  _detect_empty_response_warnings — provider refusals (content_filter,
  stop, None) fire at zero token counts, and silencing them dropped
  exactly the signal the warning system exists for. Gate now: text
  presence + tool_calls carve-out only.
- Delete test_no_warning_when_zero_tokens_and_empty (phantom case).
- Add zero-token tests for content_filter, stop, None, length.

LLMNode model fallback (finding #4 — internal):
- Delete the silent "gemini-3-flash-preview" fallback in prep().
  Production never hits it (compiler always injects via
  get_default_workflow_model or raises CompilationError). ~60 unit tests
  that bypass compilation now set "model" explicitly.

User-facing examples (finding #2):
- Prefix every model name in CLI docstrings (settings.py x3),
  LLMSettings docstring (settings.py), README "Available Models" +
  token-usage example, complete() docstring example, and 4 mintlify
  doc tables across docs/reference/cli/settings.mdx and
  docs/reference/nodes/llm.mdx (auto-detect defaults + 2 performance
  comparison tables + Model support table).

README rewrite (finding #4 — README):
- Replace invented "pflow llm --prompt=..." CLI examples with real
  .pflow.md workflow snippets demonstrating type: llm.

Verification:
- make test: 5325 passed
- make check: ruff, ruff-format, mypy, deptry all green
- tests/test_execution/test_plan_drift.py: 32 sacred parity tests green
Three fixes from end-to-end verification:

1. Auto-prefix bare model names at the adapter seam and at settings-write
   time. LiteLLM routes bare `gemini-3-flash-preview` to Vertex AI instead
   of Google AI Studio; `gemini/gemini-3-flash-preview` routes correctly.
   Settings CLI now normalizes with user feedback; adapter normalizes as
   safety net for workflow files.

2. Classify APIConnectionError from missing SDK installs (ImportError in
   __cause__ chain) as InvalidRequestError instead of LLMTransientError.
   Prevents 3 retries on a permanent failure.

3. Set LiteLLM logger to CRITICAL — the adapter's typed exception system
   is the single error surface; redundant ERROR tracebacks on stderr are
   suppressed.

Bonus: add inject_settings_env_vars() to probe command (pre-existing gap).
… (159)

The original Task 158 ("Prompt Caching via LiteLLM + Declarative `## Cache`
Block") covered both the LiteLLM library migration and the caching feature.
After Phase A landed, the migration scope had grown into something
architecturally significant on its own (typed-exception architecture,
diagnostic pipeline integration, tracing redesign, perf fixes) — splitting
gives each side its own coherent narrative and PR review attention.

Task 158 (this branch) is now scoped to the migration only:
- spec rewritten to migration-focus, what/why with no implementation detail
- progress log gets a new §25 capturing the phased-implementation decision
  that triggered the split; existing §26-§38 unchanged; renumbered duplicate
  §35 to §36, then §36→§37, §37→§38; added header with quick-reference table
- CLAUDE.md marks Task 158 complete and lists Task 159 as next

Task 159 (new) is the prompt-caching feature:
- spec extracted from original 158, trimmed to caching-only with explicit
  Task 158 prerequisite; migration-related Design Decisions marked as
  "Resolved by Task 158"
- progress log inherits original design discussion (§1-§24) plus the
  caching half of session §25 (8 ambiguities resolved, caching spec
  revisions, batch-threshold deferred, pflow publish open assumption)
- braindump rewritten for post-migration reality: documents what's already
  in place (adapter, typed exceptions, diagnostic pipeline, trace seam,
  cost contract, mock infrastructure), what's still relevant for Phase B-G,
  and what got resolved during Task 158
spinje added 11 commits April 26, 2026 12:00
The CHANGELOG entry for the LiteLLM migration was deferred — Phase A
ships without an Unreleased changelog block. Progress log updated to
remove all references to docs/changelog.mdx changes so the file
appears untouched in this branch.
…exception completion, reasoning-path consolidation

Review pass after Task 158 Phase A: instead of patching individual review
findings, fix the duplicated contracts they were symptoms of. Reduces
overall code volume and prevents the same bug class from recurring.

## Provider metadata: one canonical registry

`src/pflow/core/llm_providers.py` (new) — `ProviderInfo` dataclass +
`PROVIDERS` registry + `detect_provider` / `normalize_model_name` /
`model_name_without_provider`. Single source of truth consumed by:

- `llm_client._normalize_model_name` and `_is_anthropic`
- `llm_reasoning_map._detect_capabilities` (Anthropic / Gemini / OpenAI
  branches all use `detect_provider`)
- `exceptions._derive_env_var_for_model`

Adding a new provider/model family is one registry-line. Closes the o4-*
drift (the o4 prefix existed only in `_BARE_MODEL_PREFIX_MAP`, missing
from `_detect_capabilities` and `_derive_env_var_for_model`). Prefix
matching is exact `startswith(provider_prefix)`, so
`openrouter/anthropic/...` is no longer mis-classified as Anthropic.

## Typed exception architecture completed

`LLMCallError` base now defines `diagnostic_context(**extra)` invariant —
every subclass extends it (no more "InvalidRequestError remembered
provider_message, MissingSdkError forgot it" failure mode). Subclasses
that previously discarded raw upstream text now preserve it via the new
`provider_message: str | None` field.

- `LLMTransientError` gains `kind` discriminator (timeout / rate_limit /
  server_error / connection) mirroring the `UnknownModelError(reason)`
  and `MissingApiKeyError(kind)` pattern. `to_diagnostics()` override
  branches on kind for targeted suggestions instead of a generic
  "LLM Call Failed" diagnostic.
- `_classify_litellm_error` now translates every transient class with
  the correct kind discriminator AND threads `provider_message=str(exc)`
  to every classification branch (8 sites including the unknown-OpenAI-
  subclass default branch). Captured at the seam, surfaced honestly
  through `diagnostic_context["provider_message"]`.
- `LLMNode.exec_fallback` preserves the original transient kind as
  `transient_kind` after retry exhaustion so downstream consumers see
  both `kind="retry_exhausted"` and the underlying provider failure
  type.

## One reasoning path

`_validate_model_options` (new) rejects reasoning-shaped keys
(`thinking`, `thinking_budget`, `thinking_effort`, `reasoning_effort`,
`reasoning_max_tokens`, `thinking_level`) in raw `model_options` —
pflow has dedicated `reasoning_effort` / `reasoning_max_tokens` channels
and letting `model_options` bypass them silently drops provider-specific
translation. `_translate_reasoning_for_litellm` raises
`InvalidRequestError` on Anthropic `thinking=True` without a budget
(previously dropped silently).

`smart_filter.py` rewrites: the bespoke gemini-3 / gemini-2.5 substring
heuristics (which passed `thinking_level` / `thinking_budget` via
`model_options` — exactly what `_validate_model_options` rejects) are
replaced with `reasoning_kwargs=map_reasoning_options(filtering_model,
"none", None, None)`. Routes through the canonical reasoning translator;
provider detection lives in one place.

`map_reasoning_options` `effort="none"` branch extended to handle
`thinking_level` capability (Gemini 3 has no off-switch — "minimal" is
the lowest equivalent). Latent gap independent of smart_filter: a user
passing `reasoning_effort: none` to a Gemini 3 LLM node previously got
default expensive thinking, silently.

## Other contract alignments

- Empty-response `_detect_empty_response_warnings` no longer silently
  returns `[]` for unrecognized finish_reason values — adds a
  `kind="llm_empty_response_unrecognized_finish_reason"` warning. Aligns
  the implementation with the function's own documented invariant.
- `parse_structured_response` raises `LLMResponseParseError` on Pydantic
  validation failure (previously logged at warning and returned the raw
  invalid dict, which let callers fail later with confusing
  KeyError/TypeError). Aligns with `LLMResponseParseError`'s docstring
  contract.
- `_emit_trace` exception swallow logs at WARNING (was DEBUG); a
  trace-hook bug shouldn't break workflows but also shouldn't be
  invisible.
- Dead-code: `CriticalDiscoveryError` removed (had no production raise
  sites; remaining CLI handler was unreachable; `LLMCallError`'s rich
  `to_diagnostics()` covers the discovery error UX).

## Tests

22 new test cases covering:
- Provider registry contract (`test_llm_providers.py` new).
- `o4-*` reasoning capability + env-var derivation regression guards.
- Pydantic-validation failure raises `LLMResponseParseError`.
- Smart_filter routes via `reasoning_kwargs`, never `model_options`
  (parametrized; `assert last_call["model_options"] is None` is the
  regression guard for the bypass).
- `effort="none"` Gemini 3 → `thinking_level=minimal`.
- `LLMTransientError` retry-exhaustion preserves `transient_kind`.
- Anthropic `thinking=True` without budget raises.
- `_validate_model_options` rejects reasoning keys (with `litellm.completion`
  patched directly, exercising the real validation pipeline).
- Parametrized seal-contract test verifies `provider_message=str(exc)` is
  threaded through every classification branch.
- End-to-end round-trip: real LiteLLM exception → seam →
  `_classify_litellm_error` → typed exception → LLMNode → runner →
  `result.errors[i].context["provider_message"]`. Pins the central UX
  promise across 5+ layers; a regression dropping `provider_message` in
  any layer fails this test loudly.

## Verification

- `make test`: 5380 passed, 9 skipped (was 5325 pre-pass; +55 net).
- `make check`: ruff, ruff-format, mypy, deptry — all green.
- Architectural seal verified: `grep -rn 'litellm\.exceptions\|^import openai\|^from openai' src/pflow/`
  returns hits only in `core/llm_client.py` (the seam).

## Follow-up issues filed

- #353 — typed sub-discriminators for `MissingApiKeyError` quota/suspended/region
- #354 — enrich diagnostic suggestions from `provider_message` text
- #355 — codify autouse-mock-bypass test pattern in `tests/CLAUDE.md`
- #348 commented — `llm_providers.py` substantially closes the
  duplication-elimination half; remaining work is provider expansion
  gated on real demand.

See `.taskmaster/tasks/task_158/implementation/progress-log.md` §39 §40
for the full session-by-session narrative including the smart_filter
regression discovery and the discriminator-loss bug pattern that drove
the `provider_message` field design.
`pflow find` and `pflow mcp find` invoke the LLM adapter via
`find_workflow` / `find_components` but never called
`inject_settings_env_vars()`. Users who stored API keys via
`pflow settings set-env` (the recommended migration path post-Task 158)
saw "API Key Missing or Insufficient" even though `pflow run` worked
with the same env. Surfaced during independent end-to-end verification.

Matches the §38 probe.py pattern: lazy import + single call before the
LLM-using code path. 4 lines added across 2 files.

Repro before fix:
  env -i HOME="$HOME" PATH="$PATH" pflow find "github"
  # Error: API Key Missing or Insufficient

After fix: same command succeeds (settings inject as designed).

Verification:
- `make test`: 5380 passed
- `make check`: ruff, ruff-format, mypy, deptry — all green
- 249 targeted tests for find/probe/mcp pass
Comprehensive review of Task 158 (LiteLLM migration) optimized for
future AI agents — especially the agent implementing Task 159 (caching)
which builds on this substrate.

Covers: integration points, shared store keys + routing rules, 10
architectural decisions with alternatives, 8 gotchas, 10 reusable
patterns, 9 anti-patterns, breaking changes, extension points for
caching, and 10 common pitfalls each tied to a real bug caught during
the implementation review cycles.

Companion to implementation/progress-log.md (the session-by-session
narrative) and task-158.md (the spec).
… GOOGLE_API_KEY alias, _normalize envelope [skip review]

Three correctness fixes from PR #356 review (claude[bot] warnings 1-3):

1. ProviderInfo.env_var (str) -> env_vars (tuple[str, ...]). Gemini now
   carries both GEMINI_API_KEY (canonical) and GOOGLE_API_KEY (the alias
   LiteLLM checks first in its Gemini auth path). MissingApiKeyError's
   missing_key remediation surfaces the canonical in the primary
   suggestion and lists aliases in a follow-up so users with only
   GOOGLE_API_KEY set see it. ctx field renamed env_var -> env_vars
   (list, canonical first). llm_config.PROVIDER_ENV_VARS now derives
   from the registry — single source of truth.

2. UnknownModelError.reason / MissingApiKeyError.kind /
   LLMTransientError.kind tightened from str to Literal[...]. Mypy
   immediately caught one untyped str return in
   _classify_transient_kind, which now returns LLMTransientKind. Typo
   at any raise site now fails type-check rather than silently falling
   through to a default branch with wrong remediation.

3. _normalize call in complete() wrapped in try/except (IndexError,
   AttributeError) -> LLMResponseParseError. Without this envelope, a
   malformed LiteLLM response (empty choices list, missing usage) would
   escape the typed-exception seal AND skip the after_call trace event,
   breaking both invariants the adapter owns. The wrap also fires
   after_call with error= so trace consumers preserve their
   begin/end-pair invariant.

Tests: TestNormalizeMalformedResponse pins the new envelope (3 cases:
empty choices, missing usage, after_call still fires). New
test_gemini_missing_key_lists_both_aliases pins the dual-env-var
remediation. Seal contract gains a connection-kind regression case
(suggestion S6). test_provider_env_vars_canonical_first asserts the
registry shape.

5386 passed, make check green (ruff, ruff-format, mypy, deptry).
…ip review]

Two coverage gaps from PR #356 review (claude[bot] warnings 4-5):

1. tests/test_cli/test_lazy_imports.py — new file. Subprocess test
   asserts ``litellm`` does not enter ``sys.modules`` when only the
   pflow CLI entry point is imported. The ~700ms LiteLLM import cost
   is documented as a contract in core/llm_client.py's module docstring;
   this test catches a regression that hoists ``import litellm`` to
   module top or routes an engine module through llm_client.py.

2. tests/test_runtime/test_trace_integration.py —
   test_llm_failure_in_one_item_does_not_corrupt_siblings. Failure-path
   complement to the existing parallel-batch + sub-workflow + LLM
   success test. Patches the LLMNode binding so one item raises
   MissingApiKeyError; asserts each worker reports its own outcome,
   sibling success traces are intact, and the typed _diagnostic_context
   propagates from the typed exception through the worker boundary
   into the user-facing surface (either trace event or batch errors
   list). The §31 worker-thread bug had this exact shape — silent
   failure or sibling corruption — but only the success path was
   pinned.

5388 passed, make check green.
… artifacts [skip review]

Three review-cleanup items from PR #356 (claude[bot] warnings 7-9):

1. pyproject.toml: openai gains a >=1.0 floor. The adapter catches
   openai.OpenAIError (the structural base of LiteLLM's exception
   hierarchy); a future major-version restructuring of openai would
   silently break the seal. uv.lock unchanged (current resolution is
   well above 1.0).

2. Docstring + mintlify docs for set-default / set-discovery /
   set-filtering now name the bare-name auto-prefix behavior. The
   _normalize_and_warn_model helper has been the contract for a
   while, but neither command help nor docs/reference/cli/settings.mdx
   surfaced it. A single Note block in settings.mdx covers all three
   commands; each command's own docstring gets a one-liner pointing
   at the same behavior.

3. scratchpads/task-158-spike/ removed from git tracking. The spike
   scripts and outputs were always intended as throwaway runnable
   documentation per phase-a-execution-guide.md and progress-log.md
   §27, but were inadvertently committed in a7d8c77. The 2.1 MB
   spike_4_output.txt also leaked a developer's macOS home path. The
   research guide updated to reflect reality (the artifacts no longer
   exist; §27 of the progress log is the canonical findings record).

5388 passed, make check green.
… unknown providers [skip review]

Pre-existing UX bug surfaced during the PR #356 review discussion: a
user calling ``together_ai/llama-3-70b`` (or any provider absent from
pflow's registry) who hit an authentication error saw a misleading
suggestion to ``export ANTHROPIC_API_KEY=...``. The hardcoded fallback
literal in ``MissingApiKeyError.to_diagnostics()`` was always wrong for
non-Anthropic providers — it just happened to be the registry's first
entry, used as a placeholder.

The fix splits the missing_key path into three branches:

1. Known provider (registry hit) — unchanged, precise canonical env var
   plus aliases (the GEMINI_API_KEY / GOOGLE_API_KEY case Tier A added).

2. Unknown provider with parseable prefix — defer to provider_message
   (the raw LiteLLM error text usually names the expected env var) plus
   a heuristic "<PROVIDER>_API_KEY" candidate framed as likely, NOT
   authoritative. The suggestion text explicitly disclaims multi-key
   cloud providers (AWS Bedrock, Azure, Vertex AI need additional
   credentials beyond a single API key). The heuristic is correct for
   the Mistral/Cohere/Groq/DeepSeek/Together/OpenRouter/Anyscale
   majority that follow the convention; the disclaimer covers the
   meaningful exceptions.

3. No parseable prefix — pure docs link, no fabricated env var.

New ``extract_provider_prefix(model)`` primitive in llm_providers.py
(provider parsing belongs there); the env-var derivation lives in
exceptions.py as ``_likely_env_var_for_unknown_provider`` since it's
specific to the diagnostic UX.

Agents reading the JSON output get a new structured field
``context["likely_env_var"]`` when the heuristic fires (filtered out
when None — the diagnostic_context base helper drops Nones already).

Tests cover all three branches: TogetherAI (heuristic fires, no
ANTHROPIC_API_KEY anywhere in suggestions), Bedrock (heuristic fires
but suggestion disclaims multi-key providers explicitly), and the
no-prefix case (pure docs link, no fabricated env var of any kind).

5393 passed, make check green.
…t-state braindump [skip review]

The two prior handovers (task-94-handover-01.md and -02.md, 622 lines
combined) described the llm-library architecture that Task 158
demolished. Most of their guidance referenced infrastructure that no
longer exists: _has_llm_key subprocess paths, llm models list, llm keys
get, the pflow registry describe llm and pflow registry discover
commands removed by Task 151, and the multi-source detection that
checked llm CLI as a third tier.

Replaced with a single ~200-line starting-context document that:

- States explicitly what it supersedes and why
- Captures the premise of the task (still valid) and the plural-models
  clarification from today's discussion
- Maps what's gone (llm library, registry CLI commands, hardcoded
  defaults) vs what's new and infrastructure-ready (PROVIDERS registry,
  _has_provider_key, _likely_env_var_for_unknown_provider,
  MissingApiKeyError.to_diagnostics(), pflow settings llm show, the
  lazy-import contract)
- Pulls forward the 5 insights from the originals that still apply
  (display-time check, don't enumerate 100+ providers, the user's
  "don't over-engineer" framing, PYTEST_CURRENT_TEST short-circuit,
  detection-vs-display split)
- Surfaces the model-enumeration question (raw LiteLLM dump vs curated
  regex vs hybrid) as the central new design choice
- Lists 6 open design questions explicitly NOT resolved
- Updates file paths (the originals pointed at dead files)
- Gives concrete first-steps guidance

The new doc is a starting-context document, not a design spec — design
decisions remain open for the agent picking up Task 94.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the llm library with a pflow-owned LiteLLM adapter, introduces a typed exception hierarchy for LLM failures, and redesigns the tracing mechanism to use a shared-store save/restore seam. Feedback suggests that specific exception handling in the adapter may be redundant and recommends enhancing the warning propagation logic to support multiple warnings per call.

# error LiteLLM raises in the call path pflow uses — both the
# known subclasses and any future additions. The architectural
# seal stays intact: consumers (LLMNode retry loop, smart_filter,
# discovery callers) catch the LLMCallError umbrella without ever
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The adapter is designed to catch openai.OpenAIError as the structural base for all LiteLLM exceptions. Catching litellm.exceptions.BadRequestError specifically here might be redundant if the goal is a unified exception handling strategy as described in the module docstring.

node_id = getattr(self, "node_id", None)
if warnings_list and node_id is not None:
# In v1 the adapter emits at most one warning per call. If a
# future case needs multiple, change the contract to a list value.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation assumes a single warning per call. If LiteLLM or the adapter starts producing multiple warnings, only the first one will be captured in __warnings__. Consider if warnings should be appended or joined if multiple exist.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Code Review — PR #356 (Task 158: llm → LiteLLM adapter)

Reviewed against the Epistemic Manifesto + CLAUDE.md guidance. Six parallel agents inspected the adapter, exception hierarchy, tracing redesign, CLI plumbing, tests, and security/diff hygiene. Bottom line: the architectural work is solid and ships well. The migration is clean, the trace seam fix is structurally sound (closure capture + save/restore), and the test bar is above average for a refactor of this size. The findings below are cleanup, not blockers.


Warnings — should be addressed

1. GOOGLE_API_KEYGEMINI_API_KEY aliasing leaks wrong remediation

core/llm_config.py:29-33 lists PROVIDER_ENV_VARS with both GEMINI_API_KEY and GOOGLE_API_KEY for Gemini, but the canonical registry ProviderInfo.env_var (core/llm_providers.py) is a single string (GEMINI_API_KEY). When MissingApiKeyError.to_diagnostics() calls _derive_env_var_for_model (exceptions.py:521-528), a user with only GOOGLE_API_KEY set gets the misleading suggestion export GEMINI_API_KEY=....

Fix: extend ProviderInfo to env_vars: tuple[str, ...] and consolidate llm_config.PROVIDER_ENV_VARS into the registry. PR claims llm_providers.py is the single source of truth — this divergence weakens that claim.

2. Discriminators should be Literal[...], not str

The PR description claims typed Literal values for reason/kind/package, but exceptions.py:249, 319, 406, 462 define them as str with default values:

def __init__(self, ... reason: str = "unknown_name", ...): ...

to_diagnostics() consumers branch on equality (if self.reason == "missing_prefix" at :257) and use .get(self.kind, "...connection...") (:435). A typo at the raise site silently falls through to the default branch with wrong remediation, with no mypy signal.

Fix: tighten to Literal["unknown_name", "missing_prefix"], Literal["timeout", "rate_limit", "server_error", "connection"], etc. Trivial change, real safety.

3. Malformed LiteLLM responses bypass the typed exception system

llm_client.py:716, 722, 742 access raw_response.choices[0] and raw_response.usage directly inside _normalize. If LiteLLM returns an empty choices list or missing usage (e.g., LiteLLM bug, unusual provider response), this raises IndexError/AttributeError outside the except openai.OpenAIError: block at :311. The trace after_call event does not fire, and the runtime sees an unwrapped exception that escapes the seal contract.

Fix: wrap _normalize in a try/except that produces LLMResponseParseError(provider_message=str(exc)).

4. Missing regression test for the lazy-import contract

The PR's headline perf claim (~700ms saved on cold CLI) depends on litellm staying out of sys.modules for non-LLM CLI paths. There's no test pinning this. A 5-line subprocess test would catch a regression where someone does import litellm at module top:

def test_litellm_not_imported_for_validate(uv_exe, prepared_subprocess_env):
    result = subprocess.run([uv_exe, "run", "python", "-c",
        "import sys; from pflow.cli.main import main; "
        "assert 'litellm' not in sys.modules"],
        env=prepared_subprocess_env, capture_output=True)
    assert result.returncode == 0

This is the single biggest concrete coverage gap.

5. Missing test: LLM failure inside parallel-batch sub-workflow

tests/test_runtime/test_trace_integration.py:673-732 covers the success path for "parallel batch + sub-workflow + LLM" — the concurrency-safety bug class explicitly cited in the PR. The failure path (LLM raises during a parallel-batch sub-workflow) is untested. Given that the worker-thread _active_collectors bug was the original motivator for this redesign, the failure path under that exact configuration is the highest-value gap to close.

6. Process-wide LiteLLM logger silencing on every call

llm_client.py:307 sets logging.getLogger("LiteLLM").setLevel(logging.CRITICAL) inside complete(). Idempotent, but a process-wide side effect: any downstream tool that wants LiteLLM logs is suppressed once any pflow LLM call has run. Documented but worth flagging — consider doing it once at first import and noting it in the lazy-import block.

7. pyproject.toml: openai dep has no version floor

pyproject.toml:35 adds openai as a direct dep with no constraint. The adapter catches openai.OpenAIError — if openai ever restructures its exception hierarchy, the adapter breaks silently. uv.lock pins the resolved version today, but the intent should be encoded as a >= floor.

8. Auto-prefix rule is not user-documented

cli/commands/settings.py:458-476 auto-prefixes bare model names at write time and warns on unknown providers. The behavior is correct (azure/... short-circuits via "/" in model at llm_providers.py:57 — no double-prefix bug) but the rule is not documented in command help (set-default, set-discovery, set-filtering) or docs/. Add a one-liner under each set-* block.

9. Spike artifacts should be cleaned before merge

  • scratchpads/task-158-spike/spike_4_output.txt is 1.5 MB / 13,042 lines of LiteLLM dependency dumps. Pure noise.
  • Same file leaks a developer's macOS home path (/Users/andfal/.cache/uv/builds-v0/...) at lines 25 and 113.
  • .taskmaster/tasks/task_158/research/phase-a-execution-guide.md:41-43 says spike scripts are "not committed" — they ARE committed. Either delete them or update the doc to match reality.

Suggestions — optional improvements

  1. llm_reasoning_map.py:141max(min(int(base * ratio), 128000), 1024) has inline magic numbers. Extract MIN_THINKING_BUDGET = 1024 and MAX_THINKING_BUDGET = 128000 as module constants.
  2. llm_client.py:642EFFORT_RATIOS.get(thinking_effort, 0.5) silently substitutes "medium-ish" for unknown effort strings. At least logger.warning would help when the Literal-typing fix above doesn't catch it.
  3. llm_client.py:269 — schema response_format hardcodes "name": "response", "strict": True. Some providers may not support strict mode. Consider making it adapter-configurable if a provider rejects.
  4. llm_client.py:532_extract_missing_package regex r"pip install ['\"]?([^'\";\s]+)['\"]?" will silently return None if LiteLLM changes the hint wording (e.g., pip3 install, uv pip install). The MissingSdkError is still raised, but with package=None — less actionable. Compile the regex at module level while at it.
  5. llm_client.py:759isinstance(raw_cost, (int, float)) accepts bool (Python: isinstance(True, int) is True). Theoretical, not reachable from LiteLLM, but a smell.
  6. Discriminator coverage — add an explicit assert exc_info.value.kind == "connection" to the bare-APIConnectionError seal-contract case in test_llm_client.py.
  7. Drop mock_strip.assert_not_called() at tests/test_nodes/test_llm/test_llm.py:865-867 — testing internal dispatch shape, redundant with the type assertions on shared[\"response\"].
  8. MockLLMClient.set_error() helper — would deduplicate ~6 test sites that monkeypatch pflow.nodes.llm.llm.complete with throw-closures.
  9. CLAUDE.md L161-162: the "Never use the Explore agent" guidance is unrelated to this PR — should ideally be split into its own commit, but harmless as-is.

What works well (worth calling out)

  • Trace seam redesign is structurally sound. The closure capture (workflow_trace.py:535-554) binds stable references; the explicit prep_res[\"_trace_hook\"] propagation across the worker boundary (llm.py:325-328, 402-408) makes the prior worker-thread bug structurally impossible. Save/restore in engine.run is unconditional via try/finally.
  • Exception hierarchy quality. LLMCallError docstrings (exceptions.py:130-176) are above-average — explain the seam, deterministic vs transient split, and the provider_message rationale.
  • Test discipline: test_executor_service_llm.py:154-219 is the right pattern — patches litellm.completion directly and asserts the raw provider text survives the full chain (5+ layers). Catches regressions any other layer would mask.
  • Audit gap fix is complete: inject_settings_env_vars() covers all five entry points that reach complete() (run, find, mcp find, probe, mcp_server). Verified.
  • uv.lock regen is clean: every added/removed package is attributable to the swap; no unrelated bumps to core deps.

Verdict

Ship after addressing the warnings (especially #1 GOOGLE_API_KEY alias, #2 Literal discriminators, #3 _normalize exception envelope, and #4–5 missing tests). The rest is cleanup and can land separately.

@spinje spinje merged commit 0455156 into main Apr 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant