feat: Prompt caching via declarative `## Cache` block + `pflow analyze-cache` by spinje · Pull Request #378 · spinje/pflow

spinje · 2026-05-07T12:57:58Z

Summary

Ships provider-level prompt caching as a declarative workflow surface, plus a new pflow analyze-cache command for static + trace-driven cache analysis. Verified end-to-end on real Anthropic + Gemini calls (-25% first-run, -73% rerun-within-TTL on the lyrics-generator smoke fixture).

The important review focus is the analyzer contract, not the small provider-rendering seam: late PR/baseline sweeps turned pflow analyze-cache into the primary diagnostic surface for cache recommendations, trace fidelity, cross-workflow opportunities, and JSON/MCP consumers.

Task

159 — see .taskmaster/tasks/task_159/task-review.md for the comprehensive implementation review.

Changes

Workflow author surface

New top-level ## Cache markdown section: prose interleaved with ${var} chunks, optional - ttl: 5m|1h. Each chunk is a [prose-before-${var}][${var}] pair, parsed into ir["cache"] = {ttl, items: [{name, var, prose_before, _source_line}, ...]}.
Per-LLM-node fields: prompt_cache: [name1, name2, ...] (subset of ## Cache.items, declaration-order strict — out-of-order is a hard error) and prewarm: bool (gates auto batch-prefix marker insertion).
Existing cache: bool (memo opt-out) untouched and orthogonal — two cache layers, two distinct fields (DD#5).

Cache rendering at the LiteLLM adapter seam (src/pflow/core/llm_client.py)

complete() widened: system: str | None -> str | list[ContentBlock] | None; new user_message_blocks: list[dict] | None for batch-prefix paths.
Per-provider TTL translation in core/cache_render.py::_build_cache_control_marker (Anthropic native, Gemini -> cachedContents, OpenAI -> prompt_cache_retention).
Memo-cache-hash correctness (DD#19): compute_node_config(prompt_cache_content=) keyword-only kwarg; nodes that don't opt in keep their existing hash.
Auto batch-prefix caching gated on prewarm: true; render/hash byte parity is guarded by golden config hashes and NamespacedSharedStore production-shape tests.

pflow analyze-cache — new CLI + MCP command, full text + JSON parity

4 output modes: greenfield / steady-state / already-optimal / --from-trace.
Evidence-only token estimation for cacheable bytes: the old fabricated Tier-3 cacheable heuristic was deleted. Unresolved declared chunks return (None, "unavailable") instead of a guessed value.
4-state cost (trace / trace_partial / recomputed / unavailable) — never emits $0.00 for unavailable.
Full-validator unification: analyze-cache now surfaces execution-blocking non-cache validation errors via blocking_errors[] / other_blocking_errors[] instead of silently filtering them out.
Cross-workflow walker (Tier 2): prose mismatch, value-flow opportunities, grouped child-workflow recommendations, and JSON/raw rename diagnostics. Rename diagnostics are intentionally suppressed from text because variable names do not reach the provider wire.
Per-call projections now distinguish batch_prefix and cross_workflow_projection from ordinary parameter/memo/trace evidence; this is load-bearing for dynamic batch and parent->child cache opportunities.
Trace coverage is none / complete / truncated; successful conditional runs with unreached rows are still complete. Trace-dependent warning suppression is catalog-driven via CacheWarningSpec.requires_complete_trace, not severity.
22 stable catalog entries at core/cache_analysis/warning_catalog.py (21 cache.* + llm.thinking-temperature-mismatch; DD#29 design review for additions).

Trace format

2.0.0 -> 2.1.0 -> 2.2.0 (additive). New fields: workflow_path, per-event cache_key / cache_source / cache_age_sec / cache_chunks_skipped, cache-render snapshots, sub-workflow cost rollup, and canonical child workflow paths for dynamic workflow batches.
Filename schema bumped to identity-encoded (workflow-trace-{wf_hash}-{safe_name}-{timestamp}.json) — O(matches) glob lookup vs. O(N) scan-and-parse. Prior schema scaled to ~14s per analyze-cache invocation on machines with 67k+ traces; now <1s.
Forward-compat: existing 2.0.0 readers gate on format_version.startswith("2."); auto-load skips 2.0.0 (no workflow_path) but --from-trace <path> still works.

JSON output schema: 1.0 -> 1.1 -> 2.0 -> 2.1 -> 4.0 -> 4.1. format_version is the first key. JSON 4.1 includes other_blocking_errors[], summary.suggested_run_command, typed per_node_thresholds[].model_state, per_call[].cross_workflow_inputs, and cacheable evidence tiers batch_prefix / cross_workflow_projection.

Architectural fix — Path 1 boundary contract

resolve_workflow and resolve_sub_workflow now both file-resolve at the boundary. Every downstream consumer (analyzer, validator, compiler, runner) gets fully-resolved IR for free. Contract tests catch the regression class structurally. Cross-references issues Extend Option E shared-primitive pattern to sub-workflow output population + cycle detection #321, refactor: extract shared sub-workflow resolution primitives for heterogeneous batch planning #334.

Companion fix — GH #357 (memo cache key line-shift)

pflow save rewrites frontmatter on every invocation, shifting source lines and invalidating cache_key. Fix: _METADATA_KEY_SUFFIXES = ("_source_line", "_source_lines", "_source_files") filter at runtime/cache.py::compute_node_cache_key. Pre-existing pflow bug; fixed inline because shipping caching without it would surprise the first user who saves a ## Cache workflow.

Diagnostic infrastructure

Diagnostic.id: str | None field added (top-level stable IDs per DD#27).
Identity tuple updated to (severity, source, node_id, id or message) — when id is set, it's the dedup key.

NamespacedSharedStore now inherits MutableMapping (was duck-typed but not type-tagged). TemplateResolver._get_dict_value now checks Mapping not dict. Closes a silent stale-cache class where dotted-path chunks were filtered as _CHUNK_ABSENT in production.

Explanation

Most production risk is in the analyzer (src/pflow/core/cache_analysis/), not the provider rendering hot path. The provider rendering seam is comparatively small; the analyzer owns recommendations, cross-workflow traversal, trace evidence, cost semantics, JSON/MCP shape, and agent-facing text.

The implementation arc: 4 segments (foundations -> memo-hash gate -> rendering+prewarm+trace -> analyzer+docs) -> Stage 1 lyrics-generator verification -> Stage A/0/B/C data-model redesign (recommended_actions + cross_workflow.* derived from analysis.warnings) -> Stage 2 paid verification -> PR #378 review fixes -> baseline/polish sweeps.

The progress log is now 11,777 lines at .taskmaster/tasks/task_159/implementation/implementation-progress-log.md. For reviewer orientation, prefer .taskmaster/tasks/task_159/task-review.md; use the progress log only for forensic implementation history.

This is a large PR (700+ files, mostly task docs/baselines/tests). Exact file/commit counts may drift while final docs and baselines are pushed; the stable review surface is the task review, baseline harness, and tests listed below.

Created Docs

.taskmaster/tasks/task_159/task-159.md — spec (~617 lines, 37 design decisions, full warning ID catalog, output format reference)
.taskmaster/tasks/task_159/task-review.md — primary future-agent navigation artifact (file tour, integration points, patterns, anti-patterns, common pitfalls, Task 160 baseline guardrail)
.taskmaster/tasks/task_159/implementation/implementation-plan.md — full implementation plan
.taskmaster/tasks/task_159/implementation/implementation-progress-log.md — segment-by-segment progress log (11,777 lines)
.taskmaster/tasks/task_159/baseline/ — analyzer output oracle used during development and intended as a Task 160 refactor safeguard

User-facing documentation:

src/pflow/guide/features/caching.md — new pflow guide caching topic covering two-cache-layer model, ## Cache syntax, order invariant, TTL opt-in, sub-workflows, provider-specific notes, full warning ID catalog
docs/guides/debugging.mdx, docs/reference/cli/index.mdx, docs/reference/nodes/llm.mdx — caching cross-references and CLI help updates
src/pflow/core/cache_analysis/CLAUDE.md — current-state tacit knowledge for the analyzer package
Multiple CLAUDE.md updates across CLI, core, runtime, execution, nodes, and tests

Breaking Changes

These are not strictly breaking for end-user workflows (no users yet — see CLAUDE.md), but consumers of internal APIs should be aware:

Diagnostic identity tuple changed from (severity, source, node_id, message) to (severity, source, node_id, id or message). When id is set, it's the dedup key.
Adapter complete() signature: system: str | None -> system: str | list[ContentBlock] | None; new user_message_blocks: list[dict] | None kwarg.
compute_node_config, apply_memo_hit, write_memo_cache, handle_cached_execution widened with keyword-only kwargs (prompt_cache_content, node_type_name, cache_key, created_at, cache_source). Production callers updated; future direct callers must pass these.
WorkflowTraceCollector(*, workflow_path=None) — new keyword-only kwarg.
NamespacedSharedStore now inherits MutableMapping; __contains__(key: object) (ABC requirement) instead of key: str.
LLM usage telemetry now preserves has_cache_telemetry, uncached_input_tokens, and input_token_accounting; absence vs explicit zero is load-bearing.
JSON output is now major version 4. Consumers should gate on format_version.startswith("4.") and dispatch on typed fields like model_state rather than sentinel strings.
per_call[].cacheable_data_source changed: cacheable evidence no longer emits the old fabricated estimator tier; valid tiers include trace, memo, parameters, batch_prefix, cross_workflow_projection, unavailable.
pflow save byte-for-byte round-trips ## Cache blocks. Existing memo cache entries written under the old line-shift-corrupted cache key become unreachable; expire naturally via 24h TTL.
--no-cache flag scope clarified: disables pflow's local memoization only. Does NOT disable LLM provider prompt caching.
Trace filename schema: existing pre-fix traces silently bypass autoload (no hash prefix). Users with valuable old traces use pflow analyze-cache --from-trace <path>.

Testing

Latest logged state: default suite 6,592 passing, 1 skipped, 3 deselected; near-full sandbox-safe suite 6,562 passing, 1 skipped; baseline harness 67/67 passing after latest regeneration. make check was clean in the logged passes, with known sandbox uv caveats documented in the progress log.

Critical regression gates (do not delete or weaken — see task-review.md "Test Files" table for full list):

tests/test_runtime/fixtures/golden_config_hashes.json + test_prompt_cache_hash.py::test_golden_baseline_hashes_match — DD#19 byte-identity of compute_node_config across 30 nodes / 9 workflows.
test_plan_drift.py::test_plan_matches_engine_for_workflow_with_prompt_cache — planner <-> engine parity for cache-using workflows.
test_prompt_cache_rendering.py::test_resolve_chunk_value_is_imported_locally_at_both_sites — hash-side and prep-side import the SAME helper as local module bindings.
test_prompt_cache_rendering.py::test_hash_render_and_prep_render_byte_equivalent_through_namespaced_store — DD#19 production-shape (NamespacedSharedStore wrap + dotted path).
test_workflow_resolver_contract.py::test_resolve_workflow_returns_fully_file_resolved_ir and test_sub_workflow_resolver.py::test_resolve_sub_workflow_cross_workflow_walker_sees_resolved_prompts — Path 1 boundary contract.
test_trace_format_2_1.py::test_handle_cached_execution_does_not_overwrite_memo_cache_source — memo-hit cache_source="memo" must not be overwritten to "in_process".
test_cache_analysis_per_id_coverage.py — every catalog ID round-trips through JSON with id at top level.
test_cache_analysis_per_id_emission.py — production-shape warning emission tests via analyze(...) with real memo/trace fixtures.
JSON/version/error-envelope tests — format_version remains first key; JSON error paths emit parseable envelopes.
Trace-filtering tests — warning suppression is driven by CacheWarningSpec.requires_complete_trace, not severity.
batch_prefix / cross_workflow_projection tests — protect per-call projection tiers and no-double-counting behavior.
Task 159 baseline harness — output oracle for analyzer text/JSON behavior; Task 160 should run it before/after each refactor slice and treat unexpected drift as a stop signal.

End-to-end paid verification (Stage 2):

Anthropic Sonnet 4.5 smoke: cache_creation/cache_read observed; first-run -25%, rerun within TTL -73%.
Gemini smoke: cost-projection accuracy verified within +/-5%.
lyrics-generator workflows drove most analyzer hardening: file-resolution boundary, cross-workflow attribution, trace coverage, baseline output, and agent-UX polish.

Adversarial/baseline verification caught real bugs

cache_source mislabel (memo reported as in_process).
NamespacedSharedStore type taxonomy: dotted-path chunks silently dropped.
Trace events vs nodes typo: Tier-1 trace data unreachable.
Validator errors filtered out of analyze-cache.
Cache ratio > 100%; -$0.00/run; per-call vs cohort token-unit mistakes.
Dynamic workflow child rows misattributed because traces lacked canonical child paths.
Baseline sweeps caught multiple agent-UX regressions that unit tests did not.

Follow-up issues filed

#357 (closed inline — line-shift breaking memo), #358 (image cache support), #359 (LiteLLM cost-map fetch warnings to stderr), #360 (dynamic batch size cost undercount), #361 (Path 2 architectural cleanup), #362 (cross-workflow rename suppression — closed), #363 (shared prose detection), #364 (multi-walker consolidation in trace event traversal), #365 (sub-workflow cost rollup — resolved), #371 (test slowdown — closed), #372 (remaining perf), #373 (near-threshold expansion hints for SuggestedBlock), #374 (__warnings__ channel modernization), #375 (mock fidelity), #376 (child rollup confidence gap), #377 (split partial_cost_usd semantics), #383 (repeated non-batch prompt-shape detection).

Task 160 (cache analysis architectural refactor) is filed as a follow-up task. Its main guardrail should be the Task 159 baseline harness plus the focused cache-analysis tests above.

…n, DDs 26-36 Synthesize the analyze-cache output format into the spec contract and lock in the architectural decisions surfaced this session. Spec changes (task-159.md): - DDs #26-36 added: Tier 2 in-by-default; Diagnostic id field extension; no FixAction (mypy is the right analog, not rustc); closed warning ID catalog of 9 entries; four-level confidence labeling; savings-ratio prewarm thresholds; auto-load trace from ~/.pflow/debug/; optional inputs; three-tier validation/analysis architecture (pflow run stays fast/deterministic, --dry-run does opt-in expensive analysis, analyze-cache is the dedicated full report). - New requirements subsections: Stable Warning ID Catalog, Output Format (Text + JSON with full mockups), Confidence Labeling Algorithm, Cross-Workflow Walker (Tier 2), Token Estimation Strategy, Per-Model Capabilities Table, Diagnostic Extension. - Verification pass corrections: complete() signature extension required (cache feature is first to need structured system content blocks); compute_node_config uses key "batch" not "batch_config"; LLMNode lacks NodeConfig access (auto batch-prefix detection moves to batch_executor). - All cache analytical findings are advisory; only structural errors (cache.order-mismatch, reference-resolution) block pflow run. Progress log (§26): journey + three principle clarifications driven by user pushback (FixAction overlap with suggestions; trace-vs-memo data source conflation; savings-ratio threshold reasoning), verification pass results with three surprises, seven open threads for plan-writing. Braindump updates: drop outdated Tier 2 / prewarm framings (now resolved via DD#26 / DD#36). New supplementary braindump captures tacit knowledge not in other docs (user pushback patterns, DD interactions, hedged claims, process recommendations). v3 output-format draft deprecated — contract content folded into spec; v3 retained as historical artifact with deprecation header.

…olidation, verification corrections Resolves the DD#18 vs DD#33/DD#36 contradiction left from §26 (analytical findings never block pflow run; advisory only). Locks in research-backed decisions: adapter signature widening (system: str | list[ContentBlock]), TTL translation table (Anthropic vs Gemini wire formats), source= "cache_analyzer" diagnostic convention, 2.0.0-trace auto-load skip, cross-workflow rename precedence, sparse-memo aggregate coverage. Replaces two starting-context braindumps with single agent-handoff.md capturing only what isn't in the spec or progress log (user style, methodology, concrete substrate, watch-fors, plan-writer gotchas). Final codebase-verification round caught five wrong claims now corrected: - Anthropic does not accept ttl: "5m" explicitly (only "1h" or omit) - Phase D option (c) rejected — batch_executor.py doesn't do per-item template resolution; option (a) is now recommended - Function name was wrong (_attach_llm_call_to_event → _add_llm_data) - Co-edit invariant mischaracterized (it's three places, not two) - scratchpads/task-158-spike/ was cleaned up; pattern only Progress log gains §27 (pre-plan-writing consistency pass), §28 (TTL clarification + cost-degradation gap), §29 (verification round). Spec is contract-ready for Phase B-G plan-writing.

…ion + DD#37 Verification spike caught Task 158 §27 misdiagnosis: "Opus 4.5 + thinking + cache fails" was actually below-4096-token-threshold silent no-op. Anthropic docs (April 2026) + spike confirm per-version minimums: Opus 4.5/4.6/4.7 + Haiku 4.5 = 4096; Sonnet 4.6 = 2048; older Sonnet/Opus = 1024. New DD#37 locks OpenAI prompt_cache_retention mapping. Boundaries cleaned: spec holds contract values, progress log §30 holds journey, handoff holds plan-writer operational context.

7 Critical + 8 High-Priority + 7 Medium fixes encoded from round-3 8-agent review. 1 finding disputed (validate_data_flow reach — verified correct as planned). Critical: - D.1/D.2 unsafe defensive-read pattern (round-2 claim of universal coverage was false) - F1 cache.discrepancy row had unresolvable {root_cause_action} placeholder - B2.2 false claim: validator step 8 cannot reach top-level node keys - ABSENT-branch hash-vs-prep symmetry (silent-stale-cache class) via _CHUNK_ABSENT sentinel - D.2 prewarm execution made structurally concrete with index-alignment guarantees - cache_chunks_skipped propagated through error path + memo round-trip - 40+ WorkflowTraceCollector test sites: dropped strict assertion, added integration test High-Priority: - --no-trace-autoload rename (collision with pflow run --no-trace) - storage_mode: shared × ## Cache documented as v1 limitation - OpenAI 15 RPM batch routing documented; v1.x follow-up - B3 in-memory mutation test (baseline can't include post-task fields) - F3.1 exit code policy: ERROR severity in warnings[], exit 0 (advisory) - Engine save/restore simplified to single try/finally; _EMPTY_CACHE_RENDER hoisted Catalog: 10 → 12 entries (cache.invalid-on-non-llm, cache.prewarm-no-prefix added). Plan: 1290 → 1445 lines.

Targeted Round 4 review (5 agents) caught pseudo-code-precision bugs in Round-3-introduced code (wrong function signatures, wrong symbol paths, references to non-existent classes, count drift). All verified against actual code via direct Read before encoding fixes. Verified pseudo-code corrections: - process_item returns 5-tuple; D.2 pseudo-code now destructures correctly - NodeStatus.ABSENT (not node_state.ABSENT) - TemplateResolutionError doesn't exist; use ValueError - extract_root_node_id always returns str; dropped dead None check - Diagnostic identity tuple; multi-field rejection collapse fixed (V6) V5 fix (top-10% pattern: one rule per error condition): - Schema is single source for shape; _validate_cache_block does only semantics + defensive isinstance skip on compile path - Removed obsolete "two diagnostics that dedupe" test V6 fix (top-10% pattern: one diagnostic per [rule, location]): - cache.invalid-on-non-llm now ONE Diagnostic per node with context["invalid_fields"]: list[str]; matches mypy/ruff convention Five high-value additions (top-10% / simplest-final-code lens): 1. _resolve_static_prefix_for_cache companion helper — byte-identical resolution across all three cache paths (chunk/static-prefix/analyzer) 2. cache.discrepancy structured context["root_cause_action"] payload — agents dispatch on typed data, not prose 3. JSON format_version evolution policy — JSON_FORMAT_VERSION_MAJOR constant + consumer rule format_version.startswith("1.") 4. Compile-path defensive isinstance guards enumerated (bool subclasses int — must use isinstance(x, bool) not (int, bool)) 5. Spike contingencies subsection — explicit table mapping each spike to the encoded plan decision it tests Plan: 1445 -> 1813 lines. Catalog count drift defense: EXPECTED_CATALOG_COUNT constant; tests iterate CACHE_WARNING_CATALOG.keys() instead of hardcoding numbers.

Minimal additions — preserves existing Round 1+2 sections intact, adds: - Header + "Where I am" updated for plan v4 (1813 lines, 12 catalog entries, 4 review passes deep) - New "What §33 / §34 don't capture" section: 7 fix-shape WHYs not in progress log (V6 combined diagnostic, V5 single-source-of-truth, _resolve_static_prefix_for_cache shape, cache.discrepancy dict payload choice, EXPECTED_CATALOG_COUNT pattern, skipped verification-only pass, "verify against actual code" methodology shift) - User's load-bearing principles extended with §33 + §34 quotes (top-10% lens, read-the-actual-code instruction, diminishing-returns analysis) - Resolved/Still-open lists updated through Round 4 (16 newly-resolved items enumerated; spike-gated items separated from Round-4-still-open) - "For the next agent" section updated with Round 4 / 5-additions context, Spike contingencies table reference, current §35 recording target - Code anchors expanded with Round 4 verifications (process_item shape, NodeStatus enum, extract_root_node_id signature, plan_node ordering, compile_validation reach) Test: "Could the next agent find this by reading files?" — applied to every addition. Tacit-knowledge sections capture WHY decisions were made, not WHAT.

Round 5 (5 targeted agents) closed layer-placement + defensive-bypass bugs introduced by Round 4: shared cache helpers moved to new `core/cache_render.py` (verified `nodes/llm/llm.py` only imports `pflow.core.*`); `_CHUNK_ABSENT` defense relocated from sentinel `__repr__/__str__` (verified bypassed by `_make_serializable` at `cache.py:25-51`) to explicit isinstance branch; `cache_chunks_skipped` error-path injection moved from builder signatures to caller wrap sites; `cache.order-mismatch` formatting fixed (Python `str(list)` vs spec bare-identifier form); `MockLLMClient.set_response` cache-tokens parallel-dict pattern fully spec'd. Round 6 (4 agents) caught factual errors in Round 5 fixes: `tuple("string")` silent-splat (iterable but wrong shape); `apply_memo_hit` 3rd caller at `execution/plan.py:862` missed; `_PROPAGATED_KEYS` 5→7 entry correction; F2 confidence aggregation flipped permissive→STRICT per DD#34 line 634 verbatim; `cache_chunks_skipped` 4th wrap site at `post():511` LLMResponseParseError; V6 sub-workflow dedup test marked `xfail` because `format_child_provenance` modifies `Diagnostic.message`. Diminishing-returns analysis run jointly with user — bug class shifted from architecture (Round 1) → concurrency (Round 2) → correctness gaps (Round 3) → pseudo-code precision (Round 4) → layer/defense (Round 5) → factual errors (Round 6). Cost-saved-per-fix dropped ~2hr → ~30min. Stop decision: switch to code-review per phase merge from here. Plan summary section collapsed (330 lines of fix-history → 10-line pointer). Journey moved to progress-log §35. Braindump updated with Round 5+6 tacit knowledge per /braindump skill rules (don't repeat what's already written; capture only what isn't in other docs).

Three new artifacts to manage the Task 159 implementation handoff: 1. `implementing-agent-brief.md` — self-contained brief for the implementing agent. Specifies the 4-segment split (Foundations, Memo-hash gate, Rendering+Trace, Analyzer+Docs) with mandatory STOP-and-log behavior at each firebreak. User decides handoff vs continue at each stop. 2. `spike-runner-brief.md` — self-contained brief for the spike-running agent. Covers the 3 pre-authorized paid spikes (~$0.30) with concrete protocols, expected outcomes, plan-update instructions if outcomes contradict the encoded Spike Contingencies decisions. 3. `implementation-progress-log.md` (stub) — target file for implementing agents to append segment entries at each firebreak. Plus a new Claude Code skill at `.claude/skills/plan-breakdown/SKILL.md` for analyzing implementation plans and recommending optimal handoff breakpoints based on size + tacit-knowledge dependency. Generalizes the analysis we did for Task 159 into a reusable pattern.

Three pre-authorized paid spikes ran 2026-04-29 (~$0.04 actual vs $0.30 budget). Outcomes recorded as progress-log §36. Spike 1 (Gemini explicit cache_control): scenario A AMBIGUOUS leaning CONFIRM via a 1b disambiguator (no-marker control proved the marker does real work, but LiteLLM's Vertex telemetry surfaces only reads — no cache_creation_input_tokens field). Scenario B CONFIRMS (multi-marker request succeeds). F2 plan update applied: Gemini info-note in analyze.py documenting the reads-only telemetry shape. Spike 2 (OpenAI prompt_cache_key parallel-batch routing): CONFIRMS, 6/6 parallel calls hit cache after warm-up. No plan update. Spike 3 (Anthropic per-TTL pricing): CONTRADICTS severely — 5m write priced correctly but 1h cache-write cost is missing entirely from litellm.completion_cost (ratio 0.0083 vs expected ~1.6). E.1 plan update applied: 1h-TTL normalization override at llm_client.py:776-784, Anthropic-gated, with explicit test cases pinned to spike actuals. Doc updates trim the planner-era "go run the spikes" framing across brief, braindump, and plan; spikes are now past-tense audit material, the plan is the implementer's single source of truth.

Add stable warning identifier (Diagnostic.id) for cache-namespaced diagnostics per DD#27. Identity tuple becomes (severity, source, node_id, id or message); null-safe by construction so legacy diagnostics with id=None preserve the existing message-keyed dedup behavior. Add three cache category constants (cache_failure / cache_warning / cache_advisory) and corresponding CATEGORY_TITLES entries for renderer title lookup. _FAILURE_CATEGORY_MAP intentionally NOT extended in v1 — no typed cache exception flows through __failures__ so adding the entry would create dead code; deferred to v1.x. Extend diagnostic_render._format_warning_or_info_diagnostic to dispatch on cache categories: render [id] inline prefix and surface a closed list of context keys (savings_pct, savings_usd, batch_size, prefix_tokens_estimated, target_file) so agents reading text output can route on stable warning IDs without parsing JSON. Update the load-bearing SSoT comment in core/CLAUDE.md to document the new identity tuple shape and null-safety contract.

New dependency-free module mirroring core/llm_providers.py shape: frozen ModelCapability dataclass, module-level MODEL_CAPABILITIES tuple, and the get_min_cache_tokens(model) lookup used by cache.below-min-tokens warning emission and auto-batch-prefix detection. Per DD#32, Anthropic minimums are version-specific: Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 4 / Sonnet 3.7 -> 1024; Sonnet 4.6 / Haiku 3.5 -> 2048; Opus 4.7 / Opus 4.6 / Opus 4.5 / Haiku 4.5 -> 4096. OpenAI auto-cache fires at 1024 across all families. Gemini explicit cachedContents requires ~4k. Unknown / unprefixed models fall back to the conservative floor (4096) to prevent silent cache-warning suppression. Lookup uses most-specific-prefix-wins so dated version suffixes like ``claude-sonnet-4-5-20240620`` correctly route to the family threshold.

Extend the markdown parser state machine to recognize a structurally novel section mode: ``## Cache`` carries section-level YAML params (only ``- ttl:`` in v1) plus a single tagged ``cache`` code block, with NO ``### entities``. Add _SectionType.CACHE (intentionally NOT in _KNOWN_SECTIONS so the orphan-content rules don't fire), _CacheChunk / _CacheSection collectors, section-level YAML branch (validates `- ttl: 5m | 1h`, rejects other keys), code-fence-close branch wired to _attach_cache_code_block. The new _parse_cache_code_block helper splits the cache block content into ``[prose-before-${var}][${var}]`` chunks via _CACHE_TEMPLATE_RE; trailing prose after the last ${var} is silently discarded per the contract; empty or prose-only blocks raise. Chunk identifiers are the raw ${...} content verbatim — downstream root-extraction in B2.3 uses TemplateResolver.extract_root_node_id for bracket-syntax handling. _build_node_dict now extracts ``prompt_cache:`` and ``prewarm:`` to TOP-LEVEL node keys (siblings of ``cache``, ``batch``, ``type``), not inside ``node["params"]``. This is required for B2.2's per-node schema check to see them AND for B2.3's ``cache.invalid-on-non-llm`` rule (the validator's step 8 only iterates ``node["params"]``, so top-level placement is the load-bearing reach for the non-LLM-node check). ir["cache"] is built at Phase 4 with shape {ttl, items: [{name, var, prose_before, _source_line}], _source_line}; matches the schema added in B2.2.

…prewarm Add top-level ``cache`` field to FLOW_IR_SCHEMA: object with required ``items`` array (per-chunk shape: {name, var, prose_before, _source_line}), optional ``ttl`` enum {"5m", "1h"}, optional ``_source_line`` parser metadata. ``additionalProperties: False`` on both the cache object and per-item objects so typos like ``cahe`` and stray fields are caught at schema validation. Add ``prompt_cache: list[str]`` and ``prewarm: bool`` as additive per-node properties. Schema accepts them on EVERY node type — semantic ``cache.invalid-on-non-llm`` rejection lives in core/workflow/data_flow.py (B2.3), not here. Per the V5 fix from plan Round 4: schema is single source of truth for SHAPE; data_flow does only SEMANTICS. Top-10% pattern (mypy / ruff): one rule per error condition, not belt-and-suspenders. Without the single-source rule, save-path validation would emit two diagnostics for the same shape error. The existing per-node ``additionalProperties: False`` invariant catches typos like ``prompt_cahe`` automatically — no extra suggestion infrastructure needed.

…ata_flow Add _validate_cache_block to validate_data_flow's tail. Both validation entry points (WorkflowValidator.validate at save-time + compile_validation at compile-time) reach this via the shared call site. STEP ordering is load-bearing per the V5 fix from plan Round 5: STEP 1 (FIRST): non-LLM-rejection. Walks every node; for type:string node types other than "llm" with prompt_cache: or prewarm: declared, emits ONE ``cache.invalid-on-non-llm`` Diagnostic per node listing ALL invalid fields in context["invalid_fields"] (V6 combined-diagnostic shape — not one diagnostic per field). Shape-agnostic: a malformed prompt_cache: 5 on a type: shell node still emits the structured rejection rather than silently log-skipping. isinstance(type, str) gate prevents firing against type: ["llm"] (a structural error) when the real issue is the type field itself, not its value vs llm. STEP 2: defensive shape skip on the compile path. Surviving LLM nodes with malformed prompt_cache (non-list / list-of-non-strings) or prewarm (non-bool — note bool is a subclass of int; isinstance(x, bool) is required to reject prewarm: 1) are logger.warning'd and skipped. The schema-validator path catches these at step 1 and short-circuits; the compile path bypasses jsonschema, so we MUST guard here. Per V5: schema is single source of truth for shape; data_flow emits ZERO diagnostics for shape errors. The deeper compile error (CompilationError on CacheBlockIR construction) surfaces. STEP 3a: per-node prompt_cache: semantic checks. Each chunk must resolve to a declared cache item (find_similar_items "Did you mean?" hint on miss); declaration order must match (cache.order-mismatch ERROR with spec-locked four-line message format using bare-identifier bracketed lists, NOT Python repr quoted form). STEP 3b: top-level chunk-var resolution + batch-scoped rejection (${item.X} and descendants reject with no catalog id — flows through generic validation pipeline) + cache.unused-chunk WARNING for declared chunks no node references. Top-level cache block defense is split-scope (Round 5): when ## Cache shape is malformed, emit ONE logger.warning and skip top-level cross-references, but the per-node cache.invalid-on-non-llm STILL runs since it's independent of top-level cache shape. The see_also guide-topic pointer is intentionally OMITTED here — the caching topic is added in Phase G; the repo-wide test_all_see_also_literals_resolve_to_real_guide_topics enforces topic existence. G.2 will wire the references back in. V6 sub-workflow dedup test (xfail, strict=False) locks the contract that parent + child versions of cache.invalid-on-non-llm dedup correctly via the id-keyed identity tuple. The xfail wrapper is a tripwire for the open user decision (granular dedup tuple including workflow_path vs special-case per-id dedup ignoring message/node_id) — DO NOT silently weaken the assertion. # noqa: C901 on _validate_cache_block: the function is split into clearly numbered STEP 1/2/3a/3b and a refactor would obscure the linear flow.

Append the segment-1 entry covering B1.1 + B1.2 + B2.1 + B2.2 + B2.3. Documents what shipped, deviations from plan, tacit knowledge for the next agent, open hedged claims, and segment-2 preflight (the golden_config_hashes.json baseline fixture is the load-bearing gate before B3.1 patches begin).

A workflow input referenced ONLY in ``## Cache`` (no node param uses it) was flagged as unused by ``_validate_unused_inputs`` because the template-walker ``_extract_all_templates`` only iterated ``node.params`` and ``batch.items``. Top-level ``workflow_ir["cache"]["items"]`` was structurally invisible to it. Surfaced via the post-segment-1 smoke test on a valid lyrics-generator-shape workflow (case 1: input declared, used in cache only). Without the walker extension, every workflow with ``## Cache`` referencing inputs would produce a spurious 'Declared input(s) never used as template variable' ERROR. Fix: extend ``_extract_all_templates`` with a top-level walk that wraps each ``cache.items[i].var`` in ``${...}`` and feeds it through the existing ``extract_from_value`` pipeline. Cache chunks reach the unused-input check the same way as node params and batch items. Regression test: tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused

Append a "Post-segment-1 smoke test + bug fix" subsection to the segment-1 log entry documenting the 4-case manual UX check, the false-positive unused-input bug it surfaced, the _extract_all_templates fix, and what follow-up agents need to know about template-walker exhaustiveness.

…plicates Two regressions surfaced by adversarial smoke testing post-segment-1: 1. **Double-emit on bad cache var references.** The earlier walker fix routed ``cache.items[].var`` through ``_extract_all_templates``'s shared template set; both my B2.3 ``_validate_cache_block`` AND the existing template-path validator (pass 5) saw the same template and both emitted. User would see two errors for one mistake. Fix: split into a separate ``_extract_cache_templates`` extractor. The union flows ONLY into ``_validate_unused_inputs`` (so cache-only inputs aren't flagged unused). Path validation receives ONLY node-param templates; cache var resolution is owned by ``_validate_cache_block`` with its richer "Cache chunk 'X' references..." diagnostics. This is the V5 single-source-of-truth pattern applied to template extraction: one extractor per consumer concern, no shared sets. 2. **``prompt_cache: [a, a]`` (duplicate items) passed validation silently.** The chunk would render twice into the system prompt — wasted tokens AND a silent semantic shift the author likely didn't intend. The parser already rejects duplicate ``${var}`` in ``## Cache``; per-node ``prompt_cache:`` needs the parallel rule. Fix: detect duplicates BEFORE the resolution check; emit one validation error per offending node listing all duplicate names; suppress the order-mismatch follow-on (would be noise on top of the actionable error). Both fixes have regression tests: - tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused (updated to assert split-extractor contract) - tests/test_core/test_prompt_cache_validation.py::test_duplicate_chunk_in_prompt_cache_emits_error Manual verification of all 4 original smoke cases + 9 new adversarial cases (end-to-end run, save round-trip, sub-workflow propagation, batch + cache, JSON output, parser edge cases) confirms no other regressions in the segment-1 surface.

Reviews identified mostly minor issues — the implementation held under verification-specialist scrutiny + 4-agent code review. Fixes applied: **review-test-fidelity findings:** 1. V6 sub-workflow dedup test was a tautology — synthetic ``deduplicate_diagnostics([d1, d2])`` necessarily collapses two identically-id'd diagnostics regardless of production behavior. REPLACED with a Pattern 2 fixture-based test that drives a real parent → child workflow file pair through ``WorkflowValidator.validate()``, asserting the propagation path emits ONE diagnostic with provenance prefix and preserved node_id. 2. Two circular identity tests (``test_legacy_identity_tuple_in_hash_when_id_absent``, ``test_id_set_changes_hash_to_id_keyed_tuple``) reconstructed the tuple from production code and compared. REMOVED as bloat — the behavioral tests (``test_id_set_collapses_message_variants``, ``test_subworkflow_legacy_dedup_regression``) are the real lock. 3. Split-extractor regression test asserted MECHANISM only (cache vars in one set, not the other). EXTENDED with end-to-end EFFECT assertion: running the FULL ``validate_workflow_templates`` produces zero spurious errors for cache-only inputs. A future contributor breaking the mechanism via re-export now surfaces immediately. 4. Pattern 2 (file-based) parser→validator integration test added — parses a real ``.pflow.md`` with cache.order-mismatch and asserts the validator emits the spec-locked diagnostic. Closes the "all hand-built IR" gap that masked the post-segment-1 bugs. 5. ERROR-severity cache diagnostics now show ``[cache.X]`` next to the title (e.g. "Error: Cache Failure [cache.invalid-on-non-llm]") — same agent-routing handle as WARNING/INFO. Text-mode agents can grep on id without parsing JSON. Non-cache errors render unchanged (additive). **review-silent-failures findings:** 6. ``_extract_cache_templates`` API surface had no enforcement against future inclusion in ``all_templates`` (would silently double-emit on cache var resolution failures). RENAMED to ``_extract_cache_templates_for_unused_check`` so the contract is explicit at every call site. **review-validation-consistency findings:** 7. Parser-level invariant assertion ``chunk.name == chunk.var_expr`` in ``_build_cache_dict``. The schema doesn't enforce equality but the parser always populates them equal. Prevents future refactors from silently drifting the two fields (B2.3 resolves by ``name``; segment-3 rendering will substitute via ``var`` — drift would break symmetry). **review-impact-completeness findings:** 8. Stale doc strings updated: - ``core/CLAUDE.md:107`` short paragraph aligned with the load-bearing SSoT comment at line 103 (id-keyed identity tuple). - ``runtime/template_validation/CLAUDE.md`` documents both extractors and the split-extractor contract. Skipped (per the verification specialist's triage): - "prompt_cache: inside node[\"params\"]" asymmetry: programmatic-IR-only path; step 8 already rejects with clear error. Not worth a third rule. - "cache.duplicate-chunk" stable id: adding would expand the catalog beyond DD#29's closed-list. Substring match in regression test is fine for now. - Empty-string ``id=""`` ``__post_init__`` check: speculative future risk; no production diagnostic constructs with empty id. Final state: 5519 tests pass, mypy + ruff + deptry clean, ``test_plan_drift.py`` green (32/32). 9 manual adversarial smoke tests + 4 original cases all behave correctly end-to-end.

Append a comprehensive subsection covering the verification specialist's 14-case adversarial smoke test, the 4-agent code review (impact-completeness, silent-failures, validation-consistency, test-fidelity), the 10 fixes applied, the 3 findings deliberately not fixed (with reasoning), and the end-to-end manual verification that confirmed the segment-1 surface holds.

Pre-merge precondition for Task 159 B3 (memo-hash conditional inclusion). Generates per-node config_hash for a curated set of 9 example workflows covering 9 mandatory hash-affecting shapes (plain, batch sequential/parallel, sub-workflow, branching, cache:false, template-heavy, multi-input). The committed fixture is the LOAD-BEARING regression gate (DD#19): workflows WITHOUT prompt_cache:/##Cache must produce byte-identical hashes pre- and post-Task-159 B3 patches. Drift signals a silent stale-cache regression class. Files: - scripts/generate_config_hash_baseline.py — re-runnable generator - tests/test_runtime/fixtures/baseline_workflows.py — single source of truth (imported by both the script and the regression test to prevent drift) - tests/test_runtime/fixtures/golden_config_hashes.json — 30 hashes, 9 workflows Regen command (only after human-reviewed compute_node_config change): uv run python scripts/generate_config_hash_baseline.py

… context Implements DD#19's load-bearing invariant: workflows WITHOUT prompt_cache:/##Cache hash byte-identically pre/post-task (verified by golden_config_hashes.json regression gate); workflows WITH prompt_cache: get a fresh hash that includes the rendered cache content, so existing memo entries don't silently serve outputs produced without the prepended prefix. B3.1 — schema additions to runtime types CacheChunkIR / CacheBlockIR / CacheRenderContext (frozen dataclasses) live in NEW src/pflow/core/cache_render.py because LLMNode.prep (C1.2) reads CacheRenderContext from shared and only imports pflow.core.* — placing the dataclasses in runtime/engine/types.py would force a layer-policy violation. NodeConfig gains prompt_cache_items: tuple[str, ...] and prewarm: bool. CompiledWorkflow gains cache_block: CacheBlockIR | None. Compiler extracts both with explicit isinstance preconditions that raise CompilationError on malformed shapes (Round-6 hardening: rejects tuple("string") silent splat, prewarm: 1 truthy-int, etc.). B3.2 — CacheRenderContext build + install at engine boundary build_cache_render_dict() builds the per-workflow {node_id: CacheRenderContext} map (sparse — LLM nodes with cache state only). WorkflowEngine.run installs it under shared["__pflow_cache_render__"], wrapped in MappingProxyType for read-only enforcement. Save/restore mirrors __trace_collector__ exactly; restore-from-absent writes the module-level _EMPTY_CACHE_RENDER constant (frozen empty proxy), never None. NOT in _PROPAGATED_KEYS — each .pflow.md scopes its own ##Cache (DD#12); leaking parent->child would break cache scoping AND the CacheBlockIR freeze guarantee. B3.3 — plan_node renders cache content into the hash Reorders plan_node so resolve_templates runs BEFORE config_hash (cache rendering needs shared state). _render_cache_for_hash filters _CHUNK_ABSENT chunks (branch-not-taken upstreams) symmetrically with the future C1.2 prep site. _resolve_chunk_value handles two unresolvable cases identically: (a) NodeStatus.ABSENT upstream, (b) permissive-mode echo where TemplateResolver returns the literal "${var}" string. Both collapse to _CHUNK_ABSENT so the filter contract holds. Defense-in-depth at runtime/cache.py::_make_serializable: raises TypeError if a leaked sentinel reaches hash serialization. Without it, a leaked sentinel would silently serialize to "<...ChunkAbsentSentinel>" and fold byte-identically into the cache hash — the silent stale-cache class. B3.4 — compute_node_config accepts rendered cache content Keyword-only prompt_cache_content kwarg (forces explicit pass; future callers can't accidentally feed wrong shape as 5th positional). Conditional inclusion: empty list and None both fall through, byte-identical to absent (DD#19 three-state: absent equiv [] not equiv [chunk]). Code-review fixes applied (4-agent /code-review pass): * Planner cache_render install: _create_planner_shared installs the same MappingProxyType the engine does, so plan_node's hash matches the engine's for cache-using workflows. Without this, pflow run --dry-run silently mispredicted "execute" vs "cached" on workflows declaring ##Cache. Regression test in test_plan_drift.py exercises an LLM workflow with ##Cache end-to-end via mock LLM; mutation-tested by removing the install. * Engine install ordering: build cache_render dict BEFORE installing trace so a build-time exception leaves shared completely unchanged. * isinstance(_, _ChunkAbsentSentinel) instead of type(_CHUNK_ABSENT) — cleaner, less brittle. * logger.warning at the defensive _render_cache_for_hash skip site so bypass scenarios become observable. Storage-mode shared x ##Cache documented as v1-unsupported in runtime/CLAUDE.md (parallel batch with shared root store would have non-deterministic save/restore order; lyrics-generator doesn't use this combo). Fix #357 (companion, pre-existing pflow bug surfaced during E2E verification): compute_node_cache_key now filters _*_source_line keys, mirroring the existing filter in compute_node_config. Without this, every saved-library workflow re-executed on every invocation because pflow save mutates the workflow file's frontmatter post-run, shifting every body section's source line, which flowed into resolved_inputs -> cache_key. End-to-end verified: saved A-basic.pflow.md now hits memo cache on second run (was 2s/$0.0002, now 6ms/cached). Tests added: 67 (5523 -> 5590) - tests/test_runtime/test_prompt_cache_compile.py (21) - tests/test_runtime/test_cache_render_dict.py (12) - tests/test_runtime/test_prompt_cache_hash.py (17) - tests/test_core/test_cache_render.py (16) - tests/test_runtime/test_cache.py (3 new for #357 filter) - tests/test_execution/test_plan_drift.py (1 new for planner-engine parity on cache-using workflows) End-to-end CLI verification (real LLM calls via claude-haiku-4-5): - Cross-run: same input -> memo HIT; different input -> memo MISS - Sub-workflow nested ##Cache: parent + child both HIT, dry-run agrees - Parallel batch with prompt_cache: 2/2 in 1.5s parallel, HIT on re-run - Branch-absent: route=use-a vs use-b produce different cache_keys (sentinel filter symmetric); same-route re-run HITs - Validation errors render correctly (text + JSON): cache.order-mismatch, cache.invalid-on-non-llm, cache.unused-chunk, undeclared chunk, empty/prose-only block, multiple blocks, batch-scoped ref - pflow save round-trip: ##Cache content preserved byte-for-byte - pflow describe / --report / visualize / --no-cache / --only / --dry-run / --output-format json all work cleanly on cache workflows - No __pflow_cache_render__ leakage in trace JSON or run JSON output Closes #357

…view fixes C-D-E phases of prompt caching: - C1.1/C1.2: complete() system: str | list[dict] widening; LLMNode.prep builds system_blocks via shared _resolve_chunk_value; cache_chunks_skipped channel through 4 wrap sites + memo HIT round-trip. - C2/C3: per-provider TTL marker translation (Anthropic / Gemini seconds-suffix / OpenAI prompt_cache_key MD5 + prompt_cache_retention "24h" for ttl=1h). - D.1: auto-batch-prefix detection in prep() via _resolve_static_prefix_for_cache (canonical-JSON byte-stable prefix); user_message_blocks kwarg on complete(). - D.2: prewarm execution split inside _execute_parallel — item[0] runs synchronously through the SAME process_item closure; _collect_parallel_results widened with *, initial_completed=0, total=None. - E.1: TRACE_FORMAT_VERSION = "2.1.0"; workflow_path plumbing; apply_memo_hit / write_memo_cache widened with keyword-only node_type_name; cache fields flow via existing llm_usage keyset (no sidecar); _should_write_cache_metadata allowlist (LLMNode only; ClaudeCodeNode INTENTIONALLY excluded); Anthropic 1h-TTL cost normalization override per Spike 3. Post-review fixes from 5-agent /code-review pass: - Image-drop under prewarm + images: graceful runtime degradation + __warnings__ entry; GH #358 filed for v1.x native image-cache support. - Dead-code cache_chunks_skipped wraps on 3 of 4 error paths: _propagate_error_to_shared now threads the field from err_dict into shared. - Strengthened divergence-injection test with behavioral binding-independence check; ABSENT-case byte-equivalence test; Anthropic 1h-cost integration test through _normalize; logger.warning on _build_user_message_blocks last-resort fall-back; ClaudeCodeNode skip test exhaustive exclusion; litellm.model_cost test cleanup via monkeypatch.setattr; runtime/CLAUDE.md + runtime/engine/CLAUDE.md doc updates; debug-level signal for non-allowlisted LLM-producing node types; MockLLMClient.set_response full-replacement contract documented. Tests: 5709 passing (+143 vs end-of-Segment-2 baseline). test_plan_drift.py 34/34. test_golden_baseline_hashes_match (DD#19 silent-stale-cache gate) PASSED. make check clean (ruff + ruff-format + mypy + deptry).

Both bugs surfaced by manual `pflow` CLI runs against real Anthropic Haiku calls. Test suite was green going in (5709 passed); both were architecturally invisible to the existing unit-test fixtures. Bug #1 — `cache_source` mislabeled `"in_process"` for cross-process memo HITs. `engine.py:420` called `apply_memo_hit` (which augments `cache_source="memo"`) followed unconditionally by `handle_cached_execution`, which augmented `"in_process"` for both branches and silently overwrote the memo augment. Fix: `handle_cached_execution` takes `cache_source: str | None = None` keyword-only; engine specifies `"in_process"` for the in-process branch and omits for memo (apply_memo_hit handles it). Eliminates the overlap class. Bug #2 — Cache rendering silently drops every dotted-path chunk (`${node.field}`) through `NamespacedSharedStore`. `LLMNode.prep` receives shared as a `NamespacedSharedStore`; `TemplateResolver._get_dict_value` checked `isinstance(value, dict)` which excluded the proxy, so dotted-path resolution returned `(False, None)` → permissive-echo → `_CHUNK_ABSENT`. Fix at the type-taxonomy level: `NamespacedSharedStore` inherits `collections.abc.MutableMapping` (it always was one structurally — it just didn't declare it); `_get_dict_value` checks `isinstance(value, Mapping)`. The motivating workflow (lyrics-generator) uses dotted-path chunks almost exclusively — every one would have been silently filtered. Side-effect: `NamespacedSharedStore` shrank 218 → 154 lines because the `MutableMapping` mixin handles `keys`/`items`/`values`/`get`/`setdefault`/ `update`/`pop`/`popitem`/`clear` that we used to hand-implement. 9 new regression tests, including production-shape DD#19 byte-identity check through `NamespacedSharedStore`. The historical byte-equivalence test was a tautology (synthetic dict + single-root chunks). Verification artifacts: scratchpads/segment3-verification/ — 10 adversarial .pflow.md workflows + VERIFICATION-REPORT.md.

Four agent-UX fixes after a fresh-eyes read of the post-Cluster-A/B output, each targeting a different section of `pflow analyze-cache`. Total +35 LOC across 3 source files; 6 baselines drifted as strict improvements. - **L-5**: drop the `Discrepancy detection: skipped attribution for N trace event(s)…` Note. The prose was internal accounting (predicted cache_key, observable signal, IR-resolvable node — analyzer-internal vocabulary the agent can't act on). The per-call header `Showing N of M LLM nodes` already conveys coverage; the per-child commands section points to where to look. - **N-10**: when only the non-cache blocking-errors list renders, emit bare `## Blocking errors` instead of the orphan `## Other blocking errors (surfaced for awareness)`. Pass `cache_blocking_ present: bool` from the orchestrator into `_render_other_blocking_ errors`. The "Other" qualifier is purely relational — when a sibling section isn't present, "other than what?" is incoherent. - **B-3**: collapse `## Per-child analyze-cache commands` from N absolute paths × ~200 chars (15 on lyrics-generator) to one `cd <dirname(workflow_path)>` + relative-path commands. Sub-workflow paths naturally compose from the parent's directory, so dirname is the deepest natural common prefix — no `os.path.commonpath` heuristic needed. - **B-4**: walker emits typed `DynamicBatchInfo` entries on `CrossWorkflowResult.dynamic_batches` instead of per-batch prose appended to `notes`. `_format_dynamic_batches_note` in analyze.py emits ONE aggregated Note. Singular case keeps the original wording for backward compat with existing substring tests; multi case lists each batch's `(node_id, items_expression)` once and shares the explanatory prose. The walker is documented as a data primitive (per cache_analysis/CLAUDE.md); this brings the dynamic-batch emission path into compliance. Verification: - `make test`: 6,426 passed, 1 skipped (no regressions). - `make check`: ruff + ruff-format + mypy + deptry clean. - 65/65 baselines pass; 6 regenerated as strict-improvement diffs. Closes L-5 + N-10 + B-3 + B-4 from POLISH-PLAN.md Tier 1.

Recommendations now name affected child LLM nodes inline and project savings_usd via memo-then-trace fallback. Honors honest-unmeasurable when neither memo nor trace can resolve the parent value.

…licates Restructured analyze-cache header for AI-agent readability. The 39-word run-on scale line splits into 3 cohesive lines (`Workflow:` count + invocation status; `Models:` for 2+ models; `Heterogeneous:` for batch sub-workflows declaring `model: ${item.model}`). Single-model workflows keep `using X` inline (pragmatic density). The duplicate `Observed models:` line is dropped — `models_in_use` is a superset of `observed_models_in_trace` in complete trace mode. The sub-workflow breakdown line drops its trailing 15-name CSV; names already appear in `## Per-call cache report` headings and `## Per-child analyze-cache commands`. Latent bug fixed as bonus: static-mode workflows on the no-model-resolved branch previously suppressed the heterogeneous suffix. The static-mode lyrics-generator capture now correctly surfaces `Heterogeneous: generate-chorus-options (model varies per batch item)`. `_format_heterogeneous_line` collapses the old 1-vs-N branches via plain `', '.join`; `_format_heterogeneous_suffix` deleted as dead code. Files: render_text.py (+50/-65 LOC, net -15 simpler), test_cache_analysis_renderers.py (3 helper-direct tests migrated to end-to-end render assertions per Pitfall #19; 2 N-9 CSV literal updates; 1 multi-model test rewritten; 1 new N-2 regression test). 4 mutation contracts verified. 5 baselines regenerated as strict-improvement diffs. 6,430 tests pass (+4). `make check` clean. 65/65 baselines pass. Closes N-2 + N-3 + N-9 from POLISH-PLAN.md Cluster D.

Recommendation #1 on the lyrics-generator canonical capture flips from `savings unavailable` to `saves ~$0.20/run`, and the `## Recommended actions` header gains the `(ordered by impact)` qualifier — closes the gap where Cluster C v1's by-node-id walker couldn't see workflow-input passthroughs (concept passed parent → child → grandchild as an input parameter). Adds Tier 3 to `_estimate_parent_value_tokens`: read the resolved value from the parent's workflow-node trace event under `node_params['inputs'][child_input_name]`. The engine populates `node.params` post-template-resolution before `record_trace`, so the runtime invocation site carries the resolved value directly. Robust against `template_resolutions` size-trimming applied to real-world fixtures. One new production-shape test with verified mutation contract; full default suite green (6,427 passed); 65/65 baselines pass.

…clared` When the parent value's estimated tokens are definitively below the child model's minimum cache threshold, the recommendation now drops `savings_usd` to None and appends a `Note:` line naming model + threshold + token gap. Caching wouldn't fire as-stated, so the dollar tag would be a lie; the warning prose carries the "declare AND increase content" framing. Asymmetric with the greenfield `cache.shared-context-undeclared` path (which suppresses the recommendation when no node clears threshold): cross-boundary keeps the recommendation visible because the agent needs to know about the boundary even when caching wouldn't fire as-stated. The recommendation surface IS the actionable signal for this catalog ID. Catalog gains additive `below_threshold_clause: str` context key (always passed; empty string when not below). Helper `_below_threshold_clause` formats the warning; `_project_sub_workflow_cache_savings` returns `(savings_usd, tokens, threshold_model)` so the emit site gates without recomputing. Heterogeneous-children case (different child models within one sub-workflow) samples the first child's model — mirrors the existing single-model sampling in `_consolidate_to_root_advisories`. v1 simplification documented in the helper docstring. Files: analyze.py, warning_catalog.py, cache_analysis/CLAUDE.md, 4 test files (+3 emission tests with verified mutation contracts; synthetic-fixture helpers updated). 2 baselines regenerated as strict-improvement diffs (additive JSON field only). 6,434 tests pass. `make check` clean. 65/65 baselines.

Surface excluded-node cost as an explicit line so trace-mode dollar figures compose visually for a fresh agent. Drops `(projected subset)` qualifiers and the inline `(excludes ...)` clause from `_format_delta`. Lyrics-generator before: Actually paid (trace): ~$2.31 (trace) Cost without caching (projected subset): ~$2.53 (partial) Actual savings (this run): saves ~$0.49/run vs no-cache (excludes generate-chorus-options), 19% of no-cache cost After ($2.31 − $0.27 = $2.04, $2.53 − $2.04 = $0.49 — math composes): Actually paid (trace): ~$2.31 (trace) Excluded from analysis: ~$0.27 (generate-chorus-options: model varies per call) Cost without caching: ~$2.53 (partial) Actual savings (this run): saves ~$0.49/run vs no-cache, 19% of no-cache cost

- Per-call rows: render `model=<unresolved>` when row.model is empty (was 35 chars of whitespace, read as a broken column). - Drop the `Suggested-blocks: ... deferred to v1.x.` Note: internal jargon roadmap leak with no agent action. - Header: promote `(no model resolved — set settings.default_model)` parenthetical to its own `Models: not resolved (...)` line, parallel to the multi-model `Models: A, B, C` case. - Summary: drop duplicate `(trace)` from `Actually paid:` label since the value's tier annotation already carries the signal. Truncated branch keeps `(executed trace)` because "executed" is unique signal. Tests: 1 assertion update + 1 existing test rewritten for new shape + 2 new mutation-contract tests. 6,438 pass, `make check` clean, 65/65 baselines pass after regen (34 strict-improvement diffs, net −77 lines mostly from the Suggested-blocks Note vanishing).

Adds a parameter-aware resolution tier before memo/trace/invocation in `_estimate_parent_value_tokens` so `pflow analyze-cache root.pflow.md input=<sample>` produces real `saves ~$X.XX/run` figures on first contact without a run-then-reanalyze loop. Reads from `parameters_for_workflow` so CLI params at the root and walker-propagated params at nested boundaries share one scoping primitive.

Replace the misleading severity-keyed headline (`19 opportunities (0 warnings, 19 info)`) with section-mapped counts that match what the agent actually sees post-Cluster-A grouping: - Summary: `2 recommended actions + 5 cross-workflow boundary findings` (suppressed when both are zero — was unconditional `0 opportunities`). - `## Recommended actions (N, ordered by impact)` / `(N)` inline counts; B-6 ordered-by-impact gating preserved. - `## Sub-workflow boundaries (M, covering K underlying renames)` — rollup suffix only when grouping collapsed renames; prose-mismatches contribute to M but not K (1-per-finding, no collapse). - Dry-run nudge now uses the same rendered count via new `view_helpers.count_rendered_findings` so `pflow run --dry-run` and `pflow analyze-cache` agree (lyrics-generator: was 19 vs 7, now 7 on both surfaces). - `_group_renames_by_parent` lifted to `view_helpers` as the SSoT for rename grouping. - `actionable_opportunities`/`warnings_count`/`info_count` preserved on `AnalysisSummary` and in JSON for machine consumers. 16 baselines regenerated as strict improvements (net −20 lines). 6,449 tests pass (+11 net). `make check` clean. 65/65 baselines pass. Closes A-4 from POLISH-PLAN Cluster E (cluster fully closed: A-4 + N-10 + L-5 + B-3 + B-4 all shipped).

…etector Pass A: - A1: consolidate visibility predicate (`per_call_row_has_real_data`) into view_helpers; both analyze.py and render_text.py import from there. Removes the 4-vs-3-condition divergence that emitted a misleading "Per-call cache report hidden" note on workflows with parameter-backed cacheable evidence (5+ baselines affected). - A2: divide input/output/Tier 1 cacheable tokens by observed_call_count for static-list batch trace rows at the producer (`_build_per_call_row`). Cost helpers' `× _invocation_count` then recovers cohort cleanly. Fixes N²-inflated `no_cache_hypothetical_usd` and `rerun_within_ttl_hypothetical_usd` for static-list batches. New `_is_static_batch_trace_row` / `_divide_static_batch_trace_tokens` helpers; uses `round()` not floor. Tier 2 chunk-path now multiplies by `resolved_chunk_call_count` for non-batch repeated rows so units match cohort input. Pass B — new catalog ID `cache.prompt-cache-incomplete`: - Detector `_detect_partial_prompt_cache_declarations` walks LLM nodes via `_partial_declaration_finding_for_node`; uses `TemplateResolver.TEMPLATE_PATTERN` + `split_coalesce_operands` (NOT a hand-rolled regex) for prompt operand scanning. Root-aware via `_template_root_segment` / longest-var-prefix match. Filters batch-scoped refs. - Emitter `_emit_partial_declaration_findings` aggregates per-node findings into one workflow-level diagnostic (matches `node_id=None` convention, documents latent dedup risk shared with 4 existing IDs). - Cleanup hint via `_prompt_body_cleanup_for_node` calling `cache_overlap.compute_overlaps` — single source of truth shared with validator. Suggestions ordered cleanup-first to avoid the `cache.prompt-body-duplicates-cache` ERROR trap. - Threshold gate: visible-with-clause + `savings_usd=None` mirrors N-7 precedent. Per-node check for heterogeneous-models case. - Suppressed when `cache.consolidate-to-root-recommended` covers same root. Catalog count: 21 → 22. New baseline at 04-warning-catalog/21-cache.prompt-cache-incomplete/. Documentation updated in CLAUDE.md, MCP tool docstring, and guide. Existing baselines drift on Pass A (cost projections shrink to honest totals) and Pass B (recommendation surface gains the new ID where applicable). All diffs strict-improvement.

…tatic calls Three small renderer-only fixes a fresh-agent read of the post-Pass-A/B captures surfaced. Each was unmasked by an earlier fix that introduced the better signal but didn't drop the now-redundant or wrong original. A. ``(partial)`` qualifier on no-cache / rerun cost lines is redundant in trace mode (``Excluded from analysis:`` line carries the cohort gap in plain English) and in greenfield-with-cache mode (``(projected subset)`` label suffix is the parallel signal). Suppress it via ``projection_partial_marker`` in both branches; greenfield-no-cache keeps it (single-line ``Cost per run:`` has no other signal). B. ``_per_call_scope_explainer`` returned a 60-word run-on packing four facts. Worse, the "(divide by calls for per-call values)" guidance became structurally wrong after Pass A2 normalized static-list batch trace rows to per-call units — agents following it would compute 1/N the actual per-call value. Returns ``list[str]`` (lead + 3 bullets); the wrong advice is dropped entirely. C. ``_cell_calls`` renders ``—`` instead of ``0`` when ``evidence_scope == "static_analysis"``. ``calls=0`` for sub-workflow standalone analysis (the case ``## Per-child analyze-cache commands`` instructs agents to run) read as "this node never runs". Trace mode keeps integer rendering — ``observed_call_count == 0`` there is a real signal (conditional dispatch not taken). 6 new mutation-contract tests. 14 baselines drift as strict improvements; 1 unrelated upstream LiteLLM pricing-data drift on ``04-warning-catalog/20-llm.thinking-temperature-mismatch`` regenerated in lockstep (same drift class as prior Tier-0 / per-call-column-split commits). Coordination: surfaces are disjoint from in-flight Pass C (cross-workflow → per-call row plumbing). Pass C touches ``_cell_could_cache``, ``_per_call_confidence_footer``, the catalog template for ``cache.sub-workflow-cache-undeclared``, and ``_build_per_call_row``; this commit edits ``_render_trace_cost_lines``, ``_render_greenfield_with_cache_lines``, ``_per_call_scope_explainer``, ``_append_per_call_rows``, ``_format_per_call_row``, ``_compute_per_call_column_widths``, ``_per_call_cells``, and ``_cell_calls``. Verification: 6,555 passed, 1 skipped (near-full sandbox suite); 142 renderer tests; 566 cache-analysis suite; 66/66 baselines; ruff + mypy + ruff format clean on touched files.

Five Stage 1 UX fixes from a fresh-eyes read of the post-Pass-C lyrics-generator capture, plus the "Shape δ" rename-text-suppression investigation that makes `cache.cross-workflow-rename-detected` JSON-only. **Stage 1 — five renderer/catalog fixes.** - Fix 3 (pluralization): `nodes_phrase` computed field in `make_diagnostic`; `cache.sub-workflow-cache-undeclared` no longer emits "1 LLM nodes there". - Fix 4 (Confidence): suppress `Confidence: high_from_trace (N of N nodes)` when Evidence already says complete trace with no unreached. Keeps the line for `medium_from_memo`, partial coverage, truncated trace, unreached. - Fix 5+6 (token-confidence footer): `_per_call_confidence_footer` returns `list[str] | None`; multi-line bullet block. New `_format_node_list` aggregates duplicate node names — lyrics-generator's "review, review, review, ..." collapses to "8 review nodes". - Fix 7 (cost-block pass-through): fold excluded `actual_cost_usd` into `Cost without caching` / `Cost on rerun` so the cost block reconciles in-place ($2.31 paid < $2.42 no-cache, no subtraction). Plain-English footnote names the pass-through nodes; falls back to current Excluded- line behavior when any exclusion has `actual_cost_usd=None`. Wording corrected: "no caching can apply" → "projected savings unavailable; real savings could be higher if model+content combos repeat". - Fix 8 (Notes-section jargon): memo-empty discrepancy note + Gemini telemetry note rewritten in plain English; "Discrepancy detection / predicted-key matching / Observable-field attributions" → "Cache fidelity check / TTL expiry / chunk-skip detection". `_format_fidelity_skip_note` helper applies the same rewrite across all 9 remaining per-node skip-note sites in `_format_skipped_workflows_note`, `_build_predict_scaffold`, and `_predict_node_with_scaffold`. Production grep confirms zero remaining emission of the old jargon strings. **Shape δ — rename diagnostics text-suppression.** `cache.cross-workflow-rename-detected` continues to emit for JSON/raw consumers (`analysis.warnings` unchanged, catalog count stays at 22), but CLI text no longer renders rename diagnostics in `## Sub-workflow boundaries`. Empirical trigger: 17 rename findings on lyrics-generator collapsed to 5 boundary entries that looked actionable but were cache-fidelity false positives — variable names are stripped before the provider wire format; cache hits depend on prose before each cached value and the resolved value bytes. Helpers removed: `view_helpers.group_renames_by_parent`, `render_text._render_renames_for_parent`. Rename suggestion in the catalog rewritten to state the warning is informational. `_CROSS_WORKFLOW_ALIGNMENT_IDS` still includes rename so it stays out of Recommended actions. **Tests:** +12 new tests (5 emission/dispatch, 3 Confidence gating, 2 footer, 4 cost-block fold, 1 SSoT helper), 8 existing tests migrated to new wording, 1 prose-mismatch nudge-count test added. `make test` default suite: 6,545 passed. Cache-analysis subset: 634 passed. **Baselines:** 67/67 verify after regenerating 15 cases (lyrics-generator captures + adjacent text/JSON cases hit by Fix 5+6/7/8 wording + Shape δ rendering changes + 1 unrelated pre-existing LiteLLM pricing drift on `04-warning-catalog/20-llm.thinking-temperature-mismatch`). Carried forward (Stage 2 / data-model follow-ups): - Stage 2: lift `cache.sub-workflow-cache-undeclared` cleanup-first suggestions to a section header preamble (waiting for the in-flight below-threshold demotion work). - Text/JSON divergence: Fix 7's pass-through fold is renderer-only; `summary.no_cache_hypothetical_usd` JSON field still emits the priced- cohort value. Add `no_cache_with_passthrough_usd` (additive) or document the asymmetry in a follow-up.

Aligns the per-call cache report's `cacheable inputs:` note with the Recommended actions section by switching from parent value expressions to child input names (alphabetized). Restores parent→child data-flow visibility by adding a `flows in from parent as ${expr}` sub-line under each renamed input bullet inside the action body — scoped to where it's actionable, not as a standalone warning section. CrossWorkflowInputContribution: renamed `name` → `child_input_name`, added `parent_value_expr: str | None`. JSON shape correspondingly updated (same-minor 4.1 pre-merge shape correction). Body-block render duplication eliminated via two helpers (_format_per_consumer_input_lines, _format_single_consumer_input_lines) sharing one rename-suffix helper (_parent_origin_clause). Satisfies C901 without noqa.

spinje added 30 commits April 27, 2026 12:00

implementation plan v1

e820487

round 2 reviews of plan

43de67f

codex agent

f8bb2fc

add pflow manual execution for sandbox agent skill

0f8bd7e

minor change

6db07d7

briefs updates

8b7a874

spinje added 29 commits May 9, 2026 12:00

add handoff

0557aff

Task 159 — Cluster C (N-7): sub-workflow recommendation impact

cd6b3a6

Recommendations now name affected child LLM nodes inline and project savings_usd via memo-then-trace fallback. Honors honest-unmeasurable when neither memo nor trace can resolve the parent value.

Task 159 — track Cluster C N-7 handoff plan in scratchpads

4f31f36

move handoffs

10228fa

per-call table split + batch-alias propagation

cc3c32c

Polish prompt-shape cache recommendations

1e6b01b

Polish prompt caching analyzer output

27b0ecd

Polish cache analysis per-call output

0f1eea4

Polish prompt caching guide UX

971d78c

Sub-workflow cache recommendation aggregation

d06dae4

Fresh-eyes quick-win bundle [skip review]

3757221

update task review

6b9f672

Add run flag interaction baselines

075c7cc

dry run nudge baselines

11b8f7f

Add triggered Task 159 warning baselines

e4115f9

Polish prompt caching help and guide [skip review]

9b63f15

fix test [skip review]

2d941b2

spinje merged commit 666d847 into main May 11, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Prompt caching via declarative `## Cache` block + `pflow analyze-cache`#378

feat: Prompt caching via declarative `## Cache` block + `pflow analyze-cache`#378
spinje merged 156 commits into
mainfrom
feat/prompt-caching

spinje commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spinje commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task

Changes

Explanation

Created Docs

Breaking Changes

Testing

Follow-up issues filed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spinje commented May 7, 2026 •

edited

Loading