Skip to content

feat: Prompt caching via declarative ## Cache block + pflow analyze-cache#378

Merged
spinje merged 156 commits into
mainfrom
feat/prompt-caching
May 11, 2026
Merged

feat: Prompt caching via declarative ## Cache block + pflow analyze-cache#378
spinje merged 156 commits into
mainfrom
feat/prompt-caching

Conversation

@spinje
Copy link
Copy Markdown
Owner

@spinje spinje commented May 7, 2026

Summary

Ships provider-level prompt caching as a declarative workflow surface, plus a new pflow analyze-cache command for static + trace-driven cache analysis. Verified end-to-end on real Anthropic + Gemini calls (-25% first-run, -73% rerun-within-TTL on the lyrics-generator smoke fixture).

The important review focus is the analyzer contract, not the small provider-rendering seam: late PR/baseline sweeps turned pflow analyze-cache into the primary diagnostic surface for cache recommendations, trace fidelity, cross-workflow opportunities, and JSON/MCP consumers.

Task

159 — see .taskmaster/tasks/task_159/task-review.md for the comprehensive implementation review.

Changes

Workflow author surface

  • New top-level ## Cache markdown section: prose interleaved with ${var} chunks, optional - ttl: 5m|1h. Each chunk is a [prose-before-${var}][${var}] pair, parsed into ir["cache"] = {ttl, items: [{name, var, prose_before, _source_line}, ...]}.
  • Per-LLM-node fields: prompt_cache: [name1, name2, ...] (subset of ## Cache.items, declaration-order strict — out-of-order is a hard error) and prewarm: bool (gates auto batch-prefix marker insertion).
  • Existing cache: bool (memo opt-out) untouched and orthogonal — two cache layers, two distinct fields (DD#5).

Cache rendering at the LiteLLM adapter seam (src/pflow/core/llm_client.py)

  • complete() widened: system: str | None -> str | list[ContentBlock] | None; new user_message_blocks: list[dict] | None for batch-prefix paths.
  • Per-provider TTL translation in core/cache_render.py::_build_cache_control_marker (Anthropic native, Gemini -> cachedContents, OpenAI -> prompt_cache_retention).
  • Memo-cache-hash correctness (DD#19): compute_node_config(prompt_cache_content=) keyword-only kwarg; nodes that don't opt in keep their existing hash.
  • Auto batch-prefix caching gated on prewarm: true; render/hash byte parity is guarded by golden config hashes and NamespacedSharedStore production-shape tests.

pflow analyze-cache — new CLI + MCP command, full text + JSON parity

  • 4 output modes: greenfield / steady-state / already-optimal / --from-trace.
  • Evidence-only token estimation for cacheable bytes: the old fabricated Tier-3 cacheable heuristic was deleted. Unresolved declared chunks return (None, "unavailable") instead of a guessed value.
  • 4-state cost (trace / trace_partial / recomputed / unavailable) — never emits $0.00 for unavailable.
  • Full-validator unification: analyze-cache now surfaces execution-blocking non-cache validation errors via blocking_errors[] / other_blocking_errors[] instead of silently filtering them out.
  • Cross-workflow walker (Tier 2): prose mismatch, value-flow opportunities, grouped child-workflow recommendations, and JSON/raw rename diagnostics. Rename diagnostics are intentionally suppressed from text because variable names do not reach the provider wire.
  • Per-call projections now distinguish batch_prefix and cross_workflow_projection from ordinary parameter/memo/trace evidence; this is load-bearing for dynamic batch and parent->child cache opportunities.
  • Trace coverage is none / complete / truncated; successful conditional runs with unreached rows are still complete. Trace-dependent warning suppression is catalog-driven via CacheWarningSpec.requires_complete_trace, not severity.
  • 22 stable catalog entries at core/cache_analysis/warning_catalog.py (21 cache.* + llm.thinking-temperature-mismatch; DD#29 design review for additions).

Trace format

  • 2.0.0 -> 2.1.0 -> 2.2.0 (additive). New fields: workflow_path, per-event cache_key / cache_source / cache_age_sec / cache_chunks_skipped, cache-render snapshots, sub-workflow cost rollup, and canonical child workflow paths for dynamic workflow batches.
  • Filename schema bumped to identity-encoded (workflow-trace-{wf_hash}-{safe_name}-{timestamp}.json) — O(matches) glob lookup vs. O(N) scan-and-parse. Prior schema scaled to ~14s per analyze-cache invocation on machines with 67k+ traces; now <1s.
  • Forward-compat: existing 2.0.0 readers gate on format_version.startswith("2."); auto-load skips 2.0.0 (no workflow_path) but --from-trace <path> still works.

JSON output schema: 1.0 -> 1.1 -> 2.0 -> 2.1 -> 4.0 -> 4.1. format_version is the first key. JSON 4.1 includes other_blocking_errors[], summary.suggested_run_command, typed per_node_thresholds[].model_state, per_call[].cross_workflow_inputs, and cacheable evidence tiers batch_prefix / cross_workflow_projection.

Architectural fix — Path 1 boundary contract

Companion fix — GH #357 (memo cache key line-shift)

  • pflow save rewrites frontmatter on every invocation, shifting source lines and invalidating cache_key. Fix: _METADATA_KEY_SUFFIXES = ("_source_line", "_source_lines", "_source_files") filter at runtime/cache.py::compute_node_cache_key. Pre-existing pflow bug; fixed inline because shipping caching without it would surprise the first user who saves a ## Cache workflow.

Diagnostic infrastructure

  • Diagnostic.id: str | None field added (top-level stable IDs per DD#27).
  • Identity tuple updated to (severity, source, node_id, id or message) — when id is set, it's the dedup key.

NamespacedSharedStore now inherits MutableMapping (was duck-typed but not type-tagged). TemplateResolver._get_dict_value now checks Mapping not dict. Closes a silent stale-cache class where dotted-path chunks were filtered as _CHUNK_ABSENT in production.

Explanation

Most production risk is in the analyzer (src/pflow/core/cache_analysis/), not the provider rendering hot path. The provider rendering seam is comparatively small; the analyzer owns recommendations, cross-workflow traversal, trace evidence, cost semantics, JSON/MCP shape, and agent-facing text.

The implementation arc: 4 segments (foundations -> memo-hash gate -> rendering+prewarm+trace -> analyzer+docs) -> Stage 1 lyrics-generator verification -> Stage A/0/B/C data-model redesign (recommended_actions + cross_workflow.* derived from analysis.warnings) -> Stage 2 paid verification -> PR #378 review fixes -> baseline/polish sweeps.

The progress log is now 11,777 lines at .taskmaster/tasks/task_159/implementation/implementation-progress-log.md. For reviewer orientation, prefer .taskmaster/tasks/task_159/task-review.md; use the progress log only for forensic implementation history.

This is a large PR (700+ files, mostly task docs/baselines/tests). Exact file/commit counts may drift while final docs and baselines are pushed; the stable review surface is the task review, baseline harness, and tests listed below.

Created Docs

  • .taskmaster/tasks/task_159/task-159.md — spec (~617 lines, 37 design decisions, full warning ID catalog, output format reference)
  • .taskmaster/tasks/task_159/task-review.md — primary future-agent navigation artifact (file tour, integration points, patterns, anti-patterns, common pitfalls, Task 160 baseline guardrail)
  • .taskmaster/tasks/task_159/implementation/implementation-plan.md — full implementation plan
  • .taskmaster/tasks/task_159/implementation/implementation-progress-log.md — segment-by-segment progress log (11,777 lines)
  • .taskmaster/tasks/task_159/baseline/ — analyzer output oracle used during development and intended as a Task 160 refactor safeguard

User-facing documentation:

  • src/pflow/guide/features/caching.md — new pflow guide caching topic covering two-cache-layer model, ## Cache syntax, order invariant, TTL opt-in, sub-workflows, provider-specific notes, full warning ID catalog
  • docs/guides/debugging.mdx, docs/reference/cli/index.mdx, docs/reference/nodes/llm.mdx — caching cross-references and CLI help updates
  • src/pflow/core/cache_analysis/CLAUDE.md — current-state tacit knowledge for the analyzer package
  • Multiple CLAUDE.md updates across CLI, core, runtime, execution, nodes, and tests

Breaking Changes

These are not strictly breaking for end-user workflows (no users yet — see CLAUDE.md), but consumers of internal APIs should be aware:

  • Diagnostic identity tuple changed from (severity, source, node_id, message) to (severity, source, node_id, id or message). When id is set, it's the dedup key.
  • Adapter complete() signature: system: str | None -> system: str | list[ContentBlock] | None; new user_message_blocks: list[dict] | None kwarg.
  • compute_node_config, apply_memo_hit, write_memo_cache, handle_cached_execution widened with keyword-only kwargs (prompt_cache_content, node_type_name, cache_key, created_at, cache_source). Production callers updated; future direct callers must pass these.
  • WorkflowTraceCollector(*, workflow_path=None) — new keyword-only kwarg.
  • NamespacedSharedStore now inherits MutableMapping; __contains__(key: object) (ABC requirement) instead of key: str.
  • LLM usage telemetry now preserves has_cache_telemetry, uncached_input_tokens, and input_token_accounting; absence vs explicit zero is load-bearing.
  • JSON output is now major version 4. Consumers should gate on format_version.startswith("4.") and dispatch on typed fields like model_state rather than sentinel strings.
  • per_call[].cacheable_data_source changed: cacheable evidence no longer emits the old fabricated estimator tier; valid tiers include trace, memo, parameters, batch_prefix, cross_workflow_projection, unavailable.
  • pflow save byte-for-byte round-trips ## Cache blocks. Existing memo cache entries written under the old line-shift-corrupted cache key become unreachable; expire naturally via 24h TTL.
  • --no-cache flag scope clarified: disables pflow's local memoization only. Does NOT disable LLM provider prompt caching.
  • Trace filename schema: existing pre-fix traces silently bypass autoload (no hash prefix). Users with valuable old traces use pflow analyze-cache --from-trace <path>.

Testing

Latest logged state: default suite 6,592 passing, 1 skipped, 3 deselected; near-full sandbox-safe suite 6,562 passing, 1 skipped; baseline harness 67/67 passing after latest regeneration. make check was clean in the logged passes, with known sandbox uv caveats documented in the progress log.

Critical regression gates (do not delete or weaken — see task-review.md "Test Files" table for full list):

  • tests/test_runtime/fixtures/golden_config_hashes.json + test_prompt_cache_hash.py::test_golden_baseline_hashes_match — DD#19 byte-identity of compute_node_config across 30 nodes / 9 workflows.
  • test_plan_drift.py::test_plan_matches_engine_for_workflow_with_prompt_cache — planner <-> engine parity for cache-using workflows.
  • test_prompt_cache_rendering.py::test_resolve_chunk_value_is_imported_locally_at_both_sites — hash-side and prep-side import the SAME helper as local module bindings.
  • test_prompt_cache_rendering.py::test_hash_render_and_prep_render_byte_equivalent_through_namespaced_store — DD#19 production-shape (NamespacedSharedStore wrap + dotted path).
  • test_workflow_resolver_contract.py::test_resolve_workflow_returns_fully_file_resolved_ir and test_sub_workflow_resolver.py::test_resolve_sub_workflow_cross_workflow_walker_sees_resolved_prompts — Path 1 boundary contract.
  • test_trace_format_2_1.py::test_handle_cached_execution_does_not_overwrite_memo_cache_source — memo-hit cache_source="memo" must not be overwritten to "in_process".
  • test_cache_analysis_per_id_coverage.py — every catalog ID round-trips through JSON with id at top level.
  • test_cache_analysis_per_id_emission.py — production-shape warning emission tests via analyze(...) with real memo/trace fixtures.
  • JSON/version/error-envelope tests — format_version remains first key; JSON error paths emit parseable envelopes.
  • Trace-filtering tests — warning suppression is driven by CacheWarningSpec.requires_complete_trace, not severity.
  • batch_prefix / cross_workflow_projection tests — protect per-call projection tiers and no-double-counting behavior.
  • Task 159 baseline harness — output oracle for analyzer text/JSON behavior; Task 160 should run it before/after each refactor slice and treat unexpected drift as a stop signal.

End-to-end paid verification (Stage 2):

  • Anthropic Sonnet 4.5 smoke: cache_creation/cache_read observed; first-run -25%, rerun within TTL -73%.
  • Gemini smoke: cost-projection accuracy verified within +/-5%.
  • lyrics-generator workflows drove most analyzer hardening: file-resolution boundary, cross-workflow attribution, trace coverage, baseline output, and agent-UX polish.

Adversarial/baseline verification caught real bugs

  • cache_source mislabel (memo reported as in_process).
  • NamespacedSharedStore type taxonomy: dotted-path chunks silently dropped.
  • Trace events vs nodes typo: Tier-1 trace data unreachable.
  • Validator errors filtered out of analyze-cache.
  • Cache ratio > 100%; -$0.00/run; per-call vs cohort token-unit mistakes.
  • Dynamic workflow child rows misattributed because traces lacked canonical child paths.
  • Baseline sweeps caught multiple agent-UX regressions that unit tests did not.

Follow-up issues filed

#357 (closed inline — line-shift breaking memo), #358 (image cache support), #359 (LiteLLM cost-map fetch warnings to stderr), #360 (dynamic batch size cost undercount), #361 (Path 2 architectural cleanup), #362 (cross-workflow rename suppression — closed), #363 (shared prose detection), #364 (multi-walker consolidation in trace event traversal), #365 (sub-workflow cost rollup — resolved), #371 (test slowdown — closed), #372 (remaining perf), #373 (near-threshold expansion hints for SuggestedBlock), #374 (__warnings__ channel modernization), #375 (mock fidelity), #376 (child rollup confidence gap), #377 (split partial_cost_usd semantics), #383 (repeated non-batch prompt-shape detection).

Task 160 (cache analysis architectural refactor) is filed as a follow-up task. Its main guardrail should be the Task 159 baseline harness plus the focused cache-analysis tests above.

spinje added 30 commits April 27, 2026 12:00
…n, DDs 26-36

Synthesize the analyze-cache output format into the spec contract and lock
in the architectural decisions surfaced this session.

Spec changes (task-159.md):
- DDs #26-36 added: Tier 2 in-by-default; Diagnostic id field extension;
  no FixAction (mypy is the right analog, not rustc); closed warning ID
  catalog of 9 entries; four-level confidence labeling; savings-ratio
  prewarm thresholds; auto-load trace from ~/.pflow/debug/; optional
  inputs; three-tier validation/analysis architecture (pflow run stays
  fast/deterministic, --dry-run does opt-in expensive analysis,
  analyze-cache is the dedicated full report).
- New requirements subsections: Stable Warning ID Catalog, Output Format
  (Text + JSON with full mockups), Confidence Labeling Algorithm,
  Cross-Workflow Walker (Tier 2), Token Estimation Strategy, Per-Model
  Capabilities Table, Diagnostic Extension.
- Verification pass corrections: complete() signature extension required
  (cache feature is first to need structured system content blocks);
  compute_node_config uses key "batch" not "batch_config"; LLMNode lacks
  NodeConfig access (auto batch-prefix detection moves to batch_executor).
- All cache analytical findings are advisory; only structural errors
  (cache.order-mismatch, reference-resolution) block pflow run.

Progress log (§26): journey + three principle clarifications driven by
user pushback (FixAction overlap with suggestions; trace-vs-memo data
source conflation; savings-ratio threshold reasoning), verification pass
results with three surprises, seven open threads for plan-writing.

Braindump updates: drop outdated Tier 2 / prewarm framings (now resolved
via DD#26 / DD#36). New supplementary braindump captures tacit knowledge
not in other docs (user pushback patterns, DD interactions, hedged
claims, process recommendations).

v3 output-format draft deprecated — contract content folded into spec;
v3 retained as historical artifact with deprecation header.
…olidation, verification corrections

Resolves the DD#18 vs DD#33/DD#36 contradiction left from §26 (analytical
findings never block pflow run; advisory only). Locks in research-backed
decisions: adapter signature widening (system: str | list[ContentBlock]),
TTL translation table (Anthropic vs Gemini wire formats), source=
"cache_analyzer" diagnostic convention, 2.0.0-trace auto-load skip,
cross-workflow rename precedence, sparse-memo aggregate coverage.

Replaces two starting-context braindumps with single agent-handoff.md
capturing only what isn't in the spec or progress log (user style,
methodology, concrete substrate, watch-fors, plan-writer gotchas).

Final codebase-verification round caught five wrong claims now corrected:
- Anthropic does not accept ttl: "5m" explicitly (only "1h" or omit)
- Phase D option (c) rejected — batch_executor.py doesn't do per-item
  template resolution; option (a) is now recommended
- Function name was wrong (_attach_llm_call_to_event → _add_llm_data)
- Co-edit invariant mischaracterized (it's three places, not two)
- scratchpads/task-158-spike/ was cleaned up; pattern only

Progress log gains §27 (pre-plan-writing consistency pass), §28 (TTL
clarification + cost-degradation gap), §29 (verification round). Spec is
contract-ready for Phase B-G plan-writing.
…ion + DD#37

Verification spike caught Task 158 §27 misdiagnosis: "Opus 4.5 + thinking +
cache fails" was actually below-4096-token-threshold silent no-op. Anthropic
docs (April 2026) + spike confirm per-version minimums: Opus 4.5/4.6/4.7 +
Haiku 4.5 = 4096; Sonnet 4.6 = 2048; older Sonnet/Opus = 1024. New DD#37
locks OpenAI prompt_cache_retention mapping. Boundaries cleaned: spec holds
contract values, progress log §30 holds journey, handoff holds plan-writer
operational context.
7 Critical + 8 High-Priority + 7 Medium fixes encoded from round-3 8-agent review.
1 finding disputed (validate_data_flow reach — verified correct as planned).

Critical:
- D.1/D.2 unsafe defensive-read pattern (round-2 claim of universal coverage was false)
- F1 cache.discrepancy row had unresolvable {root_cause_action} placeholder
- B2.2 false claim: validator step 8 cannot reach top-level node keys
- ABSENT-branch hash-vs-prep symmetry (silent-stale-cache class) via _CHUNK_ABSENT sentinel
- D.2 prewarm execution made structurally concrete with index-alignment guarantees
- cache_chunks_skipped propagated through error path + memo round-trip
- 40+ WorkflowTraceCollector test sites: dropped strict assertion, added integration test

High-Priority:
- --no-trace-autoload rename (collision with pflow run --no-trace)
- storage_mode: shared × ## Cache documented as v1 limitation
- OpenAI 15 RPM batch routing documented; v1.x follow-up
- B3 in-memory mutation test (baseline can't include post-task fields)
- F3.1 exit code policy: ERROR severity in warnings[], exit 0 (advisory)
- Engine save/restore simplified to single try/finally; _EMPTY_CACHE_RENDER hoisted

Catalog: 10 → 12 entries (cache.invalid-on-non-llm, cache.prewarm-no-prefix added).

Plan: 1290 → 1445 lines.
Targeted Round 4 review (5 agents) caught pseudo-code-precision bugs in
Round-3-introduced code (wrong function signatures, wrong symbol paths,
references to non-existent classes, count drift). All verified against
actual code via direct Read before encoding fixes.

Verified pseudo-code corrections:
- process_item returns 5-tuple; D.2 pseudo-code now destructures correctly
- NodeStatus.ABSENT (not node_state.ABSENT)
- TemplateResolutionError doesn't exist; use ValueError
- extract_root_node_id always returns str; dropped dead None check
- Diagnostic identity tuple; multi-field rejection collapse fixed (V6)

V5 fix (top-10% pattern: one rule per error condition):
- Schema is single source for shape; _validate_cache_block does only
  semantics + defensive isinstance skip on compile path
- Removed obsolete "two diagnostics that dedupe" test

V6 fix (top-10% pattern: one diagnostic per [rule, location]):
- cache.invalid-on-non-llm now ONE Diagnostic per node with
  context["invalid_fields"]: list[str]; matches mypy/ruff convention

Five high-value additions (top-10% / simplest-final-code lens):
1. _resolve_static_prefix_for_cache companion helper — byte-identical
   resolution across all three cache paths (chunk/static-prefix/analyzer)
2. cache.discrepancy structured context["root_cause_action"] payload —
   agents dispatch on typed data, not prose
3. JSON format_version evolution policy — JSON_FORMAT_VERSION_MAJOR
   constant + consumer rule format_version.startswith("1.")
4. Compile-path defensive isinstance guards enumerated (bool subclasses
   int — must use isinstance(x, bool) not (int, bool))
5. Spike contingencies subsection — explicit table mapping each spike
   to the encoded plan decision it tests

Plan: 1445 -> 1813 lines.
Catalog count drift defense: EXPECTED_CATALOG_COUNT constant; tests
iterate CACHE_WARNING_CATALOG.keys() instead of hardcoding numbers.
Minimal additions — preserves existing Round 1+2 sections intact, adds:

- Header + "Where I am" updated for plan v4 (1813 lines, 12 catalog entries,
  4 review passes deep)
- New "What §33 / §34 don't capture" section: 7 fix-shape WHYs not in progress log
  (V6 combined diagnostic, V5 single-source-of-truth, _resolve_static_prefix_for_cache
  shape, cache.discrepancy dict payload choice, EXPECTED_CATALOG_COUNT pattern,
  skipped verification-only pass, "verify against actual code" methodology shift)
- User's load-bearing principles extended with §33 + §34 quotes (top-10% lens,
  read-the-actual-code instruction, diminishing-returns analysis)
- Resolved/Still-open lists updated through Round 4 (16 newly-resolved items
  enumerated; spike-gated items separated from Round-4-still-open)
- "For the next agent" section updated with Round 4 / 5-additions context,
  Spike contingencies table reference, current §35 recording target
- Code anchors expanded with Round 4 verifications (process_item shape,
  NodeStatus enum, extract_root_node_id signature, plan_node ordering,
  compile_validation reach)

Test: "Could the next agent find this by reading files?" — applied to every
addition. Tacit-knowledge sections capture WHY decisions were made, not WHAT.
Round 5 (5 targeted agents) closed layer-placement + defensive-bypass
bugs introduced by Round 4: shared cache helpers moved to new
`core/cache_render.py` (verified `nodes/llm/llm.py` only imports
`pflow.core.*`); `_CHUNK_ABSENT` defense relocated from sentinel
`__repr__/__str__` (verified bypassed by `_make_serializable` at
`cache.py:25-51`) to explicit isinstance branch; `cache_chunks_skipped`
error-path injection moved from builder signatures to caller wrap sites;
`cache.order-mismatch` formatting fixed (Python `str(list)` vs spec
bare-identifier form); `MockLLMClient.set_response` cache-tokens
parallel-dict pattern fully spec'd.

Round 6 (4 agents) caught factual errors in Round 5 fixes:
`tuple("string")` silent-splat (iterable but wrong shape); `apply_memo_hit`
3rd caller at `execution/plan.py:862` missed; `_PROPAGATED_KEYS` 5→7
entry correction; F2 confidence aggregation flipped permissive→STRICT
per DD#34 line 634 verbatim; `cache_chunks_skipped` 4th wrap site at
`post():511` LLMResponseParseError; V6 sub-workflow dedup test marked
`xfail` because `format_child_provenance` modifies `Diagnostic.message`.

Diminishing-returns analysis run jointly with user — bug class shifted
from architecture (Round 1) → concurrency (Round 2) → correctness gaps
(Round 3) → pseudo-code precision (Round 4) → layer/defense (Round 5)
→ factual errors (Round 6). Cost-saved-per-fix dropped ~2hr → ~30min.
Stop decision: switch to code-review per phase merge from here.

Plan summary section collapsed (330 lines of fix-history → 10-line
pointer). Journey moved to progress-log §35. Braindump updated with
Round 5+6 tacit knowledge per /braindump skill rules (don't repeat
what's already written; capture only what isn't in other docs).
Three new artifacts to manage the Task 159 implementation handoff:

1. `implementing-agent-brief.md` — self-contained brief for the
   implementing agent. Specifies the 4-segment split (Foundations,
   Memo-hash gate, Rendering+Trace, Analyzer+Docs) with mandatory
   STOP-and-log behavior at each firebreak. User decides handoff vs
   continue at each stop.

2. `spike-runner-brief.md` — self-contained brief for the spike-running
   agent. Covers the 3 pre-authorized paid spikes (~$0.30) with concrete
   protocols, expected outcomes, plan-update instructions if outcomes
   contradict the encoded Spike Contingencies decisions.

3. `implementation-progress-log.md` (stub) — target file for implementing
   agents to append segment entries at each firebreak.

Plus a new Claude Code skill at `.claude/skills/plan-breakdown/SKILL.md`
for analyzing implementation plans and recommending optimal handoff
breakpoints based on size + tacit-knowledge dependency. Generalizes the
analysis we did for Task 159 into a reusable pattern.
Three pre-authorized paid spikes ran 2026-04-29 (~$0.04 actual vs $0.30
budget). Outcomes recorded as progress-log §36.

Spike 1 (Gemini explicit cache_control): scenario A AMBIGUOUS leaning
CONFIRM via a 1b disambiguator (no-marker control proved the marker does
real work, but LiteLLM's Vertex telemetry surfaces only reads — no
cache_creation_input_tokens field). Scenario B CONFIRMS (multi-marker
request succeeds). F2 plan update applied: Gemini info-note in analyze.py
documenting the reads-only telemetry shape.

Spike 2 (OpenAI prompt_cache_key parallel-batch routing): CONFIRMS, 6/6
parallel calls hit cache after warm-up. No plan update.

Spike 3 (Anthropic per-TTL pricing): CONTRADICTS severely — 5m write
priced correctly but 1h cache-write cost is missing entirely from
litellm.completion_cost (ratio 0.0083 vs expected ~1.6). E.1 plan update
applied: 1h-TTL normalization override at llm_client.py:776-784,
Anthropic-gated, with explicit test cases pinned to spike actuals.

Doc updates trim the planner-era "go run the spikes" framing across
brief, braindump, and plan; spikes are now past-tense audit material,
the plan is the implementer's single source of truth.
Add stable warning identifier (Diagnostic.id) for cache-namespaced diagnostics
per DD#27. Identity tuple becomes (severity, source, node_id, id or message);
null-safe by construction so legacy diagnostics with id=None preserve the
existing message-keyed dedup behavior.

Add three cache category constants (cache_failure / cache_warning /
cache_advisory) and corresponding CATEGORY_TITLES entries for renderer title
lookup. _FAILURE_CATEGORY_MAP intentionally NOT extended in v1 — no typed
cache exception flows through __failures__ so adding the entry would create
dead code; deferred to v1.x.

Extend diagnostic_render._format_warning_or_info_diagnostic to dispatch on
cache categories: render [id] inline prefix and surface a closed list of
context keys (savings_pct, savings_usd, batch_size, prefix_tokens_estimated,
target_file) so agents reading text output can route on stable warning IDs
without parsing JSON.

Update the load-bearing SSoT comment in core/CLAUDE.md to document the new
identity tuple shape and null-safety contract.
New dependency-free module mirroring core/llm_providers.py shape: frozen
ModelCapability dataclass, module-level MODEL_CAPABILITIES tuple, and the
get_min_cache_tokens(model) lookup used by cache.below-min-tokens warning
emission and auto-batch-prefix detection.

Per DD#32, Anthropic minimums are version-specific: Sonnet 4.5 / Opus 4.1 /
Opus 4 / Sonnet 4 / Sonnet 3.7 -> 1024; Sonnet 4.6 / Haiku 3.5 -> 2048;
Opus 4.7 / Opus 4.6 / Opus 4.5 / Haiku 4.5 -> 4096. OpenAI auto-cache fires at
1024 across all families. Gemini explicit cachedContents requires ~4k.
Unknown / unprefixed models fall back to the conservative floor (4096) to
prevent silent cache-warning suppression.

Lookup uses most-specific-prefix-wins so dated version suffixes like
``claude-sonnet-4-5-20240620`` correctly route to the family threshold.
Extend the markdown parser state machine to recognize a structurally novel
section mode: ``## Cache`` carries section-level YAML params (only ``- ttl:``
in v1) plus a single tagged ``cache`` code block, with NO ``### entities``.

Add _SectionType.CACHE (intentionally NOT in _KNOWN_SECTIONS so the
orphan-content rules don't fire), _CacheChunk / _CacheSection collectors,
section-level YAML branch (validates `- ttl: 5m | 1h`, rejects other keys),
code-fence-close branch wired to _attach_cache_code_block.

The new _parse_cache_code_block helper splits the cache block content into
``[prose-before-${var}][${var}]`` chunks via _CACHE_TEMPLATE_RE; trailing
prose after the last ${var} is silently discarded per the contract; empty
or prose-only blocks raise. Chunk identifiers are the raw ${...} content
verbatim — downstream root-extraction in B2.3 uses
TemplateResolver.extract_root_node_id for bracket-syntax handling.

_build_node_dict now extracts ``prompt_cache:`` and ``prewarm:`` to TOP-LEVEL
node keys (siblings of ``cache``, ``batch``, ``type``), not inside
``node["params"]``. This is required for B2.2's per-node schema check to
see them AND for B2.3's ``cache.invalid-on-non-llm`` rule (the validator's
step 8 only iterates ``node["params"]``, so top-level placement is the
load-bearing reach for the non-LLM-node check).

ir["cache"] is built at Phase 4 with shape {ttl, items: [{name, var,
prose_before, _source_line}], _source_line}; matches the schema added in B2.2.
…prewarm

Add top-level ``cache`` field to FLOW_IR_SCHEMA: object with required
``items`` array (per-chunk shape: {name, var, prose_before, _source_line}),
optional ``ttl`` enum {"5m", "1h"}, optional ``_source_line`` parser metadata.
``additionalProperties: False`` on both the cache object and per-item objects
so typos like ``cahe`` and stray fields are caught at schema validation.

Add ``prompt_cache: list[str]`` and ``prewarm: bool`` as additive per-node
properties. Schema accepts them on EVERY node type — semantic
``cache.invalid-on-non-llm`` rejection lives in core/workflow/data_flow.py
(B2.3), not here. Per the V5 fix from plan Round 4: schema is single source
of truth for SHAPE; data_flow does only SEMANTICS. Top-10% pattern (mypy /
ruff): one rule per error condition, not belt-and-suspenders. Without the
single-source rule, save-path validation would emit two diagnostics for
the same shape error.

The existing per-node ``additionalProperties: False`` invariant catches typos
like ``prompt_cahe`` automatically — no extra suggestion infrastructure needed.
…ata_flow

Add _validate_cache_block to validate_data_flow's tail. Both validation entry
points (WorkflowValidator.validate at save-time + compile_validation at
compile-time) reach this via the shared call site.

STEP ordering is load-bearing per the V5 fix from plan Round 5:

  STEP 1 (FIRST): non-LLM-rejection. Walks every node; for type:string node
  types other than "llm" with prompt_cache: or prewarm: declared, emits ONE
  ``cache.invalid-on-non-llm`` Diagnostic per node listing ALL invalid fields
  in context["invalid_fields"] (V6 combined-diagnostic shape — not one
  diagnostic per field). Shape-agnostic: a malformed prompt_cache: 5 on a
  type: shell node still emits the structured rejection rather than silently
  log-skipping. isinstance(type, str) gate prevents firing against
  type: ["llm"] (a structural error) when the real issue is the type field
  itself, not its value vs llm.

  STEP 2: defensive shape skip on the compile path. Surviving LLM nodes with
  malformed prompt_cache (non-list / list-of-non-strings) or prewarm
  (non-bool — note bool is a subclass of int; isinstance(x, bool) is required
  to reject prewarm: 1) are logger.warning'd and skipped. The schema-validator
  path catches these at step 1 and short-circuits; the compile path bypasses
  jsonschema, so we MUST guard here. Per V5: schema is single source of truth
  for shape; data_flow emits ZERO diagnostics for shape errors. The deeper
  compile error (CompilationError on CacheBlockIR construction) surfaces.

  STEP 3a: per-node prompt_cache: semantic checks. Each chunk must resolve
  to a declared cache item (find_similar_items "Did you mean?" hint on miss);
  declaration order must match (cache.order-mismatch ERROR with spec-locked
  four-line message format using bare-identifier bracketed lists, NOT Python
  repr quoted form).

  STEP 3b: top-level chunk-var resolution + batch-scoped rejection
  (${item.X} and descendants reject with no catalog id — flows through
  generic validation pipeline) + cache.unused-chunk WARNING for declared
  chunks no node references.

Top-level cache block defense is split-scope (Round 5): when ## Cache shape
is malformed, emit ONE logger.warning and skip top-level cross-references,
but the per-node cache.invalid-on-non-llm STILL runs since it's independent
of top-level cache shape.

The see_also guide-topic pointer is intentionally OMITTED here — the caching
topic is added in Phase G; the repo-wide
test_all_see_also_literals_resolve_to_real_guide_topics enforces topic
existence. G.2 will wire the references back in.

V6 sub-workflow dedup test (xfail, strict=False) locks the contract that
parent + child versions of cache.invalid-on-non-llm dedup correctly via
the id-keyed identity tuple. The xfail wrapper is a tripwire for the
open user decision (granular dedup tuple including workflow_path vs
special-case per-id dedup ignoring message/node_id) — DO NOT silently
weaken the assertion.

# noqa: C901 on _validate_cache_block: the function is split into clearly
numbered STEP 1/2/3a/3b and a refactor would obscure the linear flow.
Append the segment-1 entry covering B1.1 + B1.2 + B2.1 + B2.2 + B2.3.
Documents what shipped, deviations from plan, tacit knowledge for the
next agent, open hedged claims, and segment-2 preflight (the
golden_config_hashes.json baseline fixture is the load-bearing gate
before B3.1 patches begin).
A workflow input referenced ONLY in ``## Cache`` (no node param uses it) was
flagged as unused by ``_validate_unused_inputs`` because the template-walker
``_extract_all_templates`` only iterated ``node.params`` and ``batch.items``.
Top-level ``workflow_ir["cache"]["items"]`` was structurally invisible to it.

Surfaced via the post-segment-1 smoke test on a valid lyrics-generator-shape
workflow (case 1: input declared, used in cache only). Without the walker
extension, every workflow with ``## Cache`` referencing inputs would produce
a spurious 'Declared input(s) never used as template variable' ERROR.

Fix: extend ``_extract_all_templates`` with a top-level walk that wraps each
``cache.items[i].var`` in ``${...}`` and feeds it through the existing
``extract_from_value`` pipeline. Cache chunks reach the unused-input check
the same way as node params and batch items.

Regression test: tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused
Append a "Post-segment-1 smoke test + bug fix" subsection to the segment-1
log entry documenting the 4-case manual UX check, the false-positive
unused-input bug it surfaced, the _extract_all_templates fix, and what
follow-up agents need to know about template-walker exhaustiveness.
…plicates

Two regressions surfaced by adversarial smoke testing post-segment-1:

1. **Double-emit on bad cache var references.** The earlier walker fix routed
   ``cache.items[].var`` through ``_extract_all_templates``'s shared template
   set; both my B2.3 ``_validate_cache_block`` AND the existing template-path
   validator (pass 5) saw the same template and both emitted. User would see
   two errors for one mistake.

   Fix: split into a separate ``_extract_cache_templates`` extractor. The
   union flows ONLY into ``_validate_unused_inputs`` (so cache-only inputs
   aren't flagged unused). Path validation receives ONLY node-param templates;
   cache var resolution is owned by ``_validate_cache_block`` with its richer
   "Cache chunk 'X' references..." diagnostics.

   This is the V5 single-source-of-truth pattern applied to template
   extraction: one extractor per consumer concern, no shared sets.

2. **``prompt_cache: [a, a]`` (duplicate items) passed validation silently.**
   The chunk would render twice into the system prompt — wasted tokens AND a
   silent semantic shift the author likely didn't intend. The parser already
   rejects duplicate ``${var}`` in ``## Cache``; per-node ``prompt_cache:``
   needs the parallel rule.

   Fix: detect duplicates BEFORE the resolution check; emit one validation
   error per offending node listing all duplicate names; suppress the
   order-mismatch follow-on (would be noise on top of the actionable error).

Both fixes have regression tests:
- tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused
  (updated to assert split-extractor contract)
- tests/test_core/test_prompt_cache_validation.py::test_duplicate_chunk_in_prompt_cache_emits_error

Manual verification of all 4 original smoke cases + 9 new adversarial cases
(end-to-end run, save round-trip, sub-workflow propagation, batch + cache,
JSON output, parser edge cases) confirms no other regressions in the
segment-1 surface.
Reviews identified mostly minor issues — the implementation held under
verification-specialist scrutiny + 4-agent code review. Fixes applied:

**review-test-fidelity findings:**

1. V6 sub-workflow dedup test was a tautology — synthetic
   ``deduplicate_diagnostics([d1, d2])`` necessarily collapses two
   identically-id'd diagnostics regardless of production behavior.
   REPLACED with a Pattern 2 fixture-based test that drives a real
   parent → child workflow file pair through ``WorkflowValidator.validate()``,
   asserting the propagation path emits ONE diagnostic with provenance
   prefix and preserved node_id.

2. Two circular identity tests
   (``test_legacy_identity_tuple_in_hash_when_id_absent``,
   ``test_id_set_changes_hash_to_id_keyed_tuple``) reconstructed the tuple
   from production code and compared. REMOVED as bloat — the behavioral
   tests (``test_id_set_collapses_message_variants``,
   ``test_subworkflow_legacy_dedup_regression``) are the real lock.

3. Split-extractor regression test asserted MECHANISM only (cache vars in
   one set, not the other). EXTENDED with end-to-end EFFECT assertion:
   running the FULL ``validate_workflow_templates`` produces zero spurious
   errors for cache-only inputs. A future contributor breaking the
   mechanism via re-export now surfaces immediately.

4. Pattern 2 (file-based) parser→validator integration test added —
   parses a real ``.pflow.md`` with cache.order-mismatch and asserts the
   validator emits the spec-locked diagnostic. Closes the "all hand-built
   IR" gap that masked the post-segment-1 bugs.

5. ERROR-severity cache diagnostics now show ``[cache.X]`` next to the
   title (e.g. "Error: Cache Failure [cache.invalid-on-non-llm]") — same
   agent-routing handle as WARNING/INFO. Text-mode agents can grep on id
   without parsing JSON. Non-cache errors render unchanged (additive).

**review-silent-failures findings:**

6. ``_extract_cache_templates`` API surface had no enforcement against
   future inclusion in ``all_templates`` (would silently double-emit on
   cache var resolution failures). RENAMED to
   ``_extract_cache_templates_for_unused_check`` so the contract is
   explicit at every call site.

**review-validation-consistency findings:**

7. Parser-level invariant assertion ``chunk.name == chunk.var_expr`` in
   ``_build_cache_dict``. The schema doesn't enforce equality but the
   parser always populates them equal. Prevents future refactors from
   silently drifting the two fields (B2.3 resolves by ``name``; segment-3
   rendering will substitute via ``var`` — drift would break symmetry).

**review-impact-completeness findings:**

8. Stale doc strings updated:
   - ``core/CLAUDE.md:107`` short paragraph aligned with the load-bearing
     SSoT comment at line 103 (id-keyed identity tuple).
   - ``runtime/template_validation/CLAUDE.md`` documents both extractors
     and the split-extractor contract.

Skipped (per the verification specialist's triage):

- "prompt_cache: inside node[\"params\"]" asymmetry: programmatic-IR-only
  path; step 8 already rejects with clear error. Not worth a third rule.
- "cache.duplicate-chunk" stable id: adding would expand the catalog
  beyond DD#29's closed-list. Substring match in regression test is fine
  for now.
- Empty-string ``id=""`` ``__post_init__`` check: speculative future
  risk; no production diagnostic constructs with empty id.

Final state: 5519 tests pass, mypy + ruff + deptry clean,
``test_plan_drift.py`` green (32/32). 9 manual adversarial smoke tests +
4 original cases all behave correctly end-to-end.
Append a comprehensive subsection covering the verification specialist's
14-case adversarial smoke test, the 4-agent code review (impact-completeness,
silent-failures, validation-consistency, test-fidelity), the 10 fixes
applied, the 3 findings deliberately not fixed (with reasoning), and the
end-to-end manual verification that confirmed the segment-1 surface holds.
Pre-merge precondition for Task 159 B3 (memo-hash conditional inclusion).
Generates per-node config_hash for a curated set of 9 example workflows
covering 9 mandatory hash-affecting shapes (plain, batch sequential/parallel,
sub-workflow, branching, cache:false, template-heavy, multi-input).

The committed fixture is the LOAD-BEARING regression gate (DD#19): workflows
WITHOUT prompt_cache:/##Cache must produce byte-identical hashes pre- and
post-Task-159 B3 patches. Drift signals a silent stale-cache regression class.

Files:
- scripts/generate_config_hash_baseline.py — re-runnable generator
- tests/test_runtime/fixtures/baseline_workflows.py — single source of truth
  (imported by both the script and the regression test to prevent drift)
- tests/test_runtime/fixtures/golden_config_hashes.json — 30 hashes, 9 workflows

Regen command (only after human-reviewed compute_node_config change):
  uv run python scripts/generate_config_hash_baseline.py
… context

Implements DD#19's load-bearing invariant: workflows WITHOUT prompt_cache:/##Cache
hash byte-identically pre/post-task (verified by golden_config_hashes.json
regression gate); workflows WITH prompt_cache: get a fresh hash that includes
the rendered cache content, so existing memo entries don't silently serve
outputs produced without the prepended prefix.

B3.1 — schema additions to runtime types
  CacheChunkIR / CacheBlockIR / CacheRenderContext (frozen dataclasses) live in
  NEW src/pflow/core/cache_render.py because LLMNode.prep (C1.2) reads
  CacheRenderContext from shared and only imports pflow.core.* — placing the
  dataclasses in runtime/engine/types.py would force a layer-policy violation.
  NodeConfig gains prompt_cache_items: tuple[str, ...] and prewarm: bool.
  CompiledWorkflow gains cache_block: CacheBlockIR | None.
  Compiler extracts both with explicit isinstance preconditions that raise
  CompilationError on malformed shapes (Round-6 hardening: rejects tuple("string")
  silent splat, prewarm: 1 truthy-int, etc.).

B3.2 — CacheRenderContext build + install at engine boundary
  build_cache_render_dict() builds the per-workflow {node_id: CacheRenderContext}
  map (sparse — LLM nodes with cache state only). WorkflowEngine.run installs
  it under shared["__pflow_cache_render__"], wrapped in MappingProxyType for
  read-only enforcement. Save/restore mirrors __trace_collector__ exactly;
  restore-from-absent writes the module-level _EMPTY_CACHE_RENDER constant
  (frozen empty proxy), never None. NOT in _PROPAGATED_KEYS — each .pflow.md
  scopes its own ##Cache (DD#12); leaking parent->child would break cache
  scoping AND the CacheBlockIR freeze guarantee.

B3.3 — plan_node renders cache content into the hash
  Reorders plan_node so resolve_templates runs BEFORE config_hash (cache
  rendering needs shared state). _render_cache_for_hash filters _CHUNK_ABSENT
  chunks (branch-not-taken upstreams) symmetrically with the future C1.2 prep
  site. _resolve_chunk_value handles two unresolvable cases identically: (a)
  NodeStatus.ABSENT upstream, (b) permissive-mode echo where TemplateResolver
  returns the literal "${var}" string. Both collapse to _CHUNK_ABSENT so the
  filter contract holds.

  Defense-in-depth at runtime/cache.py::_make_serializable: raises TypeError
  if a leaked sentinel reaches hash serialization. Without it, a leaked
  sentinel would silently serialize to "<...ChunkAbsentSentinel>" and fold
  byte-identically into the cache hash — the silent stale-cache class.

B3.4 — compute_node_config accepts rendered cache content
  Keyword-only prompt_cache_content kwarg (forces explicit pass; future
  callers can't accidentally feed wrong shape as 5th positional). Conditional
  inclusion: empty list and None both fall through, byte-identical to absent
  (DD#19 three-state: absent equiv [] not equiv [chunk]).

Code-review fixes applied (4-agent /code-review pass):
  * Planner cache_render install: _create_planner_shared installs the same
    MappingProxyType the engine does, so plan_node's hash matches the engine's
    for cache-using workflows. Without this, pflow run --dry-run silently
    mispredicted "execute" vs "cached" on workflows declaring ##Cache.
    Regression test in test_plan_drift.py exercises an LLM workflow with
    ##Cache end-to-end via mock LLM; mutation-tested by removing the install.
  * Engine install ordering: build cache_render dict BEFORE installing trace
    so a build-time exception leaves shared completely unchanged.
  * isinstance(_, _ChunkAbsentSentinel) instead of type(_CHUNK_ABSENT) —
    cleaner, less brittle.
  * logger.warning at the defensive _render_cache_for_hash skip site so
    bypass scenarios become observable.

Storage-mode shared x ##Cache documented as v1-unsupported in
runtime/CLAUDE.md (parallel batch with shared root store would have
non-deterministic save/restore order; lyrics-generator doesn't use this combo).

Fix #357 (companion, pre-existing pflow bug surfaced during E2E verification):
  compute_node_cache_key now filters _*_source_line keys, mirroring the
  existing filter in compute_node_config. Without this, every saved-library
  workflow re-executed on every invocation because pflow save mutates the
  workflow file's frontmatter post-run, shifting every body section's source
  line, which flowed into resolved_inputs -> cache_key. End-to-end verified:
  saved A-basic.pflow.md now hits memo cache on second run (was 2s/$0.0002,
  now 6ms/cached).

Tests added: 67 (5523 -> 5590)
- tests/test_runtime/test_prompt_cache_compile.py (21)
- tests/test_runtime/test_cache_render_dict.py (12)
- tests/test_runtime/test_prompt_cache_hash.py (17)
- tests/test_core/test_cache_render.py (16)
- tests/test_runtime/test_cache.py (3 new for #357 filter)
- tests/test_execution/test_plan_drift.py (1 new for planner-engine parity
  on cache-using workflows)

End-to-end CLI verification (real LLM calls via claude-haiku-4-5):
- Cross-run: same input -> memo HIT; different input -> memo MISS
- Sub-workflow nested ##Cache: parent + child both HIT, dry-run agrees
- Parallel batch with prompt_cache: 2/2 in 1.5s parallel, HIT on re-run
- Branch-absent: route=use-a vs use-b produce different cache_keys (sentinel
  filter symmetric); same-route re-run HITs
- Validation errors render correctly (text + JSON):
  cache.order-mismatch, cache.invalid-on-non-llm, cache.unused-chunk,
  undeclared chunk, empty/prose-only block, multiple blocks, batch-scoped ref
- pflow save round-trip: ##Cache content preserved byte-for-byte
- pflow describe / --report / visualize / --no-cache / --only / --dry-run /
  --output-format json all work cleanly on cache workflows
- No __pflow_cache_render__ leakage in trace JSON or run JSON output

Closes #357
…view fixes

C-D-E phases of prompt caching:
- C1.1/C1.2: complete() system: str | list[dict] widening; LLMNode.prep
  builds system_blocks via shared _resolve_chunk_value; cache_chunks_skipped
  channel through 4 wrap sites + memo HIT round-trip.
- C2/C3: per-provider TTL marker translation (Anthropic / Gemini seconds-suffix
  / OpenAI prompt_cache_key MD5 + prompt_cache_retention "24h" for ttl=1h).
- D.1: auto-batch-prefix detection in prep() via _resolve_static_prefix_for_cache
  (canonical-JSON byte-stable prefix); user_message_blocks kwarg on complete().
- D.2: prewarm execution split inside _execute_parallel — item[0] runs
  synchronously through the SAME process_item closure; _collect_parallel_results
  widened with *, initial_completed=0, total=None.
- E.1: TRACE_FORMAT_VERSION = "2.1.0"; workflow_path plumbing; apply_memo_hit /
  write_memo_cache widened with keyword-only node_type_name; cache fields flow
  via existing llm_usage keyset (no sidecar); _should_write_cache_metadata
  allowlist (LLMNode only; ClaudeCodeNode INTENTIONALLY excluded); Anthropic
  1h-TTL cost normalization override per Spike 3.

Post-review fixes from 5-agent /code-review pass:
- Image-drop under prewarm + images: graceful runtime degradation +
  __warnings__ entry; GH #358 filed for v1.x native image-cache support.
- Dead-code cache_chunks_skipped wraps on 3 of 4 error paths:
  _propagate_error_to_shared now threads the field from err_dict into shared.
- Strengthened divergence-injection test with behavioral binding-independence
  check; ABSENT-case byte-equivalence test; Anthropic 1h-cost integration test
  through _normalize; logger.warning on _build_user_message_blocks last-resort
  fall-back; ClaudeCodeNode skip test exhaustive exclusion; litellm.model_cost
  test cleanup via monkeypatch.setattr; runtime/CLAUDE.md + runtime/engine/CLAUDE.md
  doc updates; debug-level signal for non-allowlisted LLM-producing node types;
  MockLLMClient.set_response full-replacement contract documented.

Tests: 5709 passing (+143 vs end-of-Segment-2 baseline). test_plan_drift.py
34/34. test_golden_baseline_hashes_match (DD#19 silent-stale-cache gate) PASSED.
make check clean (ruff + ruff-format + mypy + deptry).
Both bugs surfaced by manual `pflow` CLI runs against real Anthropic Haiku
calls. Test suite was green going in (5709 passed); both were architecturally
invisible to the existing unit-test fixtures.

Bug #1 — `cache_source` mislabeled `"in_process"` for cross-process memo HITs.
`engine.py:420` called `apply_memo_hit` (which augments `cache_source="memo"`)
followed unconditionally by `handle_cached_execution`, which augmented
`"in_process"` for both branches and silently overwrote the memo augment.
Fix: `handle_cached_execution` takes `cache_source: str | None = None`
keyword-only; engine specifies `"in_process"` for the in-process branch and
omits for memo (apply_memo_hit handles it). Eliminates the overlap class.

Bug #2 — Cache rendering silently drops every dotted-path chunk
(`${node.field}`) through `NamespacedSharedStore`. `LLMNode.prep` receives
shared as a `NamespacedSharedStore`; `TemplateResolver._get_dict_value`
checked `isinstance(value, dict)` which excluded the proxy, so dotted-path
resolution returned `(False, None)` → permissive-echo → `_CHUNK_ABSENT`.
Fix at the type-taxonomy level: `NamespacedSharedStore` inherits
`collections.abc.MutableMapping` (it always was one structurally — it just
didn't declare it); `_get_dict_value` checks `isinstance(value, Mapping)`.
The motivating workflow (lyrics-generator) uses dotted-path chunks almost
exclusively — every one would have been silently filtered.

Side-effect: `NamespacedSharedStore` shrank 218 → 154 lines because the
`MutableMapping` mixin handles `keys`/`items`/`values`/`get`/`setdefault`/
`update`/`pop`/`popitem`/`clear` that we used to hand-implement.

9 new regression tests, including production-shape DD#19 byte-identity check
through `NamespacedSharedStore`. The historical byte-equivalence test was a
tautology (synthetic dict + single-root chunks).

Verification artifacts: scratchpads/segment3-verification/ — 10 adversarial
.pflow.md workflows + VERIFICATION-REPORT.md.
spinje added 29 commits May 9, 2026 12:00
Four agent-UX fixes after a fresh-eyes read of the post-Cluster-A/B
output, each targeting a different section of `pflow analyze-cache`.
Total +35 LOC across 3 source files; 6 baselines drifted as strict
improvements.

- **L-5**: drop the `Discrepancy detection: skipped attribution for N
  trace event(s)…` Note. The prose was internal accounting (predicted
  cache_key, observable signal, IR-resolvable node — analyzer-internal
  vocabulary the agent can't act on). The per-call header `Showing N
  of M LLM nodes` already conveys coverage; the per-child commands
  section points to where to look.

- **N-10**: when only the non-cache blocking-errors list renders,
  emit bare `## Blocking errors` instead of the orphan `## Other
  blocking errors (surfaced for awareness)`. Pass `cache_blocking_
  present: bool` from the orchestrator into `_render_other_blocking_
  errors`. The "Other" qualifier is purely relational — when a
  sibling section isn't present, "other than what?" is incoherent.

- **B-3**: collapse `## Per-child analyze-cache commands` from N
  absolute paths × ~200 chars (15 on lyrics-generator) to one
  `cd <dirname(workflow_path)>` + relative-path commands. Sub-workflow
  paths naturally compose from the parent's directory, so dirname is
  the deepest natural common prefix — no `os.path.commonpath` heuristic
  needed.

- **B-4**: walker emits typed `DynamicBatchInfo` entries on
  `CrossWorkflowResult.dynamic_batches` instead of per-batch prose
  appended to `notes`. `_format_dynamic_batches_note` in analyze.py
  emits ONE aggregated Note. Singular case keeps the original wording
  for backward compat with existing substring tests; multi case lists
  each batch's `(node_id, items_expression)` once and shares the
  explanatory prose. The walker is documented as a data primitive
  (per cache_analysis/CLAUDE.md); this brings the dynamic-batch
  emission path into compliance.

Verification:
- `make test`: 6,426 passed, 1 skipped (no regressions).
- `make check`: ruff + ruff-format + mypy + deptry clean.
- 65/65 baselines pass; 6 regenerated as strict-improvement diffs.

Closes L-5 + N-10 + B-3 + B-4 from POLISH-PLAN.md Tier 1.
Recommendations now name affected child LLM nodes inline and project
savings_usd via memo-then-trace fallback. Honors honest-unmeasurable
when neither memo nor trace can resolve the parent value.
…licates

Restructured analyze-cache header for AI-agent readability. The 39-word
run-on scale line splits into 3 cohesive lines (`Workflow:` count +
invocation status; `Models:` for 2+ models; `Heterogeneous:` for batch
sub-workflows declaring `model: ${item.model}`). Single-model workflows
keep `using X` inline (pragmatic density).

The duplicate `Observed models:` line is dropped — `models_in_use` is
a superset of `observed_models_in_trace` in complete trace mode. The
sub-workflow breakdown line drops its trailing 15-name CSV; names
already appear in `## Per-call cache report` headings and
`## Per-child analyze-cache commands`.

Latent bug fixed as bonus: static-mode workflows on the no-model-resolved
branch previously suppressed the heterogeneous suffix. The static-mode
lyrics-generator capture now correctly surfaces `Heterogeneous:
generate-chorus-options (model varies per batch item)`.

`_format_heterogeneous_line` collapses the old 1-vs-N branches via
plain `', '.join`; `_format_heterogeneous_suffix` deleted as dead code.

Files: render_text.py (+50/-65 LOC, net -15 simpler), test_cache_analysis_renderers.py
(3 helper-direct tests migrated to end-to-end render assertions per
Pitfall #19; 2 N-9 CSV literal updates; 1 multi-model test rewritten;
1 new N-2 regression test). 4 mutation contracts verified. 5 baselines
regenerated as strict-improvement diffs.

6,430 tests pass (+4). `make check` clean. 65/65 baselines pass.

Closes N-2 + N-3 + N-9 from POLISH-PLAN.md Cluster D.
Recommendation #1 on the lyrics-generator canonical capture flips from
`savings unavailable` to `saves ~$0.20/run`, and the
`## Recommended actions` header gains the `(ordered by impact)`
qualifier — closes the gap where Cluster C v1's by-node-id walker
couldn't see workflow-input passthroughs (concept passed parent → child
→ grandchild as an input parameter).

Adds Tier 3 to `_estimate_parent_value_tokens`: read the resolved
value from the parent's workflow-node trace event under
`node_params['inputs'][child_input_name]`. The engine populates
`node.params` post-template-resolution before `record_trace`, so the
runtime invocation site carries the resolved value directly. Robust
against `template_resolutions` size-trimming applied to real-world
fixtures.

One new production-shape test with verified mutation contract; full
default suite green (6,427 passed); 65/65 baselines pass.
…clared`

When the parent value's estimated tokens are definitively below the
child model's minimum cache threshold, the recommendation now drops
`savings_usd` to None and appends a `Note:` line naming model +
threshold + token gap. Caching wouldn't fire as-stated, so the dollar
tag would be a lie; the warning prose carries the "declare AND increase
content" framing.

Asymmetric with the greenfield `cache.shared-context-undeclared` path
(which suppresses the recommendation when no node clears threshold):
cross-boundary keeps the recommendation visible because the agent
needs to know about the boundary even when caching wouldn't fire
as-stated. The recommendation surface IS the actionable signal for
this catalog ID.

Catalog gains additive `below_threshold_clause: str` context key
(always passed; empty string when not below). Helper
`_below_threshold_clause` formats the warning;
`_project_sub_workflow_cache_savings` returns
`(savings_usd, tokens, threshold_model)` so the emit site gates
without recomputing.

Heterogeneous-children case (different child models within one
sub-workflow) samples the first child's model — mirrors the existing
single-model sampling in `_consolidate_to_root_advisories`. v1
simplification documented in the helper docstring.

Files: analyze.py, warning_catalog.py, cache_analysis/CLAUDE.md, 4
test files (+3 emission tests with verified mutation contracts;
synthetic-fixture helpers updated). 2 baselines regenerated as
strict-improvement diffs (additive JSON field only).

6,434 tests pass. `make check` clean. 65/65 baselines.
Surface excluded-node cost as an explicit line so trace-mode dollar
figures compose visually for a fresh agent. Drops `(projected subset)`
qualifiers and the inline `(excludes ...)` clause from `_format_delta`.

Lyrics-generator before:
  Actually paid (trace):       ~$2.31 (trace)
  Cost without caching (projected subset):        ~$2.53 (partial)
  Actual savings (this run):    saves ~$0.49/run vs no-cache (excludes generate-chorus-options), 19% of no-cache cost

After ($2.31 − $0.27 = $2.04, $2.53 − $2.04 = $0.49 — math composes):
  Actually paid (trace):       ~$2.31 (trace)
  Excluded from analysis:      ~$0.27 (generate-chorus-options: model varies per call)
  Cost without caching:        ~$2.53 (partial)
  Actual savings (this run):    saves ~$0.49/run vs no-cache, 19% of no-cache cost
- Per-call rows: render `model=<unresolved>` when row.model is empty
  (was 35 chars of whitespace, read as a broken column).
- Drop the `Suggested-blocks: ... deferred to v1.x.` Note: internal
  jargon roadmap leak with no agent action.
- Header: promote `(no model resolved — set settings.default_model)`
  parenthetical to its own `Models: not resolved (...)` line, parallel
  to the multi-model `Models: A, B, C` case.
- Summary: drop duplicate `(trace)` from `Actually paid:` label since
  the value's tier annotation already carries the signal. Truncated
  branch keeps `(executed trace)` because "executed" is unique signal.

Tests: 1 assertion update + 1 existing test rewritten for new shape +
2 new mutation-contract tests. 6,438 pass, `make check` clean, 65/65
baselines pass after regen (34 strict-improvement diffs, net −77 lines
mostly from the Suggested-blocks Note vanishing).
Adds a parameter-aware resolution tier before memo/trace/invocation in
`_estimate_parent_value_tokens` so `pflow analyze-cache root.pflow.md
input=<sample>` produces real `saves ~$X.XX/run` figures on first contact
without a run-then-reanalyze loop. Reads from `parameters_for_workflow`
so CLI params at the root and walker-propagated params at nested
boundaries share one scoping primitive.
Replace the misleading severity-keyed headline (`19 opportunities (0
warnings, 19 info)`) with section-mapped counts that match what the
agent actually sees post-Cluster-A grouping:

- Summary: `2 recommended actions + 5 cross-workflow boundary findings`
  (suppressed when both are zero — was unconditional `0 opportunities`).
- `## Recommended actions (N, ordered by impact)` / `(N)` inline counts;
  B-6 ordered-by-impact gating preserved.
- `## Sub-workflow boundaries (M, covering K underlying renames)` —
  rollup suffix only when grouping collapsed renames; prose-mismatches
  contribute to M but not K (1-per-finding, no collapse).
- Dry-run nudge now uses the same rendered count via new
  `view_helpers.count_rendered_findings` so `pflow run --dry-run` and
  `pflow analyze-cache` agree (lyrics-generator: was 19 vs 7, now 7
  on both surfaces).
- `_group_renames_by_parent` lifted to `view_helpers` as the SSoT for
  rename grouping.
- `actionable_opportunities`/`warnings_count`/`info_count` preserved on
  `AnalysisSummary` and in JSON for machine consumers.

16 baselines regenerated as strict improvements (net −20 lines).
6,449 tests pass (+11 net). `make check` clean. 65/65 baselines pass.

Closes A-4 from POLISH-PLAN Cluster E (cluster fully closed: A-4 +
N-10 + L-5 + B-3 + B-4 all shipped).
…etector

Pass A:
- A1: consolidate visibility predicate (`per_call_row_has_real_data`) into
  view_helpers; both analyze.py and render_text.py import from there.
  Removes the 4-vs-3-condition divergence that emitted a misleading
  "Per-call cache report hidden" note on workflows with parameter-backed
  cacheable evidence (5+ baselines affected).
- A2: divide input/output/Tier 1 cacheable tokens by observed_call_count
  for static-list batch trace rows at the producer (`_build_per_call_row`).
  Cost helpers' `× _invocation_count` then recovers cohort cleanly. Fixes
  N²-inflated `no_cache_hypothetical_usd` and `rerun_within_ttl_hypothetical_usd`
  for static-list batches. New `_is_static_batch_trace_row` /
  `_divide_static_batch_trace_tokens` helpers; uses `round()` not floor.
  Tier 2 chunk-path now multiplies by `resolved_chunk_call_count` for
  non-batch repeated rows so units match cohort input.

Pass B — new catalog ID `cache.prompt-cache-incomplete`:
- Detector `_detect_partial_prompt_cache_declarations` walks LLM nodes via
  `_partial_declaration_finding_for_node`; uses `TemplateResolver.TEMPLATE_PATTERN`
  + `split_coalesce_operands` (NOT a hand-rolled regex) for prompt operand
  scanning. Root-aware via `_template_root_segment` / longest-var-prefix
  match. Filters batch-scoped refs.
- Emitter `_emit_partial_declaration_findings` aggregates per-node findings
  into one workflow-level diagnostic (matches `node_id=None` convention,
  documents latent dedup risk shared with 4 existing IDs).
- Cleanup hint via `_prompt_body_cleanup_for_node` calling
  `cache_overlap.compute_overlaps` — single source of truth shared with
  validator. Suggestions ordered cleanup-first to avoid the
  `cache.prompt-body-duplicates-cache` ERROR trap.
- Threshold gate: visible-with-clause + `savings_usd=None` mirrors N-7
  precedent. Per-node check for heterogeneous-models case.
- Suppressed when `cache.consolidate-to-root-recommended` covers same root.

Catalog count: 21 → 22. New baseline at
04-warning-catalog/21-cache.prompt-cache-incomplete/. Documentation
updated in CLAUDE.md, MCP tool docstring, and guide.

Existing baselines drift on Pass A (cost projections shrink to honest
totals) and Pass B (recommendation surface gains the new ID where
applicable). All diffs strict-improvement.
…tatic calls

Three small renderer-only fixes a fresh-agent read of the post-Pass-A/B
captures surfaced. Each was unmasked by an earlier fix that introduced
the better signal but didn't drop the now-redundant or wrong original.

A. ``(partial)`` qualifier on no-cache / rerun cost lines is redundant
   in trace mode (``Excluded from analysis:`` line carries the cohort
   gap in plain English) and in greenfield-with-cache mode (``(projected
   subset)`` label suffix is the parallel signal). Suppress it via
   ``projection_partial_marker`` in both branches; greenfield-no-cache
   keeps it (single-line ``Cost per run:`` has no other signal).

B. ``_per_call_scope_explainer`` returned a 60-word run-on packing four
   facts. Worse, the "(divide by calls for per-call values)" guidance
   became structurally wrong after Pass A2 normalized static-list batch
   trace rows to per-call units — agents following it would compute 1/N
   the actual per-call value. Returns ``list[str]`` (lead + 3 bullets);
   the wrong advice is dropped entirely.

C. ``_cell_calls`` renders ``—`` instead of ``0`` when
   ``evidence_scope == "static_analysis"``. ``calls=0`` for sub-workflow
   standalone analysis (the case ``## Per-child analyze-cache commands``
   instructs agents to run) read as "this node never runs". Trace mode
   keeps integer rendering — ``observed_call_count == 0`` there is a
   real signal (conditional dispatch not taken).

6 new mutation-contract tests. 14 baselines drift as strict
improvements; 1 unrelated upstream LiteLLM pricing-data drift on
``04-warning-catalog/20-llm.thinking-temperature-mismatch`` regenerated
in lockstep (same drift class as prior Tier-0 / per-call-column-split
commits).

Coordination: surfaces are disjoint from in-flight Pass C
(cross-workflow → per-call row plumbing). Pass C touches
``_cell_could_cache``, ``_per_call_confidence_footer``, the catalog
template for ``cache.sub-workflow-cache-undeclared``, and
``_build_per_call_row``; this commit edits ``_render_trace_cost_lines``,
``_render_greenfield_with_cache_lines``, ``_per_call_scope_explainer``,
``_append_per_call_rows``, ``_format_per_call_row``,
``_compute_per_call_column_widths``, ``_per_call_cells``, and
``_cell_calls``.

Verification: 6,555 passed, 1 skipped (near-full sandbox suite); 142
renderer tests; 566 cache-analysis suite; 66/66 baselines; ruff +
mypy + ruff format clean on touched files.
Five Stage 1 UX fixes from a fresh-eyes read of the post-Pass-C lyrics-generator
capture, plus the "Shape δ" rename-text-suppression investigation that makes
`cache.cross-workflow-rename-detected` JSON-only.

**Stage 1 — five renderer/catalog fixes.**

- Fix 3 (pluralization): `nodes_phrase` computed field in `make_diagnostic`;
  `cache.sub-workflow-cache-undeclared` no longer emits "1 LLM nodes there".
- Fix 4 (Confidence): suppress `Confidence: high_from_trace (N of N nodes)`
  when Evidence already says complete trace with no unreached. Keeps the
  line for `medium_from_memo`, partial coverage, truncated trace, unreached.
- Fix 5+6 (token-confidence footer): `_per_call_confidence_footer` returns
  `list[str] | None`; multi-line bullet block. New `_format_node_list`
  aggregates duplicate node names — lyrics-generator's "review, review,
  review, ..." collapses to "8 review nodes".
- Fix 7 (cost-block pass-through): fold excluded `actual_cost_usd` into
  `Cost without caching` / `Cost on rerun` so the cost block reconciles
  in-place ($2.31 paid < $2.42 no-cache, no subtraction). Plain-English
  footnote names the pass-through nodes; falls back to current Excluded-
  line behavior when any exclusion has `actual_cost_usd=None`. Wording
  corrected: "no caching can apply" → "projected savings unavailable; real
  savings could be higher if model+content combos repeat".
- Fix 8 (Notes-section jargon): memo-empty discrepancy note + Gemini
  telemetry note rewritten in plain English; "Discrepancy detection /
  predicted-key matching / Observable-field attributions" → "Cache fidelity
  check / TTL expiry / chunk-skip detection". `_format_fidelity_skip_note`
  helper applies the same rewrite across all 9 remaining per-node skip-note
  sites in `_format_skipped_workflows_note`, `_build_predict_scaffold`, and
  `_predict_node_with_scaffold`. Production grep confirms zero remaining
  emission of the old jargon strings.

**Shape δ — rename diagnostics text-suppression.**

`cache.cross-workflow-rename-detected` continues to emit for JSON/raw consumers
(`analysis.warnings` unchanged, catalog count stays at 22), but CLI text no
longer renders rename diagnostics in `## Sub-workflow boundaries`. Empirical
trigger: 17 rename findings on lyrics-generator collapsed to 5 boundary entries
that looked actionable but were cache-fidelity false positives — variable names
are stripped before the provider wire format; cache hits depend on prose before
each cached value and the resolved value bytes.

Helpers removed: `view_helpers.group_renames_by_parent`,
`render_text._render_renames_for_parent`. Rename suggestion in the catalog
rewritten to state the warning is informational. `_CROSS_WORKFLOW_ALIGNMENT_IDS`
still includes rename so it stays out of Recommended actions.

**Tests:** +12 new tests (5 emission/dispatch, 3 Confidence gating, 2 footer,
4 cost-block fold, 1 SSoT helper), 8 existing tests migrated to new wording,
1 prose-mismatch nudge-count test added. `make test` default suite: 6,545
passed. Cache-analysis subset: 634 passed.

**Baselines:** 67/67 verify after regenerating 15 cases (lyrics-generator
captures + adjacent text/JSON cases hit by Fix 5+6/7/8 wording + Shape δ
rendering changes + 1 unrelated pre-existing LiteLLM pricing drift on
`04-warning-catalog/20-llm.thinking-temperature-mismatch`).

Carried forward (Stage 2 / data-model follow-ups):

- Stage 2: lift `cache.sub-workflow-cache-undeclared` cleanup-first
  suggestions to a section header preamble (waiting for the in-flight
  below-threshold demotion work).
- Text/JSON divergence: Fix 7's pass-through fold is renderer-only;
  `summary.no_cache_hypothetical_usd` JSON field still emits the priced-
  cohort value. Add `no_cache_with_passthrough_usd` (additive) or document
  the asymmetry in a follow-up.
Aligns the per-call cache report's `cacheable inputs:` note with the
Recommended actions section by switching from parent value expressions
to child input names (alphabetized). Restores parent→child data-flow
visibility by adding a `flows in from parent as ${expr}` sub-line under
each renamed input bullet inside the action body — scoped to where it's
actionable, not as a standalone warning section.

CrossWorkflowInputContribution: renamed `name` → `child_input_name`,
added `parent_value_expr: str | None`. JSON shape correspondingly
updated (same-minor 4.1 pre-merge shape correction).

Body-block render duplication eliminated via two helpers
(_format_per_consumer_input_lines, _format_single_consumer_input_lines)
sharing one rename-suffix helper (_parent_origin_clause). Satisfies C901
without noqa.
@spinje spinje merged commit 666d847 into main May 11, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant