feat: Prompt caching via declarative ## Cache block + pflow analyze-cache#378
Merged
Conversation
…n, DDs 26-36 Synthesize the analyze-cache output format into the spec contract and lock in the architectural decisions surfaced this session. Spec changes (task-159.md): - DDs #26-36 added: Tier 2 in-by-default; Diagnostic id field extension; no FixAction (mypy is the right analog, not rustc); closed warning ID catalog of 9 entries; four-level confidence labeling; savings-ratio prewarm thresholds; auto-load trace from ~/.pflow/debug/; optional inputs; three-tier validation/analysis architecture (pflow run stays fast/deterministic, --dry-run does opt-in expensive analysis, analyze-cache is the dedicated full report). - New requirements subsections: Stable Warning ID Catalog, Output Format (Text + JSON with full mockups), Confidence Labeling Algorithm, Cross-Workflow Walker (Tier 2), Token Estimation Strategy, Per-Model Capabilities Table, Diagnostic Extension. - Verification pass corrections: complete() signature extension required (cache feature is first to need structured system content blocks); compute_node_config uses key "batch" not "batch_config"; LLMNode lacks NodeConfig access (auto batch-prefix detection moves to batch_executor). - All cache analytical findings are advisory; only structural errors (cache.order-mismatch, reference-resolution) block pflow run. Progress log (§26): journey + three principle clarifications driven by user pushback (FixAction overlap with suggestions; trace-vs-memo data source conflation; savings-ratio threshold reasoning), verification pass results with three surprises, seven open threads for plan-writing. Braindump updates: drop outdated Tier 2 / prewarm framings (now resolved via DD#26 / DD#36). New supplementary braindump captures tacit knowledge not in other docs (user pushback patterns, DD interactions, hedged claims, process recommendations). v3 output-format draft deprecated — contract content folded into spec; v3 retained as historical artifact with deprecation header.
…olidation, verification corrections Resolves the DD#18 vs DD#33/DD#36 contradiction left from §26 (analytical findings never block pflow run; advisory only). Locks in research-backed decisions: adapter signature widening (system: str | list[ContentBlock]), TTL translation table (Anthropic vs Gemini wire formats), source= "cache_analyzer" diagnostic convention, 2.0.0-trace auto-load skip, cross-workflow rename precedence, sparse-memo aggregate coverage. Replaces two starting-context braindumps with single agent-handoff.md capturing only what isn't in the spec or progress log (user style, methodology, concrete substrate, watch-fors, plan-writer gotchas). Final codebase-verification round caught five wrong claims now corrected: - Anthropic does not accept ttl: "5m" explicitly (only "1h" or omit) - Phase D option (c) rejected — batch_executor.py doesn't do per-item template resolution; option (a) is now recommended - Function name was wrong (_attach_llm_call_to_event → _add_llm_data) - Co-edit invariant mischaracterized (it's three places, not two) - scratchpads/task-158-spike/ was cleaned up; pattern only Progress log gains §27 (pre-plan-writing consistency pass), §28 (TTL clarification + cost-degradation gap), §29 (verification round). Spec is contract-ready for Phase B-G plan-writing.
…ion + DD#37 Verification spike caught Task 158 §27 misdiagnosis: "Opus 4.5 + thinking + cache fails" was actually below-4096-token-threshold silent no-op. Anthropic docs (April 2026) + spike confirm per-version minimums: Opus 4.5/4.6/4.7 + Haiku 4.5 = 4096; Sonnet 4.6 = 2048; older Sonnet/Opus = 1024. New DD#37 locks OpenAI prompt_cache_retention mapping. Boundaries cleaned: spec holds contract values, progress log §30 holds journey, handoff holds plan-writer operational context.
7 Critical + 8 High-Priority + 7 Medium fixes encoded from round-3 8-agent review.
1 finding disputed (validate_data_flow reach — verified correct as planned).
Critical:
- D.1/D.2 unsafe defensive-read pattern (round-2 claim of universal coverage was false)
- F1 cache.discrepancy row had unresolvable {root_cause_action} placeholder
- B2.2 false claim: validator step 8 cannot reach top-level node keys
- ABSENT-branch hash-vs-prep symmetry (silent-stale-cache class) via _CHUNK_ABSENT sentinel
- D.2 prewarm execution made structurally concrete with index-alignment guarantees
- cache_chunks_skipped propagated through error path + memo round-trip
- 40+ WorkflowTraceCollector test sites: dropped strict assertion, added integration test
High-Priority:
- --no-trace-autoload rename (collision with pflow run --no-trace)
- storage_mode: shared × ## Cache documented as v1 limitation
- OpenAI 15 RPM batch routing documented; v1.x follow-up
- B3 in-memory mutation test (baseline can't include post-task fields)
- F3.1 exit code policy: ERROR severity in warnings[], exit 0 (advisory)
- Engine save/restore simplified to single try/finally; _EMPTY_CACHE_RENDER hoisted
Catalog: 10 → 12 entries (cache.invalid-on-non-llm, cache.prewarm-no-prefix added).
Plan: 1290 → 1445 lines.
Targeted Round 4 review (5 agents) caught pseudo-code-precision bugs in
Round-3-introduced code (wrong function signatures, wrong symbol paths,
references to non-existent classes, count drift). All verified against
actual code via direct Read before encoding fixes.
Verified pseudo-code corrections:
- process_item returns 5-tuple; D.2 pseudo-code now destructures correctly
- NodeStatus.ABSENT (not node_state.ABSENT)
- TemplateResolutionError doesn't exist; use ValueError
- extract_root_node_id always returns str; dropped dead None check
- Diagnostic identity tuple; multi-field rejection collapse fixed (V6)
V5 fix (top-10% pattern: one rule per error condition):
- Schema is single source for shape; _validate_cache_block does only
semantics + defensive isinstance skip on compile path
- Removed obsolete "two diagnostics that dedupe" test
V6 fix (top-10% pattern: one diagnostic per [rule, location]):
- cache.invalid-on-non-llm now ONE Diagnostic per node with
context["invalid_fields"]: list[str]; matches mypy/ruff convention
Five high-value additions (top-10% / simplest-final-code lens):
1. _resolve_static_prefix_for_cache companion helper — byte-identical
resolution across all three cache paths (chunk/static-prefix/analyzer)
2. cache.discrepancy structured context["root_cause_action"] payload —
agents dispatch on typed data, not prose
3. JSON format_version evolution policy — JSON_FORMAT_VERSION_MAJOR
constant + consumer rule format_version.startswith("1.")
4. Compile-path defensive isinstance guards enumerated (bool subclasses
int — must use isinstance(x, bool) not (int, bool))
5. Spike contingencies subsection — explicit table mapping each spike
to the encoded plan decision it tests
Plan: 1445 -> 1813 lines.
Catalog count drift defense: EXPECTED_CATALOG_COUNT constant; tests
iterate CACHE_WARNING_CATALOG.keys() instead of hardcoding numbers.
Minimal additions — preserves existing Round 1+2 sections intact, adds: - Header + "Where I am" updated for plan v4 (1813 lines, 12 catalog entries, 4 review passes deep) - New "What §33 / §34 don't capture" section: 7 fix-shape WHYs not in progress log (V6 combined diagnostic, V5 single-source-of-truth, _resolve_static_prefix_for_cache shape, cache.discrepancy dict payload choice, EXPECTED_CATALOG_COUNT pattern, skipped verification-only pass, "verify against actual code" methodology shift) - User's load-bearing principles extended with §33 + §34 quotes (top-10% lens, read-the-actual-code instruction, diminishing-returns analysis) - Resolved/Still-open lists updated through Round 4 (16 newly-resolved items enumerated; spike-gated items separated from Round-4-still-open) - "For the next agent" section updated with Round 4 / 5-additions context, Spike contingencies table reference, current §35 recording target - Code anchors expanded with Round 4 verifications (process_item shape, NodeStatus enum, extract_root_node_id signature, plan_node ordering, compile_validation reach) Test: "Could the next agent find this by reading files?" — applied to every addition. Tacit-knowledge sections capture WHY decisions were made, not WHAT.
Round 5 (5 targeted agents) closed layer-placement + defensive-bypass
bugs introduced by Round 4: shared cache helpers moved to new
`core/cache_render.py` (verified `nodes/llm/llm.py` only imports
`pflow.core.*`); `_CHUNK_ABSENT` defense relocated from sentinel
`__repr__/__str__` (verified bypassed by `_make_serializable` at
`cache.py:25-51`) to explicit isinstance branch; `cache_chunks_skipped`
error-path injection moved from builder signatures to caller wrap sites;
`cache.order-mismatch` formatting fixed (Python `str(list)` vs spec
bare-identifier form); `MockLLMClient.set_response` cache-tokens
parallel-dict pattern fully spec'd.
Round 6 (4 agents) caught factual errors in Round 5 fixes:
`tuple("string")` silent-splat (iterable but wrong shape); `apply_memo_hit`
3rd caller at `execution/plan.py:862` missed; `_PROPAGATED_KEYS` 5→7
entry correction; F2 confidence aggregation flipped permissive→STRICT
per DD#34 line 634 verbatim; `cache_chunks_skipped` 4th wrap site at
`post():511` LLMResponseParseError; V6 sub-workflow dedup test marked
`xfail` because `format_child_provenance` modifies `Diagnostic.message`.
Diminishing-returns analysis run jointly with user — bug class shifted
from architecture (Round 1) → concurrency (Round 2) → correctness gaps
(Round 3) → pseudo-code precision (Round 4) → layer/defense (Round 5)
→ factual errors (Round 6). Cost-saved-per-fix dropped ~2hr → ~30min.
Stop decision: switch to code-review per phase merge from here.
Plan summary section collapsed (330 lines of fix-history → 10-line
pointer). Journey moved to progress-log §35. Braindump updated with
Round 5+6 tacit knowledge per /braindump skill rules (don't repeat
what's already written; capture only what isn't in other docs).
Three new artifacts to manage the Task 159 implementation handoff: 1. `implementing-agent-brief.md` — self-contained brief for the implementing agent. Specifies the 4-segment split (Foundations, Memo-hash gate, Rendering+Trace, Analyzer+Docs) with mandatory STOP-and-log behavior at each firebreak. User decides handoff vs continue at each stop. 2. `spike-runner-brief.md` — self-contained brief for the spike-running agent. Covers the 3 pre-authorized paid spikes (~$0.30) with concrete protocols, expected outcomes, plan-update instructions if outcomes contradict the encoded Spike Contingencies decisions. 3. `implementation-progress-log.md` (stub) — target file for implementing agents to append segment entries at each firebreak. Plus a new Claude Code skill at `.claude/skills/plan-breakdown/SKILL.md` for analyzing implementation plans and recommending optimal handoff breakpoints based on size + tacit-knowledge dependency. Generalizes the analysis we did for Task 159 into a reusable pattern.
Three pre-authorized paid spikes ran 2026-04-29 (~$0.04 actual vs $0.30 budget). Outcomes recorded as progress-log §36. Spike 1 (Gemini explicit cache_control): scenario A AMBIGUOUS leaning CONFIRM via a 1b disambiguator (no-marker control proved the marker does real work, but LiteLLM's Vertex telemetry surfaces only reads — no cache_creation_input_tokens field). Scenario B CONFIRMS (multi-marker request succeeds). F2 plan update applied: Gemini info-note in analyze.py documenting the reads-only telemetry shape. Spike 2 (OpenAI prompt_cache_key parallel-batch routing): CONFIRMS, 6/6 parallel calls hit cache after warm-up. No plan update. Spike 3 (Anthropic per-TTL pricing): CONTRADICTS severely — 5m write priced correctly but 1h cache-write cost is missing entirely from litellm.completion_cost (ratio 0.0083 vs expected ~1.6). E.1 plan update applied: 1h-TTL normalization override at llm_client.py:776-784, Anthropic-gated, with explicit test cases pinned to spike actuals. Doc updates trim the planner-era "go run the spikes" framing across brief, braindump, and plan; spikes are now past-tense audit material, the plan is the implementer's single source of truth.
Add stable warning identifier (Diagnostic.id) for cache-namespaced diagnostics per DD#27. Identity tuple becomes (severity, source, node_id, id or message); null-safe by construction so legacy diagnostics with id=None preserve the existing message-keyed dedup behavior. Add three cache category constants (cache_failure / cache_warning / cache_advisory) and corresponding CATEGORY_TITLES entries for renderer title lookup. _FAILURE_CATEGORY_MAP intentionally NOT extended in v1 — no typed cache exception flows through __failures__ so adding the entry would create dead code; deferred to v1.x. Extend diagnostic_render._format_warning_or_info_diagnostic to dispatch on cache categories: render [id] inline prefix and surface a closed list of context keys (savings_pct, savings_usd, batch_size, prefix_tokens_estimated, target_file) so agents reading text output can route on stable warning IDs without parsing JSON. Update the load-bearing SSoT comment in core/CLAUDE.md to document the new identity tuple shape and null-safety contract.
New dependency-free module mirroring core/llm_providers.py shape: frozen ModelCapability dataclass, module-level MODEL_CAPABILITIES tuple, and the get_min_cache_tokens(model) lookup used by cache.below-min-tokens warning emission and auto-batch-prefix detection. Per DD#32, Anthropic minimums are version-specific: Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 4 / Sonnet 3.7 -> 1024; Sonnet 4.6 / Haiku 3.5 -> 2048; Opus 4.7 / Opus 4.6 / Opus 4.5 / Haiku 4.5 -> 4096. OpenAI auto-cache fires at 1024 across all families. Gemini explicit cachedContents requires ~4k. Unknown / unprefixed models fall back to the conservative floor (4096) to prevent silent cache-warning suppression. Lookup uses most-specific-prefix-wins so dated version suffixes like ``claude-sonnet-4-5-20240620`` correctly route to the family threshold.
Extend the markdown parser state machine to recognize a structurally novel
section mode: ``## Cache`` carries section-level YAML params (only ``- ttl:``
in v1) plus a single tagged ``cache`` code block, with NO ``### entities``.
Add _SectionType.CACHE (intentionally NOT in _KNOWN_SECTIONS so the
orphan-content rules don't fire), _CacheChunk / _CacheSection collectors,
section-level YAML branch (validates `- ttl: 5m | 1h`, rejects other keys),
code-fence-close branch wired to _attach_cache_code_block.
The new _parse_cache_code_block helper splits the cache block content into
``[prose-before-${var}][${var}]`` chunks via _CACHE_TEMPLATE_RE; trailing
prose after the last ${var} is silently discarded per the contract; empty
or prose-only blocks raise. Chunk identifiers are the raw ${...} content
verbatim — downstream root-extraction in B2.3 uses
TemplateResolver.extract_root_node_id for bracket-syntax handling.
_build_node_dict now extracts ``prompt_cache:`` and ``prewarm:`` to TOP-LEVEL
node keys (siblings of ``cache``, ``batch``, ``type``), not inside
``node["params"]``. This is required for B2.2's per-node schema check to
see them AND for B2.3's ``cache.invalid-on-non-llm`` rule (the validator's
step 8 only iterates ``node["params"]``, so top-level placement is the
load-bearing reach for the non-LLM-node check).
ir["cache"] is built at Phase 4 with shape {ttl, items: [{name, var,
prose_before, _source_line}], _source_line}; matches the schema added in B2.2.
…prewarm
Add top-level ``cache`` field to FLOW_IR_SCHEMA: object with required
``items`` array (per-chunk shape: {name, var, prose_before, _source_line}),
optional ``ttl`` enum {"5m", "1h"}, optional ``_source_line`` parser metadata.
``additionalProperties: False`` on both the cache object and per-item objects
so typos like ``cahe`` and stray fields are caught at schema validation.
Add ``prompt_cache: list[str]`` and ``prewarm: bool`` as additive per-node
properties. Schema accepts them on EVERY node type — semantic
``cache.invalid-on-non-llm`` rejection lives in core/workflow/data_flow.py
(B2.3), not here. Per the V5 fix from plan Round 4: schema is single source
of truth for SHAPE; data_flow does only SEMANTICS. Top-10% pattern (mypy /
ruff): one rule per error condition, not belt-and-suspenders. Without the
single-source rule, save-path validation would emit two diagnostics for
the same shape error.
The existing per-node ``additionalProperties: False`` invariant catches typos
like ``prompt_cahe`` automatically — no extra suggestion infrastructure needed.
…ata_flow
Add _validate_cache_block to validate_data_flow's tail. Both validation entry
points (WorkflowValidator.validate at save-time + compile_validation at
compile-time) reach this via the shared call site.
STEP ordering is load-bearing per the V5 fix from plan Round 5:
STEP 1 (FIRST): non-LLM-rejection. Walks every node; for type:string node
types other than "llm" with prompt_cache: or prewarm: declared, emits ONE
``cache.invalid-on-non-llm`` Diagnostic per node listing ALL invalid fields
in context["invalid_fields"] (V6 combined-diagnostic shape — not one
diagnostic per field). Shape-agnostic: a malformed prompt_cache: 5 on a
type: shell node still emits the structured rejection rather than silently
log-skipping. isinstance(type, str) gate prevents firing against
type: ["llm"] (a structural error) when the real issue is the type field
itself, not its value vs llm.
STEP 2: defensive shape skip on the compile path. Surviving LLM nodes with
malformed prompt_cache (non-list / list-of-non-strings) or prewarm
(non-bool — note bool is a subclass of int; isinstance(x, bool) is required
to reject prewarm: 1) are logger.warning'd and skipped. The schema-validator
path catches these at step 1 and short-circuits; the compile path bypasses
jsonschema, so we MUST guard here. Per V5: schema is single source of truth
for shape; data_flow emits ZERO diagnostics for shape errors. The deeper
compile error (CompilationError on CacheBlockIR construction) surfaces.
STEP 3a: per-node prompt_cache: semantic checks. Each chunk must resolve
to a declared cache item (find_similar_items "Did you mean?" hint on miss);
declaration order must match (cache.order-mismatch ERROR with spec-locked
four-line message format using bare-identifier bracketed lists, NOT Python
repr quoted form).
STEP 3b: top-level chunk-var resolution + batch-scoped rejection
(${item.X} and descendants reject with no catalog id — flows through
generic validation pipeline) + cache.unused-chunk WARNING for declared
chunks no node references.
Top-level cache block defense is split-scope (Round 5): when ## Cache shape
is malformed, emit ONE logger.warning and skip top-level cross-references,
but the per-node cache.invalid-on-non-llm STILL runs since it's independent
of top-level cache shape.
The see_also guide-topic pointer is intentionally OMITTED here — the caching
topic is added in Phase G; the repo-wide
test_all_see_also_literals_resolve_to_real_guide_topics enforces topic
existence. G.2 will wire the references back in.
V6 sub-workflow dedup test (xfail, strict=False) locks the contract that
parent + child versions of cache.invalid-on-non-llm dedup correctly via
the id-keyed identity tuple. The xfail wrapper is a tripwire for the
open user decision (granular dedup tuple including workflow_path vs
special-case per-id dedup ignoring message/node_id) — DO NOT silently
weaken the assertion.
# noqa: C901 on _validate_cache_block: the function is split into clearly
numbered STEP 1/2/3a/3b and a refactor would obscure the linear flow.
Append the segment-1 entry covering B1.1 + B1.2 + B2.1 + B2.2 + B2.3. Documents what shipped, deviations from plan, tacit knowledge for the next agent, open hedged claims, and segment-2 preflight (the golden_config_hashes.json baseline fixture is the load-bearing gate before B3.1 patches begin).
A workflow input referenced ONLY in ``## Cache`` (no node param uses it) was
flagged as unused by ``_validate_unused_inputs`` because the template-walker
``_extract_all_templates`` only iterated ``node.params`` and ``batch.items``.
Top-level ``workflow_ir["cache"]["items"]`` was structurally invisible to it.
Surfaced via the post-segment-1 smoke test on a valid lyrics-generator-shape
workflow (case 1: input declared, used in cache only). Without the walker
extension, every workflow with ``## Cache`` referencing inputs would produce
a spurious 'Declared input(s) never used as template variable' ERROR.
Fix: extend ``_extract_all_templates`` with a top-level walk that wraps each
``cache.items[i].var`` in ``${...}`` and feeds it through the existing
``extract_from_value`` pipeline. Cache chunks reach the unused-input check
the same way as node params and batch items.
Regression test: tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused
Append a "Post-segment-1 smoke test + bug fix" subsection to the segment-1 log entry documenting the 4-case manual UX check, the false-positive unused-input bug it surfaced, the _extract_all_templates fix, and what follow-up agents need to know about template-walker exhaustiveness.
…plicates
Two regressions surfaced by adversarial smoke testing post-segment-1:
1. **Double-emit on bad cache var references.** The earlier walker fix routed
``cache.items[].var`` through ``_extract_all_templates``'s shared template
set; both my B2.3 ``_validate_cache_block`` AND the existing template-path
validator (pass 5) saw the same template and both emitted. User would see
two errors for one mistake.
Fix: split into a separate ``_extract_cache_templates`` extractor. The
union flows ONLY into ``_validate_unused_inputs`` (so cache-only inputs
aren't flagged unused). Path validation receives ONLY node-param templates;
cache var resolution is owned by ``_validate_cache_block`` with its richer
"Cache chunk 'X' references..." diagnostics.
This is the V5 single-source-of-truth pattern applied to template
extraction: one extractor per consumer concern, no shared sets.
2. **``prompt_cache: [a, a]`` (duplicate items) passed validation silently.**
The chunk would render twice into the system prompt — wasted tokens AND a
silent semantic shift the author likely didn't intend. The parser already
rejects duplicate ``${var}`` in ``## Cache``; per-node ``prompt_cache:``
needs the parallel rule.
Fix: detect duplicates BEFORE the resolution check; emit one validation
error per offending node listing all duplicate names; suppress the
order-mismatch follow-on (would be noise on top of the actionable error).
Both fixes have regression tests:
- tests/test_core/test_cache_block_parser.py::test_inputs_referenced_only_in_cache_not_flagged_unused
(updated to assert split-extractor contract)
- tests/test_core/test_prompt_cache_validation.py::test_duplicate_chunk_in_prompt_cache_emits_error
Manual verification of all 4 original smoke cases + 9 new adversarial cases
(end-to-end run, save round-trip, sub-workflow propagation, batch + cache,
JSON output, parser edge cases) confirms no other regressions in the
segment-1 surface.
Reviews identified mostly minor issues — the implementation held under
verification-specialist scrutiny + 4-agent code review. Fixes applied:
**review-test-fidelity findings:**
1. V6 sub-workflow dedup test was a tautology — synthetic
``deduplicate_diagnostics([d1, d2])`` necessarily collapses two
identically-id'd diagnostics regardless of production behavior.
REPLACED with a Pattern 2 fixture-based test that drives a real
parent → child workflow file pair through ``WorkflowValidator.validate()``,
asserting the propagation path emits ONE diagnostic with provenance
prefix and preserved node_id.
2. Two circular identity tests
(``test_legacy_identity_tuple_in_hash_when_id_absent``,
``test_id_set_changes_hash_to_id_keyed_tuple``) reconstructed the tuple
from production code and compared. REMOVED as bloat — the behavioral
tests (``test_id_set_collapses_message_variants``,
``test_subworkflow_legacy_dedup_regression``) are the real lock.
3. Split-extractor regression test asserted MECHANISM only (cache vars in
one set, not the other). EXTENDED with end-to-end EFFECT assertion:
running the FULL ``validate_workflow_templates`` produces zero spurious
errors for cache-only inputs. A future contributor breaking the
mechanism via re-export now surfaces immediately.
4. Pattern 2 (file-based) parser→validator integration test added —
parses a real ``.pflow.md`` with cache.order-mismatch and asserts the
validator emits the spec-locked diagnostic. Closes the "all hand-built
IR" gap that masked the post-segment-1 bugs.
5. ERROR-severity cache diagnostics now show ``[cache.X]`` next to the
title (e.g. "Error: Cache Failure [cache.invalid-on-non-llm]") — same
agent-routing handle as WARNING/INFO. Text-mode agents can grep on id
without parsing JSON. Non-cache errors render unchanged (additive).
**review-silent-failures findings:**
6. ``_extract_cache_templates`` API surface had no enforcement against
future inclusion in ``all_templates`` (would silently double-emit on
cache var resolution failures). RENAMED to
``_extract_cache_templates_for_unused_check`` so the contract is
explicit at every call site.
**review-validation-consistency findings:**
7. Parser-level invariant assertion ``chunk.name == chunk.var_expr`` in
``_build_cache_dict``. The schema doesn't enforce equality but the
parser always populates them equal. Prevents future refactors from
silently drifting the two fields (B2.3 resolves by ``name``; segment-3
rendering will substitute via ``var`` — drift would break symmetry).
**review-impact-completeness findings:**
8. Stale doc strings updated:
- ``core/CLAUDE.md:107`` short paragraph aligned with the load-bearing
SSoT comment at line 103 (id-keyed identity tuple).
- ``runtime/template_validation/CLAUDE.md`` documents both extractors
and the split-extractor contract.
Skipped (per the verification specialist's triage):
- "prompt_cache: inside node[\"params\"]" asymmetry: programmatic-IR-only
path; step 8 already rejects with clear error. Not worth a third rule.
- "cache.duplicate-chunk" stable id: adding would expand the catalog
beyond DD#29's closed-list. Substring match in regression test is fine
for now.
- Empty-string ``id=""`` ``__post_init__`` check: speculative future
risk; no production diagnostic constructs with empty id.
Final state: 5519 tests pass, mypy + ruff + deptry clean,
``test_plan_drift.py`` green (32/32). 9 manual adversarial smoke tests +
4 original cases all behave correctly end-to-end.
Append a comprehensive subsection covering the verification specialist's 14-case adversarial smoke test, the 4-agent code review (impact-completeness, silent-failures, validation-consistency, test-fidelity), the 10 fixes applied, the 3 findings deliberately not fixed (with reasoning), and the end-to-end manual verification that confirmed the segment-1 surface holds.
Pre-merge precondition for Task 159 B3 (memo-hash conditional inclusion). Generates per-node config_hash for a curated set of 9 example workflows covering 9 mandatory hash-affecting shapes (plain, batch sequential/parallel, sub-workflow, branching, cache:false, template-heavy, multi-input). The committed fixture is the LOAD-BEARING regression gate (DD#19): workflows WITHOUT prompt_cache:/##Cache must produce byte-identical hashes pre- and post-Task-159 B3 patches. Drift signals a silent stale-cache regression class. Files: - scripts/generate_config_hash_baseline.py — re-runnable generator - tests/test_runtime/fixtures/baseline_workflows.py — single source of truth (imported by both the script and the regression test to prevent drift) - tests/test_runtime/fixtures/golden_config_hashes.json — 30 hashes, 9 workflows Regen command (only after human-reviewed compute_node_config change): uv run python scripts/generate_config_hash_baseline.py
… context
Implements DD#19's load-bearing invariant: workflows WITHOUT prompt_cache:/##Cache
hash byte-identically pre/post-task (verified by golden_config_hashes.json
regression gate); workflows WITH prompt_cache: get a fresh hash that includes
the rendered cache content, so existing memo entries don't silently serve
outputs produced without the prepended prefix.
B3.1 — schema additions to runtime types
CacheChunkIR / CacheBlockIR / CacheRenderContext (frozen dataclasses) live in
NEW src/pflow/core/cache_render.py because LLMNode.prep (C1.2) reads
CacheRenderContext from shared and only imports pflow.core.* — placing the
dataclasses in runtime/engine/types.py would force a layer-policy violation.
NodeConfig gains prompt_cache_items: tuple[str, ...] and prewarm: bool.
CompiledWorkflow gains cache_block: CacheBlockIR | None.
Compiler extracts both with explicit isinstance preconditions that raise
CompilationError on malformed shapes (Round-6 hardening: rejects tuple("string")
silent splat, prewarm: 1 truthy-int, etc.).
B3.2 — CacheRenderContext build + install at engine boundary
build_cache_render_dict() builds the per-workflow {node_id: CacheRenderContext}
map (sparse — LLM nodes with cache state only). WorkflowEngine.run installs
it under shared["__pflow_cache_render__"], wrapped in MappingProxyType for
read-only enforcement. Save/restore mirrors __trace_collector__ exactly;
restore-from-absent writes the module-level _EMPTY_CACHE_RENDER constant
(frozen empty proxy), never None. NOT in _PROPAGATED_KEYS — each .pflow.md
scopes its own ##Cache (DD#12); leaking parent->child would break cache
scoping AND the CacheBlockIR freeze guarantee.
B3.3 — plan_node renders cache content into the hash
Reorders plan_node so resolve_templates runs BEFORE config_hash (cache
rendering needs shared state). _render_cache_for_hash filters _CHUNK_ABSENT
chunks (branch-not-taken upstreams) symmetrically with the future C1.2 prep
site. _resolve_chunk_value handles two unresolvable cases identically: (a)
NodeStatus.ABSENT upstream, (b) permissive-mode echo where TemplateResolver
returns the literal "${var}" string. Both collapse to _CHUNK_ABSENT so the
filter contract holds.
Defense-in-depth at runtime/cache.py::_make_serializable: raises TypeError
if a leaked sentinel reaches hash serialization. Without it, a leaked
sentinel would silently serialize to "<...ChunkAbsentSentinel>" and fold
byte-identically into the cache hash — the silent stale-cache class.
B3.4 — compute_node_config accepts rendered cache content
Keyword-only prompt_cache_content kwarg (forces explicit pass; future
callers can't accidentally feed wrong shape as 5th positional). Conditional
inclusion: empty list and None both fall through, byte-identical to absent
(DD#19 three-state: absent equiv [] not equiv [chunk]).
Code-review fixes applied (4-agent /code-review pass):
* Planner cache_render install: _create_planner_shared installs the same
MappingProxyType the engine does, so plan_node's hash matches the engine's
for cache-using workflows. Without this, pflow run --dry-run silently
mispredicted "execute" vs "cached" on workflows declaring ##Cache.
Regression test in test_plan_drift.py exercises an LLM workflow with
##Cache end-to-end via mock LLM; mutation-tested by removing the install.
* Engine install ordering: build cache_render dict BEFORE installing trace
so a build-time exception leaves shared completely unchanged.
* isinstance(_, _ChunkAbsentSentinel) instead of type(_CHUNK_ABSENT) —
cleaner, less brittle.
* logger.warning at the defensive _render_cache_for_hash skip site so
bypass scenarios become observable.
Storage-mode shared x ##Cache documented as v1-unsupported in
runtime/CLAUDE.md (parallel batch with shared root store would have
non-deterministic save/restore order; lyrics-generator doesn't use this combo).
Fix #357 (companion, pre-existing pflow bug surfaced during E2E verification):
compute_node_cache_key now filters _*_source_line keys, mirroring the
existing filter in compute_node_config. Without this, every saved-library
workflow re-executed on every invocation because pflow save mutates the
workflow file's frontmatter post-run, shifting every body section's source
line, which flowed into resolved_inputs -> cache_key. End-to-end verified:
saved A-basic.pflow.md now hits memo cache on second run (was 2s/$0.0002,
now 6ms/cached).
Tests added: 67 (5523 -> 5590)
- tests/test_runtime/test_prompt_cache_compile.py (21)
- tests/test_runtime/test_cache_render_dict.py (12)
- tests/test_runtime/test_prompt_cache_hash.py (17)
- tests/test_core/test_cache_render.py (16)
- tests/test_runtime/test_cache.py (3 new for #357 filter)
- tests/test_execution/test_plan_drift.py (1 new for planner-engine parity
on cache-using workflows)
End-to-end CLI verification (real LLM calls via claude-haiku-4-5):
- Cross-run: same input -> memo HIT; different input -> memo MISS
- Sub-workflow nested ##Cache: parent + child both HIT, dry-run agrees
- Parallel batch with prompt_cache: 2/2 in 1.5s parallel, HIT on re-run
- Branch-absent: route=use-a vs use-b produce different cache_keys (sentinel
filter symmetric); same-route re-run HITs
- Validation errors render correctly (text + JSON):
cache.order-mismatch, cache.invalid-on-non-llm, cache.unused-chunk,
undeclared chunk, empty/prose-only block, multiple blocks, batch-scoped ref
- pflow save round-trip: ##Cache content preserved byte-for-byte
- pflow describe / --report / visualize / --no-cache / --only / --dry-run /
--output-format json all work cleanly on cache workflows
- No __pflow_cache_render__ leakage in trace JSON or run JSON output
Closes #357
…view fixes C-D-E phases of prompt caching: - C1.1/C1.2: complete() system: str | list[dict] widening; LLMNode.prep builds system_blocks via shared _resolve_chunk_value; cache_chunks_skipped channel through 4 wrap sites + memo HIT round-trip. - C2/C3: per-provider TTL marker translation (Anthropic / Gemini seconds-suffix / OpenAI prompt_cache_key MD5 + prompt_cache_retention "24h" for ttl=1h). - D.1: auto-batch-prefix detection in prep() via _resolve_static_prefix_for_cache (canonical-JSON byte-stable prefix); user_message_blocks kwarg on complete(). - D.2: prewarm execution split inside _execute_parallel — item[0] runs synchronously through the SAME process_item closure; _collect_parallel_results widened with *, initial_completed=0, total=None. - E.1: TRACE_FORMAT_VERSION = "2.1.0"; workflow_path plumbing; apply_memo_hit / write_memo_cache widened with keyword-only node_type_name; cache fields flow via existing llm_usage keyset (no sidecar); _should_write_cache_metadata allowlist (LLMNode only; ClaudeCodeNode INTENTIONALLY excluded); Anthropic 1h-TTL cost normalization override per Spike 3. Post-review fixes from 5-agent /code-review pass: - Image-drop under prewarm + images: graceful runtime degradation + __warnings__ entry; GH #358 filed for v1.x native image-cache support. - Dead-code cache_chunks_skipped wraps on 3 of 4 error paths: _propagate_error_to_shared now threads the field from err_dict into shared. - Strengthened divergence-injection test with behavioral binding-independence check; ABSENT-case byte-equivalence test; Anthropic 1h-cost integration test through _normalize; logger.warning on _build_user_message_blocks last-resort fall-back; ClaudeCodeNode skip test exhaustive exclusion; litellm.model_cost test cleanup via monkeypatch.setattr; runtime/CLAUDE.md + runtime/engine/CLAUDE.md doc updates; debug-level signal for non-allowlisted LLM-producing node types; MockLLMClient.set_response full-replacement contract documented. Tests: 5709 passing (+143 vs end-of-Segment-2 baseline). test_plan_drift.py 34/34. test_golden_baseline_hashes_match (DD#19 silent-stale-cache gate) PASSED. make check clean (ruff + ruff-format + mypy + deptry).
Both bugs surfaced by manual `pflow` CLI runs against real Anthropic Haiku calls. Test suite was green going in (5709 passed); both were architecturally invisible to the existing unit-test fixtures. Bug #1 — `cache_source` mislabeled `"in_process"` for cross-process memo HITs. `engine.py:420` called `apply_memo_hit` (which augments `cache_source="memo"`) followed unconditionally by `handle_cached_execution`, which augmented `"in_process"` for both branches and silently overwrote the memo augment. Fix: `handle_cached_execution` takes `cache_source: str | None = None` keyword-only; engine specifies `"in_process"` for the in-process branch and omits for memo (apply_memo_hit handles it). Eliminates the overlap class. Bug #2 — Cache rendering silently drops every dotted-path chunk (`${node.field}`) through `NamespacedSharedStore`. `LLMNode.prep` receives shared as a `NamespacedSharedStore`; `TemplateResolver._get_dict_value` checked `isinstance(value, dict)` which excluded the proxy, so dotted-path resolution returned `(False, None)` → permissive-echo → `_CHUNK_ABSENT`. Fix at the type-taxonomy level: `NamespacedSharedStore` inherits `collections.abc.MutableMapping` (it always was one structurally — it just didn't declare it); `_get_dict_value` checks `isinstance(value, Mapping)`. The motivating workflow (lyrics-generator) uses dotted-path chunks almost exclusively — every one would have been silently filtered. Side-effect: `NamespacedSharedStore` shrank 218 → 154 lines because the `MutableMapping` mixin handles `keys`/`items`/`values`/`get`/`setdefault`/ `update`/`pop`/`popitem`/`clear` that we used to hand-implement. 9 new regression tests, including production-shape DD#19 byte-identity check through `NamespacedSharedStore`. The historical byte-equivalence test was a tautology (synthetic dict + single-root chunks). Verification artifacts: scratchpads/segment3-verification/ — 10 adversarial .pflow.md workflows + VERIFICATION-REPORT.md.
Four agent-UX fixes after a fresh-eyes read of the post-Cluster-A/B output, each targeting a different section of `pflow analyze-cache`. Total +35 LOC across 3 source files; 6 baselines drifted as strict improvements. - **L-5**: drop the `Discrepancy detection: skipped attribution for N trace event(s)…` Note. The prose was internal accounting (predicted cache_key, observable signal, IR-resolvable node — analyzer-internal vocabulary the agent can't act on). The per-call header `Showing N of M LLM nodes` already conveys coverage; the per-child commands section points to where to look. - **N-10**: when only the non-cache blocking-errors list renders, emit bare `## Blocking errors` instead of the orphan `## Other blocking errors (surfaced for awareness)`. Pass `cache_blocking_ present: bool` from the orchestrator into `_render_other_blocking_ errors`. The "Other" qualifier is purely relational — when a sibling section isn't present, "other than what?" is incoherent. - **B-3**: collapse `## Per-child analyze-cache commands` from N absolute paths × ~200 chars (15 on lyrics-generator) to one `cd <dirname(workflow_path)>` + relative-path commands. Sub-workflow paths naturally compose from the parent's directory, so dirname is the deepest natural common prefix — no `os.path.commonpath` heuristic needed. - **B-4**: walker emits typed `DynamicBatchInfo` entries on `CrossWorkflowResult.dynamic_batches` instead of per-batch prose appended to `notes`. `_format_dynamic_batches_note` in analyze.py emits ONE aggregated Note. Singular case keeps the original wording for backward compat with existing substring tests; multi case lists each batch's `(node_id, items_expression)` once and shares the explanatory prose. The walker is documented as a data primitive (per cache_analysis/CLAUDE.md); this brings the dynamic-batch emission path into compliance. Verification: - `make test`: 6,426 passed, 1 skipped (no regressions). - `make check`: ruff + ruff-format + mypy + deptry clean. - 65/65 baselines pass; 6 regenerated as strict-improvement diffs. Closes L-5 + N-10 + B-3 + B-4 from POLISH-PLAN.md Tier 1.
Recommendations now name affected child LLM nodes inline and project savings_usd via memo-then-trace fallback. Honors honest-unmeasurable when neither memo nor trace can resolve the parent value.
…licates
Restructured analyze-cache header for AI-agent readability. The 39-word
run-on scale line splits into 3 cohesive lines (`Workflow:` count +
invocation status; `Models:` for 2+ models; `Heterogeneous:` for batch
sub-workflows declaring `model: ${item.model}`). Single-model workflows
keep `using X` inline (pragmatic density).
The duplicate `Observed models:` line is dropped — `models_in_use` is
a superset of `observed_models_in_trace` in complete trace mode. The
sub-workflow breakdown line drops its trailing 15-name CSV; names
already appear in `## Per-call cache report` headings and
`## Per-child analyze-cache commands`.
Latent bug fixed as bonus: static-mode workflows on the no-model-resolved
branch previously suppressed the heterogeneous suffix. The static-mode
lyrics-generator capture now correctly surfaces `Heterogeneous:
generate-chorus-options (model varies per batch item)`.
`_format_heterogeneous_line` collapses the old 1-vs-N branches via
plain `', '.join`; `_format_heterogeneous_suffix` deleted as dead code.
Files: render_text.py (+50/-65 LOC, net -15 simpler), test_cache_analysis_renderers.py
(3 helper-direct tests migrated to end-to-end render assertions per
Pitfall #19; 2 N-9 CSV literal updates; 1 multi-model test rewritten;
1 new N-2 regression test). 4 mutation contracts verified. 5 baselines
regenerated as strict-improvement diffs.
6,430 tests pass (+4). `make check` clean. 65/65 baselines pass.
Closes N-2 + N-3 + N-9 from POLISH-PLAN.md Cluster D.
Recommendation #1 on the lyrics-generator canonical capture flips from `savings unavailable` to `saves ~$0.20/run`, and the `## Recommended actions` header gains the `(ordered by impact)` qualifier — closes the gap where Cluster C v1's by-node-id walker couldn't see workflow-input passthroughs (concept passed parent → child → grandchild as an input parameter). Adds Tier 3 to `_estimate_parent_value_tokens`: read the resolved value from the parent's workflow-node trace event under `node_params['inputs'][child_input_name]`. The engine populates `node.params` post-template-resolution before `record_trace`, so the runtime invocation site carries the resolved value directly. Robust against `template_resolutions` size-trimming applied to real-world fixtures. One new production-shape test with verified mutation contract; full default suite green (6,427 passed); 65/65 baselines pass.
…clared` When the parent value's estimated tokens are definitively below the child model's minimum cache threshold, the recommendation now drops `savings_usd` to None and appends a `Note:` line naming model + threshold + token gap. Caching wouldn't fire as-stated, so the dollar tag would be a lie; the warning prose carries the "declare AND increase content" framing. Asymmetric with the greenfield `cache.shared-context-undeclared` path (which suppresses the recommendation when no node clears threshold): cross-boundary keeps the recommendation visible because the agent needs to know about the boundary even when caching wouldn't fire as-stated. The recommendation surface IS the actionable signal for this catalog ID. Catalog gains additive `below_threshold_clause: str` context key (always passed; empty string when not below). Helper `_below_threshold_clause` formats the warning; `_project_sub_workflow_cache_savings` returns `(savings_usd, tokens, threshold_model)` so the emit site gates without recomputing. Heterogeneous-children case (different child models within one sub-workflow) samples the first child's model — mirrors the existing single-model sampling in `_consolidate_to_root_advisories`. v1 simplification documented in the helper docstring. Files: analyze.py, warning_catalog.py, cache_analysis/CLAUDE.md, 4 test files (+3 emission tests with verified mutation contracts; synthetic-fixture helpers updated). 2 baselines regenerated as strict-improvement diffs (additive JSON field only). 6,434 tests pass. `make check` clean. 65/65 baselines.
Surface excluded-node cost as an explicit line so trace-mode dollar figures compose visually for a fresh agent. Drops `(projected subset)` qualifiers and the inline `(excludes ...)` clause from `_format_delta`. Lyrics-generator before: Actually paid (trace): ~$2.31 (trace) Cost without caching (projected subset): ~$2.53 (partial) Actual savings (this run): saves ~$0.49/run vs no-cache (excludes generate-chorus-options), 19% of no-cache cost After ($2.31 − $0.27 = $2.04, $2.53 − $2.04 = $0.49 — math composes): Actually paid (trace): ~$2.31 (trace) Excluded from analysis: ~$0.27 (generate-chorus-options: model varies per call) Cost without caching: ~$2.53 (partial) Actual savings (this run): saves ~$0.49/run vs no-cache, 19% of no-cache cost
- Per-call rows: render `model=<unresolved>` when row.model is empty (was 35 chars of whitespace, read as a broken column). - Drop the `Suggested-blocks: ... deferred to v1.x.` Note: internal jargon roadmap leak with no agent action. - Header: promote `(no model resolved — set settings.default_model)` parenthetical to its own `Models: not resolved (...)` line, parallel to the multi-model `Models: A, B, C` case. - Summary: drop duplicate `(trace)` from `Actually paid:` label since the value's tier annotation already carries the signal. Truncated branch keeps `(executed trace)` because "executed" is unique signal. Tests: 1 assertion update + 1 existing test rewritten for new shape + 2 new mutation-contract tests. 6,438 pass, `make check` clean, 65/65 baselines pass after regen (34 strict-improvement diffs, net −77 lines mostly from the Suggested-blocks Note vanishing).
Adds a parameter-aware resolution tier before memo/trace/invocation in `_estimate_parent_value_tokens` so `pflow analyze-cache root.pflow.md input=<sample>` produces real `saves ~$X.XX/run` figures on first contact without a run-then-reanalyze loop. Reads from `parameters_for_workflow` so CLI params at the root and walker-propagated params at nested boundaries share one scoping primitive.
Replace the misleading severity-keyed headline (`19 opportunities (0 warnings, 19 info)`) with section-mapped counts that match what the agent actually sees post-Cluster-A grouping: - Summary: `2 recommended actions + 5 cross-workflow boundary findings` (suppressed when both are zero — was unconditional `0 opportunities`). - `## Recommended actions (N, ordered by impact)` / `(N)` inline counts; B-6 ordered-by-impact gating preserved. - `## Sub-workflow boundaries (M, covering K underlying renames)` — rollup suffix only when grouping collapsed renames; prose-mismatches contribute to M but not K (1-per-finding, no collapse). - Dry-run nudge now uses the same rendered count via new `view_helpers.count_rendered_findings` so `pflow run --dry-run` and `pflow analyze-cache` agree (lyrics-generator: was 19 vs 7, now 7 on both surfaces). - `_group_renames_by_parent` lifted to `view_helpers` as the SSoT for rename grouping. - `actionable_opportunities`/`warnings_count`/`info_count` preserved on `AnalysisSummary` and in JSON for machine consumers. 16 baselines regenerated as strict improvements (net −20 lines). 6,449 tests pass (+11 net). `make check` clean. 65/65 baselines pass. Closes A-4 from POLISH-PLAN Cluster E (cluster fully closed: A-4 + N-10 + L-5 + B-3 + B-4 all shipped).
…etector Pass A: - A1: consolidate visibility predicate (`per_call_row_has_real_data`) into view_helpers; both analyze.py and render_text.py import from there. Removes the 4-vs-3-condition divergence that emitted a misleading "Per-call cache report hidden" note on workflows with parameter-backed cacheable evidence (5+ baselines affected). - A2: divide input/output/Tier 1 cacheable tokens by observed_call_count for static-list batch trace rows at the producer (`_build_per_call_row`). Cost helpers' `× _invocation_count` then recovers cohort cleanly. Fixes N²-inflated `no_cache_hypothetical_usd` and `rerun_within_ttl_hypothetical_usd` for static-list batches. New `_is_static_batch_trace_row` / `_divide_static_batch_trace_tokens` helpers; uses `round()` not floor. Tier 2 chunk-path now multiplies by `resolved_chunk_call_count` for non-batch repeated rows so units match cohort input. Pass B — new catalog ID `cache.prompt-cache-incomplete`: - Detector `_detect_partial_prompt_cache_declarations` walks LLM nodes via `_partial_declaration_finding_for_node`; uses `TemplateResolver.TEMPLATE_PATTERN` + `split_coalesce_operands` (NOT a hand-rolled regex) for prompt operand scanning. Root-aware via `_template_root_segment` / longest-var-prefix match. Filters batch-scoped refs. - Emitter `_emit_partial_declaration_findings` aggregates per-node findings into one workflow-level diagnostic (matches `node_id=None` convention, documents latent dedup risk shared with 4 existing IDs). - Cleanup hint via `_prompt_body_cleanup_for_node` calling `cache_overlap.compute_overlaps` — single source of truth shared with validator. Suggestions ordered cleanup-first to avoid the `cache.prompt-body-duplicates-cache` ERROR trap. - Threshold gate: visible-with-clause + `savings_usd=None` mirrors N-7 precedent. Per-node check for heterogeneous-models case. - Suppressed when `cache.consolidate-to-root-recommended` covers same root. Catalog count: 21 → 22. New baseline at 04-warning-catalog/21-cache.prompt-cache-incomplete/. Documentation updated in CLAUDE.md, MCP tool docstring, and guide. Existing baselines drift on Pass A (cost projections shrink to honest totals) and Pass B (recommendation surface gains the new ID where applicable). All diffs strict-improvement.
…tatic calls Three small renderer-only fixes a fresh-agent read of the post-Pass-A/B captures surfaced. Each was unmasked by an earlier fix that introduced the better signal but didn't drop the now-redundant or wrong original. A. ``(partial)`` qualifier on no-cache / rerun cost lines is redundant in trace mode (``Excluded from analysis:`` line carries the cohort gap in plain English) and in greenfield-with-cache mode (``(projected subset)`` label suffix is the parallel signal). Suppress it via ``projection_partial_marker`` in both branches; greenfield-no-cache keeps it (single-line ``Cost per run:`` has no other signal). B. ``_per_call_scope_explainer`` returned a 60-word run-on packing four facts. Worse, the "(divide by calls for per-call values)" guidance became structurally wrong after Pass A2 normalized static-list batch trace rows to per-call units — agents following it would compute 1/N the actual per-call value. Returns ``list[str]`` (lead + 3 bullets); the wrong advice is dropped entirely. C. ``_cell_calls`` renders ``—`` instead of ``0`` when ``evidence_scope == "static_analysis"``. ``calls=0`` for sub-workflow standalone analysis (the case ``## Per-child analyze-cache commands`` instructs agents to run) read as "this node never runs". Trace mode keeps integer rendering — ``observed_call_count == 0`` there is a real signal (conditional dispatch not taken). 6 new mutation-contract tests. 14 baselines drift as strict improvements; 1 unrelated upstream LiteLLM pricing-data drift on ``04-warning-catalog/20-llm.thinking-temperature-mismatch`` regenerated in lockstep (same drift class as prior Tier-0 / per-call-column-split commits). Coordination: surfaces are disjoint from in-flight Pass C (cross-workflow → per-call row plumbing). Pass C touches ``_cell_could_cache``, ``_per_call_confidence_footer``, the catalog template for ``cache.sub-workflow-cache-undeclared``, and ``_build_per_call_row``; this commit edits ``_render_trace_cost_lines``, ``_render_greenfield_with_cache_lines``, ``_per_call_scope_explainer``, ``_append_per_call_rows``, ``_format_per_call_row``, ``_compute_per_call_column_widths``, ``_per_call_cells``, and ``_cell_calls``. Verification: 6,555 passed, 1 skipped (near-full sandbox suite); 142 renderer tests; 566 cache-analysis suite; 66/66 baselines; ruff + mypy + ruff format clean on touched files.
Five Stage 1 UX fixes from a fresh-eyes read of the post-Pass-C lyrics-generator capture, plus the "Shape δ" rename-text-suppression investigation that makes `cache.cross-workflow-rename-detected` JSON-only. **Stage 1 — five renderer/catalog fixes.** - Fix 3 (pluralization): `nodes_phrase` computed field in `make_diagnostic`; `cache.sub-workflow-cache-undeclared` no longer emits "1 LLM nodes there". - Fix 4 (Confidence): suppress `Confidence: high_from_trace (N of N nodes)` when Evidence already says complete trace with no unreached. Keeps the line for `medium_from_memo`, partial coverage, truncated trace, unreached. - Fix 5+6 (token-confidence footer): `_per_call_confidence_footer` returns `list[str] | None`; multi-line bullet block. New `_format_node_list` aggregates duplicate node names — lyrics-generator's "review, review, review, ..." collapses to "8 review nodes". - Fix 7 (cost-block pass-through): fold excluded `actual_cost_usd` into `Cost without caching` / `Cost on rerun` so the cost block reconciles in-place ($2.31 paid < $2.42 no-cache, no subtraction). Plain-English footnote names the pass-through nodes; falls back to current Excluded- line behavior when any exclusion has `actual_cost_usd=None`. Wording corrected: "no caching can apply" → "projected savings unavailable; real savings could be higher if model+content combos repeat". - Fix 8 (Notes-section jargon): memo-empty discrepancy note + Gemini telemetry note rewritten in plain English; "Discrepancy detection / predicted-key matching / Observable-field attributions" → "Cache fidelity check / TTL expiry / chunk-skip detection". `_format_fidelity_skip_note` helper applies the same rewrite across all 9 remaining per-node skip-note sites in `_format_skipped_workflows_note`, `_build_predict_scaffold`, and `_predict_node_with_scaffold`. Production grep confirms zero remaining emission of the old jargon strings. **Shape δ — rename diagnostics text-suppression.** `cache.cross-workflow-rename-detected` continues to emit for JSON/raw consumers (`analysis.warnings` unchanged, catalog count stays at 22), but CLI text no longer renders rename diagnostics in `## Sub-workflow boundaries`. Empirical trigger: 17 rename findings on lyrics-generator collapsed to 5 boundary entries that looked actionable but were cache-fidelity false positives — variable names are stripped before the provider wire format; cache hits depend on prose before each cached value and the resolved value bytes. Helpers removed: `view_helpers.group_renames_by_parent`, `render_text._render_renames_for_parent`. Rename suggestion in the catalog rewritten to state the warning is informational. `_CROSS_WORKFLOW_ALIGNMENT_IDS` still includes rename so it stays out of Recommended actions. **Tests:** +12 new tests (5 emission/dispatch, 3 Confidence gating, 2 footer, 4 cost-block fold, 1 SSoT helper), 8 existing tests migrated to new wording, 1 prose-mismatch nudge-count test added. `make test` default suite: 6,545 passed. Cache-analysis subset: 634 passed. **Baselines:** 67/67 verify after regenerating 15 cases (lyrics-generator captures + adjacent text/JSON cases hit by Fix 5+6/7/8 wording + Shape δ rendering changes + 1 unrelated pre-existing LiteLLM pricing drift on `04-warning-catalog/20-llm.thinking-temperature-mismatch`). Carried forward (Stage 2 / data-model follow-ups): - Stage 2: lift `cache.sub-workflow-cache-undeclared` cleanup-first suggestions to a section header preamble (waiting for the in-flight below-threshold demotion work). - Text/JSON divergence: Fix 7's pass-through fold is renderer-only; `summary.no_cache_hypothetical_usd` JSON field still emits the priced- cohort value. Add `no_cache_with_passthrough_usd` (additive) or document the asymmetry in a follow-up.
Aligns the per-call cache report's `cacheable inputs:` note with the
Recommended actions section by switching from parent value expressions
to child input names (alphabetized). Restores parent→child data-flow
visibility by adding a `flows in from parent as ${expr}` sub-line under
each renamed input bullet inside the action body — scoped to where it's
actionable, not as a standalone warning section.
CrossWorkflowInputContribution: renamed `name` → `child_input_name`,
added `parent_value_expr: str | None`. JSON shape correspondingly
updated (same-minor 4.1 pre-merge shape correction).
Body-block render duplication eliminated via two helpers
(_format_per_consumer_input_lines, _format_single_consumer_input_lines)
sharing one rename-suffix helper (_parent_origin_clause). Satisfies C901
without noqa.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships provider-level prompt caching as a declarative workflow surface, plus a new
pflow analyze-cachecommand for static + trace-driven cache analysis. Verified end-to-end on real Anthropic + Gemini calls (-25% first-run, -73% rerun-within-TTL on the lyrics-generator smoke fixture).The important review focus is the analyzer contract, not the small provider-rendering seam: late PR/baseline sweeps turned
pflow analyze-cacheinto the primary diagnostic surface for cache recommendations, trace fidelity, cross-workflow opportunities, and JSON/MCP consumers.Task
159 — see
.taskmaster/tasks/task_159/task-review.mdfor the comprehensive implementation review.Changes
Workflow author surface
## Cachemarkdown section: prose interleaved with${var}chunks, optional- ttl: 5m|1h. Each chunk is a[prose-before-${var}][${var}]pair, parsed intoir["cache"] = {ttl, items: [{name, var, prose_before, _source_line}, ...]}.prompt_cache: [name1, name2, ...](subset of## Cache.items, declaration-order strict — out-of-order is a hard error) andprewarm: bool(gates auto batch-prefix marker insertion).cache: bool(memo opt-out) untouched and orthogonal — two cache layers, two distinct fields (DD#5).Cache rendering at the LiteLLM adapter seam (
src/pflow/core/llm_client.py)complete()widened:system: str | None -> str | list[ContentBlock] | None; newuser_message_blocks: list[dict] | Nonefor batch-prefix paths.core/cache_render.py::_build_cache_control_marker(Anthropic native, Gemini ->cachedContents, OpenAI ->prompt_cache_retention).compute_node_config(prompt_cache_content=)keyword-only kwarg; nodes that don't opt in keep their existing hash.prewarm: true; render/hash byte parity is guarded by golden config hashes and NamespacedSharedStore production-shape tests.pflow analyze-cache— new CLI + MCP command, full text + JSON parity--from-trace.(None, "unavailable")instead of a guessed value.trace/trace_partial/recomputed/unavailable) — never emits$0.00for unavailable.analyze-cachenow surfaces execution-blocking non-cache validation errors viablocking_errors[]/other_blocking_errors[]instead of silently filtering them out.batch_prefixandcross_workflow_projectionfrom ordinary parameter/memo/trace evidence; this is load-bearing for dynamic batch and parent->child cache opportunities.none/complete/truncated; successful conditional runs with unreached rows are stillcomplete. Trace-dependent warning suppression is catalog-driven viaCacheWarningSpec.requires_complete_trace, not severity.core/cache_analysis/warning_catalog.py(21cache.*+llm.thinking-temperature-mismatch; DD#29 design review for additions).Trace format
workflow_path, per-eventcache_key/cache_source/cache_age_sec/cache_chunks_skipped, cache-render snapshots, sub-workflow cost rollup, and canonical child workflow paths for dynamic workflow batches.workflow-trace-{wf_hash}-{safe_name}-{timestamp}.json) — O(matches) glob lookup vs. O(N) scan-and-parse. Prior schema scaled to ~14s peranalyze-cacheinvocation on machines with 67k+ traces; now <1s.format_version.startswith("2."); auto-load skips 2.0.0 (noworkflow_path) but--from-trace <path>still works.JSON output schema: 1.0 -> 1.1 -> 2.0 -> 2.1 -> 4.0 -> 4.1.
format_versionis the first key. JSON 4.1 includesother_blocking_errors[],summary.suggested_run_command, typedper_node_thresholds[].model_state,per_call[].cross_workflow_inputs, and cacheable evidence tiersbatch_prefix/cross_workflow_projection.Architectural fix — Path 1 boundary contract
resolve_workflowandresolve_sub_workflownow both file-resolve at the boundary. Every downstream consumer (analyzer, validator, compiler, runner) gets fully-resolved IR for free. Contract tests catch the regression class structurally. Cross-references issues Extend Option E shared-primitive pattern to sub-workflow output population + cycle detection #321, refactor: extract shared sub-workflow resolution primitives for heterogeneous batch planning #334.Companion fix — GH #357 (memo cache key line-shift)
pflow saverewrites frontmatter on every invocation, shifting source lines and invalidating cache_key. Fix:_METADATA_KEY_SUFFIXES = ("_source_line", "_source_lines", "_source_files")filter atruntime/cache.py::compute_node_cache_key. Pre-existing pflow bug; fixed inline because shipping caching without it would surprise the first user who saves a## Cacheworkflow.Diagnostic infrastructure
Diagnostic.id: str | Nonefield added (top-level stable IDs per DD#27).(severity, source, node_id, id or message)— whenidis set, it's the dedup key.NamespacedSharedStorenow inheritsMutableMapping(was duck-typed but not type-tagged).TemplateResolver._get_dict_valuenow checksMappingnotdict. Closes a silent stale-cache class where dotted-path chunks were filtered as_CHUNK_ABSENTin production.Explanation
Most production risk is in the analyzer (
src/pflow/core/cache_analysis/), not the provider rendering hot path. The provider rendering seam is comparatively small; the analyzer owns recommendations, cross-workflow traversal, trace evidence, cost semantics, JSON/MCP shape, and agent-facing text.The implementation arc: 4 segments (foundations -> memo-hash gate -> rendering+prewarm+trace -> analyzer+docs) -> Stage 1 lyrics-generator verification -> Stage A/0/B/C data-model redesign (
recommended_actions+cross_workflow.*derived fromanalysis.warnings) -> Stage 2 paid verification -> PR #378 review fixes -> baseline/polish sweeps.The progress log is now 11,777 lines at
.taskmaster/tasks/task_159/implementation/implementation-progress-log.md. For reviewer orientation, prefer.taskmaster/tasks/task_159/task-review.md; use the progress log only for forensic implementation history.This is a large PR (700+ files, mostly task docs/baselines/tests). Exact file/commit counts may drift while final docs and baselines are pushed; the stable review surface is the task review, baseline harness, and tests listed below.
Created Docs
.taskmaster/tasks/task_159/task-159.md— spec (~617 lines, 37 design decisions, full warning ID catalog, output format reference).taskmaster/tasks/task_159/task-review.md— primary future-agent navigation artifact (file tour, integration points, patterns, anti-patterns, common pitfalls, Task 160 baseline guardrail).taskmaster/tasks/task_159/implementation/implementation-plan.md— full implementation plan.taskmaster/tasks/task_159/implementation/implementation-progress-log.md— segment-by-segment progress log (11,777 lines).taskmaster/tasks/task_159/baseline/— analyzer output oracle used during development and intended as a Task 160 refactor safeguardUser-facing documentation:
src/pflow/guide/features/caching.md— newpflow guide cachingtopic covering two-cache-layer model,## Cachesyntax, order invariant, TTL opt-in, sub-workflows, provider-specific notes, full warning ID catalogdocs/guides/debugging.mdx,docs/reference/cli/index.mdx,docs/reference/nodes/llm.mdx— caching cross-references and CLI help updatessrc/pflow/core/cache_analysis/CLAUDE.md— current-state tacit knowledge for the analyzer packageCLAUDE.mdupdates across CLI, core, runtime, execution, nodes, and testsBreaking Changes
These are not strictly breaking for end-user workflows (no users yet — see CLAUDE.md), but consumers of internal APIs should be aware:
Diagnosticidentity tuple changed from(severity, source, node_id, message)to(severity, source, node_id, id or message). Whenidis set, it's the dedup key.complete()signature:system: str | None->system: str | list[ContentBlock] | None; newuser_message_blocks: list[dict] | Nonekwarg.compute_node_config,apply_memo_hit,write_memo_cache,handle_cached_executionwidened with keyword-only kwargs (prompt_cache_content,node_type_name,cache_key,created_at,cache_source). Production callers updated; future direct callers must pass these.WorkflowTraceCollector(*, workflow_path=None)— new keyword-only kwarg.NamespacedSharedStorenow inheritsMutableMapping;__contains__(key: object)(ABC requirement) instead ofkey: str.has_cache_telemetry,uncached_input_tokens, andinput_token_accounting; absence vs explicit zero is load-bearing.format_version.startswith("4.")and dispatch on typed fields likemodel_staterather than sentinel strings.per_call[].cacheable_data_sourcechanged: cacheable evidence no longer emits the old fabricatedestimatortier; valid tiers includetrace,memo,parameters,batch_prefix,cross_workflow_projection,unavailable.pflow savebyte-for-byte round-trips## Cacheblocks. Existing memo cache entries written under the old line-shift-corrupted cache key become unreachable; expire naturally via 24h TTL.--no-cacheflag scope clarified: disables pflow's local memoization only. Does NOT disable LLM provider prompt caching.pflow analyze-cache --from-trace <path>.Testing
Latest logged state: default suite 6,592 passing, 1 skipped, 3 deselected; near-full sandbox-safe suite 6,562 passing, 1 skipped; baseline harness 67/67 passing after latest regeneration.
make checkwas clean in the logged passes, with known sandboxuvcaveats documented in the progress log.Critical regression gates (do not delete or weaken — see task-review.md "Test Files" table for full list):
tests/test_runtime/fixtures/golden_config_hashes.json+test_prompt_cache_hash.py::test_golden_baseline_hashes_match— DD#19 byte-identity ofcompute_node_configacross 30 nodes / 9 workflows.test_plan_drift.py::test_plan_matches_engine_for_workflow_with_prompt_cache— planner <-> engine parity for cache-using workflows.test_prompt_cache_rendering.py::test_resolve_chunk_value_is_imported_locally_at_both_sites— hash-side and prep-side import the SAME helper as local module bindings.test_prompt_cache_rendering.py::test_hash_render_and_prep_render_byte_equivalent_through_namespaced_store— DD#19 production-shape (NamespacedSharedStore wrap + dotted path).test_workflow_resolver_contract.py::test_resolve_workflow_returns_fully_file_resolved_irandtest_sub_workflow_resolver.py::test_resolve_sub_workflow_cross_workflow_walker_sees_resolved_prompts— Path 1 boundary contract.test_trace_format_2_1.py::test_handle_cached_execution_does_not_overwrite_memo_cache_source— memo-hitcache_source="memo"must not be overwritten to"in_process".test_cache_analysis_per_id_coverage.py— every catalog ID round-trips through JSON withidat top level.test_cache_analysis_per_id_emission.py— production-shape warning emission tests viaanalyze(...)with real memo/trace fixtures.format_versionremains first key; JSON error paths emit parseable envelopes.CacheWarningSpec.requires_complete_trace, not severity.batch_prefix/cross_workflow_projectiontests — protect per-call projection tiers and no-double-counting behavior.End-to-end paid verification (Stage 2):
Adversarial/baseline verification caught real bugs
cache_sourcemislabel (memoreported asin_process).eventsvsnodestypo: Tier-1 trace data unreachable.analyze-cache.-$0.00/run; per-call vs cohort token-unit mistakes.Follow-up issues filed
#357(closed inline — line-shift breaking memo),#358(image cache support),#359(LiteLLM cost-map fetch warnings to stderr),#360(dynamic batch size cost undercount),#361(Path 2 architectural cleanup),#362(cross-workflow rename suppression — closed),#363(shared prose detection),#364(multi-walker consolidation in trace event traversal),#365(sub-workflow cost rollup — resolved),#371(test slowdown — closed),#372(remaining perf),#373(near-threshold expansion hints for SuggestedBlock),#374(__warnings__channel modernization),#375(mock fidelity),#376(child rollup confidence gap),#377(split partial_cost_usd semantics),#383(repeated non-batch prompt-shape detection).Task 160 (cache analysis architectural refactor) is filed as a follow-up task. Its main guardrail should be the Task 159 baseline harness plus the focused cache-analysis tests above.