fix: allow arrays/dicts to serialize to JSON in string contexts#19
Conversation
Fixes #13 Shell nodes always output text (stdout is string). When that text is valid JSON like `["a","b","c"]`, batch processing now auto-parses it to a Python list, matching the behavior already in node_wrapper.py for node parameters. Changes: - Add JSON auto-parsing in batch_node.py prep() method - Add type coercion helpers for batch config (defense-in-depth) - Add 25 new tests (10 for JSON parsing, 15 for type coercion) The JSON parsing follows the proven pattern from node_wrapper.py: - Strip whitespace (handles shell output newlines) - 10MB size limit for security - Quick check for '[' prefix before parsing - Graceful fallback if parsing fails
Fixes #15 Three related issues fixed: 1. Batch item alias (`${item}` or custom via `batch.as`) now recognized - Added batch alias extraction to `_extract_node_outputs()` - Added batch alias recognition to `validate_data_flow()` 2. Batch outputs (`${node.results}`, `${node.count}`, etc.) now recognized - Added `_BATCH_OUTPUTS` constant with all batch output definitions - Batch nodes register batch outputs instead of inner node outputs 3. Templates in `batch.items` now extracted for validation - Extended `_extract_all_templates()` to scan `node.batch.items` Added 14 new tests covering batch template validation scenarios.
Clarified comments in the cli-agent-instructions.md file to enhance understanding of shell command usage. Updated the example command to demonstrate the use of pflow variables directly, ensuring they resolve before the shell execution. This change aims to prevent common mistakes and improve the clarity of instructions for users.
Documents lesson learned from batch validation bug fix (GitHub #15): When adding runtime wrappers that change node output structure, compile-time validation systems must also be updated.
Introduces a new JSON file to demonstrate the current workaround for passing arrays via stdin in shell commands, addressing the limitation of embedding arrays directly in command strings. Additionally, adds a design issue report outlining the challenges faced when combining multiple arrays in workflows, along with reproduction steps and potential solutions. Includes minimal reproduction and real-world problem examples to illustrate the issue.
Fixes #18 - Update type checker to allow dict/list/array → str compatibility - Add shell-unsafe JSON warning for apostrophes, backticks, $(...) - Document limitation in template-variables.md - Update tests for new behavior - Fix pre-existing mypy error in template_validator.py Behavior: - Complex templates like "echo '${items}'" auto-serialize arrays to JSON - Simple templates like "${items}" still blocked at runtime (preserves guardrail) - Warning emitted if JSON contains shell-unsafe characters
- Fix ReDoS vulnerability in _warn_shell_unsafe_json() by limiting regex to 500 chars and removing DOTALL flag - Clarify bidirectional type compatibility in type_checker.py comments - Improve shell warning message clarity (mention multiple issues may exist)
Code Review for PR #19: Array/Dict to JSON String SerializationSummaryThis PR enables arrays and dictionaries to be embedded in string parameters by auto-serializing them to JSON. This solves a real problem where shell nodes output JSON strings that need to flow into batch processing or be embedded in commands. The implementation is well-structured with proper guardrails, but there are some areas that need attention before merge. ✅ Strengths
|
Review Fixes AppliedThanks for the thorough review! I've addressed the critical issues in commit ✅ Fixed#1 ReDoS Vulnerability (Critical)
#2 Type Checker Comment Clarity (Critical)
#5 Warning Message Clarity (Warning)
❌ Not Addressed (Pre-existing, out of scope)#3 Batch Node 10MB Limit
#4 Batch Validation Inconsistency
Answers to Questions
In the webpage-to-markdown workflow, batch items are image descriptions - typically a few KB each. 10MB would be extreme. However, since this limit is pre-existing, I'd prefer to address it in a separate PR if we want to change it.
Yes, documented in pitfalls.md (lines 360+). It's a fundamental limitation of static validation vs runtime wrappers.
Good suggestion for follow-up. The shell.py file is getting large (760 lines). A |
…kip review] PR #378 review finding #4. Closes the synthetic-builder ↔ production-analyze() fidelity gap in tests/test_core/test_cache_analysis_renderers.py. What landed - _BUILDER_DOCUMENTED_DEFAULTS frozenset (6 entries) names the AnalysisSummary fields the synthetic builder cannot faithfully model. Tests asserting on these MUST drive analyze() end-to-end. - TestMakeAnalysisShapeParity class with two methods: * test_builder_field_set_matches_dataclass_minus_documented_defaults — uses dataclasses.fields() introspection to fail noisily when a new field is added without builder coverage or allowlist documentation. * test_documented_defaults_get_overwritten_by_production — drives analyze() against a contrived IR + trace that triggers each documented-default overwrite, catching the case where production's overwrite logic is deleted while the allowlist stays stale. - Three renderer tests migrated from synthetic to e2e analyze() calls: * test_json_partial_trace_exposes_evidence_scope_and_observed_models * test_json_summary_exposes_projection_exclusions_and_delta_reason * test_render_json_includes_rollup_workflow_paths_and_unavailable_models_by_workflow Reviewer's two non-issues confirmed via inspection - test_text_summary_renders_blocking_errors_categorically (line 272 in PR baseline) only asserts on builder-populated fields. Kept synthetic. - test_json_emits_root_and_sub_workflow_llm_node_counts already drove analyze() end-to-end against the committed 3-deep fixture. No migration needed. Verification - 5 mutation contracts checked by reverting production code: * Add new AnalysisSummary field → parity method 1 fails naming the field. * Delete observed_models_in_trace overwrite → migration #2 + parity method 2 fail with documented diagnostics. * Delete unavailable_models_by_workflow overwrite → migration #4 + parity method 2 fail. - 6,335 tests passing on default suite. - make check clean (ruff + ruff-format + mypy + deptry). - test_golden_baseline_hashes_match (DD#19) green; test_plan_drift.py 34/34. Plan + progress log - Atomic plan at .taskmaster/tasks/task_159/implementation/fix-plans/ renderer-test-fidelity-shape-parity-plan.md. - Consolidated PR #378 review-fix sweep entry appended to implementation-progress-log.md, documenting all four phases (Phase 1 easy bundle / Phase 2 medium bundle / Phase 3 cohort-key correctness / Phase 4 this commit), the five disputed findings (with citations), the GH #380 follow-up filed, and the cross-cutting insights from the sweep. GH issue #380 filed for the deferred test-bloat parametrize-collapse work.
…le [skip review]
Adds `.taskmaster/tasks/task_159/baseline/` as a byte-stable regression
oracle for the upcoming Task 160 cache_analysis refactor. Per-case folder:
workflow.pflow.md + command.sh + README.md (mutation contract) +
expected-{stdout,stderr,exit-code}.txt. Harness wraps each case in
`HOME=$BASELINE_HOME` redirect for filesystem isolation; normalize.py
redacts non-deterministic content (timestamps, paths, hashes, sub-cent
costs) without touching warning IDs / severity / JSON ordering.
Surfaces (63 cases total):
01 parser errors 10
02 validator errors 8
03 analyze-cache modes 8
04 warning catalog (one per ID) 20 (15/20 trigger target ID; 5 in F-02)
05 advisory cases 5
12 real-world lyrics-generator 5 (committed snapshot of music workflow tree)
13 happy-path interactions 4 (batch+cache+prewarm, sub-workflow with cache, 3-level nesting)
14 Pitfall #19 defenses 3 (dotted-path through NamespacedSharedStore, file-resolve boundary)
`./regenerate.sh` writes expected-* files; `./verify.sh` re-runs and
diffs (drift = stop signal). Exit 0 = clean. Trailing-newline + per-line
trailing-whitespace handling aligned with pre-commit's
trailing-whitespace + end-of-file-fixer hooks so `make check` stays clean.
Findings logged in `FINDINGS.md`:
- F-01: parser silently splits two-vars-on-one-line (spec promises error)
- F-02: 5 catalog warning IDs need more elaborate fixtures than minimal
- F-03: `pflow guide <workflow>` auto-detect misses `caching` topic
- F-04: `cache.below-min-tokens` false-positive on greenfield when chunks
resolve to LLM responses
- F-05: `pflow visualize` validates before rendering — blocks on unknown
node types in unrelated sub-workflows
Audit handoff: `NEXT-AGENT-AUDIT.md` (mission, priorities, 7 worked
examples) + `starting-context/braindump-2026-05-08-baseline-audit-handoff.md`
(tacit knowledge transfer).
Out of scope per user direction: trace format 2.0.0 backcompat
(2.0.0 fixture removed); MCP server cases; surfaces 06-11 (deferred).
`pflow analyze-cache --from-trace` previously rendered three delta lines
(Actual savings + First-run + Rerun) in trace mode. The first-run-with-
cache projection assumes a fresh run with no memo cache hits and no
provider implicit caching — neither modeled. When those fire (the common
case), the projection diverges from actual cost by an order of magnitude
(lyrics-generator: projection said "saves 1%"; actual was 49%) and the
two competing numbers anchor agents on the smaller, misleading figure.
Option B: drop the first-run delta line in trace mode entirely. Show only
the measured number plus the steady-state forward-looking projection.
- `_render_summary_deltas` split into 4-line dispatch + two named helpers
`_render_trace_deltas` and `_render_greenfield_deltas`. Dispatch on
`evidence_scope ∈ {"complete_trace", "truncated_trace_executed_subset"}`
rather than `actually_paid_usd is not None` to handle the all-unpriced-
trace edge case correctly.
- Trace mode renders `Actual savings (this run):` (or `unavailable
(projection excludes …)`) + `Rerun delta (projected):`. The
`(projected)` suffix on rerun signals "model, not measurement" right
at the line.
- Greenfield mode unchanged shape: `First-run delta:` + `Rerun delta:`,
no `(projected)` suffix (both are projections by construction; the
absence of an actual-savings line carries the signal).
- The truncated-mode `(executed)` qualifier on rerun was retired —
the suppression note already conveys executed-subset context.
- `_format_delta` actual-delta label simplified `"actual vs no-cache"`
→ `"vs no-cache"` (the row label "Actual savings (this run):" already
says "actual"; doubled word was a Stage A artifact).
- Dead branch `if not in_trace_mode and actual_line:` removed
(production-unreachable per searcher #2; only synthetic fixtures hit it).
- `_make_analysis` test builder upgraded to set `evidence_scope` based
on `actually_paid` (matches production shape; closes a Pitfall #19
gap that searcher #2 specifically warned about under the new dispatch).
JSON shape unchanged — all three deltas still on `AnalysisSummary`.
Tests: 2 new regression tests with verified mutation contracts; 1
existing test renamed and rewritten for Option B truncated semantics;
1 new mutation guard on the `vs no-cache` label simplification. Three
mutation contracts verified by reverting production code and observing
the matching test fail.
6,396 default-suite tests pass. 65/65 baselines pass (no drift — the
parallel L-12/L-2/L-1 commit had already regenerated the 3 affected
baselines anticipating Option B's final shape). Manual smoke confirms
trace mode shows actual + rerun (projected) only; greenfield shows
first-run + rerun.
Closes L-3 from BASELINE-AUDIT Section F.
Five deferred BASELINE-AUDIT findings shipped together as the post-merge-block polish round. Single-render-format changes; no architectural shift. Bundled with the prior in-flight B-9 blocking- errors split work in this commit. - B-6: drop "(ordered by impact)" qualifier when no action has positive savings (render_text.py). - B-11: heterogeneous suffix uses "; plus " separator instead of bare "+" so it doesn't parse as "model X PLUS model Y" (render_text.py). - B-8 + L-7: per-call tokens/cacheable columns widened from :>5 to :>7, (thousands separator + 6-digit alignment) (render_text.py). - L-4: collect skipped sub-workflow paths and emit ONE aggregated note via new _format_skipped_workflows_note helper. Lyrics-generator collapses 15 near-identical "predicted-key matching skipped" notes (~4KB) to one summary line (analyze.py). Tests: 12 new tests including a production-shape integration test (test_predict_cache_keys_aggregates_skip_notes_via_production_path) that drives _predict_cache_keys end-to-end — caught a Pitfall #19 where helper-only tests passed against reverted production code. Each fix has a verified mutation contract. Doc fix: docs/reference/cli/analyze-cache.mdx now reflects the conditional "ordered by impact" claim. Verification: make check clean (ruff + ruff-format + mypy + deptry); make test 6,415 passed, 1 skipped (+11 from baseline); 65/65 baseline cases pass after re-capture of 46 affected cases. Strict-improvement audit confirms all baseline drifts fall into expected classes. Closes B-6, B-8, B-9, B-11, L-4, L-7 from BASELINE-AUDIT.md.
…licates
Restructured analyze-cache header for AI-agent readability. The 39-word
run-on scale line splits into 3 cohesive lines (`Workflow:` count +
invocation status; `Models:` for 2+ models; `Heterogeneous:` for batch
sub-workflows declaring `model: ${item.model}`). Single-model workflows
keep `using X` inline (pragmatic density).
The duplicate `Observed models:` line is dropped — `models_in_use` is
a superset of `observed_models_in_trace` in complete trace mode. The
sub-workflow breakdown line drops its trailing 15-name CSV; names
already appear in `## Per-call cache report` headings and
`## Per-child analyze-cache commands`.
Latent bug fixed as bonus: static-mode workflows on the no-model-resolved
branch previously suppressed the heterogeneous suffix. The static-mode
lyrics-generator capture now correctly surfaces `Heterogeneous:
generate-chorus-options (model varies per batch item)`.
`_format_heterogeneous_line` collapses the old 1-vs-N branches via
plain `', '.join`; `_format_heterogeneous_suffix` deleted as dead code.
Files: render_text.py (+50/-65 LOC, net -15 simpler), test_cache_analysis_renderers.py
(3 helper-direct tests migrated to end-to-end render assertions per
Pitfall #19; 2 N-9 CSV literal updates; 1 multi-model test rewritten;
1 new N-2 regression test). 4 mutation contracts verified. 5 baselines
regenerated as strict-improvement diffs.
6,430 tests pass (+4). `make check` clean. 65/65 baselines pass.
Closes N-2 + N-3 + N-9 from POLISH-PLAN.md Cluster D.
Summary
Fixes #18 - Arrays and dicts can now be embedded in shell command strings (and other string parameters) by auto-serializing to JSON.
Changes
str,stringto compatible types fordict,list,array,object_warn_shell_unsafe_json()method warns when JSON contains characters that could break shell parsing (',`,$()architecture/reference/template-variables.mdwith new type matrix and "Shell Command Limitations" sectionBehavior
"echo '${items}'"["a", "b"]echo '["a", "b"]'"${items}"["a", "b"]"echo '${data}'"{"msg": "it's"}Guardrails Preserved
${items}alone) → runtime check still blocks dict/list→strstdinFile Changes
Testing
Run
make testto verify all tests pass (3612 passed).Run
make checkto verify linting and type checks pass.