feat(spec-009): orchestration × complexity matrix — full implementation by zenprocess · Pull Request #14 · zenprocess/pawbench

zenprocess · 2026-04-07T21:41:31Z

Summary

Operationalizes the orchestration × complexity matrix design inside Pawbench. Inspired by Fabian Wesner's One-Shot Shop Challenge (announcement) — the empirical demonstration that orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests).

This PR closes Pawbench phases P3 through P7 of the matrix design.

What's new

Modules

complexity.py — ComplexityTier enum + heuristic inference (B2)
quality.py — ArtifactQuality + Python analyzer (ruff/mypy/radon, all optional) + generic fallback + pluggable registry (B4)
dqs.py — Composite Dispatch Quality Score v1 + dqs_spread
orchestration.py — OrchestrationShape vocabulary + run_with_shape executor + merge-turn synthesis (B1)
ablation.py — Pure ablation matrix over already-collected results, interpretation thresholds, removal candidates (B7)
context_tier.py — manifest-only context stripping (B6)

Scenarios

pawstyle-orchestration-matrix.json (new) — 4 independent feature blocks, one per complexity tier, designed to differentiate orchestration shapes
All 8 existing scenarios retroactively tagged with complexity_tier

CLI surface

--orchestration flat,waves,scatter-gather,team-mode,subagents
--ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
--context-tier {standard,manifest-only}
--verification-runs N
--no-quality-analysis

Report fields

dim5_artifact_quality (per-scenario + aggregate)
quality_by_tier — display / crud / transactional / cross_cutting
orchestration_results + orchestration_dqs_spread — the headline SLI
ablation — per-component delta + removal candidates
dqs — composite + breakdown + verification reliability

Tests

53 new tests across 6 test_spec009_* files
Full regression: 151/151 green (98 existing + 53 new)
Network-free unit tests via lazy aiohttp/engine imports in orchestration.py

Explicitly out of scope (deferred follow-ups)

waves shape currently degenerates to subagents; team-mode degenerates to scatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need precise operational contracts before they ship distinct executors.
artifact_quality is reported but does NOT feed DQS — calibration data required first (≥100 dispatches), then formula change in a separate PR.
verifier_agreement_rate is structurally surfaced but degenerate at 1.0 until an LLM-judge verifier replaces deterministic score_turn. Plumbing is in place; the judge is a follow-up.
Leaderboard rendering of new dimensions is a separate PR.

Companion work

Axiom §17 — normative orchestration shape and complexity tier vocabularies (v0.6.0)
CACP additive fields — complexity_tier, verification_runs[], artifact_quality, fixture_gap status

Test plan

pytest tests/test_spec009_* — 53/53 green
pytest tests/ full regression — 151/151 green
CLI surface smoke (pawbench --help shows all 5 new flags)

🤖 Generated with Claude Code

Operationalizes Switchyard spec 009 inside Pawbench. Inspired by Fabian Wesner's One-Shot Shop Challenge (agentic-engineers.dev) — the empirical demonstration that orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests). ## New modules - complexity.py — ComplexityTier enum + heuristic inference (B2) - quality.py — ArtifactQuality + Python analyzer (ruff/mypy/radon) + generic fallback + pluggable registry (B4) - dqs.py — Composite Dispatch Quality Score v1 + dqs_spread - orchestration.py — OrchestrationShape vocabulary + run_with_shape executor (flat / waves / scatter-gather / team-mode / subagents) with merge-turn synthesis (B1) - ablation.py — Pure ablation matrix over already-collected results, interpretation thresholds, removal candidates (B7) - context_tier.py — manifest-only context stripping (B6) ## Scenarios - pawstyle-orchestration-matrix.json (NEW) — 4 independent feature blocks, one per complexity tier (display / crud / transactional / cross_cutting), designed to differentiate orchestration shapes - All 8 existing scenarios retroactively tagged with complexity_tier ## CLI surface - --orchestration flat,waves,scatter-gather,team-mode,subagents - --ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate - --context-tier standard|manifest-only - --verification-runs N - --no-quality-analysis ## Report fields - dim5_artifact_quality (per-scenario + aggregate) - quality_by_tier (display/crud/transactional/cross_cutting) - orchestration_results + orchestration_dqs_spread (the headline SLI) - ablation (per-component delta + removal candidates) - dqs (composite + breakdown + verification reliability) ## Tests - 53 new tests across 6 spec_009_* test files - Full regression: 151/151 green (98 existing + 53 new) - Network-free unit tests via lazy aiohttp/engine imports in orchestration.py ## Out of scope (deferred follow-ups) - waves shape currently degenerates to subagents; team-mode degenerates to scatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need precise operational contracts before they ship distinct executors. - artifact_quality is reported but does NOT feed DQS — calibration data required first (>=100 dispatches), then formula change in a separate PR. - verifier_agreement_rate is structurally surfaced but degenerate at 1.0 until an LLM-judge verifier replaces deterministic score_turn. - Leaderboard rendering of new dimensions is a separate (web) PR. Tracking: zenprocess/switchyard#476 — closes pawbench phases P3-P7 of spec 009. Companion work: switchyard#472 (CACP additive fields), axiom#10 (§17 Stratification spec), pawbench#9 (README attribution). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-07T21:42:02Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

…rrowing

…n matrix scenario

… stub - scoring.py: extract _resolve_agent_spec/_turn_spec_at/_resolve_tier helpers to lower quality_by_tier cognitive complexity - quality.py: extract _count_ruff_issues/_count_mypy_errors/_max_radon_complexity + _run_python_analyzers to lower _analyze_python cognitive complexity - servingcard.py: type: ignore on yaml import + raise from e (pre-existing typecheck failure on master, fixed here so spec 009 CI is green) - ruff format applied across all spec 009 files

Adds an "Inspired by" callout in the About section linking to: - agentic-engineers.dev (the study) - Fabian's LinkedIn announcement - switchyard spec 009 (operational mapping) Fabian's headline finding — orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57%) — is the empirical motivation for the upcoming Pawbench orchestration × complexity matrix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-07T22:01:21Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Follow-up to #14 to clear the sonar quality gate on master. - orchestration._run_parallel: replace `assert isinstance(item, AgentResult)` with explicit elif/else fallback. Asserts are stripped under python -O, so they're not load-bearing in production. Sonar flagged this as a bug. - quality._run: add a security review docstring documenting why these subprocess.run calls are safe (shell=False by default, argv-list, no user-controlled tokens reach argv, fixed scratch cwd, bounded timeout, capture_output=True). Add # nosec marker for B603 (subprocess-without- shell-equals-true is the desired form, not a finding). - quality._materialize: lift root.resolve() out of the loop and clarify the path-escape rejection comment. No behavior change. Same 53 spec-009 tests pass; ruff + format clean.

Val Vladescu and others added 4 commits April 8, 2026 00:49

fix(spec-009): ruff import sort + line length + mypy gather-result na…

7f087a3

…rrowing

docs: README — advertise spec 009 dimensions, flags, and orchestratio…

c8f5e52

…n matrix scenario

zenprocess mentioned this pull request Apr 7, 2026

docs: attribute Fabian Wesner's One-Shot Shop Challenge #9

Closed

zenprocess merged commit a75a36e into master Apr 7, 2026
6 of 7 checks passed

zenprocess deleted the spec-009-full-implementation branch April 7, 2026 22:02

zenprocess mentioned this pull request Apr 8, 2026

fix(spec-009): clear sonar reliability rating + harden subprocess calls #15

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spec-009): orchestration × complexity matrix — full implementation#14

feat(spec-009): orchestration × complexity matrix — full implementation#14
zenprocess merged 5 commits intomasterfrom
spec-009-full-implementation

zenprocess commented Apr 7, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 7, 2026

Uh oh!

sonarqubecloud bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zenprocess commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Modules

Scenarios

CLI surface

Report fields

Tests

Explicitly out of scope (deferred follow-ups)

Companion work

Test plan

Uh oh!

codecov-commenter commented Apr 7, 2026

Welcome to Codecov 🎉

Uh oh!

sonarqubecloud bot commented Apr 7, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zenprocess commented Apr 7, 2026 •

edited

Loading