feat(spec-009): orchestration × complexity matrix — full implementation#14
Merged
zenprocess merged 5 commits intomasterfrom Apr 7, 2026
Merged
feat(spec-009): orchestration × complexity matrix — full implementation#14zenprocess merged 5 commits intomasterfrom
zenprocess merged 5 commits intomasterfrom
Conversation
Operationalizes Switchyard spec 009 inside Pawbench. Inspired by Fabian
Wesner's One-Shot Shop Challenge (agentic-engineers.dev) — the empirical
demonstration that orchestration architecture beats model choice
(Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests).
## New modules
- complexity.py — ComplexityTier enum + heuristic inference (B2)
- quality.py — ArtifactQuality + Python analyzer (ruff/mypy/radon)
+ generic fallback + pluggable registry (B4)
- dqs.py — Composite Dispatch Quality Score v1 + dqs_spread
- orchestration.py — OrchestrationShape vocabulary + run_with_shape
executor (flat / waves / scatter-gather / team-mode
/ subagents) with merge-turn synthesis (B1)
- ablation.py — Pure ablation matrix over already-collected results,
interpretation thresholds, removal candidates (B7)
- context_tier.py — manifest-only context stripping (B6)
## Scenarios
- pawstyle-orchestration-matrix.json (NEW) — 4 independent feature blocks,
one per complexity tier (display / crud / transactional / cross_cutting),
designed to differentiate orchestration shapes
- All 8 existing scenarios retroactively tagged with complexity_tier
## CLI surface
- --orchestration flat,waves,scatter-gather,team-mode,subagents
- --ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
- --context-tier standard|manifest-only
- --verification-runs N
- --no-quality-analysis
## Report fields
- dim5_artifact_quality (per-scenario + aggregate)
- quality_by_tier (display/crud/transactional/cross_cutting)
- orchestration_results + orchestration_dqs_spread (the headline SLI)
- ablation (per-component delta + removal candidates)
- dqs (composite + breakdown + verification reliability)
## Tests
- 53 new tests across 6 spec_009_* test files
- Full regression: 151/151 green (98 existing + 53 new)
- Network-free unit tests via lazy aiohttp/engine imports in orchestration.py
## Out of scope (deferred follow-ups)
- waves shape currently degenerates to subagents; team-mode degenerates to
scatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need
precise operational contracts before they ship distinct executors.
- artifact_quality is reported but does NOT feed DQS — calibration data
required first (>=100 dispatches), then formula change in a separate PR.
- verifier_agreement_rate is structurally surfaced but degenerate at 1.0
until an LLM-judge verifier replaces deterministic score_turn.
- Leaderboard rendering of new dimensions is a separate (web) PR.
Tracking: zenprocess/switchyard#476 — closes pawbench phases P3-P7 of
spec 009. Companion work: switchyard#472 (CACP additive fields), axiom#10
(§17 Stratification spec), pawbench#9 (README attribution).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
…n matrix scenario
… stub - scoring.py: extract _resolve_agent_spec/_turn_spec_at/_resolve_tier helpers to lower quality_by_tier cognitive complexity - quality.py: extract _count_ruff_issues/_count_mypy_errors/_max_radon_complexity + _run_python_analyzers to lower _analyze_python cognitive complexity - servingcard.py: type: ignore on yaml import + raise from e (pre-existing typecheck failure on master, fixed here so spec 009 CI is green) - ruff format applied across all spec 009 files
Adds an "Inspired by" callout in the About section linking to: - agentic-engineers.dev (the study) - Fabian's LinkedIn announcement - switchyard spec 009 (operational mapping) Fabian's headline finding — orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57%) — is the empirical motivation for the upcoming Pawbench orchestration × complexity matrix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
zenprocess
pushed a commit
that referenced
this pull request
Apr 8, 2026
Follow-up to #14 to clear the sonar quality gate on master. - orchestration._run_parallel: replace `assert isinstance(item, AgentResult)` with explicit elif/else fallback. Asserts are stripped under python -O, so they're not load-bearing in production. Sonar flagged this as a bug. - quality._run: add a security review docstring documenting why these subprocess.run calls are safe (shell=False by default, argv-list, no user-controlled tokens reach argv, fixed scratch cwd, bounded timeout, capture_output=True). Add # nosec marker for B603 (subprocess-without- shell-equals-true is the desired form, not a finding). - quality._materialize: lift root.resolve() out of the loop and clarify the path-escape rejection comment. No behavior change. Same 53 spec-009 tests pass; ruff + format clean.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Operationalizes the orchestration × complexity matrix design inside Pawbench. Inspired by Fabian Wesner's One-Shot Shop Challenge (announcement) — the empirical demonstration that orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests).
This PR closes Pawbench phases P3 through P7 of the matrix design.
What's new
Modules
complexity.py—ComplexityTierenum + heuristic inference (B2)quality.py—ArtifactQuality+ Python analyzer (ruff/mypy/radon, all optional) + generic fallback + pluggable registry (B4)dqs.py— Composite Dispatch Quality Score v1 +dqs_spreadorchestration.py—OrchestrationShapevocabulary +run_with_shapeexecutor + merge-turn synthesis (B1)ablation.py— Pure ablation matrix over already-collected results, interpretation thresholds, removal candidates (B7)context_tier.py— manifest-only context stripping (B6)Scenarios
pawstyle-orchestration-matrix.json(new) — 4 independent feature blocks, one per complexity tier, designed to differentiate orchestration shapescomplexity_tierCLI surface
Report fields
dim5_artifact_quality(per-scenario + aggregate)quality_by_tier— display / crud / transactional / cross_cuttingorchestration_results+orchestration_dqs_spread— the headline SLIablation— per-component delta + removal candidatesdqs— composite + breakdown + verification reliabilityTests
test_spec009_*filesorchestration.pyExplicitly out of scope (deferred follow-ups)
wavesshape currently degenerates tosubagents;team-modedegenerates toscatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need precise operational contracts before they ship distinct executors.artifact_qualityis reported but does NOT feed DQS — calibration data required first (≥100 dispatches), then formula change in a separate PR.verifier_agreement_rateis structurally surfaced but degenerate at 1.0 until an LLM-judge verifier replaces deterministicscore_turn. Plumbing is in place; the judge is a follow-up.Companion work
complexity_tier,verification_runs[],artifact_quality,fixture_gapstatusTest plan
pytest tests/test_spec009_*— 53/53 greenpytest tests/full regression — 151/151 greenpawbench --helpshows all 5 new flags)🤖 Generated with Claude Code