Skip to content

feat(spec-009): orchestration × complexity matrix — full implementation#14

Merged
zenprocess merged 5 commits intomasterfrom
spec-009-full-implementation
Apr 7, 2026
Merged

feat(spec-009): orchestration × complexity matrix — full implementation#14
zenprocess merged 5 commits intomasterfrom
spec-009-full-implementation

Conversation

@zenprocess
Copy link
Copy Markdown
Owner

@zenprocess zenprocess commented Apr 7, 2026

Summary

Operationalizes the orchestration × complexity matrix design inside Pawbench. Inspired by Fabian Wesner's One-Shot Shop Challenge (announcement) — the empirical demonstration that orchestration architecture beats model choice (Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests).

This PR closes Pawbench phases P3 through P7 of the matrix design.

What's new

Modules

  • complexity.pyComplexityTier enum + heuristic inference (B2)
  • quality.pyArtifactQuality + Python analyzer (ruff/mypy/radon, all optional) + generic fallback + pluggable registry (B4)
  • dqs.py — Composite Dispatch Quality Score v1 + dqs_spread
  • orchestration.pyOrchestrationShape vocabulary + run_with_shape executor + merge-turn synthesis (B1)
  • ablation.py — Pure ablation matrix over already-collected results, interpretation thresholds, removal candidates (B7)
  • context_tier.py — manifest-only context stripping (B6)

Scenarios

  • pawstyle-orchestration-matrix.json (new) — 4 independent feature blocks, one per complexity tier, designed to differentiate orchestration shapes
  • All 8 existing scenarios retroactively tagged with complexity_tier

CLI surface

--orchestration flat,waves,scatter-gather,team-mode,subagents
--ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
--context-tier {standard,manifest-only}
--verification-runs N
--no-quality-analysis

Report fields

  • dim5_artifact_quality (per-scenario + aggregate)
  • quality_by_tier — display / crud / transactional / cross_cutting
  • orchestration_results + orchestration_dqs_spreadthe headline SLI
  • ablation — per-component delta + removal candidates
  • dqs — composite + breakdown + verification reliability

Tests

  • 53 new tests across 6 test_spec009_* files
  • Full regression: 151/151 green (98 existing + 53 new)
  • Network-free unit tests via lazy aiohttp/engine imports in orchestration.py

Explicitly out of scope (deferred follow-ups)

  • waves shape currently degenerates to subagents; team-mode degenerates to scatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need precise operational contracts before they ship distinct executors.
  • artifact_quality is reported but does NOT feed DQS — calibration data required first (≥100 dispatches), then formula change in a separate PR.
  • verifier_agreement_rate is structurally surfaced but degenerate at 1.0 until an LLM-judge verifier replaces deterministic score_turn. Plumbing is in place; the judge is a follow-up.
  • Leaderboard rendering of new dimensions is a separate PR.

Companion work

  • Axiom §17 — normative orchestration shape and complexity tier vocabularies (v0.6.0)
  • CACP additive fieldscomplexity_tier, verification_runs[], artifact_quality, fixture_gap status

Test plan

  • pytest tests/test_spec009_* — 53/53 green
  • pytest tests/ full regression — 151/151 green
  • CLI surface smoke (pawbench --help shows all 5 new flags)

🤖 Generated with Claude Code

Operationalizes Switchyard spec 009 inside Pawbench. Inspired by Fabian
Wesner's One-Shot Shop Challenge (agentic-engineers.dev) — the empirical
demonstration that orchestration architecture beats model choice
(Team Mode 85% vs Sub-Agents 57% on the same model, 143 E2E tests).

## New modules

- complexity.py        — ComplexityTier enum + heuristic inference (B2)
- quality.py           — ArtifactQuality + Python analyzer (ruff/mypy/radon)
                         + generic fallback + pluggable registry (B4)
- dqs.py               — Composite Dispatch Quality Score v1 + dqs_spread
- orchestration.py     — OrchestrationShape vocabulary + run_with_shape
                         executor (flat / waves / scatter-gather / team-mode
                         / subagents) with merge-turn synthesis (B1)
- ablation.py          — Pure ablation matrix over already-collected results,
                         interpretation thresholds, removal candidates (B7)
- context_tier.py      — manifest-only context stripping (B6)

## Scenarios

- pawstyle-orchestration-matrix.json (NEW) — 4 independent feature blocks,
  one per complexity tier (display / crud / transactional / cross_cutting),
  designed to differentiate orchestration shapes
- All 8 existing scenarios retroactively tagged with complexity_tier

## CLI surface

- --orchestration flat,waves,scatter-gather,team-mode,subagents
- --ablate quality,format_compliance,tool_accuracy,useful_ratio,steering_rate
- --context-tier standard|manifest-only
- --verification-runs N
- --no-quality-analysis

## Report fields

- dim5_artifact_quality (per-scenario + aggregate)
- quality_by_tier (display/crud/transactional/cross_cutting)
- orchestration_results + orchestration_dqs_spread (the headline SLI)
- ablation (per-component delta + removal candidates)
- dqs (composite + breakdown + verification reliability)

## Tests

- 53 new tests across 6 spec_009_* test files
- Full regression: 151/151 green (98 existing + 53 new)
- Network-free unit tests via lazy aiohttp/engine imports in orchestration.py

## Out of scope (deferred follow-ups)

- waves shape currently degenerates to subagents; team-mode degenerates to
  scatter-gather. Real DAG-aware waves and shared-scratchpad team-mode need
  precise operational contracts before they ship distinct executors.
- artifact_quality is reported but does NOT feed DQS — calibration data
  required first (>=100 dispatches), then formula change in a separate PR.
- verifier_agreement_rate is structurally surfaced but degenerate at 1.0
  until an LLM-judge verifier replaces deterministic score_turn.
- Leaderboard rendering of new dimensions is a separate (web) PR.

Tracking: zenprocess/switchyard#476 — closes pawbench phases P3-P7 of
spec 009. Companion work: switchyard#472 (CACP additive fields), axiom#10
(§17 Stratification spec), pawbench#9 (README attribution).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Val Vladescu and others added 4 commits April 8, 2026 00:49
… stub

- scoring.py: extract _resolve_agent_spec/_turn_spec_at/_resolve_tier helpers
  to lower quality_by_tier cognitive complexity
- quality.py: extract _count_ruff_issues/_count_mypy_errors/_max_radon_complexity
  + _run_python_analyzers to lower _analyze_python cognitive complexity
- servingcard.py: type: ignore on yaml import + raise from e
  (pre-existing typecheck failure on master, fixed here so spec 009 CI is green)
- ruff format applied across all spec 009 files
Adds an "Inspired by" callout in the About section linking to:
- agentic-engineers.dev (the study)
- Fabian's LinkedIn announcement
- switchyard spec 009 (operational mapping)

Fabian's headline finding — orchestration architecture beats model choice
(Team Mode 85% vs Sub-Agents 57%) — is the empirical motivation for the
upcoming Pawbench orchestration × complexity matrix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 7, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@zenprocess zenprocess merged commit a75a36e into master Apr 7, 2026
6 of 7 checks passed
@zenprocess zenprocess deleted the spec-009-full-implementation branch April 7, 2026 22:02
zenprocess pushed a commit that referenced this pull request Apr 8, 2026
Follow-up to #14 to clear the sonar quality gate on master.

- orchestration._run_parallel: replace `assert isinstance(item, AgentResult)`
  with explicit elif/else fallback. Asserts are stripped under python -O,
  so they're not load-bearing in production. Sonar flagged this as a bug.

- quality._run: add a security review docstring documenting why these
  subprocess.run calls are safe (shell=False by default, argv-list, no
  user-controlled tokens reach argv, fixed scratch cwd, bounded timeout,
  capture_output=True). Add # nosec marker for B603 (subprocess-without-
  shell-equals-true is the desired form, not a finding).

- quality._materialize: lift root.resolve() out of the loop and clarify
  the path-escape rejection comment.

No behavior change. Same 53 spec-009 tests pass; ruff + format clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants