Pipeline Design 56

Design: Multi-model orchestration — intelligent model routing by task complexity and stage

Context

Shipwright currently routes all pipeline stages to a single model (typically opus) regardless of task complexity. The intelligence layer already has intelligence_recommend_model() with SPRT-based evidence testing and per-model pricing in sw-intelligence.sh, plus an A/B testing gate in sw-pipeline.sh (~L7267) that defaults to opus with intelligent routing as the experimental arm (20%).

The problem: this wastes budget on simple tasks (haiku could handle intake/test stages) and under-provisions complex ones. The A/B gate is inverted — the smarter routing should be the default, not the experiment.

Constraints:

Bash 3.2 compatibility (no associative arrays, no readarray, no ${var,,})
set -euo pipefail in all scripts
Atomic file writes (tmp + mv)
JSON via jq --arg, never string interpolation
Existing intelligence_recommend_model() and SPRT infrastructure must be preserved, not replaced
Pipeline templates are JSON; changes must be backward-compatible (stages without config.model should fall back gracefully)

Decision

Three-tier model routing with automatic escalation on failure.

Data Flow

Stage Start
  → intelligence_stage_defaults(stage, complexity) returns default model
  → Pipeline checks template override (config.model per stage)
  → Template override wins if set; otherwise use stage default
  → On retry failure: escalate_model(current) bumps haiku→sonnet→opus
  → On loop stall (CONSECUTIVE_FAILURES >= 2): escalate from sonnet→opus
  → All routing decisions logged to .claude/pipeline-artifacts/model-routing.log
  → cost_pipeline_summary() reads routing log + costs.json for per-stage breakdown

Stage × Complexity Model Map (implemented as case statements, not associative arrays)

Stage	Low Complexity	Medium	High
intake	haiku	haiku	sonnet
plan	sonnet	sonnet	opus
design	sonnet	opus	opus
build	sonnet	sonnet	opus
test	haiku	sonnet	sonnet
review	sonnet	opus	opus
compound_quality	sonnet	opus	opus
pr	haiku	haiku	sonnet
merge	haiku	haiku	haiku
deploy	haiku	sonnet	sonnet
validate	haiku	sonnet	sonnet
monitor	haiku	haiku	sonnet

Escalation Chain

escalate_model(current_model) returns next tier: haiku → sonnet → opus → opus (opus is ceiling). Pure function, no side effects. Caller logs the escalation event.

A/B Gate Inversion

In run_pipeline() (~L7267), flip the ratio: intelligent routing becomes 80% (default), opus-everywhere becomes the 20% control group. The existing SPRT evidence framework continues to measure which arm performs better, and the ab_test_ratio config flag still controls the split.

Error Handling

intelligence_stage_defaults() returns "sonnet" if stage or complexity is unrecognized (safe middle-ground)
escalate_model() returns "opus" for any unrecognized input (fail to most capable)
Template config.model is optional; missing key → fall through to intelligence_stage_defaults()
cost_pipeline_summary() gracefully handles missing model-routing.log (prints "no routing data")
Loop stall escalation only triggers if MODEL_ESCALATION_ENABLED is not explicitly "false" (opt-out, not opt-in)

Logging Format

Each routing decision appends to model-routing.log:

timestamp|stage|complexity|selected_model|source(default|template|escalation)|attempt_number

Alternatives Considered

Central routing service / separate script — Pros: clean separation, independently testable. Cons: adds another script to source/maintain, introduces IPC overhead, and the existing sw-intelligence.sh already owns model selection logic. Adding two functions to an existing script is simpler than a new component.
Associative array for stage→model mapping — Pros: cleaner lookup syntax. Cons: requires Bash 4+, violates the Bash 3.2 compatibility constraint. Case statements are verbose but compatible.
Always-escalate on any failure (no threshold) — Pros: faster recovery. Cons: expensive — a single flaky test would immediately jump to opus. The threshold (CONSECUTIVE_FAILURES >= 2 in loop, per-retry in pipeline) balances cost against recovery speed.
Model routing as a daemon-config-only setting (no per-stage templates) — Pros: simpler config. Cons: loses the ability to override per-stage in specific pipeline templates (e.g., cost-aware template could force haiku everywhere).

Implementation Plan

Files to create

None (all changes are additions to existing files)

Files to modify

scripts/sw-intelligence.sh — Add intelligence_stage_defaults() and escalate_model()
scripts/sw-pipeline.sh — Model escalation in run_stage_with_retry() (~L6755), invert A/B gate in run_pipeline() (~L7267)
scripts/sw-loop.sh — Stall-based escalation when CONSECUTIVE_FAILURES >= 2 (~L2150)
scripts/sw-cost.sh — Add cost_pipeline_summary() function
templates/pipelines/standard.json — Add per-stage config.model keys
templates/pipelines/full.json — Add per-stage config.model keys
templates/pipelines/autonomous.json — Add per-stage config.model keys
templates/pipelines/deployed.json — Add per-stage config.model keys
scripts/sw-intelligence-test.sh — Unit tests for new functions
scripts/sw-e2e-smoke-test.sh — Smoke tests for routing integration

Dependencies

None new. Uses existing jq, existing intelligence infrastructure, existing template loading.

Risk Areas

run_stage_with_retry() modification (~L6755 in sw-pipeline.sh): This is a hot path — every stage passes through it. The model escalation must not break the existing retry logic. The change should be additive: read current model, call escalate_model(), set the new model env var, then proceed with existing retry flow.
A/B gate inversion: Swapping the default/control percentages could affect running daemon pipelines mid-flight. Mitigation: the gate is evaluated once per pipeline run at startup, so in-flight pipelines keep their original assignment.
Template backward compatibility: Existing templates without config.model must continue to work. The pipeline code must use jq -r '.stages[].config.model // empty' (not .config.model which would error on missing key).
model-routing.log growth: Log file could grow unbounded across many pipeline runs. cost_pipeline_summary() should scope to the current pipeline run (filter by run ID or timestamp).

Pipeline Design 56

Design: Multi-model orchestration — intelligent model routing by task complexity and stage

Context

Decision

Data Flow

Stage × Complexity Model Map (implemented as case statements, not associative arrays)

Escalation Chain

A/B Gate Inversion

Error Handling

Logging Format

Alternatives Considered

Implementation Plan

Files to create

Files to modify

Dependencies

Risk Areas

Validation Criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!