Skip to content

Pipeline Design 56

Seth Ford edited this page Feb 14, 2026 · 1 revision

Design: Multi-model orchestration — intelligent model routing by task complexity and stage

Context

Shipwright currently routes all pipeline stages to a single model (typically opus) regardless of task complexity. The intelligence layer already has intelligence_recommend_model() with SPRT-based evidence testing and per-model pricing in sw-intelligence.sh, plus an A/B testing gate in sw-pipeline.sh (~L7267) that defaults to opus with intelligent routing as the experimental arm (20%).

The problem: this wastes budget on simple tasks (haiku could handle intake/test stages) and under-provisions complex ones. The A/B gate is inverted — the smarter routing should be the default, not the experiment.

Constraints:

  • Bash 3.2 compatibility (no associative arrays, no readarray, no ${var,,})
  • set -euo pipefail in all scripts
  • Atomic file writes (tmp + mv)
  • JSON via jq --arg, never string interpolation
  • Existing intelligence_recommend_model() and SPRT infrastructure must be preserved, not replaced
  • Pipeline templates are JSON; changes must be backward-compatible (stages without config.model should fall back gracefully)

Decision

Three-tier model routing with automatic escalation on failure.

Data Flow

Stage Start
  → intelligence_stage_defaults(stage, complexity) returns default model
  → Pipeline checks template override (config.model per stage)
  → Template override wins if set; otherwise use stage default
  → On retry failure: escalate_model(current) bumps haiku→sonnet→opus
  → On loop stall (CONSECUTIVE_FAILURES >= 2): escalate from sonnet→opus
  → All routing decisions logged to .claude/pipeline-artifacts/model-routing.log
  → cost_pipeline_summary() reads routing log + costs.json for per-stage breakdown

Stage × Complexity Model Map (implemented as case statements, not associative arrays)

Stage Low Complexity Medium High
intake haiku haiku sonnet
plan sonnet sonnet opus
design sonnet opus opus
build sonnet sonnet opus
test haiku sonnet sonnet
review sonnet opus opus
compound_quality sonnet opus opus
pr haiku haiku sonnet
merge haiku haiku haiku
deploy haiku sonnet sonnet
validate haiku sonnet sonnet
monitor haiku haiku sonnet

Escalation Chain

escalate_model(current_model) returns next tier: haiku → sonnet → opus → opus (opus is ceiling). Pure function, no side effects. Caller logs the escalation event.

A/B Gate Inversion

In run_pipeline() (~L7267), flip the ratio: intelligent routing becomes 80% (default), opus-everywhere becomes the 20% control group. The existing SPRT evidence framework continues to measure which arm performs better, and the ab_test_ratio config flag still controls the split.

Error Handling

  • intelligence_stage_defaults() returns "sonnet" if stage or complexity is unrecognized (safe middle-ground)
  • escalate_model() returns "opus" for any unrecognized input (fail to most capable)
  • Template config.model is optional; missing key → fall through to intelligence_stage_defaults()
  • cost_pipeline_summary() gracefully handles missing model-routing.log (prints "no routing data")
  • Loop stall escalation only triggers if MODEL_ESCALATION_ENABLED is not explicitly "false" (opt-out, not opt-in)

Logging Format

Each routing decision appends to model-routing.log:

timestamp|stage|complexity|selected_model|source(default|template|escalation)|attempt_number

Alternatives Considered

  1. Central routing service / separate script — Pros: clean separation, independently testable. Cons: adds another script to source/maintain, introduces IPC overhead, and the existing sw-intelligence.sh already owns model selection logic. Adding two functions to an existing script is simpler than a new component.

  2. Associative array for stage→model mapping — Pros: cleaner lookup syntax. Cons: requires Bash 4+, violates the Bash 3.2 compatibility constraint. Case statements are verbose but compatible.

  3. Always-escalate on any failure (no threshold) — Pros: faster recovery. Cons: expensive — a single flaky test would immediately jump to opus. The threshold (CONSECUTIVE_FAILURES >= 2 in loop, per-retry in pipeline) balances cost against recovery speed.

  4. Model routing as a daemon-config-only setting (no per-stage templates) — Pros: simpler config. Cons: loses the ability to override per-stage in specific pipeline templates (e.g., cost-aware template could force haiku everywhere).

Implementation Plan

Files to create

  • None (all changes are additions to existing files)

Files to modify

  1. scripts/sw-intelligence.sh — Add intelligence_stage_defaults() and escalate_model()
  2. scripts/sw-pipeline.sh — Model escalation in run_stage_with_retry() (~L6755), invert A/B gate in run_pipeline() (~L7267)
  3. scripts/sw-loop.sh — Stall-based escalation when CONSECUTIVE_FAILURES >= 2 (~L2150)
  4. scripts/sw-cost.sh — Add cost_pipeline_summary() function
  5. templates/pipelines/standard.json — Add per-stage config.model keys
  6. templates/pipelines/full.json — Add per-stage config.model keys
  7. templates/pipelines/autonomous.json — Add per-stage config.model keys
  8. templates/pipelines/deployed.json — Add per-stage config.model keys
  9. scripts/sw-intelligence-test.sh — Unit tests for new functions
  10. scripts/sw-e2e-smoke-test.sh — Smoke tests for routing integration

Dependencies

  • None new. Uses existing jq, existing intelligence infrastructure, existing template loading.

Risk Areas

  • run_stage_with_retry() modification (~L6755 in sw-pipeline.sh): This is a hot path — every stage passes through it. The model escalation must not break the existing retry logic. The change should be additive: read current model, call escalate_model(), set the new model env var, then proceed with existing retry flow.
  • A/B gate inversion: Swapping the default/control percentages could affect running daemon pipelines mid-flight. Mitigation: the gate is evaluated once per pipeline run at startup, so in-flight pipelines keep their original assignment.
  • Template backward compatibility: Existing templates without config.model must continue to work. The pipeline code must use jq -r '.stages[].config.model // empty' (not .config.model which would error on missing key).
  • model-routing.log growth: Log file could grow unbounded across many pipeline runs. cost_pipeline_summary() should scope to the current pipeline run (filter by run ID or timestamp).

Validation Criteria

  • intelligence_stage_defaults "build" "low" returns "sonnet"; "intake" "low" returns "haiku"; "design" "high" returns "opus"
  • escalate_model "haiku" returns "sonnet"; "sonnet" returns "opus"; "opus" returns "opus"
  • escalate_model "unknown" returns "opus" (fail-safe)
  • intelligence_stage_defaults "nonexistent_stage" "low" returns "sonnet" (safe default)
  • Pipeline retry escalates model: first attempt uses stage default, second attempt uses next tier
  • Loop stall with CONSECUTIVE_FAILURES=2 triggers model escalation log entry
  • Loop stall with MODEL_ESCALATION_ENABLED=false does NOT escalate
  • Templates with config.model override intelligence_stage_defaults() return value
  • Templates without config.model fall through to intelligence_stage_defaults() cleanly
  • cost_pipeline_summary produces per-stage cost breakdown when routing log exists
  • cost_pipeline_summary handles missing routing log gracefully (no error, informative message)
  • A/B gate now assigns 80% to intelligent routing, 20% to opus control
  • Full test suite passes: npm test exits 0
  • No Bash 4+ features used (no declare -A, no readarray, no ${var,,})

Clone this wiki locally