Skip to content

Pipeline Design 46

Seth Ford edited this page Feb 13, 2026 · 1 revision

The file is in the .claude/ directory which requires explicit permission. Let me output the ADR directly so you can review it, and then we can write it.

The ADR has been written and is ready. Here's the complete Architecture Decision Record:


Design: Dynamic team scaling — spawn and dismiss agents mid-pipeline based on demand

Context

Shipwright currently supports two fixed scaling models: (1) the daemon auto-scaler (sw-daemon.sh:4260-4444) which adjusts concurrent pipeline workers between poll cycles based on CPU/memory/budget/queue-depth, and (2) multi-agent mode in sw-loop.sh which spawns a fixed --agents N count of tmux worker panes at launch and never changes it. Neither model can adjust agent count within a running pipeline — if a build stage starts with 1 agent but discovers 4 independent modules, it runs serially. If a 3-agent build enters a merge-conflict-prone phase, extra agents waste budget.

Constraints from the codebase

  • Bash 3.2 compatible — no declare -A, readarray, or ${var,,}
  • File-based coordination — pipeline communicates with the loop via .agent-N-complete, progress.md, error-summary.json, failure-reason.txt
  • tmux pane lifecycle — panes created via tmux split-window in launch_multi_agent() (sw-loop.sh:1849), killed via pane IDs in cleanup_multi_agent() (sw-loop.sh:1940)
  • Event logging — all state changes must call emit_event to events.jsonl
  • Budget awarenesssw-cost.sh remaining-budget already integrated in daemon
  • Atomic writes — all JSON/state files use tmp + mv pattern

Decision

File-based scaling signal protocol. New scaling engine (scripts/sw-scaling.sh) sourced by both sw-pipeline.sh and sw-loop.sh. Three participants: trigger evaluators write JSON request files, the loop monitor reads/executes them between iterations, and the daemon surfaces events in metrics.

Signal files:

  • scaling-requests.json — pipeline writes spawn/dismiss requests atomically; loop consumes them
  • scaling-state.json — current agent count, active pane IDs, cooldown timestamps, history

12 functions in sw-scaling.sh: 6 trigger evaluators + scaling_monitor_tick() + scaling_spawn() + scaling_dismiss() + scaling_check_cooldown() + scaling_check_budget() + scaling_prepare_context().

6 trigger types: iteration_threshold (spawn if stuck at iter 8+), coverage_gap (spawn reviewer if coverage drops >10% below target), security_critical (spawn security agent), idle_agent (dismiss if 0 commits in 3 iterations), multi_module_split (spawn to match independent module count), consecutive_failures (dismiss if 3+ low-progress iterations).

Guards: 120s cooldown between actions, 20% budget reserve blocks spawns, max_agents ceiling (default 4), min_agents floor (always 1). One request processed per tick to serialize operations.

Error handling: Spawn failures emit scale.spawn_failed and respect cooldown. Dismiss preserves uncommitted work via git stash before killing pane. Corrupt signal files are moved to .bad and replaced. Single-to-multi transition creates tmux infrastructure before spawning additional agents.

Alternatives Considered

  1. IPC-Based Scaling (Named Pipes / Unix Sockets) — Pros: lower latency, real-time events, no file corruption risk / Cons: breaks file-based coordination pattern used throughout Shipwright, Bash 3.2 has no socket support, harder to debug, requires background listener process

  2. Daemon-Driven Scaling (Scale From Outside) — Pros: centralizes logic in daemon which already has auto-scaler / Cons: daemon operates at pipeline-worker level not agent-within-pipeline level, 5-minute poll interval is too slow, daemon lacks visibility into iteration progress and module structure, violates encapsulation

  3. Pre-Computed Scaling Plan (Static at Pipeline Start) — Pros: simple, no runtime complexity / Cons: cannot react to runtime conditions (failures, coverage drops, security findings), equivalent to smarter --agents N default

Implementation Plan

  • Files to create:

    • scripts/sw-scaling.sh — core scaling engine (~400 lines)
    • scripts/sw-scaling-test.sh — 18 test cases (~600 lines)
  • Files to modify:

    • scripts/sw-loop.sh — scaling monitor in wait_for_multi_completion() polling loop; single-to-multi transition in run_single_agent_loop()
    • scripts/sw-pipeline.sh — source scaling engine; scaling_evaluate_triggers() at stage boundaries; scaling state in write_state()
    • templates/pipelines/full.json — add scaling config block (all 6 triggers, max_agents: 4)
    • templates/pipelines/autonomous.json — add scaling config block (all 6 triggers, max_agents: 4)
    • templates/pipelines/standard.json — add scaling config block (4 triggers, max_agents: 3)
    • scripts/sw-daemon.shscale.spawn/scale.dismiss event counting in daemon_metrics(); scaling health in daemon_health_check()
    • .claude/CLAUDE.md — documentation updates
    • package.json — register test suite #23
  • Dependencies: none (uses existing jq, tmux, emit_event, sw-cost.sh)

  • Risk areas:

    • Single-to-multi transition in sw-loop.sh — must create tmux window mid-run without losing original agent state
    • Worktree creation under pipefailgit worktree add failures must not kill the loop
    • Race on scaling-requests.json — mitigated by rare writes + atomic tmp+mv + polling interval
    • tmux pane ID stability — panes are stable per-server but must handle external kills gracefully
    • Budget estimation accuracy — should use daemon's adaptive cost estimation when available

Validation Criteria

  • scaling_evaluate_triggers() correctly identifies all 6 trigger types in isolation (18 unit tests)
  • Cooldown enforcement prevents scaling actions within 120s of the previous action
  • Budget reserve blocks spawns when remaining budget is below threshold
  • scaling_spawn() creates a functional tmux pane with worktree and worker script
  • scaling_dismiss() gracefully stops agent, preserves uncommitted work, cleans up worktree
  • Signal file corruption is detected and recovered without crashing the loop
  • Single-to-multi agent transition works without losing original agent's progress
  • write_state() includes scaling state visible in pipeline state file
  • daemon_metrics() counts and displays scale.spawn and scale.dismiss events
  • All 22 existing test suites continue to pass (npm test)
  • New sw-scaling-test.sh passes all 18 test cases
  • No Bash 3.2 incompatibilities

I attempted to write this to .claude/pipeline-artifacts/design.md but the file is in a protected directory. Please approve the write permission so I can save it, or I can write it to an alternative location.

Clone this wiki locally