Skip to content

Pipeline Design 212

Seth Ford edited this page Mar 7, 2026 · 1 revision

Design: Adaptive Stage Timeout Engine with P95 Duration-Based Auto-Tuning

Context

Pipeline stages currently run unbounded — there is no wall-clock enforcement. A failing build stage can burn 60+ minutes of compute before the retry/failure logic kicks in. The existing get_timeout() function in sw-adaptive.sh:141 already computes P95-based timeouts from historical stage.completed events, but nothing enforces them. The value is computed and used only for informational tuning recommendations — never as a kill signal.

Constraints from the codebase:

  • All scripts are Bash 3.2 compatible (no associative arrays, no readarray)
  • The scripts/lib/*.sh pattern (40 existing libraries) is the established way to add shared logic — libraries use an idempotent load guard ([[ -n "${_LIB_LOADED:-}" ]] && return 0)
  • Stage execution happens in sw-pipeline.sh:1683 via run_stage_with_retry() which calls stage_${id} functions — stages spawn Claude CLI sessions that create child processes
  • Event storage is dual-layer: SQLite (sw-db.sh) primary + JSONL fallback, queried via db_query_events() and db_query_events_since()
  • Pipeline templates are JSON files in templates/pipelines/ with per-stage config objects
  • The daemon already has an optimization/ directory at ~/.shipwright/optimization/ (used by daemon-adaptive.sh for daemon-tuning.json)

The problem: Without enforcement, a single stage can waste an entire pipeline budget. With enforcement, we need careful process group cleanup (Claude sessions spawn subprocesses), a priority chain for timeout sources, and a bootstrap path for repos with no historical data.

Decision

Create a new shared library scripts/lib/stage-timeout.sh that provides:

  1. Adaptive timeout computation with a 3-tier priority chain (manual override > cached P95×1.2 > hardcoded defaults)
  2. Timeout enforcement via process group kill with watchdog timer
  3. Metrics emission for observability (fast-fail detection, false-timeout prevention)
  4. Dashboard data for sw adaptive timeout CLI

Component Diagram

+------------------------------------------------------------------+
|                        Consumers                                  |
|                                                                   |
|  sw-pipeline.sh          sw-adaptive.sh         sw-daemon.sh      |
|  (enforcement)           (CLI dashboard)        (config source)   |
+--------+-----------------------+---------------------+-----------+
         |                       |                     |
         v                       v                     v
+------------------------------------------------------------------+
|              scripts/lib/stage-timeout.sh                         |
|                                                                   |
|  compute_adaptive_timeout()  run_with_stage_timeout()             |
|  recalculate_stage_timeouts() record_stage_timeout_metric()       |
|  get_timeout_dashboard_json() _stage_timeout_default()            |
+--------+-----------------------+---------------------+-----------+
         |                       |                     |
         v                       v                     v
+------------------+  +-------------------+  +---------------------+
| stage-timeouts   |  | sw-db.sh          |  | sw-adaptive.sh      |
| .json            |  | db_query_events   |  | percentile()        |
| (cached P50/95/  |  | _since()          |  | mean()              |
|  99 + history)   |  | record_stage()    |  |                     |
+------------------+  +-------------------+  +---------------------+
         |                       |
         v                       v
+------------------+  +-------------------+
| Pipeline         |  | events.jsonl /    |
| templates JSON   |  | shipwright.db     |
| (manual overrides)|  | (source of truth) |
+------------------+  +-------------------+

Dependencies point inward: consumers depend on the library, the library depends on the data layer. No circular references.

Interface Contracts

// All functions are bash — TypeScript signatures used to express contracts

// Returns timeout in seconds. Never fails (returns default on error).
// Priority: template config.timeout_s > daemon-config stage_timeouts.defaults.$stage
//           > cached adaptive P95*1.2 > hardcoded default
function compute_adaptive_timeout(
  stage: string,                    // e.g. "build", "test", "review"
  pipeline_config_json?: string     // path to composed pipeline JSON
): number;                          // timeout in seconds, always > 0

// Recomputes percentiles from 30-day event window.
// Writes to STAGE_TIMEOUT_FILE atomically (tmp + mv).
// Called on-demand when cache is stale (>7 days). Idempotent.
function recalculate_stage_timeouts(
  force?: boolean                   // --force flag bypasses staleness check
): void;                            // exit 0 on success, exit 1 on error

// Wraps a command with timeout enforcement.
// Forks command in process group, starts watchdog, waits for either.
// On timeout: kills process group (kill -- -$pgid), emits stage.timeout event.
function run_with_stage_timeout(
  stage_id: string,                 // stage name for timeout lookup
  ...command: string[]              // command to execute (e.g. "run_stage_with_retry" "build")
): number;                          // exit 0 = success, 124 = timeout, other = command failure

// Emits observability metrics. Called after stage completion or timeout.
// Errors: none (best-effort emit, never fails the caller)
function record_stage_timeout_metric(
  stage_id: string,
  duration_s: number,
  timeout_s: number,
  result: "success" | "timeout" | "failed"
): void;

// Returns JSON blob for CLI dashboard rendering.
// Errors: returns minimal JSON with empty stages on error.
function get_timeout_dashboard_json(): string; // JSON

// Returns hardcoded default timeout for a stage type.
// Pure function, no side effects.
function _stage_timeout_default(stage: string): number;

Data Flow

Pipeline Start
    |
    v
[Check stage_timeouts.enabled in daemon-config]
    |
    +-- disabled --> run_stage_with_retry() directly (current behavior)
    |
    +-- enabled -->
         |
         v
    [compute_adaptive_timeout(stage, pipeline_config)]
         |
         +-- 1. Check pipeline template: stages[].config.timeout_s
         |      found? --> use it (manual override)
         |
         +-- 2. Check daemon-config: stage_timeouts.defaults.$stage
         |      found? --> use it (operator override)
         |
         +-- 3. Read ~/.shipwright/optimization/stage-timeouts.json
         |      |
         |      +-- missing or stale (>7 days)?
         |      |      |
         |      |      v
         |      |   [recalculate_stage_timeouts()]
         |      |      |
         |      |      +-- db_query_events_since(now-30days, "stage.completed")
         |      |      +-- group by stage, compute P50/P95/P99 via percentile()
         |      |      +-- timeout = max(P95 * 1.2, min_threshold)
         |      |      +-- atomic write to stage-timeouts.json
         |      |
         |      +-- fresh --> read cached timeout_s
         |             < 10 samples? --> use _stage_timeout_default()
         |
         +-- 4. Fallback: _stage_timeout_default(stage)
         |
         v
    [run_with_stage_timeout(stage, run_stage_with_retry, stage)]
         |
         +-- set -m (enable job control for process groups)
         +-- run command in background: command & pid=$!
         +-- start watchdog: ( sleep $timeout && kill -- -$pid ) & wd=$!
         +-- wait $pid --> capture exit_code
         |
         +-- exit 0 (success):
         |      kill watchdog
         |      record_stage_timeout_metric(stage, dur, timeout, "success")
         |      if dur > fixed_default: emit timeout.false_timeout_prevented
         |      return 0
         |
         +-- killed by signal (timeout):          <-- FAILURE POINT [F1]
         |      emit stage.timeout event
         |      set LAST_STAGE_ERROR_CLASS="timeout"
         |      record_stage_timeout_metric(stage, timeout, timeout, "timeout")
         |      return 124
         |
         +-- non-zero exit (command failure):     <-- FAILURE POINT [F2]
              kill watchdog
              record_stage_timeout_metric(stage, dur, timeout, "failed")
              return original exit code

Error Boundaries

Component Error Source Handling Propagation
compute_adaptive_timeout JSON parse failure, missing file, jq error Return _stage_timeout_default() — never propagate errors upward Swallowed — logged as warning
compute_adaptive_timeout recalculate_stage_timeouts failure Fall back to default — recalculation is best-effort Swallowed
run_with_stage_timeout Timeout fires Kill process group, return exit 124 Propagated — caller must distinguish 124 from other failures
run_with_stage_timeout Command fails before timeout Kill watchdog, return original exit code Propagated — transparent to existing error handling
run_with_stage_timeout Race: command exits while kill fires Check if PID still alive before kill; kill 2>/dev/null Swallowed — harmless
recalculate_stage_timeouts DB unavailable, no events Write empty/default JSON, log warning Swallowed
record_stage_timeout_metric Event emission failure Best-effort `
get_timeout_dashboard_json Missing cache file Return JSON with defaults and zero samples Swallowed

Key error design principle: The timeout library must never prevent a pipeline from running. All computation errors fall back to sensible defaults. Only the enforcement outcome (exit 124) propagates to the caller.

Alternatives Considered

1. Inline in sw-pipeline.sh

Pros: Single file change, co-located with stage execution, no new imports. Cons: sw-pipeline.sh is already 3041 lines. Adding 200+ lines of timeout logic (computation, watchdog, metrics, dashboard data) violates single-responsibility. Impossible to unit-test the timeout logic without loading the entire pipeline. Mixes data access (event queries, JSON caching) with orchestration. Rejected because it makes the largest file even larger and prevents isolated testing.

2. Standalone script sw-stage-timeout.sh

Pros: Clean CLI entry point (sw stage-timeout dashboard), fully self-contained. Cons: Requires registration in the CLI router (scripts/sw), creates another 600+ line top-level script (already 100+ scripts). Functions can't be sourced by sw-pipeline.sh without also sourcing its CLI logic. Doesn't follow the lib/*.sh decomposition pattern used by all other pipeline subsystems (e.g., lib/pipeline-state.sh, lib/daemon-adaptive.sh). Rejected because it introduces a new integration pattern inconsistent with codebase conventions.

3. Reuse existing get_timeout() from sw-adaptive.sh directly

Pros: Zero new files, the function already computes P95-based timeouts. Cons: get_timeout() calls db_query_events() with a 5000-event scan on every invocation — too expensive to run per-stage. No caching layer. No enforcement wrapper. Adding enforcement into sw-adaptive.sh would couple the adaptive tuning CLI with pipeline execution. The existing function would need significant modification (caching, process group kill, metrics) that would bloat sw-adaptive.sh beyond its scope. Rejected because it conflates tuning analysis with runtime enforcement and has no caching.

Implementation Plan

Files to create

  1. scripts/lib/stage-timeout.sh (~250 lines) — Core library with all 6 functions, load guard, constants, per-stage defaults
  2. scripts/sw-stage-timeout-test.sh (~300 lines) — 17 tests covering unit, integration, and edge cases

Files to modify

  1. scripts/sw-pipeline.sh (~15 lines changed) — Source the library, wrap line 1683 run_stage_with_retry with run_with_stage_timeout, handle exit 124 as timeout-specific failure
  2. scripts/sw-adaptive.sh (~30 lines) — Add timeout subcommand calling get_timeout_dashboard_json(), register in case statement and help text
  3. scripts/sw-adaptive-test.sh (~20 lines) — Tests for the new timeout subcommand
  4. templates/pipelines/autonomous.json (~2 lines) — Add "timeout_s": 3600 to build stage config as a documented example of manual override
  5. .claude/CLAUDE.md (~1 line) — Add stage-timeouts.json to Runtime State section

Dependencies

  • No new external dependencies. Uses existing jq, bc, sqlite3 (all already required by the project).
  • Internal dependencies: sw-db.sh (event queries), sw-adaptive.sh (percentile function — sourced only for recalculate_stage_timeouts, not at load time).

Risk areas

Risk Severity Mitigation
Process group kill leaves orphan Claude sessions High Use kill -- -$pgid to kill entire process group. Add a cleanup trap in run_with_stage_timeout that runs on EXIT/TERM/INT. Verify with test that spawns a subprocess tree.
Race between natural completion and watchdog kill Medium Check kill -0 $pid 2>/dev/null before killing. The wait returns the actual exit code — if the process already exited, the kill is a no-op.
set -m (job control) interacts with set -e Medium Only enable job control in the subprocess fork, not globally. Use ( set -m; command ) subshell pattern to scope it.
recalculate_stage_timeouts during concurrent pipeline runs Low Atomic write (tmp + mv) prevents partial reads. Concurrent recalculations produce identical results (deterministic query window). Last writer wins, which is correct.
P95 too tight for legitimate long runs Medium 1.2x buffer on P95, configurable min_threshold_s per stage, manual override escape hatch, warning logged at 80% of timeout before kill.
Bootstrap: no historical data Low _stage_timeout_default() provides conservative defaults (build=3600s, test=1800s). System never blocks a pipeline from running due to missing data.

Schema Changes

New file: ~/.shipwright/optimization/stage-timeouts.json

{
  "version": 1,
  "last_global_recalc": "2026-03-07T00:00:00Z",
  "stages": {
    "build": {
      "p50_s": 120, "p95_s": 450, "p99_s": 680,
      "timeout_s": 540,
      "min_threshold_s": 300,
      "samples": 47,
      "last_calculated": "2026-03-07T00:00:00Z",
      "history": [
        { "ts": "2026-03-01T00:00:00Z", "timeout_s": 520, "p95_s": 433, "samples": 42 }
      ]
    }
  }
}

Forward migration: None — file is created on-demand by recalculate_stage_timeouts().

Rollback: rm ~/.shipwright/optimization/stage-timeouts.json — system falls back to hardcoded defaults. No database schema changes. Events with stage.timeout type are harmless if ignored.

Data Flow Diagram (Ingestion)

Stage Execution
    |
    v
[stage.completed event] ---> events.jsonl / shipwright.db
    |                              |
    v                              v
emit_event()               record_stage() in sw-db.sh
                                   |
                                   v
              [recalculate_stage_timeouts() on staleness]
                                   |
                     db_query_events_since(30-day window)
                                   |
                     group by stage, percentile(durations, 95)
                                   |
                     atomic write --> stage-timeouts.json
                                          ^
                                          |
                           [F] Failure: write defaults, log warning

Idempotency Strategy

  • Cache file (stage-timeouts.json): Recomputed deterministically from the events DB. Deleting it triggers a fresh recalculation on next pipeline run. Concurrent writes are safe via atomic tmp+mv.
  • Event emission: Events are append-only. The events table has a UNIQUE constraint on (ts_epoch, type, job_id) preventing duplicates. emit_event is idempotent for the same timestamp+type+job combination.
  • Timeout enforcement: Stateless per-run — reads config, applies, done. No persistent state modified during enforcement. Re-running a stage after timeout is safe.

Rollback Plan

  1. Set stage_timeouts.enabled=false in daemon-config.json (instant, no code change)
  2. If code rollback needed: revert the single-line change in sw-pipeline.sh:1683 (restore direct run_stage_with_retry call)
  3. Delete cached data: rm ~/.shipwright/optimization/stage-timeouts.json
  4. No database migration to revert — no schema was changed
  5. Orphaned stage.timeout events in the events DB are harmless (filtered out by queries that don't select that type)

Validation Criteria

  • compute_adaptive_timeout "build" returns 3600 (default) when no history exists
  • compute_adaptive_timeout "build" returns P95×1.2 when stage-timeouts.json has >=10 samples
  • Manual timeout_s in pipeline template config takes precedence over adaptive value
  • run_with_stage_timeout "test" sleep 100 with a 2-second timeout exits 124 and kills the sleep process group (no orphans)
  • run_with_stage_timeout "test" true exits 0 and kills the watchdog timer (no zombie watchdogs)
  • run_with_stage_timeout "test" false exits 1 (not 124) and preserves the original error semantics
  • stage.timeout events are distinguishable from stage.failed in the events DB
  • timeout.false_timeout_prevented is emitted when a stage takes longer than the fixed default but completes within the adaptive timeout
  • recalculate_stage_timeouts --force rewrites the cache file atomically (verified via concurrent reads during write)
  • sw adaptive timeout renders a formatted dashboard with P50/P95/P99 per stage
  • Setting stage_timeouts.enabled=false in daemon-config bypasses all timeout enforcement (stages run unbounded, existing behavior preserved)
  • History array in stage-timeouts.json is capped at 52 entries (1 year of weekly recalculations)
  • Full test suite bash scripts/sw-stage-timeout-test.sh passes with 0 failures
  • npm test passes with no regressions in existing test suites
  • No Bash 3.2 incompatibilities (no associative arrays, no readarray, no ${var,,})

Clone this wiki locally