Pipeline Design 212

Design: Adaptive Stage Timeout Engine with P95 Duration-Based Auto-Tuning

Context

Pipeline stages currently run unbounded — there is no wall-clock enforcement. A failing build stage can burn 60+ minutes of compute before the retry/failure logic kicks in. The existing get_timeout() function in sw-adaptive.sh:141 already computes P95-based timeouts from historical stage.completed events, but nothing enforces them. The value is computed and used only for informational tuning recommendations — never as a kill signal.

Constraints from the codebase:

All scripts are Bash 3.2 compatible (no associative arrays, no readarray)
The scripts/lib/*.sh pattern (40 existing libraries) is the established way to add shared logic — libraries use an idempotent load guard ([[ -n "${_LIB_LOADED:-}" ]] && return 0)
Stage execution happens in sw-pipeline.sh:1683 via run_stage_with_retry() which calls stage_${id} functions — stages spawn Claude CLI sessions that create child processes
Event storage is dual-layer: SQLite (sw-db.sh) primary + JSONL fallback, queried via db_query_events() and db_query_events_since()
Pipeline templates are JSON files in templates/pipelines/ with per-stage config objects
The daemon already has an optimization/ directory at ~/.shipwright/optimization/ (used by daemon-adaptive.sh for daemon-tuning.json)

The problem: Without enforcement, a single stage can waste an entire pipeline budget. With enforcement, we need careful process group cleanup (Claude sessions spawn subprocesses), a priority chain for timeout sources, and a bootstrap path for repos with no historical data.

Decision

Create a new shared library scripts/lib/stage-timeout.sh that provides:

Adaptive timeout computation with a 3-tier priority chain (manual override > cached P95×1.2 > hardcoded defaults)
Timeout enforcement via process group kill with watchdog timer
Metrics emission for observability (fast-fail detection, false-timeout prevention)
Dashboard data for sw adaptive timeout CLI

Component Diagram

+------------------------------------------------------------------+
|                        Consumers                                  |
|                                                                   |
|  sw-pipeline.sh          sw-adaptive.sh         sw-daemon.sh      |
|  (enforcement)           (CLI dashboard)        (config source)   |
+--------+-----------------------+---------------------+-----------+
         |                       |                     |
         v                       v                     v
+------------------------------------------------------------------+
|              scripts/lib/stage-timeout.sh                         |
|                                                                   |
|  compute_adaptive_timeout()  run_with_stage_timeout()             |
|  recalculate_stage_timeouts() record_stage_timeout_metric()       |
|  get_timeout_dashboard_json() _stage_timeout_default()            |
+--------+-----------------------+---------------------+-----------+
         |                       |                     |
         v                       v                     v
+------------------+  +-------------------+  +---------------------+
| stage-timeouts   |  | sw-db.sh          |  | sw-adaptive.sh      |
| .json            |  | db_query_events   |  | percentile()        |
| (cached P50/95/  |  | _since()          |  | mean()              |
|  99 + history)   |  | record_stage()    |  |                     |
+------------------+  +-------------------+  +---------------------+
         |                       |
         v                       v
+------------------+  +-------------------+
| Pipeline         |  | events.jsonl /    |
| templates JSON   |  | shipwright.db     |
| (manual overrides)|  | (source of truth) |
+------------------+  +-------------------+

Dependencies point inward: consumers depend on the library, the library depends on the data layer. No circular references.

Interface Contracts

// All functions are bash — TypeScript signatures used to express contracts

// Returns timeout in seconds. Never fails (returns default on error).
// Priority: template config.timeout_s > daemon-config stage_timeouts.defaults.$stage
//           > cached adaptive P95*1.2 > hardcoded default
function compute_adaptive_timeout(
  stage: string,                    // e.g. "build", "test", "review"
  pipeline_config_json?: string     // path to composed pipeline JSON
): number;                          // timeout in seconds, always > 0

// Recomputes percentiles from 30-day event window.
// Writes to STAGE_TIMEOUT_FILE atomically (tmp + mv).
// Called on-demand when cache is stale (>7 days). Idempotent.
function recalculate_stage_timeouts(
  force?: boolean                   // --force flag bypasses staleness check
): void;                            // exit 0 on success, exit 1 on error

// Wraps a command with timeout enforcement.
// Forks command in process group, starts watchdog, waits for either.
// On timeout: kills process group (kill -- -$pgid), emits stage.timeout event.
function run_with_stage_timeout(
  stage_id: string,                 // stage name for timeout lookup
  ...command: string[]              // command to execute (e.g. "run_stage_with_retry" "build")
): number;                          // exit 0 = success, 124 = timeout, other = command failure

// Emits observability metrics. Called after stage completion or timeout.
// Errors: none (best-effort emit, never fails the caller)
function record_stage_timeout_metric(
  stage_id: string,
  duration_s: number,
  timeout_s: number,
  result: "success" | "timeout" | "failed"
): void;

// Returns JSON blob for CLI dashboard rendering.
// Errors: returns minimal JSON with empty stages on error.
function get_timeout_dashboard_json(): string; // JSON

// Returns hardcoded default timeout for a stage type.
// Pure function, no side effects.
function _stage_timeout_default(stage: string): number;

Data Flow

Pipeline Start
    |
    v
[Check stage_timeouts.enabled in daemon-config]
    |
    +-- disabled --> run_stage_with_retry() directly (current behavior)
    |
    +-- enabled -->
         |
         v
    [compute_adaptive_timeout(stage, pipeline_config)]
         |
         +-- 1. Check pipeline template: stages[].config.timeout_s
         |      found? --> use it (manual override)
         |
         +-- 2. Check daemon-config: stage_timeouts.defaults.$stage
         |      found? --> use it (operator override)
         |
         +-- 3. Read ~/.shipwright/optimization/stage-timeouts.json
         |      |
         |      +-- missing or stale (>7 days)?
         |      |      |
         |      |      v
         |      |   [recalculate_stage_timeouts()]
         |      |      |
         |      |      +-- db_query_events_since(now-30days, "stage.completed")
         |      |      +-- group by stage, compute P50/P95/P99 via percentile()
         |      |      +-- timeout = max(P95 * 1.2, min_threshold)
         |      |      +-- atomic write to stage-timeouts.json
         |      |
         |      +-- fresh --> read cached timeout_s
         |             < 10 samples? --> use _stage_timeout_default()
         |
         +-- 4. Fallback: _stage_timeout_default(stage)
         |
         v
    [run_with_stage_timeout(stage, run_stage_with_retry, stage)]
         |
         +-- set -m (enable job control for process groups)
         +-- run command in background: command & pid=$!
         +-- start watchdog: ( sleep $timeout && kill -- -$pid ) & wd=$!
         +-- wait $pid --> capture exit_code
         |
         +-- exit 0 (success):
         |      kill watchdog
         |      record_stage_timeout_metric(stage, dur, timeout, "success")
         |      if dur > fixed_default: emit timeout.false_timeout_prevented
         |      return 0
         |
         +-- killed by signal (timeout):          <-- FAILURE POINT [F1]
         |      emit stage.timeout event
         |      set LAST_STAGE_ERROR_CLASS="timeout"
         |      record_stage_timeout_metric(stage, timeout, timeout, "timeout")
         |      return 124
         |
         +-- non-zero exit (command failure):     <-- FAILURE POINT [F2]
              kill watchdog
              record_stage_timeout_metric(stage, dur, timeout, "failed")
              return original exit code

Error Boundaries

Component	Error Source	Handling	Propagation
`compute_adaptive_timeout`	JSON parse failure, missing file, jq error	Return `_stage_timeout_default()` — never propagate errors upward	Swallowed — logged as warning
`compute_adaptive_timeout`	`recalculate_stage_timeouts` failure	Fall back to default — recalculation is best-effort	Swallowed
`run_with_stage_timeout`	Timeout fires	Kill process group, return exit 124	Propagated — caller must distinguish 124 from other failures
`run_with_stage_timeout`	Command fails before timeout	Kill watchdog, return original exit code	Propagated — transparent to existing error handling
`run_with_stage_timeout`	Race: command exits while kill fires	Check if PID still alive before kill; `kill 2>/dev/null`	Swallowed — harmless
`recalculate_stage_timeouts`	DB unavailable, no events	Write empty/default JSON, log warning	Swallowed
`record_stage_timeout_metric`	Event emission failure	Best-effort `
`get_timeout_dashboard_json`	Missing cache file	Return JSON with defaults and zero samples	Swallowed

Key error design principle: The timeout library must never prevent a pipeline from running. All computation errors fall back to sensible defaults. Only the enforcement outcome (exit 124) propagates to the caller.

Alternatives Considered

1. Inline in `sw-pipeline.sh`

Pros: Single file change, co-located with stage execution, no new imports. Cons: sw-pipeline.sh is already 3041 lines. Adding 200+ lines of timeout logic (computation, watchdog, metrics, dashboard data) violates single-responsibility. Impossible to unit-test the timeout logic without loading the entire pipeline. Mixes data access (event queries, JSON caching) with orchestration. Rejected because it makes the largest file even larger and prevents isolated testing.

2. Standalone script `sw-stage-timeout.sh`

Pros: Clean CLI entry point (sw stage-timeout dashboard), fully self-contained. Cons: Requires registration in the CLI router (scripts/sw), creates another 600+ line top-level script (already 100+ scripts). Functions can't be sourced by sw-pipeline.sh without also sourcing its CLI logic. Doesn't follow the lib/*.sh decomposition pattern used by all other pipeline subsystems (e.g., lib/pipeline-state.sh, lib/daemon-adaptive.sh). Rejected because it introduces a new integration pattern inconsistent with codebase conventions.

3. Reuse existing `get_timeout()` from `sw-adaptive.sh` directly

Pros: Zero new files, the function already computes P95-based timeouts. Cons: get_timeout() calls db_query_events() with a 5000-event scan on every invocation — too expensive to run per-stage. No caching layer. No enforcement wrapper. Adding enforcement into sw-adaptive.sh would couple the adaptive tuning CLI with pipeline execution. The existing function would need significant modification (caching, process group kill, metrics) that would bloat sw-adaptive.sh beyond its scope. Rejected because it conflates tuning analysis with runtime enforcement and has no caching.

Implementation Plan

Files to create

scripts/lib/stage-timeout.sh (~250 lines) — Core library with all 6 functions, load guard, constants, per-stage defaults
scripts/sw-stage-timeout-test.sh (~300 lines) — 17 tests covering unit, integration, and edge cases

Files to modify

scripts/sw-pipeline.sh (~15 lines changed) — Source the library, wrap line 1683 run_stage_with_retry with run_with_stage_timeout, handle exit 124 as timeout-specific failure
scripts/sw-adaptive.sh (~30 lines) — Add timeout subcommand calling get_timeout_dashboard_json(), register in case statement and help text
scripts/sw-adaptive-test.sh (~20 lines) — Tests for the new timeout subcommand
templates/pipelines/autonomous.json (~2 lines) — Add "timeout_s": 3600 to build stage config as a documented example of manual override
.claude/CLAUDE.md (~1 line) — Add stage-timeouts.json to Runtime State section

Dependencies

No new external dependencies. Uses existing jq, bc, sqlite3 (all already required by the project).
Internal dependencies: sw-db.sh (event queries), sw-adaptive.sh (percentile function — sourced only for recalculate_stage_timeouts, not at load time).

Risk areas

Risk	Severity	Mitigation
Process group kill leaves orphan Claude sessions	High	Use `kill -- -$pgid` to kill entire process group. Add a cleanup trap in `run_with_stage_timeout` that runs on EXIT/TERM/INT. Verify with test that spawns a subprocess tree.
Race between natural completion and watchdog kill	Medium	Check `kill -0 $pid 2>/dev/null` before killing. The `wait` returns the actual exit code — if the process already exited, the kill is a no-op.
`set -m` (job control) interacts with `set -e`	Medium	Only enable job control in the subprocess fork, not globally. Use `( set -m; command )` subshell pattern to scope it.
`recalculate_stage_timeouts` during concurrent pipeline runs	Low	Atomic write (tmp + mv) prevents partial reads. Concurrent recalculations produce identical results (deterministic query window). Last writer wins, which is correct.
P95 too tight for legitimate long runs	Medium	1.2x buffer on P95, configurable `min_threshold_s` per stage, manual override escape hatch, warning logged at 80% of timeout before kill.
Bootstrap: no historical data	Low	`_stage_timeout_default()` provides conservative defaults (build=3600s, test=1800s). System never blocks a pipeline from running due to missing data.

Schema Changes

New file: `~/.shipwright/optimization/stage-timeouts.json`

{
  "version": 1,
  "last_global_recalc": "2026-03-07T00:00:00Z",
  "stages": {
    "build": {
      "p50_s": 120, "p95_s": 450, "p99_s": 680,
      "timeout_s": 540,
      "min_threshold_s": 300,
      "samples": 47,
      "last_calculated": "2026-03-07T00:00:00Z",
      "history": [
        { "ts": "2026-03-01T00:00:00Z", "timeout_s": 520, "p95_s": 433, "samples": 42 }
      ]
    }
  }
}

Forward migration: None — file is created on-demand by recalculate_stage_timeouts().

Rollback: rm ~/.shipwright/optimization/stage-timeouts.json — system falls back to hardcoded defaults. No database schema changes. Events with stage.timeout type are harmless if ignored.

Data Flow Diagram (Ingestion)

Stage Execution
    |
    v
[stage.completed event] ---> events.jsonl / shipwright.db
    |                              |
    v                              v
emit_event()               record_stage() in sw-db.sh
                                   |
                                   v
              [recalculate_stage_timeouts() on staleness]
                                   |
                     db_query_events_since(30-day window)
                                   |
                     group by stage, percentile(durations, 95)
                                   |
                     atomic write --> stage-timeouts.json
                                          ^
                                          |
                           [F] Failure: write defaults, log warning

Idempotency Strategy

Cache file (stage-timeouts.json): Recomputed deterministically from the events DB. Deleting it triggers a fresh recalculation on next pipeline run. Concurrent writes are safe via atomic tmp+mv.
Event emission: Events are append-only. The events table has a UNIQUE constraint on (ts_epoch, type, job_id) preventing duplicates. emit_event is idempotent for the same timestamp+type+job combination.
Timeout enforcement: Stateless per-run — reads config, applies, done. No persistent state modified during enforcement. Re-running a stage after timeout is safe.

Rollback Plan

Set stage_timeouts.enabled=false in daemon-config.json (instant, no code change)
If code rollback needed: revert the single-line change in sw-pipeline.sh:1683 (restore direct run_stage_with_retry call)
Delete cached data: rm ~/.shipwright/optimization/stage-timeouts.json
No database migration to revert — no schema was changed
Orphaned stage.timeout events in the events DB are harmless (filtered out by queries that don't select that type)

Pipeline Design 212

Design: Adaptive Stage Timeout Engine with P95 Duration-Based Auto-Tuning

Context

Decision

Component Diagram

Interface Contracts

Data Flow

Error Boundaries

Alternatives Considered

1. Inline in sw-pipeline.sh

2. Standalone script sw-stage-timeout.sh

3. Reuse existing get_timeout() from sw-adaptive.sh directly

Implementation Plan

Files to create

Files to modify

Dependencies

Risk areas

Schema Changes

New file: ~/.shipwright/optimization/stage-timeouts.json

Data Flow Diagram (Ingestion)

Idempotency Strategy

Rollback Plan

Validation Criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1. Inline in `sw-pipeline.sh`

2. Standalone script `sw-stage-timeout.sh`

3. Reuse existing `get_timeout()` from `sw-adaptive.sh` directly

New file: `~/.shipwright/optimization/stage-timeouts.json`