Skip to content

Pipeline Design 51

Seth Ford edited this page Feb 13, 2026 · 1 revision

I now have a thorough understanding of the codebase. Let me write the ADR.


Design: Event-driven architecture — replace polling with durable event bus, SIGCHLD traps, and real-time notifications

Context

Shipwright's daemon (scripts/sw-daemon.sh) orchestrates pipeline processes using a poll-based architecture. The main loop (daemon_poll_loop(), line 4856) sleeps for 30–120 seconds per cycle (adaptive), then sequentially calls daemon_poll_issues(), daemon_reap_completed(), and daemon_health_check().

The core problem: Process completion detection is slow and fragile.

  1. Reaping latency: daemon_reap_completed() (line 1823) iterates all active_jobs in daemon-state.json, probing each PID with kill -0 $pid (line 1847). A pipeline exiting at T=5 into a 60-second cycle won't be detected until T=60 — up to 55 seconds of dead time before the queue drains or GitHub gets notified.

  2. Exit code race condition: After kill -0 returns non-zero, wait $pid (line 1855) may return 127 if init already reaped the zombie. The daemon falls back to grepping log files for "Pipeline completed successfully" (line 1860) — a fragile heuristic that can misclassify outcomes.

  3. Wasted API calls: daemon_poll_issues() (line 3827) hits the GitHub API every cycle regardless of whether new issues exist. Periodic tasks (config reload, auto-scale, self-optimize, stale cleanup) fire on modulo counters tied to poll cycles, creating coupling between timer frequency and task scheduling.

Constraints:

  • All scripts must be Bash 3.2 compatible (no associative arrays, no readarray)
  • No external dependencies beyond standard POSIX utilities + jq + gh
  • The events.jsonl append-only log must continue working unchanged for metrics, memory system, and DORA calculations
  • Graceful degradation to current polling if the event bus is unavailable
  • Pipelines run as background subprocesses spawned via (trap '' HUP; cd $dir; exec sw-pipeline.sh ...) & (line 1763)

Decision

Introduce three cooperating mechanisms that replace the sleep-poll-probe cycle:

1. Named FIFO Event Bus (scripts/lib/event-bus.sh)

A new library creates and manages a named pipe at ~/.shipwright/events.fifo. The FIFO provides an in-process, zero-dependency IPC channel between pipeline subprocesses and the daemon.

Lifecycle:

  • event_bus_init() — creates the FIFO with mkfifo if absent, opens it for non-blocking read (fd 3)
  • event_bus_write() — writes a single JSON line to the FIFO (with O_NONBLOCK so writers never block if no reader)
  • event_bus_read()read -t $timeout <&3 to consume one event from the FIFO
  • event_bus_destroy() — closes fd 3, removes the FIFO file

Writer safety: Writers open-close the FIFO per event (O_WRONLY|O_NONBLOCK). If the FIFO doesn't exist or has no reader, the write silently fails — events still land in events.jsonl via the existing path.

Why FIFO over alternatives: Named pipes are POSIX standard, work on macOS Bash 3.2, require zero dependencies, and provide natural backpressure. They're simpler than Unix domain sockets and more portable than inotifywait.

2. Dual-Write emit_event() in scripts/lib/helpers.sh

The existing emit_event() function (line 56) is extended to dual-write:

emit_event() {
    # ... build JSON line (unchanged) ...
    
    # 1. Durable write (unchanged) — append to events.jsonl
    echo "$json_line" >> "$EVENTS_FILE"
    
    # 2. Real-time write (new) — push to FIFO if available
    event_bus_write "$json_line" 2>/dev/null || true
}

Every existing call site (daemon.spawn, daemon.reap, pipeline.completed, pipeline.started, etc.) automatically gains real-time delivery with zero code changes at call sites. The || true ensures FIFO failure never breaks event logging.

Events gain two new fields for idempotency:

  • seq — monotonic counter (per-process, reset on restart)
  • correlation_id — set by the daemon at spawn time, inherited by the pipeline child (via environment variable SHIPWRIGHT_CORRELATION_ID)

3. SIGCHLD Trap + Event-Driven Loop in scripts/sw-daemon.sh

SIGCHLD handler (replaces kill -0 probing):

SIGCHLD_FLAG=""
trap 'SIGCHLD_FLAG=1' SIGCHLD

When a pipeline subprocess exits, the kernel delivers SIGCHLD to the daemon. The trap sets a flag variable. The event loop checks the flag, then calls wait -n (or iterates PIDs with wait $pid) to reap completed children and retrieve their exit codes directly — no more kill -0 polling, no more 127/log-grepping fallback.

Compatibility note: Bash 3.2 does not have wait -n. The handler will iterate known PIDs from active_jobs and call wait $pid for each. Since SIGCHLD fires before init can reap the zombie (the daemon is the direct parent), wait will reliably return the true exit code.

Event-driven main loop (daemon_event_loop(), replaces daemon_poll_loop()):

daemon_event_loop() {
    event_bus_init
    while [[ ! -f "$SHUTDOWN_FLAG" ]]; do
        # 1. Check SIGCHLD flag — reap immediately
        if [[ -n "$SIGCHLD_FLAG" ]]; then
            SIGCHLD_FLAG=""
            daemon_reap_completed_sigchld  # new: uses wait, not kill -0
        fi
        
        # 2. Read from FIFO with 1-second timeout
        local event_line=""
        if event_line=$(event_bus_read 1); then
            daemon_dispatch_event "$event_line"
        fi
        
        # 3. Time-based periodic tasks (unchanged logic, now wall-clock based)
        daemon_run_periodic_tasks
        
        # 4. GitHub issue polling on its own timer (not every loop iteration)
        daemon_maybe_poll_issues
    done
    event_bus_destroy
}

The loop blocks on read -t 1 instead of sleep 1. This means:

  • Pipeline events (stage transitions, completions) wake the daemon instantly
  • SIGCHLD interrupts the read, causing immediate reap
  • Periodic tasks still fire on wall-clock intervals (decoupled from poll frequency)
  • If the FIFO is broken, event_bus_read returns non-zero every 1 second — same cadence as the old 1-second sleep increments

Periodic task scheduling changes from modulo-on-cycle-counter to wall-clock timestamps:

daemon_run_periodic_tasks() {
    local now=$(now_epoch)
    [[ $((now - LAST_CONFIG_RELOAD)) -ge 180 ]]   && { daemon_reload_config; LAST_CONFIG_RELOAD=$now; }
    [[ $((now - LAST_DEGRADATION_CHECK)) -ge 300 ]] && { daemon_check_degradation; LAST_DEGRADATION_CHECK=$now; }
    [[ $((now - LAST_AUTO_SCALE)) -ge ${AUTO_SCALE_INTERVAL_SECS:-300} ]] && { daemon_auto_scale; LAST_AUTO_SCALE=$now; }
    # ... etc
}

This decouples task frequency from poll interval entirely.

Safety net: A 5-minute kill -0 sweep runs as a periodic task to catch any SIGCHLDs that were missed (e.g., signal delivered during a non-interruptible section). This sweep is the existing daemon_reap_completed() logic, kept as a fallback.

4. Pipeline Correlation ID (scripts/sw-pipeline.sh)

daemon_spawn_pipeline() sets SHIPWRIGHT_CORRELATION_ID as an environment variable before exec. The pipeline inherits it and includes it in all emit_event() calls. This enables:

  • Tracing all events from a single pipeline run
  • Deduplication by the daemon's event dispatcher
  • Future: filtering the FIFO by correlation ID for multi-pipeline scenarios

Data Flow

Pipeline Process                    Daemon Process
─────────────────                   ──────────────
emit_event("pipeline.completed")
  ├─ append to events.jsonl  ───→  (durable, read by metrics later)
  └─ write to events.fifo   ───→  event_bus_read() wakes daemon
                                     └─ daemon_dispatch_event()
                                          └─ triggers immediate reap/notify
Process exits
  └─ SIGCHLD delivered       ───→  trap sets SIGCHLD_FLAG
                                     └─ daemon_reap_completed_sigchld()
                                          └─ wait $pid → exit code (reliable)
                                          └─ daemon_on_success / daemon_on_failure
                                          └─ dequeue next issue

Event Dispatch Table

Event Type Daemon Action
pipeline.completed Skip next kill -0 sweep for this PID (already know outcome)
pipeline.started Log, update dashboard
pipeline.context_exhaustion Tag for retry escalation
pipeline.quality_gate_failed Optionally alert early
daemon.* Self-events — ignore on read-back

Fallback / Degradation

If mkfifo fails or the FIFO becomes unreadable:

  1. event_bus_init() logs a warning and sets EVENT_BUS_DEGRADED=1
  2. daemon_event_loop() detects degradation and falls back to daemon_poll_loop() (the existing code, completely unchanged)
  3. emit_event() skips the FIFO write when degraded — events.jsonl still captures everything

Alternatives Considered

1. inotifywait on events.jsonl

Approach: Use inotify to watch the events file for writes, waking the daemon on each append.

Pros: No new IPC primitive; events.jsonl is already the single write target; filesystem-level notification.

Cons: inotifywait is a Linux-only tool (not available on macOS without fswatch, a Homebrew dependency). Bash 3.2 has no built-in inotify support. Adds an external dependency that violates the project's zero-dep constraint. Also can't distinguish between event types without re-parsing the file tail.

2. Unix Domain Socket

Approach: Replace FIFO with a UDS for bidirectional communication. Daemon listens; pipelines connect and send structured messages.

Pros: Bidirectional (daemon could send commands back to pipelines). Handles multiple concurrent writers without blocking. More robust connection lifecycle.

Cons: Bash 3.2 cannot natively open Unix domain sockets — requires socat or nc -U, adding dependencies. Connection management (accept, read, close per client) is complex in bash. Over-engineered for one-way event delivery. The FIFO handles concurrent writers natively (kernel serializes writes < PIPE_BUF = 4096 bytes, and our JSON events are well under that).

3. Polling with Reduced Interval

Approach: Simply decrease POLL_INTERVAL from 60s to 5s.

Pros: Zero code changes. Immediately reduces reaping latency to ~5s max.

Cons: 12x more GitHub API calls per hour (may hit rate limits). 12x more jq state file parses. Doesn't fix the wait exit-code race condition (127 fallback). CPU and I/O overhead scales linearly with poll frequency. Doesn't address the fundamental coupling between poll rate and task scheduling.

4. Self-pipe Trick (Signal-only, no FIFO)

Approach: Use SIGCHLD trap with a self-pipe (write to pipe in signal handler, read from pipe in main loop) — no FIFO, rely purely on signals.

Pros: Even simpler than FIFO. Standard Unix pattern. No filesystem artifact.

Cons: Only provides process-exit notifications, not real-time stage events. The daemon would still need to poll or use another mechanism to learn about pipeline.completed, pipeline.quality_gate_failed, etc. SIGCHLD alone can coalesce — if two pipelines exit in the same instant, only one SIGCHLD is delivered. The FIFO complements SIGCHLD by carrying rich event data that signals cannot.

Implementation Plan

Files to Create

File Purpose
scripts/lib/event-bus.sh FIFO lifecycle (event_bus_init, event_bus_write, event_bus_read, event_bus_destroy), degradation detection, fd management

Files to Modify

File Changes
scripts/lib/helpers.sh Add seq counter and FIFO dual-write to emit_event(); add correlation_id from env var
scripts/sw-daemon.sh Add SIGCHLD trap; add daemon_event_loop() and daemon_dispatch_event(); add daemon_reap_completed_sigchld() (wait-based); refactor periodic tasks to wall-clock; set SHIPWRIGHT_CORRELATION_ID in daemon_spawn_pipeline(); wire event_bus_init/destroy into startup/cleanup
scripts/sw-pipeline.sh Read SHIPWRIGHT_CORRELATION_ID from env; pass through to emit_event() calls
scripts/sw-daemon-test.sh 8 new test cases (see validation criteria)

Dependencies

None. Named pipes (mkfifo), file descriptors, read -t, and trap SIGCHLD are all POSIX/Bash 3.2 builtins.

Risk Areas

  1. SIGCHLD coalescing: If multiple pipelines exit simultaneously, Linux/macOS may deliver only one SIGCHLD. Mitigation: The handler iterates all tracked PIDs calling wait on each, not just one. The 5-minute safety sweep catches any stragglers.

  2. FIFO blocking on open: A writer opening a FIFO with no reader blocks indefinitely by default. Mitigation: Writers use O_NONBLOCK — if no reader exists, open() returns ENXIO and the write is skipped (event still in JSONL).

  3. FIFO blocking in Bash: Bash's built-in redirection doesn't support O_NONBLOCK on open. Mitigation: The daemon opens the FIFO for reading at startup (fd 3) and keeps it open. Writers can then open and write without blocking. For the writer side, use echo "$line" > "$FIFO_PATH" which opens, writes, and closes — this works because the reader fd is already open. If the FIFO path doesn't exist, the redirect fails silently (with 2>/dev/null || true).

  4. Partial reads from FIFO: If two writers write concurrently and their combined output exceeds PIPE_BUF (4096 bytes on macOS/Linux), reads may interleave. Mitigation: Our JSON event lines are typically 200–500 bytes, well under PIPE_BUF. Atomic write guarantee holds.

  5. State file contention during rapid events: Instant reaping could cause more frequent locked_state_update() calls. Mitigation: The flock mechanism already handles contention. Event dispatch can batch state updates if needed (process all queued events before writing state).

  6. daemon_poll_loop() must not be deleted: It serves as the fallback. Mitigation: The old function stays intact. daemon_event_loop() is an alternative entry point selected at startup based on FIFO availability.

  7. SIGHUP interaction: The daemon already traps SIGHUP (line 5057: trap '' SIGHUP). Adding SIGCHLD doesn't conflict — Bash supports multiple simultaneous traps on different signals.

Validation Criteria

  • Reaping latency < 2 seconds: When a pipeline exits, the daemon calls daemon_on_success/daemon_on_failure within 2 seconds (measured via daemon.reap event timestamp minus pipeline exit timestamp)
  • Exit code reliability: daemon_reap_completed_sigchld() never falls back to log-file grepping — wait $pid always returns the true exit code (verified by test that spawns a child exiting with code 42)
  • FIFO dual-write: Events appear in both events.jsonl and are readable from events.fifo (test writes an event, reads it back from FIFO, confirms JSON matches)
  • Graceful degradation: When events.fifo is deleted mid-run, the daemon switches to daemon_poll_loop() within one loop iteration and logs a warning (test removes FIFO, verifies fallback activates)
  • SIGCHLD coalescing: When 3 pipelines exit simultaneously, all 3 are reaped within one loop iteration (test forks 3 children, kills them together, asserts 3 reap events emitted)
  • Idempotency: Duplicate events (same seq+correlation_id) are processed exactly once (test writes same event twice to FIFO, verifies single state mutation)
  • Periodic tasks fire on wall-clock: Config reload fires every 180 seconds regardless of event volume (test with 0 events over 200s verifies reload happened; test with 1000 events in 100s verifies reload did not fire early)
  • Backward compatibility: events.jsonl entries include seq and correlation_id fields but remain valid JSON parseable by existing consumers (run sw-memory.sh and sw-cost.sh against new-format events)
  • All 22 existing test suites pass: npm test green — no regressions from emit_event() changes
  • No Bash 3.2 incompatibilities: shellcheck clean; no associative arrays, readarray, ${var,,}, or wait -n

Clone this wiki locally