-
Notifications
You must be signed in to change notification settings - Fork 1
Pipeline Design 51
I now have a thorough understanding of the codebase. Let me write the ADR.
Design: Event-driven architecture — replace polling with durable event bus, SIGCHLD traps, and real-time notifications
Shipwright's daemon (scripts/sw-daemon.sh) orchestrates pipeline processes using a poll-based architecture. The main loop (daemon_poll_loop(), line 4856) sleeps for 30–120 seconds per cycle (adaptive), then sequentially calls daemon_poll_issues(), daemon_reap_completed(), and daemon_health_check().
The core problem: Process completion detection is slow and fragile.
-
Reaping latency:
daemon_reap_completed()(line 1823) iterates allactive_jobsindaemon-state.json, probing each PID withkill -0 $pid(line 1847). A pipeline exiting at T=5 into a 60-second cycle won't be detected until T=60 — up to 55 seconds of dead time before the queue drains or GitHub gets notified. -
Exit code race condition: After
kill -0returns non-zero,wait $pid(line 1855) may return 127 if init already reaped the zombie. The daemon falls back to grepping log files for "Pipeline completed successfully" (line 1860) — a fragile heuristic that can misclassify outcomes. -
Wasted API calls:
daemon_poll_issues()(line 3827) hits the GitHub API every cycle regardless of whether new issues exist. Periodic tasks (config reload, auto-scale, self-optimize, stale cleanup) fire on modulo counters tied to poll cycles, creating coupling between timer frequency and task scheduling.
Constraints:
- All scripts must be Bash 3.2 compatible (no associative arrays, no
readarray) - No external dependencies beyond standard POSIX utilities +
jq+gh - The
events.jsonlappend-only log must continue working unchanged for metrics, memory system, and DORA calculations - Graceful degradation to current polling if the event bus is unavailable
- Pipelines run as background subprocesses spawned via
(trap '' HUP; cd $dir; exec sw-pipeline.sh ...) &(line 1763)
Introduce three cooperating mechanisms that replace the sleep-poll-probe cycle:
A new library creates and manages a named pipe at ~/.shipwright/events.fifo. The FIFO provides an in-process, zero-dependency IPC channel between pipeline subprocesses and the daemon.
Lifecycle:
-
event_bus_init()— creates the FIFO withmkfifoif absent, opens it for non-blocking read (fd 3) -
event_bus_write()— writes a single JSON line to the FIFO (with O_NONBLOCK so writers never block if no reader) -
event_bus_read()—read -t $timeout <&3to consume one event from the FIFO -
event_bus_destroy()— closes fd 3, removes the FIFO file
Writer safety: Writers open-close the FIFO per event (O_WRONLY|O_NONBLOCK). If the FIFO doesn't exist or has no reader, the write silently fails — events still land in events.jsonl via the existing path.
Why FIFO over alternatives: Named pipes are POSIX standard, work on macOS Bash 3.2, require zero dependencies, and provide natural backpressure. They're simpler than Unix domain sockets and more portable than inotifywait.
The existing emit_event() function (line 56) is extended to dual-write:
emit_event() {
# ... build JSON line (unchanged) ...
# 1. Durable write (unchanged) — append to events.jsonl
echo "$json_line" >> "$EVENTS_FILE"
# 2. Real-time write (new) — push to FIFO if available
event_bus_write "$json_line" 2>/dev/null || true
}
Every existing call site (daemon.spawn, daemon.reap, pipeline.completed, pipeline.started, etc.) automatically gains real-time delivery with zero code changes at call sites. The || true ensures FIFO failure never breaks event logging.
Events gain two new fields for idempotency:
-
seq— monotonic counter (per-process, reset on restart) -
correlation_id— set by the daemon at spawn time, inherited by the pipeline child (via environment variableSHIPWRIGHT_CORRELATION_ID)
SIGCHLD handler (replaces kill -0 probing):
SIGCHLD_FLAG=""
trap 'SIGCHLD_FLAG=1' SIGCHLDWhen a pipeline subprocess exits, the kernel delivers SIGCHLD to the daemon. The trap sets a flag variable. The event loop checks the flag, then calls wait -n (or iterates PIDs with wait $pid) to reap completed children and retrieve their exit codes directly — no more kill -0 polling, no more 127/log-grepping fallback.
Compatibility note: Bash 3.2 does not have wait -n. The handler will iterate known PIDs from active_jobs and call wait $pid for each. Since SIGCHLD fires before init can reap the zombie (the daemon is the direct parent), wait will reliably return the true exit code.
Event-driven main loop (daemon_event_loop(), replaces daemon_poll_loop()):
daemon_event_loop() {
event_bus_init
while [[ ! -f "$SHUTDOWN_FLAG" ]]; do
# 1. Check SIGCHLD flag — reap immediately
if [[ -n "$SIGCHLD_FLAG" ]]; then
SIGCHLD_FLAG=""
daemon_reap_completed_sigchld # new: uses wait, not kill -0
fi
# 2. Read from FIFO with 1-second timeout
local event_line=""
if event_line=$(event_bus_read 1); then
daemon_dispatch_event "$event_line"
fi
# 3. Time-based periodic tasks (unchanged logic, now wall-clock based)
daemon_run_periodic_tasks
# 4. GitHub issue polling on its own timer (not every loop iteration)
daemon_maybe_poll_issues
done
event_bus_destroy
}The loop blocks on read -t 1 instead of sleep 1. This means:
- Pipeline events (stage transitions, completions) wake the daemon instantly
- SIGCHLD interrupts the
read, causing immediate reap - Periodic tasks still fire on wall-clock intervals (decoupled from poll frequency)
- If the FIFO is broken,
event_bus_readreturns non-zero every 1 second — same cadence as the old 1-second sleep increments
Periodic task scheduling changes from modulo-on-cycle-counter to wall-clock timestamps:
daemon_run_periodic_tasks() {
local now=$(now_epoch)
[[ $((now - LAST_CONFIG_RELOAD)) -ge 180 ]] && { daemon_reload_config; LAST_CONFIG_RELOAD=$now; }
[[ $((now - LAST_DEGRADATION_CHECK)) -ge 300 ]] && { daemon_check_degradation; LAST_DEGRADATION_CHECK=$now; }
[[ $((now - LAST_AUTO_SCALE)) -ge ${AUTO_SCALE_INTERVAL_SECS:-300} ]] && { daemon_auto_scale; LAST_AUTO_SCALE=$now; }
# ... etc
}This decouples task frequency from poll interval entirely.
Safety net: A 5-minute kill -0 sweep runs as a periodic task to catch any SIGCHLDs that were missed (e.g., signal delivered during a non-interruptible section). This sweep is the existing daemon_reap_completed() logic, kept as a fallback.
daemon_spawn_pipeline() sets SHIPWRIGHT_CORRELATION_ID as an environment variable before exec. The pipeline inherits it and includes it in all emit_event() calls. This enables:
- Tracing all events from a single pipeline run
- Deduplication by the daemon's event dispatcher
- Future: filtering the FIFO by correlation ID for multi-pipeline scenarios
Pipeline Process Daemon Process
───────────────── ──────────────
emit_event("pipeline.completed")
├─ append to events.jsonl ───→ (durable, read by metrics later)
└─ write to events.fifo ───→ event_bus_read() wakes daemon
└─ daemon_dispatch_event()
└─ triggers immediate reap/notify
Process exits
└─ SIGCHLD delivered ───→ trap sets SIGCHLD_FLAG
└─ daemon_reap_completed_sigchld()
└─ wait $pid → exit code (reliable)
└─ daemon_on_success / daemon_on_failure
└─ dequeue next issue
| Event Type | Daemon Action |
|---|---|
pipeline.completed |
Skip next kill -0 sweep for this PID (already know outcome) |
pipeline.started |
Log, update dashboard |
pipeline.context_exhaustion |
Tag for retry escalation |
pipeline.quality_gate_failed |
Optionally alert early |
daemon.* |
Self-events — ignore on read-back |
If mkfifo fails or the FIFO becomes unreadable:
-
event_bus_init()logs a warning and setsEVENT_BUS_DEGRADED=1 -
daemon_event_loop()detects degradation and falls back todaemon_poll_loop()(the existing code, completely unchanged) -
emit_event()skips the FIFO write when degraded —events.jsonlstill captures everything
Approach: Use inotify to watch the events file for writes, waking the daemon on each append.
Pros: No new IPC primitive; events.jsonl is already the single write target; filesystem-level notification.
Cons: inotifywait is a Linux-only tool (not available on macOS without fswatch, a Homebrew dependency). Bash 3.2 has no built-in inotify support. Adds an external dependency that violates the project's zero-dep constraint. Also can't distinguish between event types without re-parsing the file tail.
Approach: Replace FIFO with a UDS for bidirectional communication. Daemon listens; pipelines connect and send structured messages.
Pros: Bidirectional (daemon could send commands back to pipelines). Handles multiple concurrent writers without blocking. More robust connection lifecycle.
Cons: Bash 3.2 cannot natively open Unix domain sockets — requires socat or nc -U, adding dependencies. Connection management (accept, read, close per client) is complex in bash. Over-engineered for one-way event delivery. The FIFO handles concurrent writers natively (kernel serializes writes < PIPE_BUF = 4096 bytes, and our JSON events are well under that).
Approach: Simply decrease POLL_INTERVAL from 60s to 5s.
Pros: Zero code changes. Immediately reduces reaping latency to ~5s max.
Cons: 12x more GitHub API calls per hour (may hit rate limits). 12x more jq state file parses. Doesn't fix the wait exit-code race condition (127 fallback). CPU and I/O overhead scales linearly with poll frequency. Doesn't address the fundamental coupling between poll rate and task scheduling.
Approach: Use SIGCHLD trap with a self-pipe (write to pipe in signal handler, read from pipe in main loop) — no FIFO, rely purely on signals.
Pros: Even simpler than FIFO. Standard Unix pattern. No filesystem artifact.
Cons: Only provides process-exit notifications, not real-time stage events. The daemon would still need to poll or use another mechanism to learn about pipeline.completed, pipeline.quality_gate_failed, etc. SIGCHLD alone can coalesce — if two pipelines exit in the same instant, only one SIGCHLD is delivered. The FIFO complements SIGCHLD by carrying rich event data that signals cannot.
| File | Purpose |
|---|---|
scripts/lib/event-bus.sh |
FIFO lifecycle (event_bus_init, event_bus_write, event_bus_read, event_bus_destroy), degradation detection, fd management |
| File | Changes |
|---|---|
scripts/lib/helpers.sh |
Add seq counter and FIFO dual-write to emit_event(); add correlation_id from env var |
scripts/sw-daemon.sh |
Add SIGCHLD trap; add daemon_event_loop() and daemon_dispatch_event(); add daemon_reap_completed_sigchld() (wait-based); refactor periodic tasks to wall-clock; set SHIPWRIGHT_CORRELATION_ID in daemon_spawn_pipeline(); wire event_bus_init/destroy into startup/cleanup |
scripts/sw-pipeline.sh |
Read SHIPWRIGHT_CORRELATION_ID from env; pass through to emit_event() calls |
scripts/sw-daemon-test.sh |
8 new test cases (see validation criteria) |
None. Named pipes (mkfifo), file descriptors, read -t, and trap SIGCHLD are all POSIX/Bash 3.2 builtins.
-
SIGCHLD coalescing: If multiple pipelines exit simultaneously, Linux/macOS may deliver only one SIGCHLD. Mitigation: The handler iterates all tracked PIDs calling
waiton each, not just one. The 5-minute safety sweep catches any stragglers. -
FIFO blocking on open: A writer opening a FIFO with no reader blocks indefinitely by default. Mitigation: Writers use
O_NONBLOCK— if no reader exists,open()returns ENXIO and the write is skipped (event still in JSONL). -
FIFO blocking in Bash: Bash's built-in redirection doesn't support
O_NONBLOCKon open. Mitigation: The daemon opens the FIFO for reading at startup (fd 3) and keeps it open. Writers can then open and write without blocking. For the writer side, useecho "$line" > "$FIFO_PATH"which opens, writes, and closes — this works because the reader fd is already open. If the FIFO path doesn't exist, the redirect fails silently (with2>/dev/null || true). -
Partial reads from FIFO: If two writers write concurrently and their combined output exceeds PIPE_BUF (4096 bytes on macOS/Linux), reads may interleave. Mitigation: Our JSON event lines are typically 200–500 bytes, well under PIPE_BUF. Atomic write guarantee holds.
-
State file contention during rapid events: Instant reaping could cause more frequent
locked_state_update()calls. Mitigation: The flock mechanism already handles contention. Event dispatch can batch state updates if needed (process all queued events before writing state). -
daemon_poll_loop()must not be deleted: It serves as the fallback. Mitigation: The old function stays intact.daemon_event_loop()is an alternative entry point selected at startup based on FIFO availability. -
SIGHUP interaction: The daemon already traps
SIGHUP(line 5057:trap '' SIGHUP). AddingSIGCHLDdoesn't conflict — Bash supports multiple simultaneous traps on different signals.
- Reaping latency < 2 seconds: When a pipeline exits, the daemon calls
daemon_on_success/daemon_on_failurewithin 2 seconds (measured viadaemon.reapevent timestamp minus pipeline exit timestamp) - Exit code reliability:
daemon_reap_completed_sigchld()never falls back to log-file grepping —wait $pidalways returns the true exit code (verified by test that spawns a child exiting with code 42) - FIFO dual-write: Events appear in both
events.jsonland are readable fromevents.fifo(test writes an event, reads it back from FIFO, confirms JSON matches) - Graceful degradation: When
events.fifois deleted mid-run, the daemon switches todaemon_poll_loop()within one loop iteration and logs a warning (test removes FIFO, verifies fallback activates) - SIGCHLD coalescing: When 3 pipelines exit simultaneously, all 3 are reaped within one loop iteration (test forks 3 children, kills them together, asserts 3 reap events emitted)
- Idempotency: Duplicate events (same
seq+correlation_id) are processed exactly once (test writes same event twice to FIFO, verifies single state mutation) - Periodic tasks fire on wall-clock: Config reload fires every 180 seconds regardless of event volume (test with 0 events over 200s verifies reload happened; test with 1000 events in 100s verifies reload did not fire early)
- Backward compatibility:
events.jsonlentries includeseqandcorrelation_idfields but remain valid JSON parseable by existing consumers (runsw-memory.shandsw-cost.shagainst new-format events) - All 22 existing test suites pass:
npm testgreen — no regressions fromemit_event()changes - No Bash 3.2 incompatibilities:
shellcheckclean; no associative arrays,readarray,${var,,}, orwait -n