-
Notifications
You must be signed in to change notification settings - Fork 1
Pipeline Plan 246
What is the minimum viable change?
A new library (scripts/lib/build-loop-failure.sh) that classifies build loop failures into 5 modes and returns mode-specific recovery strategies. The daemon's retry logic and the loop's restart logic both call into this library to select the right recovery approach instead of using generic retry.
Implicit requirements:
- Must integrate with the existing
daemon-failure.shclassification (enhance, not replace) - Must work within the existing
error-summary.jsonandprogress.mddata structures - Must be Bash 3.2 compatible (no associative arrays, no
readarray, no${var,,}) - Must follow the existing test harness patterns (
test-helpers.sh,assert_eq, etc.) - Recovery strategies must be actionable CLI args that the daemon can pass to the next pipeline/loop invocation
Acceptance criteria (from issue + derived):
-
classify_build_loop_failure()analyzeserror-summary.json+progress.mdand returns one of:context_exhaustion,infinite_loop,test_flakiness,dependency_issue,code_error -
get_recovery_strategy()returns mode-specific recovery parameters (JSON) - Failure mode + strategy logged to
events.jsonlviaemit_event - Daemon retry logic uses classification to select recovery strategy
-
--failure-modeflag onsw-loop.shallows manual override for testing - Test suite verifies each mode triggers correct strategy
Approach A: Extend daemon-failure.sh inline
- Pros: Single file, no new imports needed
- Cons:
daemon-failure.shis daemon-scoped (reads daemon logs); build-loop classification reads different data (error-summary.json, progress.md, iteration history). Mixing concerns makes both harder to test. - Blast radius: Medium — changes to daemon-failure.sh affect all daemon retry paths
Approach B: New library lib/build-loop-failure.sh + integration points (CHOSEN)
- Pros: Clean separation of concerns. Loop-level classification can be tested independently. Daemon calls into it as an enhancement layer. Can be sourced by both
sw-loop.shandsw-daemon.sh. - Cons: One more file to maintain. Need to coordinate with existing
classify_failure(). - Blast radius: Low — new file with integration hooks into existing code
Approach C: Enhance existing root-cause.sh with loop-specific patterns
- Pros: Reuses confidence scoring, evidence collection
- Cons:
root-cause.shreadserror-log.jsonl(pipeline-level), noterror-summary.json(loop-level). Different data sources, different granularity. Would need significant refactoring. - Blast radius: Medium — changes to root-cause.sh affect pipeline analysis
Why Approach B: The build loop failure classification operates on different data (per-iteration error-summary.json, progress.md history, git diff patterns) than daemon-level classification (log tail). Keeping them separate allows each to evolve independently while the daemon can call both: classify_failure() for broad category, then classify_build_loop_failure() for granular recovery strategy when the broad category is build_failure or context_exhaustion.
| Risk | Impact | Mitigation |
|---|---|---|
New classification disagrees with existing classify_failure()
|
Daemon applies wrong retry strategy | New classification is an enhancement layer — only consulted after daemon-level classification. If unavailable, falls back to existing behavior |
| Pattern matching for test flakiness produces false positives | Wastes a retry on "rerun tests" when code is actually broken | Require 2+ instances of alternating pass/fail in iteration history before classifying as flaky. Conservative threshold. |
--failure-mode flag bypasses real classification |
Could mask bugs in classification logic | Flag is for testing only — documented as such, emits warning event |
| Dependency reinstall recovery corrupts node_modules mid-build | Build state becomes inconsistent | Recovery strategy includes rm -rf node_modules && npm ci (clean install), not incremental fix |
| New library not sourced in all code paths | Silent fallback to generic retry | Guard all calls with type classify_build_loop_failure >/dev/null 2>&1 checks |
This depends on:
-
scripts/lib/daemon-failure.sh— integration point for retry strategy selection -
scripts/sw-loop.sh— integration point for--failure-modeflag and mid-loop classification -
scripts/lib/session-restart.sh— provides restart reason detection we build on -
scripts/lib/test-helpers.sh— test harness for the new test suite -
error-summary.jsonformat:{iteration, error_count, error_lines[], test_cmd} -
progress.mdformat: Iteration, Tests passing, Status fields
What depends on what we're changing:
-
daemon-failure.shchanges:sw-daemon.shsources this — must remain backward-compatible -
sw-loop.shchanges: pipeline build stage calls loop — new flag must be optional - No circular dependency risk — new library is leaf-level, sources nothing from daemon or loop
-
Runtime: If
error-summary.jsondoesn't exist or is malformed JSON, classification must returncode_error(safe default), not crash. Usejqwith fallback. -
Concurrency: Multiple daemon workers could read/write
error-summary.jsonsimultaneously in worktree mode — but each worktree has its own copy, so no race condition. - Scale: Classification reads at most 2 small files (error-summary.json ~1KB, progress.md ~2KB) — no scale concern.
-
Rollback: All changes are additive (new library, new flag, enhanced retry logic). Removing the library causes graceful fallback to existing behavior via
typeguards.
┌─────────────────────────────────────────────────────────┐
│ sw-daemon.sh │
│ daemon_on_failure() │
│ → classify_failure() [existing, daemon-level] │
│ → classify_build_loop_failure() [NEW, granular] │
│ → get_recovery_strategy() [NEW] │
│ → apply recovery args to daemon_spawn_pipeline() │
└─────────────────────────────────────────────────────────┘
│ sources │ sources
▼ ▼
┌──────────────────────┐ ┌───────────────────────────────┐
│ lib/daemon-failure.sh │ │ lib/build-loop-failure.sh │
│ [existing] │ │ [NEW] │
│ │ │ │
│ classify_failure() │ │ classify_build_loop_failure() │
│ get_max_retries() │ │ get_recovery_strategy() │
│ daemon_on_failure() │ │ detect_test_flakiness() │
│ │ │ detect_infinite_loop() │
│ │ │ detect_dependency_issue() │
│ │ │ _blf_read_error_summary() │
│ │ │ _blf_read_progress_history() │
└──────────────────────┘ └───────────────────────────────┘
▲
│ sources
┌─────────────────────┐
│ sw-loop.sh │
│ [modified] │
│ │
│ --failure-mode flag │
│ post-failure classify │
│ write failure-mode.json│
└───────────────────────┘
classify_build_loop_failure(artifacts_dir)
- Input:
artifacts_dir— path containingerror-summary.jsonandprogress.md - Output (stdout): One of
context_exhaustion,infinite_loop,test_flakiness,dependency_issue,code_error - Side effect: Writes
failure-mode.jsontoartifacts_dirwith{mode, confidence, evidence[], timestamp} - Error: Returns
code_erroron any parse failure (safe default)
get_recovery_strategy(failure_mode)
- Input:
failure_modestring - Output (stdout): JSON object:
{ "mode": "test_flakiness", "action": "rerun_tests", "args": ["--max-iterations", "5"], "description": "Rerun tests without code changes", "max_retries_override": 3 }
detect_test_flakiness(artifacts_dir) — exit 0 = flaky, 1 = not flaky
detect_infinite_loop(artifacts_dir) — exit 0 = stuck, 1 = not stuck
detect_dependency_issue(error_lines) — exit 0 = dep issue, 1 = not
detect_context_exhaustion(artifacts_dir) — exit 0 = exhausted, 1 = not
-
scripts/lib/build-loop-failure.sh— Core classification library (~300 lines) -
scripts/sw-lib-build-loop-failure-test.sh— Test suite (~400 lines)
-
scripts/sw-loop.sh— Add--failure-modeflag, source library, call classification after test failures, writefailure-mode.json -
scripts/lib/daemon-failure.sh— Source new library, enhance retry strategy selection with granular classification -
package.json— Register new test suite
Core library structure:
#!/usr/bin/env bash
# build-loop-failure.sh — Build loop failure mode classification and adaptive recovery
[[ -n "${_BUILD_LOOP_FAILURE_LOADED:-}" ]] && return 0
_BUILD_LOOP_FAILURE_LOADED=1Internal helpers:
-
_blf_read_error_summary(artifacts_dir)— Read and parseerror-summary.json. Returns error_lines as newline-separated text. Handle missing file and malformed JSON with empty-string fallback. -
_blf_read_progress(artifacts_dir)— Readprogress.md, extract iteration count, test status, and status field. Return asiter=N tests=true/false status=running/stuck/exhausted. -
_blf_read_iteration_history(artifacts_dir)— Scanrestart-*/error-summary.jsonarchives plus current to build a history of per-iteration test results. Returns newline-separatediteration:N|passed:true/false.
Detectors (priority order):
-
detect_dependency_issue(error_lines)— Pattern match:npm ERR.*ERESOLVE|npm ERR.*peer dep|Module not found|Cannot find modulepip.*install.*fail|ModuleNotFoundError|ImportErrorcargo.*fetch|cargo.*resolve|unresolved importENOENT.*node_modules|package.*not found|version.*conflict
-
detect_test_flakiness(artifacts_dir)— Check:- Iteration history shows alternating pass/fail (2+ alternations)
- Error contains timing patterns:
timeout|EADDRINUSE|ECONNREFUSED|race|flaky|intermittent - Different error lines across consecutive failures (error content changes but tests still fail)
-
detect_infinite_loop(artifacts_dir)— Check:- Same error line appears in 3+ consecutive error-summary.json entries (current + archived)
- Progress status is "stuck" or "diverging"
- CONSECUTIVE_FAILURES >= circuit breaker threshold (from progress.md)
- No new commits in last 3+ iterations
-
detect_context_exhaustion(artifacts_dir)— Check:- Progress status is "exhausted"
- Iteration count >= max_iterations (near limit)
- Error lines contain
context|token.*limit|compact|truncat - Multiple restart archives exist with declining progress
Orchestrator:
-
classify_build_loop_failure(artifacts_dir)— Runs detectors in priority order (dependency → flakiness → infinite_loop → context_exhaustion → code_error). First match wins. Writesfailure-mode.jsonatomically. Echoes mode to stdout.
Recovery strategy:
-
get_recovery_strategy(failure_mode)— Returns JSON per mode:
| Mode | Action | Extra Args | Description |
|---|---|---|---|
context_exhaustion |
restart_compressed |
--max-restarts +2 |
Restart with compressed briefing, boost restarts |
infinite_loop |
reduce_and_redirect |
--max-iterations 10 |
Reduce iterations, inject "try different approach" |
test_flakiness |
rerun_tests |
--max-iterations 3 |
Rerun tests without code changes |
dependency_issue |
reinstall_deps |
--max-iterations 5 |
Clean reinstall dependencies first |
code_error |
standard_retry |
(none) | Standard retry with model upgrade (existing behavior) |
In the CLI parsing section (before the -* catch-all at ~line 308):
--failure-mode)
FAILURE_MODE_OVERRIDE="${2:-}"
[[ -z "$FAILURE_MODE_OVERRIDE" ]] && { error "Missing value for --failure-mode"; exit 1; }
shift 2
;;
--failure-mode=*) FAILURE_MODE_OVERRIDE="${1#--failure-mode=}"; shift ;;Add validation after parsing (around line 332):
if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
case "$FAILURE_MODE_OVERRIDE" in
context_exhaustion|infinite_loop|test_flakiness|dependency_issue|code_error) ;;
*) error "--failure-mode must be: context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error"; exit 1 ;;
esac
fiInitialize the variable with other defaults (around line 90):
FAILURE_MODE_OVERRIDE=""Source the library near the top (after other lib sources, ~line 60):
_BLF_LIB="$SCRIPT_DIR/lib/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"After write_error_summary() is called (around line 1112, in the iteration flow when TEST_PASSED=false), add classification:
# Classify failure mode for adaptive recovery
if type classify_build_loop_failure >/dev/null 2>&1; then
local failure_mode
if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
failure_mode="$FAILURE_MODE_OVERRIDE"
warn "Using manual failure mode override: $failure_mode"
else
failure_mode=$(classify_build_loop_failure "$LOG_DIR")
fi
if type emit_event >/dev/null 2>&1; then
emit_event "loop.failure_classified" \
"mode=$failure_mode" \
"iteration=$ITERATION" \
"job_id=${PIPELINE_JOB_ID:-loop-$$}"
fi
fiIn run_loop_with_restarts() (around line 2462, before incrementing RESTART_COUNT), add recovery logic:
# Classify failure and apply mode-specific recovery
local loop_failure_mode="code_error"
if type classify_build_loop_failure >/dev/null 2>&1; then
loop_failure_mode=$(classify_build_loop_failure "$LOG_DIR" 2>/dev/null || echo "code_error")
fi
case "$loop_failure_mode" in
test_flakiness)
info "Detected test flakiness — will rerun tests without code changes"
;;
dependency_issue)
info "Detected dependency issue — reinstalling dependencies"
if [[ -f "package.json" ]]; then
( cd "$PROJECT_ROOT" && rm -rf node_modules 2>/dev/null && npm ci 2>/dev/null ) || true
elif [[ -f "requirements.txt" ]]; then
( cd "$PROJECT_ROOT" && pip install -r requirements.txt 2>/dev/null ) || true
fi
;;
infinite_loop)
info "Detected infinite loop — reducing iterations and injecting new approach guidance"
if [[ "$MAX_ITERATIONS" -gt 10 ]]; then
MAX_ITERATIONS=10
fi
;;
context_exhaustion)
info "Detected context exhaustion — restarting with compressed briefing"
# Default restart behavior handles this via session-restart.sh
;;
esac
if type emit_event >/dev/null 2>&1; then
emit_event "loop.recovery_applied" \
"mode=$loop_failure_mode" \
"restart=$RESTART_COUNT" \
"job_id=${PIPELINE_JOB_ID:-loop-$$}"
fiIn scripts/lib/daemon-failure.sh:
Source the new library (after the guard, around line 5):
_BLF_LIB="${BASH_SOURCE[0]%/*}/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"Enhance retry strategy in daemon_on_failure() (around line 270, after existing escalation logic, before the daemon_spawn_pipeline call):
# Granular build-loop failure classification
local granular_mode=""
if type classify_build_loop_failure >/dev/null 2>&1; then
local issue_artifacts="${issue_worktree_path}/.claude/loop-logs"
if [[ -d "$issue_artifacts" ]]; then
granular_mode=$(classify_build_loop_failure "$issue_artifacts" 2>/dev/null || echo "")
fi
fi
if [[ -n "$granular_mode" ]] && type get_recovery_strategy >/dev/null 2>&1; then
local strategy_json
strategy_json=$(get_recovery_strategy "$granular_mode")
local strategy_action
strategy_action=$(echo "$strategy_json" | jq -r '.action // "standard_retry"' 2>/dev/null || echo "standard_retry")
emit_event "daemon.granular_failure" \
"issue=$issue_num" \
"broad_class=$failure_class" \
"granular_mode=$granular_mode" \
"action=$strategy_action"
case "$strategy_action" in
rerun_tests)
extra_args+=("--max-iterations" "3")
daemon_log INFO "Granular recovery: rerun tests (flaky test detected)"
;;
reinstall_deps)
extra_args+=("--max-iterations" "5")
daemon_log INFO "Granular recovery: reinstall dependencies"
;;
reduce_and_redirect)
extra_args+=("--max-iterations" "10")
daemon_log INFO "Granular recovery: reduce iterations (infinite loop)"
;;
restart_compressed)
local boosted=$(( ${MAX_RESTARTS_CFG:-3} + retry_count + 2 ))
[[ "$boosted" -gt 5 ]] && boosted=5
extra_args+=("--max-restarts" "$boosted")
daemon_log INFO "Granular recovery: compressed restart (context exhaustion)"
;;
esac
fiStandard test harness pattern with these test groups:
Group 1: classify_build_loop_failure tests (12 cases)
- Returns
dependency_issuefor "Module not found" errors - Returns
dependency_issuefor "npm ERR! ERESOLVE" errors - Returns
dependency_issuefor "ModuleNotFoundError" errors - Returns
test_flakinessfor alternating pass/fail history - Returns
test_flakinessfor EADDRINUSE errors - Returns
infinite_loopfor 3+ repeated same errors - Returns
infinite_loopfor "stuck" status in progress.md - Returns
context_exhaustionfor "exhausted" status - Returns
context_exhaustionfor token/context error patterns - Returns
code_erroras default (assertion errors, no special patterns) - Returns
code_errorwhen error-summary.json is missing - Returns
code_errorwhen error-summary.json is malformed
Group 2: get_recovery_strategy tests (6 cases)
13. Returns rerun_tests action for test_flakiness
14. Returns reinstall_deps action for dependency_issue
15. Returns reduce_and_redirect action for infinite_loop
16. Returns restart_compressed action for context_exhaustion
17. Returns standard_retry action for code_error
18. Returns valid JSON for unknown mode (defaults to standard_retry)
Group 3: detect_ function tests (8 cases)*
19. detect_test_flakiness returns 0 with alternating pass/fail history
20. detect_test_flakiness returns 1 with consistent failures
21. detect_infinite_loop returns 0 with 3+ identical error lines
22. detect_infinite_loop returns 1 with varying errors
23. detect_dependency_issue returns 0 with npm ERESOLVE in error_lines
24. detect_dependency_issue returns 1 with assertion errors
25. detect_context_exhaustion returns 0 with exhausted status
26. detect_context_exhaustion returns 1 with normal running status
Group 4: Integration tests (2 cases)
27. Classification writes well-formed failure-mode.json
28. Priority: dependency_issue beats code_error with mixed signals
Setup pattern per test:
- Create temp
$LOG_DIRwith mockerror-summary.jsonandprogress.md - For history tests, create
restart-1/,restart-2/with archived error summaries - Source
build-loop-failure.shafter stubbingemit_event
Add to the scripts section in package.json:
"test:build-loop-failure": "bash scripts/sw-lib-build-loop-failure-test.sh"And append to the aggregate test script.
- Task 1: Create
scripts/lib/build-loop-failure.shwith all classification and recovery functions - Task 2: Add
--failure-modeCLI flag toscripts/sw-loop.shwith validation - Task 3: Source library in
sw-loop.shand integrate classification after test failures - Task 4: Integrate classification into
run_loop_with_restarts()for mid-loop recovery - Task 5: Enhance
daemon-failure.shto use granular classification for retry strategy - Task 6: Create test suite
scripts/sw-lib-build-loop-failure-test.sh(28 test cases) - Task 7: Register test suite in
package.json - Task 8: Run new test suite and fix any failures
- Task 9: Run
sw-lib-daemon-failure-test.shto verify no regressions - Task 10: Run
sw-loop-test.shto verify no regressions
-
Unit tests (28 cases): Test each classification function in isolation with mock
error-summary.jsonandprogress.md. Covers all 5 failure modes, edge cases (missing files, malformed JSON), and priority ordering. -
Integration tests (2 cases): Test that classification writes correct
failure-mode.jsonand priority ordering works end-to-end. -
Regression tests (2 existing suites): Run
sw-lib-daemon-failure-test.shandsw-loop-test.sh.
- 100% of the 5 failure modes have at least 2 test cases each
- All 4 detection functions tested for both positive and negative cases
-
get_recovery_strategy()tested for all 5 modes + unknown fallback - Error handling: missing files, malformed JSON, empty error_lines
- Happy path: Each failure mode correctly identified from representative error patterns
-
Error case 1: Missing
error-summary.json→ defaults tocode_error -
Error case 2: Malformed JSON in
error-summary.json→ defaults tocode_error - Edge case 1: Mixed signals (dependency error + test flakiness) → priority order respected (dependency wins)
-
Edge case 2: Empty
progress.mdwith real errors → classifies based on error content alone
-
classify_build_loop_failure()correctly classifies all 5 failure modes fromerror-summary.jsonandprogress.md -
get_recovery_strategy()returns mode-specific recovery parameters as valid JSON - Failure mode and recovery strategy logged to
events.jsonlviaemit_event - Daemon retry logic (
daemon_on_failure) uses granular classification for mode-specific retry args -
--failure-modeflag onsw-loop.shallows manual override for testing (with validation) - Test suite passes with 28 test cases covering all 5 modes + edge cases
- Existing
sw-lib-daemon-failure-test.shpasses (no regressions) - Existing
sw-loop-test.shpasses (no regressions) - All code is Bash 3.2 compatible
- Library gracefully degrades when not available (guarded with
typechecks)
No previous attempt has failed — this is the first plan stage. However, proactive analysis of potential failure modes:
-
Most likely: Bash 3.2 compatibility issue — using
readarray,declare -A, or${var,,}in the new library. Evidence to confirm: Shellcheck or macOS test run. Mitigation: Strict adherence to project conventions, test on macOS. -
Possible:
jqnot available in test environment — classification falls back to wrong path. Evidence: Check mock setup in test harness. Mitigation: Test both jq-present and jq-absent paths. -
Unlikely: Sourcing order issue —
build-loop-failure.shtries to use functions from a file sourced after it. Evidence: Check dependency chain. Mitigation: Library is self-contained, no external function deps.
-
daemon-failure.sh:1-71— existing classification reads log tail, returns string -
sw-loop.sh:1051-1112—write_error_summary()format is stable -
sw-loop.sh:193-325— CLI parsing pattern is well-established -
sw-loop.sh:2440-2500— restart wrapper has clear hook points - All existing libraries use
_LOADEDguard pattern
Not applicable (first attempt). Approach is conservative:
- New library with no external deps (self-contained)
- Integration via
typeguards (graceful fallback) - Additive changes only (no existing behavior modified unless granular classification available)
- Run
sw-lib-build-loop-failure-test.sh— all 28 tests pass - Run
sw-lib-daemon-failure-test.sh— no regressions - Run
sw-loop-test.sh— no regressions - Manual test:
shipwright loop "test" --failure-mode test_flakinessexits cleanly
- Approach: Send error logs to Claude API for classification
- Pros: More accurate, handles novel error patterns
- Cons: Adds API cost per failure, adds latency, requires API availability
- Trade-offs: Higher accuracy but much higher cost/complexity. Pattern matching handles 90%+ of cases.
- Verdict: Rejected — deterministic pattern matching is sufficient and free
-
Approach: Add build-loop categories to
rootcause_classify() - Pros: Reuses confidence scoring, evidence collection
-
Cons:
root-cause.shreadserror-log.jsonl(pipeline-level), noterror-summary.json(loop-level). Different data sources. - Trade-offs: Less code duplication but worse separation of concerns
- Verdict: Rejected — different data sources make clean integration difficult
- Approach: Define patterns in a JSON config that maps regex → failure mode
- Pros: User-configurable without code changes
- Cons: Stateful detectors (flakiness=alternating history, infinite_loop=repeated errors) can't be expressed as simple regex rules
- Trade-offs: More flexible but significantly more complex for marginal benefit
- Verdict: Rejected — stateful detection logic requires code, not just patterns