Pipeline Plan 246

Build Loop Failure Mode Classification and Adaptive Recovery

Brainstorming: Socratic Design Refinement

Requirements Clarity

What is the minimum viable change? A new library (scripts/lib/build-loop-failure.sh) that classifies build loop failures into 5 modes and returns mode-specific recovery strategies. The daemon's retry logic and the loop's restart logic both call into this library to select the right recovery approach instead of using generic retry.

Implicit requirements:

Must integrate with the existing daemon-failure.sh classification (enhance, not replace)
Must work within the existing error-summary.json and progress.md data structures
Must be Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
Must follow the existing test harness patterns (test-helpers.sh, assert_eq, etc.)
Recovery strategies must be actionable CLI args that the daemon can pass to the next pipeline/loop invocation

Acceptance criteria (from issue + derived):

classify_build_loop_failure() analyzes error-summary.json + progress.md and returns one of: context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error
get_recovery_strategy() returns mode-specific recovery parameters (JSON)
Failure mode + strategy logged to events.jsonl via emit_event
Daemon retry logic uses classification to select recovery strategy
--failure-mode flag on sw-loop.sh allows manual override for testing
Test suite verifies each mode triggers correct strategy

Design Alternatives

Approach A: Extend daemon-failure.sh inline

Pros: Single file, no new imports needed
Cons: daemon-failure.sh is daemon-scoped (reads daemon logs); build-loop classification reads different data (error-summary.json, progress.md, iteration history). Mixing concerns makes both harder to test.
Blast radius: Medium — changes to daemon-failure.sh affect all daemon retry paths

Approach B: New library lib/build-loop-failure.sh + integration points (CHOSEN)

Pros: Clean separation of concerns. Loop-level classification can be tested independently. Daemon calls into it as an enhancement layer. Can be sourced by both sw-loop.sh and sw-daemon.sh.
Cons: One more file to maintain. Need to coordinate with existing classify_failure().
Blast radius: Low — new file with integration hooks into existing code

Approach C: Enhance existing root-cause.sh with loop-specific patterns

Pros: Reuses confidence scoring, evidence collection
Cons: root-cause.sh reads error-log.jsonl (pipeline-level), not error-summary.json (loop-level). Different data sources, different granularity. Would need significant refactoring.
Blast radius: Medium — changes to root-cause.sh affect pipeline analysis

Why Approach B: The build loop failure classification operates on different data (per-iteration error-summary.json, progress.md history, git diff patterns) than daemon-level classification (log tail). Keeping them separate allows each to evolve independently while the daemon can call both: classify_failure() for broad category, then classify_build_loop_failure() for granular recovery strategy when the broad category is build_failure or context_exhaustion.

Risk Assessment

Risk	Impact	Mitigation
New classification disagrees with existing `classify_failure()`	Daemon applies wrong retry strategy	New classification is an enhancement layer — only consulted after daemon-level classification. If unavailable, falls back to existing behavior
Pattern matching for test flakiness produces false positives	Wastes a retry on "rerun tests" when code is actually broken	Require 2+ instances of alternating pass/fail in iteration history before classifying as flaky. Conservative threshold.
`--failure-mode` flag bypasses real classification	Could mask bugs in classification logic	Flag is for testing only — documented as such, emits warning event
Dependency reinstall recovery corrupts node_modules mid-build	Build state becomes inconsistent	Recovery strategy includes `rm -rf node_modules && npm ci` (clean install), not incremental fix
New library not sourced in all code paths	Silent fallback to generic retry	Guard all calls with `type classify_build_loop_failure >/dev/null 2>&1` checks

Dependency Analysis

This depends on:

scripts/lib/daemon-failure.sh — integration point for retry strategy selection
scripts/sw-loop.sh — integration point for --failure-mode flag and mid-loop classification
scripts/lib/session-restart.sh — provides restart reason detection we build on
scripts/lib/test-helpers.sh — test harness for the new test suite
error-summary.json format: {iteration, error_count, error_lines[], test_cmd}
progress.md format: Iteration, Tests passing, Status fields

What depends on what we're changing:

daemon-failure.sh changes: sw-daemon.sh sources this — must remain backward-compatible
sw-loop.sh changes: pipeline build stage calls loop — new flag must be optional
No circular dependency risk — new library is leaf-level, sources nothing from daemon or loop

Failure Mode Analysis

Runtime: If error-summary.json doesn't exist or is malformed JSON, classification must return code_error (safe default), not crash. Use jq with fallback.
Concurrency: Multiple daemon workers could read/write error-summary.json simultaneously in worktree mode — but each worktree has its own copy, so no race condition.
Scale: Classification reads at most 2 small files (error-summary.json ~1KB, progress.md ~2KB) — no scale concern.
Rollback: All changes are additive (new library, new flag, enhanced retry logic). Removing the library causes graceful fallback to existing behavior via type guards.

Architecture Decision Record

Component Decomposition

┌─────────────────────────────────────────────────────────┐
│                    sw-daemon.sh                          │
│  daemon_on_failure()                                     │
│    → classify_failure() [existing, daemon-level]         │
│    → classify_build_loop_failure() [NEW, granular]       │
│    → get_recovery_strategy() [NEW]                       │
│    → apply recovery args to daemon_spawn_pipeline()      │
└─────────────────────────────────────────────────────────┘
         │ sources                    │ sources
         ▼                            ▼
┌──────────────────────┐  ┌───────────────────────────────┐
│ lib/daemon-failure.sh │  │ lib/build-loop-failure.sh     │
│ [existing]            │  │ [NEW]                         │
│                       │  │                               │
│ classify_failure()    │  │ classify_build_loop_failure()  │
│ get_max_retries()     │  │ get_recovery_strategy()        │
│ daemon_on_failure()   │  │ detect_test_flakiness()        │
│                       │  │ detect_infinite_loop()         │
│                       │  │ detect_dependency_issue()      │
│                       │  │ _blf_read_error_summary()      │
│                       │  │ _blf_read_progress_history()   │
└──────────────────────┘  └───────────────────────────────┘
                                     ▲
                                     │ sources
                           ┌─────────────────────┐
                           │   sw-loop.sh         │
                           │ [modified]            │
                           │                       │
                           │ --failure-mode flag   │
                           │ post-failure classify │
                           │ write failure-mode.json│
                           └───────────────────────┘

Interface Contracts

classify_build_loop_failure(artifacts_dir)

Input: artifacts_dir — path containing error-summary.json and progress.md
Output (stdout): One of context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error
Side effect: Writes failure-mode.json to artifacts_dir with {mode, confidence, evidence[], timestamp}
Error: Returns code_error on any parse failure (safe default)

get_recovery_strategy(failure_mode)

Input: failure_mode string

Output (stdout): JSON object:

{
  "mode": "test_flakiness",
  "action": "rerun_tests",
  "args": ["--max-iterations", "5"],
  "description": "Rerun tests without code changes",
  "max_retries_override": 3
}

detect_test_flakiness(artifacts_dir) — exit 0 = flaky, 1 = not flaky detect_infinite_loop(artifacts_dir) — exit 0 = stuck, 1 = not stuck detect_dependency_issue(error_lines) — exit 0 = dep issue, 1 = not detect_context_exhaustion(artifacts_dir) — exit 0 = exhausted, 1 = not

Files to Modify

New Files

scripts/lib/build-loop-failure.sh — Core classification library (~300 lines)
scripts/sw-lib-build-loop-failure-test.sh — Test suite (~400 lines)

Modified Files

scripts/sw-loop.sh — Add --failure-mode flag, source library, call classification after test failures, write failure-mode.json
scripts/lib/daemon-failure.sh — Source new library, enhance retry strategy selection with granular classification
package.json — Register new test suite

Implementation Steps

Step 1: Create `scripts/lib/build-loop-failure.sh`

Core library structure:

#!/usr/bin/env bash
# build-loop-failure.sh — Build loop failure mode classification and adaptive recovery
[[ -n "${_BUILD_LOOP_FAILURE_LOADED:-}" ]] && return 0
_BUILD_LOOP_FAILURE_LOADED=1

Internal helpers:

_blf_read_error_summary(artifacts_dir) — Read and parse error-summary.json. Returns error_lines as newline-separated text. Handle missing file and malformed JSON with empty-string fallback.
_blf_read_progress(artifacts_dir) — Read progress.md, extract iteration count, test status, and status field. Return as iter=N tests=true/false status=running/stuck/exhausted.
_blf_read_iteration_history(artifacts_dir) — Scan restart-*/error-summary.json archives plus current to build a history of per-iteration test results. Returns newline-separated iteration:N|passed:true/false.

Detectors (priority order):

detect_dependency_issue(error_lines) — Pattern match:
- npm ERR.*ERESOLVE|npm ERR.*peer dep|Module not found|Cannot find module
- pip.*install.*fail|ModuleNotFoundError|ImportError
- cargo.*fetch|cargo.*resolve|unresolved import
- ENOENT.*node_modules|package.*not found|version.*conflict
detect_test_flakiness(artifacts_dir) — Check:
- Iteration history shows alternating pass/fail (2+ alternations)
- Error contains timing patterns: timeout|EADDRINUSE|ECONNREFUSED|race|flaky|intermittent
- Different error lines across consecutive failures (error content changes but tests still fail)
detect_infinite_loop(artifacts_dir) — Check:
- Same error line appears in 3+ consecutive error-summary.json entries (current + archived)
- Progress status is "stuck" or "diverging"
- CONSECUTIVE_FAILURES >= circuit breaker threshold (from progress.md)
- No new commits in last 3+ iterations
detect_context_exhaustion(artifacts_dir) — Check:
- Progress status is "exhausted"
- Iteration count >= max_iterations (near limit)
- Error lines contain context|token.*limit|compact|truncat
- Multiple restart archives exist with declining progress

Orchestrator:

classify_build_loop_failure(artifacts_dir) — Runs detectors in priority order (dependency → flakiness → infinite_loop → context_exhaustion → code_error). First match wins. Writes failure-mode.json atomically. Echoes mode to stdout.

Recovery strategy:

get_recovery_strategy(failure_mode) — Returns JSON per mode:

Mode	Action	Extra Args	Description
`context_exhaustion`	`restart_compressed`	`--max-restarts +2`	Restart with compressed briefing, boost restarts
`infinite_loop`	`reduce_and_redirect`	`--max-iterations 10`	Reduce iterations, inject "try different approach"
`test_flakiness`	`rerun_tests`	`--max-iterations 3`	Rerun tests without code changes
`dependency_issue`	`reinstall_deps`	`--max-iterations 5`	Clean reinstall dependencies first
`code_error`	`standard_retry`	(none)	Standard retry with model upgrade (existing behavior)

Step 2: Add `--failure-mode` flag to `sw-loop.sh`

In the CLI parsing section (before the -* catch-all at ~line 308):

--failure-mode)
    FAILURE_MODE_OVERRIDE="${2:-}"
    [[ -z "$FAILURE_MODE_OVERRIDE" ]] && { error "Missing value for --failure-mode"; exit 1; }
    shift 2
    ;;
--failure-mode=*) FAILURE_MODE_OVERRIDE="${1#--failure-mode=}"; shift ;;

Add validation after parsing (around line 332):

if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
    case "$FAILURE_MODE_OVERRIDE" in
        context_exhaustion|infinite_loop|test_flakiness|dependency_issue|code_error) ;;
        *) error "--failure-mode must be: context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error"; exit 1 ;;
    esac
fi

Initialize the variable with other defaults (around line 90):

FAILURE_MODE_OVERRIDE=""

Step 3: Source library and integrate into `sw-loop.sh` post-failure flow

Source the library near the top (after other lib sources, ~line 60):

_BLF_LIB="$SCRIPT_DIR/lib/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"

After write_error_summary() is called (around line 1112, in the iteration flow when TEST_PASSED=false), add classification:

# Classify failure mode for adaptive recovery
if type classify_build_loop_failure >/dev/null 2>&1; then
    local failure_mode
    if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
        failure_mode="$FAILURE_MODE_OVERRIDE"
        warn "Using manual failure mode override: $failure_mode"
    else
        failure_mode=$(classify_build_loop_failure "$LOG_DIR")
    fi
    if type emit_event >/dev/null 2>&1; then
        emit_event "loop.failure_classified" \
            "mode=$failure_mode" \
            "iteration=$ITERATION" \
            "job_id=${PIPELINE_JOB_ID:-loop-$$}"
    fi
fi

Step 4: Integrate into `run_loop_with_restarts()` for mid-loop recovery

In run_loop_with_restarts() (around line 2462, before incrementing RESTART_COUNT), add recovery logic:

# Classify failure and apply mode-specific recovery
local loop_failure_mode="code_error"
if type classify_build_loop_failure >/dev/null 2>&1; then
    loop_failure_mode=$(classify_build_loop_failure "$LOG_DIR" 2>/dev/null || echo "code_error")
fi

case "$loop_failure_mode" in
    test_flakiness)
        info "Detected test flakiness — will rerun tests without code changes"
        ;;
    dependency_issue)
        info "Detected dependency issue — reinstalling dependencies"
        if [[ -f "package.json" ]]; then
            ( cd "$PROJECT_ROOT" && rm -rf node_modules 2>/dev/null && npm ci 2>/dev/null ) || true
        elif [[ -f "requirements.txt" ]]; then
            ( cd "$PROJECT_ROOT" && pip install -r requirements.txt 2>/dev/null ) || true
        fi
        ;;
    infinite_loop)
        info "Detected infinite loop — reducing iterations and injecting new approach guidance"
        if [[ "$MAX_ITERATIONS" -gt 10 ]]; then
            MAX_ITERATIONS=10
        fi
        ;;
    context_exhaustion)
        info "Detected context exhaustion — restarting with compressed briefing"
        # Default restart behavior handles this via session-restart.sh
        ;;
esac

if type emit_event >/dev/null 2>&1; then
    emit_event "loop.recovery_applied" \
        "mode=$loop_failure_mode" \
        "restart=$RESTART_COUNT" \
        "job_id=${PIPELINE_JOB_ID:-loop-$$}"
fi

Step 5: Enhance daemon retry with granular classification

In scripts/lib/daemon-failure.sh:

Source the new library (after the guard, around line 5):

_BLF_LIB="${BASH_SOURCE[0]%/*}/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"

Enhance retry strategy in daemon_on_failure() (around line 270, after existing escalation logic, before the daemon_spawn_pipeline call):

# Granular build-loop failure classification
local granular_mode=""
if type classify_build_loop_failure >/dev/null 2>&1; then
    local issue_artifacts="${issue_worktree_path}/.claude/loop-logs"
    if [[ -d "$issue_artifacts" ]]; then
        granular_mode=$(classify_build_loop_failure "$issue_artifacts" 2>/dev/null || echo "")
    fi
fi

if [[ -n "$granular_mode" ]] && type get_recovery_strategy >/dev/null 2>&1; then
    local strategy_json
    strategy_json=$(get_recovery_strategy "$granular_mode")
    local strategy_action
    strategy_action=$(echo "$strategy_json" | jq -r '.action // "standard_retry"' 2>/dev/null || echo "standard_retry")

    emit_event "daemon.granular_failure" \
        "issue=$issue_num" \
        "broad_class=$failure_class" \
        "granular_mode=$granular_mode" \
        "action=$strategy_action"

    case "$strategy_action" in
        rerun_tests)
            extra_args+=("--max-iterations" "3")
            daemon_log INFO "Granular recovery: rerun tests (flaky test detected)"
            ;;
        reinstall_deps)
            extra_args+=("--max-iterations" "5")
            daemon_log INFO "Granular recovery: reinstall dependencies"
            ;;
        reduce_and_redirect)
            extra_args+=("--max-iterations" "10")
            daemon_log INFO "Granular recovery: reduce iterations (infinite loop)"
            ;;
        restart_compressed)
            local boosted=$(( ${MAX_RESTARTS_CFG:-3} + retry_count + 2 ))
            [[ "$boosted" -gt 5 ]] && boosted=5
            extra_args+=("--max-restarts" "$boosted")
            daemon_log INFO "Granular recovery: compressed restart (context exhaustion)"
            ;;
    esac
fi

Step 6: Create test suite `scripts/sw-lib-build-loop-failure-test.sh`

Standard test harness pattern with these test groups:

Group 1: classify_build_loop_failure tests (12 cases)

Returns dependency_issue for "Module not found" errors
Returns dependency_issue for "npm ERR! ERESOLVE" errors
Returns dependency_issue for "ModuleNotFoundError" errors
Returns test_flakiness for alternating pass/fail history
Returns test_flakiness for EADDRINUSE errors
Returns infinite_loop for 3+ repeated same errors
Returns infinite_loop for "stuck" status in progress.md
Returns context_exhaustion for "exhausted" status
Returns context_exhaustion for token/context error patterns
Returns code_error as default (assertion errors, no special patterns)
Returns code_error when error-summary.json is missing
Returns code_error when error-summary.json is malformed

Group 2: get_recovery_strategy tests (6 cases) 13. Returns rerun_tests action for test_flakiness 14. Returns reinstall_deps action for dependency_issue 15. Returns reduce_and_redirect action for infinite_loop 16. Returns restart_compressed action for context_exhaustion 17. Returns standard_retry action for code_error 18. Returns valid JSON for unknown mode (defaults to standard_retry)

Group 3: detect_ function tests (8 cases)* 19. detect_test_flakiness returns 0 with alternating pass/fail history 20. detect_test_flakiness returns 1 with consistent failures 21. detect_infinite_loop returns 0 with 3+ identical error lines 22. detect_infinite_loop returns 1 with varying errors 23. detect_dependency_issue returns 0 with npm ERESOLVE in error_lines 24. detect_dependency_issue returns 1 with assertion errors 25. detect_context_exhaustion returns 0 with exhausted status 26. detect_context_exhaustion returns 1 with normal running status

Group 4: Integration tests (2 cases) 27. Classification writes well-formed failure-mode.json 28. Priority: dependency_issue beats code_error with mixed signals

Setup pattern per test:

Create temp $LOG_DIR with mock error-summary.json and progress.md
For history tests, create restart-1/, restart-2/ with archived error summaries
Source build-loop-failure.sh after stubbing emit_event

Step 7: Register test in `package.json`

Add to the scripts section in package.json:

"test:build-loop-failure": "bash scripts/sw-lib-build-loop-failure-test.sh"

And append to the aggregate test script.

Task Checklist

Testing Approach

Test Pyramid Breakdown

Unit tests (28 cases): Test each classification function in isolation with mock error-summary.json and progress.md. Covers all 5 failure modes, edge cases (missing files, malformed JSON), and priority ordering.
Integration tests (2 cases): Test that classification writes correct failure-mode.json and priority ordering works end-to-end.
Regression tests (2 existing suites): Run sw-lib-daemon-failure-test.sh and sw-loop-test.sh.

Coverage Targets

100% of the 5 failure modes have at least 2 test cases each
All 4 detection functions tested for both positive and negative cases
get_recovery_strategy() tested for all 5 modes + unknown fallback
Error handling: missing files, malformed JSON, empty error_lines

Critical Paths to Test

Happy path: Each failure mode correctly identified from representative error patterns
Error case 1: Missing error-summary.json → defaults to code_error
Error case 2: Malformed JSON in error-summary.json → defaults to code_error
Edge case 1: Mixed signals (dependency error + test flakiness) → priority order respected (dependency wins)
Edge case 2: Empty progress.md with real errors → classifies based on error content alone

Definition of Done

Systematic Debugging: Previous Attempt Analysis

Root Cause Hypothesis

No previous attempt has failed — this is the first plan stage. However, proactive analysis of potential failure modes:

Most likely: Bash 3.2 compatibility issue — using readarray, declare -A, or ${var,,} in the new library. Evidence to confirm: Shellcheck or macOS test run. Mitigation: Strict adherence to project conventions, test on macOS.
Possible: jq not available in test environment — classification falls back to wrong path. Evidence: Check mock setup in test harness. Mitigation: Test both jq-present and jq-absent paths.
Unlikely: Sourcing order issue — build-loop-failure.sh tries to use functions from a file sourced after it. Evidence: Check dependency chain. Mitigation: Library is self-contained, no external function deps.

Evidence Gathered

daemon-failure.sh:1-71 — existing classification reads log tail, returns string
sw-loop.sh:1051-1112 — write_error_summary() format is stable
sw-loop.sh:193-325 — CLI parsing pattern is well-established
sw-loop.sh:2440-2500 — restart wrapper has clear hook points
All existing libraries use _LOADED guard pattern

Fix Strategy

Not applicable (first attempt). Approach is conservative:

New library with no external deps (self-contained)
Integration via type guards (graceful fallback)
Additive changes only (no existing behavior modified unless granular classification available)

Verification Plan

Run sw-lib-build-loop-failure-test.sh — all 28 tests pass
Run sw-lib-daemon-failure-test.sh — no regressions
Run sw-loop-test.sh — no regressions
Manual test: shipwright loop "test" --failure-mode test_flakiness exits cleanly

Alternatives Considered

Alternative 1: ML-based classification using Claude API

Approach: Send error logs to Claude API for classification
Pros: More accurate, handles novel error patterns
Cons: Adds API cost per failure, adds latency, requires API availability
Trade-offs: Higher accuracy but much higher cost/complexity. Pattern matching handles 90%+ of cases.
Verdict: Rejected — deterministic pattern matching is sufficient and free

Alternative 2: Extend `root-cause.sh` with loop-specific patterns

Approach: Add build-loop categories to rootcause_classify()
Pros: Reuses confidence scoring, evidence collection
Cons: root-cause.sh reads error-log.jsonl (pipeline-level), not error-summary.json (loop-level). Different data sources.
Trade-offs: Less code duplication but worse separation of concerns
Verdict: Rejected — different data sources make clean integration difficult

Alternative 3: Configuration-driven classification (JSON rules file)

Approach: Define patterns in a JSON config that maps regex → failure mode
Pros: User-configurable without code changes
Cons: Stateful detectors (flakiness=alternating history, infinite_loop=repeated errors) can't be expressed as simple regex rules
Trade-offs: More flexible but significantly more complex for marginal benefit
Verdict: Rejected — stateful detection logic requires code, not just patterns

Pipeline Plan 246

Build Loop Failure Mode Classification and Adaptive Recovery

Brainstorming: Socratic Design Refinement

Requirements Clarity

Design Alternatives

Risk Assessment

Dependency Analysis

Failure Mode Analysis

Architecture Decision Record

Component Decomposition

Interface Contracts

Files to Modify

New Files

Modified Files

Implementation Steps

Step 1: Create scripts/lib/build-loop-failure.sh

Step 2: Add --failure-mode flag to sw-loop.sh

Step 3: Source library and integrate into sw-loop.sh post-failure flow

Step 4: Integrate into run_loop_with_restarts() for mid-loop recovery

Step 5: Enhance daemon retry with granular classification

Step 6: Create test suite scripts/sw-lib-build-loop-failure-test.sh

Step 7: Register test in package.json

Task Checklist

Testing Approach

Test Pyramid Breakdown

Coverage Targets

Critical Paths to Test

Definition of Done

Systematic Debugging: Previous Attempt Analysis

Root Cause Hypothesis

Evidence Gathered

Fix Strategy

Verification Plan

Alternatives Considered

Alternative 1: ML-based classification using Claude API

Alternative 2: Extend root-cause.sh with loop-specific patterns

Alternative 3: Configuration-driven classification (JSON rules file)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Step 1: Create `scripts/lib/build-loop-failure.sh`

Step 2: Add `--failure-mode` flag to `sw-loop.sh`

Step 3: Source library and integrate into `sw-loop.sh` post-failure flow

Step 4: Integrate into `run_loop_with_restarts()` for mid-loop recovery

Step 6: Create test suite `scripts/sw-lib-build-loop-failure-test.sh`

Step 7: Register test in `package.json`

Alternative 2: Extend `root-cause.sh` with loop-specific patterns