Skip to content

Pipeline Plan 246

Seth Ford edited this page Mar 10, 2026 · 1 revision

Build Loop Failure Mode Classification and Adaptive Recovery

Brainstorming: Socratic Design Refinement

Requirements Clarity

What is the minimum viable change? A new library (scripts/lib/build-loop-failure.sh) that classifies build loop failures into 5 modes and returns mode-specific recovery strategies. The daemon's retry logic and the loop's restart logic both call into this library to select the right recovery approach instead of using generic retry.

Implicit requirements:

  • Must integrate with the existing daemon-failure.sh classification (enhance, not replace)
  • Must work within the existing error-summary.json and progress.md data structures
  • Must be Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
  • Must follow the existing test harness patterns (test-helpers.sh, assert_eq, etc.)
  • Recovery strategies must be actionable CLI args that the daemon can pass to the next pipeline/loop invocation

Acceptance criteria (from issue + derived):

  1. classify_build_loop_failure() analyzes error-summary.json + progress.md and returns one of: context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error
  2. get_recovery_strategy() returns mode-specific recovery parameters (JSON)
  3. Failure mode + strategy logged to events.jsonl via emit_event
  4. Daemon retry logic uses classification to select recovery strategy
  5. --failure-mode flag on sw-loop.sh allows manual override for testing
  6. Test suite verifies each mode triggers correct strategy

Design Alternatives

Approach A: Extend daemon-failure.sh inline

  • Pros: Single file, no new imports needed
  • Cons: daemon-failure.sh is daemon-scoped (reads daemon logs); build-loop classification reads different data (error-summary.json, progress.md, iteration history). Mixing concerns makes both harder to test.
  • Blast radius: Medium — changes to daemon-failure.sh affect all daemon retry paths

Approach B: New library lib/build-loop-failure.sh + integration points (CHOSEN)

  • Pros: Clean separation of concerns. Loop-level classification can be tested independently. Daemon calls into it as an enhancement layer. Can be sourced by both sw-loop.sh and sw-daemon.sh.
  • Cons: One more file to maintain. Need to coordinate with existing classify_failure().
  • Blast radius: Low — new file with integration hooks into existing code

Approach C: Enhance existing root-cause.sh with loop-specific patterns

  • Pros: Reuses confidence scoring, evidence collection
  • Cons: root-cause.sh reads error-log.jsonl (pipeline-level), not error-summary.json (loop-level). Different data sources, different granularity. Would need significant refactoring.
  • Blast radius: Medium — changes to root-cause.sh affect pipeline analysis

Why Approach B: The build loop failure classification operates on different data (per-iteration error-summary.json, progress.md history, git diff patterns) than daemon-level classification (log tail). Keeping them separate allows each to evolve independently while the daemon can call both: classify_failure() for broad category, then classify_build_loop_failure() for granular recovery strategy when the broad category is build_failure or context_exhaustion.

Risk Assessment

Risk Impact Mitigation
New classification disagrees with existing classify_failure() Daemon applies wrong retry strategy New classification is an enhancement layer — only consulted after daemon-level classification. If unavailable, falls back to existing behavior
Pattern matching for test flakiness produces false positives Wastes a retry on "rerun tests" when code is actually broken Require 2+ instances of alternating pass/fail in iteration history before classifying as flaky. Conservative threshold.
--failure-mode flag bypasses real classification Could mask bugs in classification logic Flag is for testing only — documented as such, emits warning event
Dependency reinstall recovery corrupts node_modules mid-build Build state becomes inconsistent Recovery strategy includes rm -rf node_modules && npm ci (clean install), not incremental fix
New library not sourced in all code paths Silent fallback to generic retry Guard all calls with type classify_build_loop_failure >/dev/null 2>&1 checks

Dependency Analysis

This depends on:

  • scripts/lib/daemon-failure.sh — integration point for retry strategy selection
  • scripts/sw-loop.sh — integration point for --failure-mode flag and mid-loop classification
  • scripts/lib/session-restart.sh — provides restart reason detection we build on
  • scripts/lib/test-helpers.sh — test harness for the new test suite
  • error-summary.json format: {iteration, error_count, error_lines[], test_cmd}
  • progress.md format: Iteration, Tests passing, Status fields

What depends on what we're changing:

  • daemon-failure.sh changes: sw-daemon.sh sources this — must remain backward-compatible
  • sw-loop.sh changes: pipeline build stage calls loop — new flag must be optional
  • No circular dependency risk — new library is leaf-level, sources nothing from daemon or loop

Failure Mode Analysis

  1. Runtime: If error-summary.json doesn't exist or is malformed JSON, classification must return code_error (safe default), not crash. Use jq with fallback.
  2. Concurrency: Multiple daemon workers could read/write error-summary.json simultaneously in worktree mode — but each worktree has its own copy, so no race condition.
  3. Scale: Classification reads at most 2 small files (error-summary.json ~1KB, progress.md ~2KB) — no scale concern.
  4. Rollback: All changes are additive (new library, new flag, enhanced retry logic). Removing the library causes graceful fallback to existing behavior via type guards.

Architecture Decision Record

Component Decomposition

┌─────────────────────────────────────────────────────────┐
│                    sw-daemon.sh                          │
│  daemon_on_failure()                                     │
│    → classify_failure() [existing, daemon-level]         │
│    → classify_build_loop_failure() [NEW, granular]       │
│    → get_recovery_strategy() [NEW]                       │
│    → apply recovery args to daemon_spawn_pipeline()      │
└─────────────────────────────────────────────────────────┘
         │ sources                    │ sources
         ▼                            ▼
┌──────────────────────┐  ┌───────────────────────────────┐
│ lib/daemon-failure.sh │  │ lib/build-loop-failure.sh     │
│ [existing]            │  │ [NEW]                         │
│                       │  │                               │
│ classify_failure()    │  │ classify_build_loop_failure()  │
│ get_max_retries()     │  │ get_recovery_strategy()        │
│ daemon_on_failure()   │  │ detect_test_flakiness()        │
│                       │  │ detect_infinite_loop()         │
│                       │  │ detect_dependency_issue()      │
│                       │  │ _blf_read_error_summary()      │
│                       │  │ _blf_read_progress_history()   │
└──────────────────────┘  └───────────────────────────────┘
                                     ▲
                                     │ sources
                           ┌─────────────────────┐
                           │   sw-loop.sh         │
                           │ [modified]            │
                           │                       │
                           │ --failure-mode flag   │
                           │ post-failure classify │
                           │ write failure-mode.json│
                           └───────────────────────┘

Interface Contracts

classify_build_loop_failure(artifacts_dir)

  • Input: artifacts_dir — path containing error-summary.json and progress.md
  • Output (stdout): One of context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error
  • Side effect: Writes failure-mode.json to artifacts_dir with {mode, confidence, evidence[], timestamp}
  • Error: Returns code_error on any parse failure (safe default)

get_recovery_strategy(failure_mode)

  • Input: failure_mode string
  • Output (stdout): JSON object:
    {
      "mode": "test_flakiness",
      "action": "rerun_tests",
      "args": ["--max-iterations", "5"],
      "description": "Rerun tests without code changes",
      "max_retries_override": 3
    }

detect_test_flakiness(artifacts_dir) — exit 0 = flaky, 1 = not flaky detect_infinite_loop(artifacts_dir) — exit 0 = stuck, 1 = not stuck detect_dependency_issue(error_lines) — exit 0 = dep issue, 1 = not detect_context_exhaustion(artifacts_dir) — exit 0 = exhausted, 1 = not


Files to Modify

New Files

  1. scripts/lib/build-loop-failure.sh — Core classification library (~300 lines)
  2. scripts/sw-lib-build-loop-failure-test.sh — Test suite (~400 lines)

Modified Files

  1. scripts/sw-loop.sh — Add --failure-mode flag, source library, call classification after test failures, write failure-mode.json
  2. scripts/lib/daemon-failure.sh — Source new library, enhance retry strategy selection with granular classification
  3. package.json — Register new test suite

Implementation Steps

Step 1: Create scripts/lib/build-loop-failure.sh

Core library structure:

#!/usr/bin/env bash
# build-loop-failure.sh — Build loop failure mode classification and adaptive recovery
[[ -n "${_BUILD_LOOP_FAILURE_LOADED:-}" ]] && return 0
_BUILD_LOOP_FAILURE_LOADED=1

Internal helpers:

  • _blf_read_error_summary(artifacts_dir) — Read and parse error-summary.json. Returns error_lines as newline-separated text. Handle missing file and malformed JSON with empty-string fallback.

  • _blf_read_progress(artifacts_dir) — Read progress.md, extract iteration count, test status, and status field. Return as iter=N tests=true/false status=running/stuck/exhausted.

  • _blf_read_iteration_history(artifacts_dir) — Scan restart-*/error-summary.json archives plus current to build a history of per-iteration test results. Returns newline-separated iteration:N|passed:true/false.

Detectors (priority order):

  1. detect_dependency_issue(error_lines) — Pattern match:

    • npm ERR.*ERESOLVE|npm ERR.*peer dep|Module not found|Cannot find module
    • pip.*install.*fail|ModuleNotFoundError|ImportError
    • cargo.*fetch|cargo.*resolve|unresolved import
    • ENOENT.*node_modules|package.*not found|version.*conflict
  2. detect_test_flakiness(artifacts_dir) — Check:

    • Iteration history shows alternating pass/fail (2+ alternations)
    • Error contains timing patterns: timeout|EADDRINUSE|ECONNREFUSED|race|flaky|intermittent
    • Different error lines across consecutive failures (error content changes but tests still fail)
  3. detect_infinite_loop(artifacts_dir) — Check:

    • Same error line appears in 3+ consecutive error-summary.json entries (current + archived)
    • Progress status is "stuck" or "diverging"
    • CONSECUTIVE_FAILURES >= circuit breaker threshold (from progress.md)
    • No new commits in last 3+ iterations
  4. detect_context_exhaustion(artifacts_dir) — Check:

    • Progress status is "exhausted"
    • Iteration count >= max_iterations (near limit)
    • Error lines contain context|token.*limit|compact|truncat
    • Multiple restart archives exist with declining progress

Orchestrator:

  • classify_build_loop_failure(artifacts_dir) — Runs detectors in priority order (dependency → flakiness → infinite_loop → context_exhaustion → code_error). First match wins. Writes failure-mode.json atomically. Echoes mode to stdout.

Recovery strategy:

  • get_recovery_strategy(failure_mode) — Returns JSON per mode:
Mode Action Extra Args Description
context_exhaustion restart_compressed --max-restarts +2 Restart with compressed briefing, boost restarts
infinite_loop reduce_and_redirect --max-iterations 10 Reduce iterations, inject "try different approach"
test_flakiness rerun_tests --max-iterations 3 Rerun tests without code changes
dependency_issue reinstall_deps --max-iterations 5 Clean reinstall dependencies first
code_error standard_retry (none) Standard retry with model upgrade (existing behavior)

Step 2: Add --failure-mode flag to sw-loop.sh

In the CLI parsing section (before the -* catch-all at ~line 308):

--failure-mode)
    FAILURE_MODE_OVERRIDE="${2:-}"
    [[ -z "$FAILURE_MODE_OVERRIDE" ]] && { error "Missing value for --failure-mode"; exit 1; }
    shift 2
    ;;
--failure-mode=*) FAILURE_MODE_OVERRIDE="${1#--failure-mode=}"; shift ;;

Add validation after parsing (around line 332):

if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
    case "$FAILURE_MODE_OVERRIDE" in
        context_exhaustion|infinite_loop|test_flakiness|dependency_issue|code_error) ;;
        *) error "--failure-mode must be: context_exhaustion, infinite_loop, test_flakiness, dependency_issue, code_error"; exit 1 ;;
    esac
fi

Initialize the variable with other defaults (around line 90):

FAILURE_MODE_OVERRIDE=""

Step 3: Source library and integrate into sw-loop.sh post-failure flow

Source the library near the top (after other lib sources, ~line 60):

_BLF_LIB="$SCRIPT_DIR/lib/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"

After write_error_summary() is called (around line 1112, in the iteration flow when TEST_PASSED=false), add classification:

# Classify failure mode for adaptive recovery
if type classify_build_loop_failure >/dev/null 2>&1; then
    local failure_mode
    if [[ -n "${FAILURE_MODE_OVERRIDE:-}" ]]; then
        failure_mode="$FAILURE_MODE_OVERRIDE"
        warn "Using manual failure mode override: $failure_mode"
    else
        failure_mode=$(classify_build_loop_failure "$LOG_DIR")
    fi
    if type emit_event >/dev/null 2>&1; then
        emit_event "loop.failure_classified" \
            "mode=$failure_mode" \
            "iteration=$ITERATION" \
            "job_id=${PIPELINE_JOB_ID:-loop-$$}"
    fi
fi

Step 4: Integrate into run_loop_with_restarts() for mid-loop recovery

In run_loop_with_restarts() (around line 2462, before incrementing RESTART_COUNT), add recovery logic:

# Classify failure and apply mode-specific recovery
local loop_failure_mode="code_error"
if type classify_build_loop_failure >/dev/null 2>&1; then
    loop_failure_mode=$(classify_build_loop_failure "$LOG_DIR" 2>/dev/null || echo "code_error")
fi

case "$loop_failure_mode" in
    test_flakiness)
        info "Detected test flakiness — will rerun tests without code changes"
        ;;
    dependency_issue)
        info "Detected dependency issue — reinstalling dependencies"
        if [[ -f "package.json" ]]; then
            ( cd "$PROJECT_ROOT" && rm -rf node_modules 2>/dev/null && npm ci 2>/dev/null ) || true
        elif [[ -f "requirements.txt" ]]; then
            ( cd "$PROJECT_ROOT" && pip install -r requirements.txt 2>/dev/null ) || true
        fi
        ;;
    infinite_loop)
        info "Detected infinite loop — reducing iterations and injecting new approach guidance"
        if [[ "$MAX_ITERATIONS" -gt 10 ]]; then
            MAX_ITERATIONS=10
        fi
        ;;
    context_exhaustion)
        info "Detected context exhaustion — restarting with compressed briefing"
        # Default restart behavior handles this via session-restart.sh
        ;;
esac

if type emit_event >/dev/null 2>&1; then
    emit_event "loop.recovery_applied" \
        "mode=$loop_failure_mode" \
        "restart=$RESTART_COUNT" \
        "job_id=${PIPELINE_JOB_ID:-loop-$$}"
fi

Step 5: Enhance daemon retry with granular classification

In scripts/lib/daemon-failure.sh:

Source the new library (after the guard, around line 5):

_BLF_LIB="${BASH_SOURCE[0]%/*}/build-loop-failure.sh"
[[ -f "$_BLF_LIB" ]] && source "$_BLF_LIB"

Enhance retry strategy in daemon_on_failure() (around line 270, after existing escalation logic, before the daemon_spawn_pipeline call):

# Granular build-loop failure classification
local granular_mode=""
if type classify_build_loop_failure >/dev/null 2>&1; then
    local issue_artifacts="${issue_worktree_path}/.claude/loop-logs"
    if [[ -d "$issue_artifacts" ]]; then
        granular_mode=$(classify_build_loop_failure "$issue_artifacts" 2>/dev/null || echo "")
    fi
fi

if [[ -n "$granular_mode" ]] && type get_recovery_strategy >/dev/null 2>&1; then
    local strategy_json
    strategy_json=$(get_recovery_strategy "$granular_mode")
    local strategy_action
    strategy_action=$(echo "$strategy_json" | jq -r '.action // "standard_retry"' 2>/dev/null || echo "standard_retry")

    emit_event "daemon.granular_failure" \
        "issue=$issue_num" \
        "broad_class=$failure_class" \
        "granular_mode=$granular_mode" \
        "action=$strategy_action"

    case "$strategy_action" in
        rerun_tests)
            extra_args+=("--max-iterations" "3")
            daemon_log INFO "Granular recovery: rerun tests (flaky test detected)"
            ;;
        reinstall_deps)
            extra_args+=("--max-iterations" "5")
            daemon_log INFO "Granular recovery: reinstall dependencies"
            ;;
        reduce_and_redirect)
            extra_args+=("--max-iterations" "10")
            daemon_log INFO "Granular recovery: reduce iterations (infinite loop)"
            ;;
        restart_compressed)
            local boosted=$(( ${MAX_RESTARTS_CFG:-3} + retry_count + 2 ))
            [[ "$boosted" -gt 5 ]] && boosted=5
            extra_args+=("--max-restarts" "$boosted")
            daemon_log INFO "Granular recovery: compressed restart (context exhaustion)"
            ;;
    esac
fi

Step 6: Create test suite scripts/sw-lib-build-loop-failure-test.sh

Standard test harness pattern with these test groups:

Group 1: classify_build_loop_failure tests (12 cases)

  1. Returns dependency_issue for "Module not found" errors
  2. Returns dependency_issue for "npm ERR! ERESOLVE" errors
  3. Returns dependency_issue for "ModuleNotFoundError" errors
  4. Returns test_flakiness for alternating pass/fail history
  5. Returns test_flakiness for EADDRINUSE errors
  6. Returns infinite_loop for 3+ repeated same errors
  7. Returns infinite_loop for "stuck" status in progress.md
  8. Returns context_exhaustion for "exhausted" status
  9. Returns context_exhaustion for token/context error patterns
  10. Returns code_error as default (assertion errors, no special patterns)
  11. Returns code_error when error-summary.json is missing
  12. Returns code_error when error-summary.json is malformed

Group 2: get_recovery_strategy tests (6 cases) 13. Returns rerun_tests action for test_flakiness 14. Returns reinstall_deps action for dependency_issue 15. Returns reduce_and_redirect action for infinite_loop 16. Returns restart_compressed action for context_exhaustion 17. Returns standard_retry action for code_error 18. Returns valid JSON for unknown mode (defaults to standard_retry)

Group 3: detect_ function tests (8 cases)* 19. detect_test_flakiness returns 0 with alternating pass/fail history 20. detect_test_flakiness returns 1 with consistent failures 21. detect_infinite_loop returns 0 with 3+ identical error lines 22. detect_infinite_loop returns 1 with varying errors 23. detect_dependency_issue returns 0 with npm ERESOLVE in error_lines 24. detect_dependency_issue returns 1 with assertion errors 25. detect_context_exhaustion returns 0 with exhausted status 26. detect_context_exhaustion returns 1 with normal running status

Group 4: Integration tests (2 cases) 27. Classification writes well-formed failure-mode.json 28. Priority: dependency_issue beats code_error with mixed signals

Setup pattern per test:

  • Create temp $LOG_DIR with mock error-summary.json and progress.md
  • For history tests, create restart-1/, restart-2/ with archived error summaries
  • Source build-loop-failure.sh after stubbing emit_event

Step 7: Register test in package.json

Add to the scripts section in package.json:

"test:build-loop-failure": "bash scripts/sw-lib-build-loop-failure-test.sh"

And append to the aggregate test script.


Task Checklist

  • Task 1: Create scripts/lib/build-loop-failure.sh with all classification and recovery functions
  • Task 2: Add --failure-mode CLI flag to scripts/sw-loop.sh with validation
  • Task 3: Source library in sw-loop.sh and integrate classification after test failures
  • Task 4: Integrate classification into run_loop_with_restarts() for mid-loop recovery
  • Task 5: Enhance daemon-failure.sh to use granular classification for retry strategy
  • Task 6: Create test suite scripts/sw-lib-build-loop-failure-test.sh (28 test cases)
  • Task 7: Register test suite in package.json
  • Task 8: Run new test suite and fix any failures
  • Task 9: Run sw-lib-daemon-failure-test.sh to verify no regressions
  • Task 10: Run sw-loop-test.sh to verify no regressions

Testing Approach

Test Pyramid Breakdown

  • Unit tests (28 cases): Test each classification function in isolation with mock error-summary.json and progress.md. Covers all 5 failure modes, edge cases (missing files, malformed JSON), and priority ordering.
  • Integration tests (2 cases): Test that classification writes correct failure-mode.json and priority ordering works end-to-end.
  • Regression tests (2 existing suites): Run sw-lib-daemon-failure-test.sh and sw-loop-test.sh.

Coverage Targets

  • 100% of the 5 failure modes have at least 2 test cases each
  • All 4 detection functions tested for both positive and negative cases
  • get_recovery_strategy() tested for all 5 modes + unknown fallback
  • Error handling: missing files, malformed JSON, empty error_lines

Critical Paths to Test

  • Happy path: Each failure mode correctly identified from representative error patterns
  • Error case 1: Missing error-summary.json → defaults to code_error
  • Error case 2: Malformed JSON in error-summary.json → defaults to code_error
  • Edge case 1: Mixed signals (dependency error + test flakiness) → priority order respected (dependency wins)
  • Edge case 2: Empty progress.md with real errors → classifies based on error content alone

Definition of Done

  • classify_build_loop_failure() correctly classifies all 5 failure modes from error-summary.json and progress.md
  • get_recovery_strategy() returns mode-specific recovery parameters as valid JSON
  • Failure mode and recovery strategy logged to events.jsonl via emit_event
  • Daemon retry logic (daemon_on_failure) uses granular classification for mode-specific retry args
  • --failure-mode flag on sw-loop.sh allows manual override for testing (with validation)
  • Test suite passes with 28 test cases covering all 5 modes + edge cases
  • Existing sw-lib-daemon-failure-test.sh passes (no regressions)
  • Existing sw-loop-test.sh passes (no regressions)
  • All code is Bash 3.2 compatible
  • Library gracefully degrades when not available (guarded with type checks)

Systematic Debugging: Previous Attempt Analysis

Root Cause Hypothesis

No previous attempt has failed — this is the first plan stage. However, proactive analysis of potential failure modes:

  1. Most likely: Bash 3.2 compatibility issue — using readarray, declare -A, or ${var,,} in the new library. Evidence to confirm: Shellcheck or macOS test run. Mitigation: Strict adherence to project conventions, test on macOS.

  2. Possible: jq not available in test environment — classification falls back to wrong path. Evidence: Check mock setup in test harness. Mitigation: Test both jq-present and jq-absent paths.

  3. Unlikely: Sourcing order issue — build-loop-failure.sh tries to use functions from a file sourced after it. Evidence: Check dependency chain. Mitigation: Library is self-contained, no external function deps.

Evidence Gathered

  • daemon-failure.sh:1-71 — existing classification reads log tail, returns string
  • sw-loop.sh:1051-1112write_error_summary() format is stable
  • sw-loop.sh:193-325 — CLI parsing pattern is well-established
  • sw-loop.sh:2440-2500 — restart wrapper has clear hook points
  • All existing libraries use _LOADED guard pattern

Fix Strategy

Not applicable (first attempt). Approach is conservative:

  • New library with no external deps (self-contained)
  • Integration via type guards (graceful fallback)
  • Additive changes only (no existing behavior modified unless granular classification available)

Verification Plan

  1. Run sw-lib-build-loop-failure-test.sh — all 28 tests pass
  2. Run sw-lib-daemon-failure-test.sh — no regressions
  3. Run sw-loop-test.sh — no regressions
  4. Manual test: shipwright loop "test" --failure-mode test_flakiness exits cleanly

Alternatives Considered

Alternative 1: ML-based classification using Claude API

  • Approach: Send error logs to Claude API for classification
  • Pros: More accurate, handles novel error patterns
  • Cons: Adds API cost per failure, adds latency, requires API availability
  • Trade-offs: Higher accuracy but much higher cost/complexity. Pattern matching handles 90%+ of cases.
  • Verdict: Rejected — deterministic pattern matching is sufficient and free

Alternative 2: Extend root-cause.sh with loop-specific patterns

  • Approach: Add build-loop categories to rootcause_classify()
  • Pros: Reuses confidence scoring, evidence collection
  • Cons: root-cause.sh reads error-log.jsonl (pipeline-level), not error-summary.json (loop-level). Different data sources.
  • Trade-offs: Less code duplication but worse separation of concerns
  • Verdict: Rejected — different data sources make clean integration difficult

Alternative 3: Configuration-driven classification (JSON rules file)

  • Approach: Define patterns in a JSON config that maps regex → failure mode
  • Pros: User-configurable without code changes
  • Cons: Stateful detectors (flakiness=alternating history, infinite_loop=repeated errors) can't be expressed as simple regex rules
  • Trade-offs: More flexible but significantly more complex for marginal benefit
  • Verdict: Rejected — stateful detection logic requires code, not just patterns

Clone this wiki locally