Skip to content

Pipeline Plan 184

Seth Ford edited this page Mar 10, 2026 · 2 revisions

Implementation Plan: Failure Root Cause Classifier with Automated Platform Issue Creation

Issue: #184 Branch: feat/-failure-root-cause-classifier-with-auto-184 Complexity: Standard Estimated files: 7 modified, 1 new


Brainstorming & Design Decisions

Requirements Clarity

Minimum viable change: Wire the existing lib/root-cause.sh classifier (427 lines, already on this branch) into the daemon's failure handling path, add historical pattern learning from events.jsonl, create a dashboard breakdown visualization, and expose a CLI command.

Implicit requirements not stated:

  • The classifier library already exists but isn't called from the daemon — the critical integration is missing
  • Historical learning needs to feed back into classification confidence (not just regex)
  • Dashboard needs both an API endpoint and frontend component
  • CLI command needed for standalone shipwright root-cause usage

Acceptance criteria (from issue):

  1. Failure classifier analyzes error-log.jsonl and categorizes root cause — LIBRARY EXISTS
  2. Decision tree trained on historical failure patterns from events.jsonl — NEEDS IMPLEMENTATION
  3. Platform bugs trigger automatic GitHub issue creation — LIBRARY EXISTS, NEEDS DAEMON WIRING
  4. Dashboard shows failure breakdown by category — NEEDS IMPLEMENTATION
  5. Reduce repeat platform failures by >30% — MEASURABLE VIA LEARNING SYSTEM

Alternatives Considered

Approach A: Enhance existing library + add integration points (CHOSEN)

  • Pros: Minimal blast radius (7 files), builds on 427-line library with 25+ tests, test suite already passes
  • Cons: Regex-based classification has limits vs ML-based approach
  • Blast radius: 7 files modified, 1 new
  • Complexity: Low-medium

Approach B: ML-based classifier using Claude API calls

  • Pros: More sophisticated, context-aware classification
  • Cons: Over-engineered for bash, adds API cost per failure, adds latency to failure handling path, fragile in offline/local mode
  • Blast radius: 15+ files
  • Complexity: High

Approach C: Build entirely new classification system

  • Pros: Clean design from scratch
  • Cons: Discards 427 lines of working code + 374 lines of tests, massive waste
  • Blast radius: 20+ files
  • Complexity: Very high

Decision: Approach A — The library is feature-complete. The gap is purely integration: wiring it into the daemon, adding historical learning, and surfacing data in the dashboard.

Risk Analysis

Risk Impact Likelihood Mitigation
Daemon integration breaks failure handling High Low Wrap all rootcause calls in `
GitHub issue spam from auto-creation Medium Low Already gated: confidence >70% + dedup via cksum signature
events.jsonl too large for analysis Low Medium Read only last 200 entries, use tail not cat
Dashboard endpoint performance Low Low Aggregate at read time, cache results
Root cause misclassification Medium Medium Historical learning improves over time, unknown defaults to code_bug

Current State Assessment

What EXISTS (from WIP commit 6d70188):

File Lines Status
scripts/lib/root-cause.sh 427 Complete — 7 categories, classify/analyze/create_issue/suggest_fix/learn/report/main
scripts/sw-root-cause-test.sh 374 Complete — 25+ tests covering all functions
scripts/sw-pipeline.sh:58 1 line Sources root-cause.sh
package.json 1 line Test suite registered

What's MISSING:

Component File Description
CLI entry point scripts/sw-root-cause.sh (NEW) Standalone shipwright root-cause command
CLI router scripts/sw Add root-cause dispatch
Daemon integration scripts/lib/daemon-failure.sh Wire rootcause_main() into daemon_on_failure()
Historical learning scripts/lib/root-cause.sh New function: rootcause_analyze_history() using events.jsonl
Enhanced failure comment scripts/lib/daemon-failure.sh Include root cause + fix suggestions in GitHub comment
Dashboard API dashboard/server.ts GET /api/root-cause/breakdown endpoint
Dashboard frontend dashboard/src/views/insights.ts Failure breakdown visualization
Dashboard types dashboard/src/types/api.ts RootCauseBreakdown interface
Dashboard API wrapper dashboard/src/core/api.ts fetchRootCauseBreakdown() function

Files to Modify

Modified Files (7):

  1. scripts/lib/root-cause.sh — Add rootcause_analyze_history() for events.jsonl historical learning, enhance rootcause_classify() to incorporate historical confidence boosting
  2. scripts/lib/daemon-failure.sh — Wire rootcause_main() into daemon_on_failure(), enhance failure comments with classification
  3. scripts/sw — Add root-cause command dispatch to CLI router
  4. dashboard/server.ts — Add GET /api/root-cause/breakdown endpoint
  5. dashboard/src/views/insights.ts — Add failure breakdown by category visualization
  6. dashboard/src/types/api.ts — Add RootCauseBreakdown TypeScript interface
  7. dashboard/src/core/api.ts — Add fetchRootCauseBreakdown() API wrapper

New Files (1):

  1. scripts/sw-root-cause.sh — CLI entry point for shipwright root-cause command (classify, analyze, report, history subcommands)

Implementation Steps

Step 1: Add Historical Pattern Analysis to root-cause.sh

File: scripts/lib/root-cause.sh

Add rootcause_analyze_history() function after rootcause_analyze_error_log() (after line 159). This function:

  • Reads last 200 entries from ~/.shipwright/events.jsonl where type matches daemon.failure_classified or memory.failure
  • Groups by failure class/category
  • Computes frequency distribution and recency weighting
  • Returns JSON with historical patterns and confidence adjustments

Enhance rootcause_classify() to call historical analysis when available:

  • After regex classification, check if ~/.shipwright/optimization/root-causes.jsonl has matching patterns
  • If a pattern has been seen 3+ times with the same category, boost confidence by 5%
  • If a pattern was previously classified differently, flag as "disputed" in evidence
rootcause_analyze_history() {
    local events_file="${HOME}/.shipwright/events.jsonl"
    local learn_file="${HOME}/.shipwright/optimization/root-causes.jsonl"

    # Analyze learned classifications
    [[ ! -f "$learn_file" ]] && { echo '{"total":0,"categories":{},"trends":{}}'; return 0; }

    # Category distribution from historical data
    local dist
    dist=$(tail -200 "$learn_file" 2>/dev/null | jq -s '
        group_by(.category) |
        map({key: .[0].category, value: length}) |
        from_entries
    ' 2>/dev/null || echo '{}')

    # Recent trend (last 24h vs last 7d)
    local recent_counts
    recent_counts=$(tail -200 "$learn_file" 2>/dev/null | jq -s --arg cutoff_1d "..." --arg cutoff_7d "..." '
        {
            last_24h: [.[] | select(.recorded_at > $cutoff_1d)] | length,
            last_7d: [.[] | select(.recorded_at > $cutoff_7d)] | length,
            platform_bugs_24h: [.[] | select(.recorded_at > $cutoff_1d and .category == "platform_bug")] | length,
            platform_bugs_7d: [.[] | select(.recorded_at > $cutoff_7d and .category == "platform_bug")] | length
        }
    ' 2>/dev/null || echo '{}')

    local total
    total=$(wc -l < "$learn_file" 2>/dev/null | tr -d ' ' || echo "0")

    jq -n --arg total "$total" --argjson categories "$dist" --argjson trends "$recent_counts" \
        '{total: ($total | tonumber), categories: $categories, trends: $trends}'
}

Also add rootcause_boost_from_history() — a helper that checks learned patterns against the current error message to adjust confidence:

rootcause_boost_from_history() {
    local error_msg="${1:-}"
    local current_category="${2:-}"
    local learn_file="${HOME}/.shipwright/optimization/root-causes.jsonl"

    [[ ! -f "$learn_file" ]] && { echo "0"; return 0; }

    # Check how many times similar errors mapped to this category
    local error_sig
    error_sig=$(echo "$error_msg" | head -c 100 | cksum | awk '{print $1}')

    local matching
    matching=$(grep -c "$(echo "$error_msg" | head -c 50 | sed 's/[^a-zA-Z0-9 ]//g' | head -c 30)" "$learn_file" 2>/dev/null || echo "0")

    # Boost: +5 if seen 3+ times, +10 if seen 10+ times
    if [[ "$matching" -ge 10 ]]; then
        echo "10"
    elif [[ "$matching" -ge 3 ]]; then
        echo "5"
    else
        echo "0"
    fi
}

Step 2: Wire Root Cause into Daemon Failure Handler

File: scripts/lib/daemon-failure.sh

Integration point: After line 198 (record_failure_class "$failure_class") and before line 201 (retry escalation).

Source root-cause.sh at the top of daemon-failure.sh (after module guard):

# Root cause classifier (optional — degrades gracefully)
[[ -f "$SCRIPT_DIR/lib/root-cause.sh" ]] && source "$SCRIPT_DIR/lib/root-cause.sh" 2>/dev/null || true

Add root cause classification block after record_failure_class:

    # ── Root cause classification (Issue #184) ──
    local root_cause_result=""
    local root_cause_category="unknown"
    local root_cause_confidence=0
    local root_cause_fix=""
    if type rootcause_main >/dev/null 2>&1; then
        local error_tail=""
        local log_path="$LOG_DIR/issue-${issue_num}.log"
        [[ -f "$log_path" ]] && error_tail=$(tail -200 "$log_path" 2>/dev/null || true)

        if [[ -n "$error_tail" ]]; then
            root_cause_result=$(rootcause_main "$error_tail" "$failure_class" "$exit_code" 2>/dev/null || echo "")
            if [[ -n "$root_cause_result" ]]; then
                root_cause_category=$(echo "$root_cause_result" | jq -r '.classification.category // "unknown"' 2>/dev/null || echo "unknown")
                root_cause_confidence=$(echo "$root_cause_result" | jq -r '.classification.confidence // 0' 2>/dev/null || echo "0")
                root_cause_fix=$(echo "$root_cause_result" | jq -r '.fix.suggestions // ""' 2>/dev/null || echo "")
                daemon_log INFO "Root cause: ${root_cause_category} (${root_cause_confidence}% confidence)"
                emit_event "daemon.root_cause_classified" \
                    "issue=$issue_num" \
                    "category=$root_cause_category" \
                    "confidence=$root_cause_confidence" \
                    "daemon_class=$failure_class"
            fi
        fi
    fi

Enhance the retry comment (around line 289-301) to include root cause: Add after the existing retry table:

${root_cause_category:+
**Root Cause:** \`${root_cause_category}\` (${root_cause_confidence}% confidence)
${root_cause_fix:+**Suggested Fix:** ${root_cause_fix}}}

Enhance the final failure comment (around line 371-391) to include root cause classification: Add a new row to the table and a section:

| Root Cause | \`${root_cause_category}\` (${root_cause_confidence}% confidence) |

And after the log details block:

${root_cause_fix:+
### 🔍 Root Cause Analysis

**Category:** \`${root_cause_category}\`
**Confidence:** ${root_cause_confidence}%
**Suggestions:** ${root_cause_fix}
}

Step 3: Create CLI Entry Point

File: scripts/sw-root-cause.sh (NEW)

Standard Shipwright script structure with subcommands:

  • classify <error_message> [--stage <stage>] — Classify a single error
  • analyze — Analyze error-log.jsonl for patterns
  • report — Generate root cause analytics report
  • history — Show historical pattern analysis from events.jsonl
  • help — Usage info
#!/usr/bin/env bash
# ╔═══════════════════════════════════════════════════════════════════════════╗
# ║  shipwright root-cause — Failure Root Cause Classification & Analytics   ║
# ╚═══════════════════════════════════════════════════════════════════════════╝
set -euo pipefail
VERSION="3.2.4"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/lib/helpers.sh" 2>/dev/null || true
source "$SCRIPT_DIR/lib/root-cause.sh"

case "${1:-help}" in
    classify)
        shift
        local error_msg="${1:-}"
        local stage="${2:-unknown}"
        rootcause_main "$error_msg" "$stage" "1"
        ;;
    analyze)
        rootcause_analyze_error_log
        ;;
    report)
        rootcause_report
        ;;
    history)
        rootcause_analyze_history
        ;;
    help|--help|-h)
        # show_help
        ;;
esac

Step 4: Add CLI Router Dispatch

File: scripts/sw

Add to the command dispatch case statement (alphabetically near other r commands):

root-cause)     exec "$SCRIPT_DIR/sw-root-cause.sh" "$@" ;;

Step 5: Dashboard API Endpoint

File: dashboard/server.ts

Add new endpoint after the existing /api/memory/patterns endpoint (around line 3808):

// Root cause failure breakdown
app.get("/api/root-cause/breakdown", async (req) => {
    const url = new URL(req.url);
    const days = parseInt(url.searchParams.get("days") || "30");

    // Read from root-causes.jsonl (learning system output)
    const rcFile = path.join(os.homedir(), ".shipwright/optimization/root-causes.jsonl");
    let classifications: Array<{category: string; confidence: number; message: string; recorded_at: string}> = [];

    try {
        const content = await Bun.file(rcFile).text();
        const cutoff = new Date(Date.now() - days * 86400000).toISOString();
        classifications = content.trim().split("\n")
            .filter(Boolean)
            .map(line => { try { return JSON.parse(line); } catch { return null; } })
            .filter((e): e is NonNullable<typeof e> => e !== null && e.recorded_at > cutoff);
    } catch { /* no data yet */ }

    // Aggregate by category
    const byCategory: Record<string, number> = {};
    const byDay: Record<string, Record<string, number>> = {};

    for (const c of classifications) {
        byCategory[c.category] = (byCategory[c.category] || 0) + 1;
        const day = c.recorded_at?.substring(0, 10) || "unknown";
        if (!byDay[day]) byDay[day] = {};
        byDay[day][c.category] = (byDay[day][c.category] || 0) + 1;
    }

    // Platform bug trend
    const now = Date.now();
    const platformBugs24h = classifications.filter(c =>
        c.category === "platform_bug" &&
        new Date(c.recorded_at).getTime() > now - 86400000
    ).length;
    const platformBugs7d = classifications.filter(c =>
        c.category === "platform_bug" &&
        new Date(c.recorded_at).getTime() > now - 7 * 86400000
    ).length;

    return Response.json({
        total: classifications.length,
        breakdown: byCategory,
        daily: byDay,
        trends: {
            platform_bugs_24h: platformBugs24h,
            platform_bugs_7d: platformBugs7d,
            trend: platformBugs7d > 0
                ? (platformBugs24h * 7 > platformBugs7d ? "increasing" : "stable_or_decreasing")
                : "no_data"
        },
        top_errors: classifications
            .slice(-20)
            .reverse()
            .map(c => ({ category: c.category, confidence: c.confidence, message: c.message?.substring(0, 100) }))
    });
});

Step 6: Dashboard Frontend — Types

File: dashboard/src/types/api.ts

Add interface:

export interface RootCauseBreakdown {
  total: number;
  breakdown: Record<string, number>;
  daily: Record<string, Record<string, number>>;
  trends: {
    platform_bugs_24h: number;
    platform_bugs_7d: number;
    trend: "increasing" | "stable_or_decreasing" | "no_data";
  };
  top_errors: Array<{
    category: string;
    confidence: number;
    message: string;
  }>;
}

Step 7: Dashboard Frontend — API Wrapper

File: dashboard/src/core/api.ts

Add function:

export const fetchRootCauseBreakdown = (days = 30) =>
  request<RootCauseBreakdown>(`/api/root-cause/breakdown?days=${days}`);

Step 8: Dashboard Frontend — Insights Visualization

File: dashboard/src/views/insights.ts

Add to the parallel API calls in the Insights tab:

  1. Add fetchRootCauseBreakdown() to the parallel fetch calls
  2. Add a "Root Cause Breakdown" card to the Insights view

The card renders:

  • Bar chart (CSS-only, no external deps) showing category distribution
  • Platform bug trend indicator (increasing/stable)
  • Top 5 recent errors with category badges
  • Color-coded by category (platform_bug=rose, code_bug=amber, infra_issue=cyan, etc.)

Step 9: Update Test Suite

File: scripts/sw-root-cause-test.sh

Add tests for the new functions:

  • test_analyze_history_empty — handles missing history file
  • test_analyze_history_with_data — returns correct distribution
  • test_boost_from_history — returns correct confidence boost
  • test_cli_classify — standalone CLI classify works
  • test_cli_report — standalone CLI report works

Task Checklist

  • Task 1: Add rootcause_analyze_history() and rootcause_boost_from_history() to scripts/lib/root-cause.sh
  • Task 2: Enhance rootcause_classify() to incorporate historical confidence boosting
  • Task 3: Wire root cause classifier into daemon_on_failure() in scripts/lib/daemon-failure.sh
  • Task 4: Enhance daemon failure/retry GitHub comments with root cause classification
  • Task 5: Create scripts/sw-root-cause.sh CLI entry point
  • Task 6: Add root-cause dispatch to scripts/sw CLI router
  • Task 7: Add GET /api/root-cause/breakdown endpoint to dashboard/server.ts
  • Task 8: Add RootCauseBreakdown TypeScript interface to dashboard/src/types/api.ts
  • Task 9: Add fetchRootCauseBreakdown() to dashboard/src/core/api.ts
  • Task 10: Add failure breakdown visualization to dashboard/src/views/insights.ts
  • Task 11: Add tests for new functions in scripts/sw-root-cause-test.sh
  • Task 12: Run sw-root-cause-test.sh and sw-lib-daemon-failure-test.sh to verify

Task Dependencies

Task 1 → Task 2 (history functions needed before classify enhancement)
Task 1 → Task 3 (library must be complete before daemon wiring)
Task 3 → Task 4 (daemon integration before comment enhancement)
Task 5 depends on Task 1 (CLI wraps library functions)
Task 6 depends on Task 5 (router needs entry point)
Task 8 → Task 9 → Task 10 (types → API → UI)
Task 7 is independent (server-side endpoint)
Task 11 depends on Tasks 1-2 (tests for new functions)
Task 12 depends on all other tasks

Testing Approach

Unit Tests (Task 11-12)

  1. Run existing sw-root-cause-test.sh — all 25+ tests must pass
  2. Add new tests for rootcause_analyze_history() and rootcause_boost_from_history()
  3. Run sw-lib-daemon-failure-test.sh — existing tests must still pass
  4. Test daemon integration by verifying rootcause_main is called (mock via function override)

Integration Tests

  1. Verify CLI shipwright root-cause classify "rate limit 429" returns correct JSON
  2. Verify shipwright root-cause report produces formatted output
  3. Verify dashboard endpoint returns valid JSON structure

Targeted Test Commands

# Core classifier tests
./scripts/sw-root-cause-test.sh

# Daemon failure handling tests
./scripts/sw-lib-daemon-failure-test.sh

# Dashboard API tests (if server running)
./scripts/sw-server-api-test.sh

Definition of Done

  • scripts/lib/root-cause.sh has rootcause_analyze_history() and rootcause_boost_from_history()
  • rootcause_classify() incorporates historical confidence boosting
  • daemon_on_failure() calls rootcause_main() on every failure
  • Daemon retry comments include root cause category + confidence + suggestions
  • Daemon final failure comments include root cause analysis section
  • Platform bugs with >70% confidence auto-create GitHub issues (already in library)
  • shipwright root-cause CLI command works with classify/analyze/report/history subcommands
  • Dashboard API returns failure breakdown by category
  • Dashboard Insights tab shows failure breakdown visualization
  • All existing tests pass (sw-root-cause-test.sh, sw-lib-daemon-failure-test.sh)
  • New tests cover historical analysis and confidence boosting
  • Events emitted: daemon.root_cause_classified with category, confidence, daemon_class

Endpoint Specification (API Skill)

GET /api/root-cause/breakdown

Query parameters:

  • days (optional, default: 30) — Number of days of history to include

Response (200 OK):

{
  "total": 47,
  "breakdown": {
    "code_bug": 22,
    "platform_bug": 8,
    "infra_issue": 7,
    "rate_limit": 5,
    "context_exhaustion": 3,
    "config_error": 1,
    "external_dep": 1
  },
  "daily": {
    "2026-03-09": {"code_bug": 3, "platform_bug": 1},
    "2026-03-08": {"code_bug": 2, "infra_issue": 1}
  },
  "trends": {
    "platform_bugs_24h": 1,
    "platform_bugs_7d": 5,
    "trend": "stable_or_decreasing"
  },
  "top_errors": [
    {"category": "code_bug", "confidence": 85, "message": "AssertionError: expected 'foo'..."}
  ]
}

Error responses:

  • 500: {"error": {"code": "INTERNAL_ERROR", "message": "Failed to read root cause data"}}

Rate Limiting

Not applicable — internal dashboard endpoint, not public API.

Versioning

No versioning needed — internal API following existing dashboard patterns.

Root Cause Hypothesis (Systematic Debugging — previous plan stage failure)

  1. Most likely: Previous plan stage produced empty plan.md (context exhaustion or timeout) — confirmed by reading the file (empty). The library and tests already exist from the WIP commit, so the plan stage just needs to produce the plan document, not recreate the implementation.
  2. Possible: Previous attempt tried to re-implement everything from scratch instead of recognizing existing code — mitigated by this plan explicitly building on existing work.
  3. Unlikely: Fundamental architectural issue — the feature is straightforward integration work.

Evidence gathered:

  • plan.md was empty (line 1 only)
  • lib/root-cause.sh (427 lines) and test suite (374 lines) exist and are complete
  • sw-pipeline.sh:58 already sources the library
  • daemon-failure.sh has no rootcause references — the integration gap

Fix strategy: This plan documents the integration work needed. It does NOT re-implement the library — it builds on the existing 427-line implementation.

Verification plan: Run sw-root-cause-test.sh after each change; run sw-lib-daemon-failure-test.sh after daemon integration.

Clone this wiki locally