Skip to content

Pipeline Design 184

Seth Ford edited this page Mar 10, 2026 · 2 revisions

Now I have a thorough understanding of the codebase. Here is the Architecture Decision Record:


Design: Failure Root Cause Classifier with Automated Platform Issue Creation

Context

Pipeline failures today are classified by daemon-failure.sh:classify_failure() into 6 coarse categories (auth_error, api_error, invalid_issue, context_exhaustion, build_failure, unknown). This classification drives retry strategy but provides no root cause analysis, no fix suggestions, no historical learning, and no platform bug auto-filing.

A parallel library scripts/lib/root-cause.sh (427 lines, 25+ tests) already exists on this branch with 7 finer-grained categories, fix suggestions, a learning system (root-causes.jsonl), and platform issue auto-creation. However, nothing calls it — the daemon doesn't source it, there's no CLI entry point, and the dashboard doesn't surface its data.

Constraints:

  • Bash 3.2 compatibility (no associative arrays, no readarray)
  • Daemon failure path is hot — must not block or crash on rootcause failures
  • root-cause.sh is a sourced library (module guard pattern), not a standalone script
  • Dashboard uses Bun/TypeScript with a pattern of app.get("/api/...") handlers returning Response.json()
  • All new functions must degrade gracefully when dependencies (jq, gh, events.jsonl) are missing

Decision

Integrate the existing lib/root-cause.sh library into the daemon failure path and surface data through CLI + dashboard. Do not build a new classification system.

Data Flow

Pipeline Fails
      │
      ▼
daemon_on_failure()                    ←── scripts/lib/daemon-failure.sh
      │
      ├─ classify_failure()            ←── existing coarse classifier (unchanged)
      │       │
      │       ▼ failure_class
      │
      ├─ record_failure_class()        ←── existing (unchanged)
      │
      ├─ rootcause_main()  ◄── NEW ──── scripts/lib/root-cause.sh
      │       │
      │       ├─ rootcause_classify()      regex → category + confidence
      │       │       │
      │       │       └─ rootcause_boost_from_history()  ◄── NEW
      │       │               reads root-causes.jsonl for pattern frequency
      │       │
      │       ├─ rootcause_suggest_fix()   category → actionable text
      │       │
      │       ├─ rootcause_learn()         append to root-causes.jsonl
      │       │
      │       └─ rootcause_create_platform_issue()  (if platform_bug, conf>70%)
      │               creates GitHub issue via gh CLI
      │
      ├─ emit_event "daemon.root_cause_classified"
      │
      ├─ Enhanced retry comment         ←── adds root cause + suggestions
      │
      └─ Enhanced final failure comment ←── adds root cause analysis section


CLI: shipwright root-cause
      │
      ├─ classify <msg>  → rootcause_main()
      ├─ analyze         → rootcause_analyze_error_log()
      ├─ report          → rootcause_report()
      └─ history         → rootcause_analyze_history()  ◄── NEW


Dashboard: GET /api/root-cause/breakdown
      │
      └─ Reads ~/.shipwright/optimization/root-causes.jsonl
              │
              └─ Aggregates by category, daily, trends
                      │
                      └─ Insights tab: bar chart + trend badge + recent errors

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                     DAEMON LAYER                            │
│  sw-daemon.sh                                               │
│    └─ lib/daemon-failure.sh                                │
│         ├─ classify_failure()      [existing, unchanged]    │
│         ├─ record_failure_class()  [existing, unchanged]    │
│         └─ rootcause integration   [NEW: ~30 lines]        │
│              wraps rootcause_main() in || true guard        │
└──────────────────────┬──────────────────────────────────────┘
                       │ sources
┌──────────────────────▼──────────────────────────────────────┐
│                  CLASSIFIER LIBRARY                         │
│  lib/root-cause.sh  [existing: 427 lines, adding ~80]      │
│    ├─ rootcause_classify()          [existing + boost]      │
│    ├─ rootcause_suggest_fix()       [existing, unchanged]   │
│    ├─ rootcause_learn()             [existing, unchanged]   │
│    ├─ rootcause_create_platform_issue() [existing]          │
│    ├─ rootcause_analyze_error_log() [existing, unchanged]   │
│    ├─ rootcause_report()            [existing, unchanged]   │
│    ├─ rootcause_main()              [existing, unchanged]   │
│    ├─ rootcause_analyze_history()   [NEW]                   │
│    └─ rootcause_boost_from_history()[NEW]                   │
└──────────────────────┬──────────────────────────────────────┘
                       │ reads/writes
┌──────────────────────▼──────────────────────────────────────┐
│                   DATA LAYER                                │
│  ~/.shipwright/optimization/root-causes.jsonl               │
│    { category, confidence, message, recorded_at }           │
│  ~/.shipwright/events.jsonl                                 │
│    daemon.root_cause_classified events                      │
│  .claude/pipeline-artifacts/error-log.jsonl                 │
│    raw error entries from PostToolUse hook                   │
└──────────────────────┬──────────────────────────────────────┘
                       │ read by
┌──────────────────────▼──────────────────────────────────────┐
│                  DASHBOARD LAYER                            │
│  dashboard/server.ts                                        │
│    └─ GET /api/root-cause/breakdown  [NEW endpoint]         │
│  dashboard/src/types/api.ts                                 │
│    └─ RootCauseBreakdown interface   [NEW type]             │
│  dashboard/src/core/api.ts                                  │
│    └─ fetchRootCauseBreakdown()      [NEW wrapper]          │
│  dashboard/src/views/insights.ts                            │
│    └─ Root cause breakdown card      [NEW visualization]    │
└─────────────────────────────────────────────────────────────┘
                       │ dispatches to
┌──────────────────────▼──────────────────────────────────────┐
│                    CLI LAYER                                 │
│  scripts/sw                                                 │
│    └─ root-cause) dispatch           [NEW: 1 line]          │
│  scripts/sw-root-cause.sh            [NEW: ~80 lines]       │
│    └─ classify | analyze | report | history                 │
└─────────────────────────────────────────────────────────────┘

Interface Contracts

Bash — rootcause_classify(error_message, stage, exit_code) → JSON

Input:  error_message: string, stage: string ("build"|"test"|...), exit_code: string
Output: {"category": "code_bug"|"infra_issue"|"rate_limit"|"context_exhaustion"|
          "platform_bug"|"config_error"|"external_dep",
         "confidence": 0-100,
         "evidence": string[],
         "suggested_action": string}
Error:  Returns {"category":"unknown","confidence":0,...} on empty input — never fails

Bash — rootcause_analyze_history() → JSON

Input:  none (reads ~/.shipwright/optimization/root-causes.jsonl)
Output: {"total": number, "categories": {"code_bug": N, ...}, "trends": {...}}
Error:  Returns {"total":0,"categories":{},"trends":{}} if file missing

Bash — rootcause_boost_from_history(error_message, current_category) → number

Input:  error_message: string, current_category: string
Output: "0" | "5" | "10" (confidence boost amount, printed to stdout)
Error:  Returns "0" if learn file missing or grep fails

Bash — rootcause_main(error_message, stage, exit_code) → JSON

Input:  error_message: string, stage: string, exit_code: string
Output: {"classification": <classify output>, "fix": <suggest_fix output>}
Error:  Returns error message to stderr, exit 1 on empty input
Side effects: appends to root-causes.jsonl, may create GitHub issue

TypeScript — GET /api/root-cause/breakdown?days=30

interface RootCauseBreakdown {
  total: number;
  breakdown: Record<string, number>;        // category → count
  daily: Record<string, Record<string, number>>; // "2026-03-09" → {category → count}
  trends: {
    platform_bugs_24h: number;
    platform_bugs_7d: number;
    trend: "increasing" | "stable_or_decreasing" | "no_data";
  };
  top_errors: Array<{
    category: string;
    confidence: number;
    message: string;                         // truncated to 100 chars
  }>;
}
// Error: 500 {"error": {"code": "INTERNAL_ERROR", "message": "..."}}

TypeScript — fetchRootCauseBreakdown(days?: number) → Promise<RootCauseBreakdown>

Error Boundaries

Component Error Source Handling
daemon-failure.sh rootcause block rootcause_main crashes Wrapped in `
rootcause_classify jq not available Falls back to echo '[]' for evidence array
rootcause_analyze_history Missing jsonl file Returns {"total":0,...} — empty-state-safe
rootcause_boost_from_history grep/learn file missing Returns "0" — no boost applied
rootcause_create_platform_issue NO_GITHUB set, gh missing, API error Returns 0 (skip) or 1 (failure logged), never crashes caller
Dashboard endpoint Missing/corrupt jsonl Returns {"total":0,"breakdown":{},...} — frontend renders empty state
Dashboard frontend API 500 Existing checkDone() pattern handles partial failures — card shows "No data"

Key Design Decisions

1. Two classifiers coexist — rootcause supplements, doesn't replace

Context: classify_failure() in daemon-failure.sh (6 categories) drives retry strategy. rootcause_classify() in root-cause.sh (7 categories) provides finer-grained analysis.

Decision: Keep both. The daemon's existing classify_failure determines retry behavior (auth_error → no retry, api_error → 4 retries with long backoff). The root cause classifier runs after as an enrichment layer — it adds category/confidence/suggestions to comments and events but does not alter retry logic.

Alternative rejected: Replace classify_failure with rootcause_classify. This would require mapping 7 rootcause categories to the 6 daemon categories and changing retry strategy — high risk for no immediate benefit.

Consequence: A failure might be classified as api_error by the daemon (retry 4x with 5min backoff) and rate_limit by root cause (for comments/analytics). The categories overlap but are not identical. This is acceptable: retry strategy should be conservative (daemon), while analytics benefit from precision (rootcause).

2. Historical boosting via pattern frequency, not ML

Context: The plan calls for "decision tree trained on historical patterns." In a bash environment with JSONL files, true ML is impractical.

Decision: Use simple frequency-based boosting: if rootcause_boost_from_history() finds that similar error messages (first 30 alphanum chars) have been classified 3+ times, boost confidence by 5%; 10+ times, boost by 10%. This is a lookup, not a model.

Alternative rejected: Claude API call per failure for intelligent classification. Adds latency (2-5s), cost ($0.01+/failure), and a dependency on API availability in the failure path — precisely when APIs may be down.

Consequence: Classification accuracy is bounded by regex quality. The historical boost only reinforces existing classifications, it cannot reclassify. This is acceptable for v1 — the learning system captures data that could power a more sophisticated classifier later.

3. Daemon integration is fail-safe by design

Context: The failure handler is critical path — if it crashes, the daemon may leave issues in a broken state.

Decision: The entire rootcause block in daemon_on_failure is wrapped in guards:

  • type rootcause_main >/dev/null 2>&1 before calling (function may not exist if source failed)
  • All rootcause calls use 2>/dev/null || echo ""
  • Variables default to empty/zero: root_cause_category="unknown", root_cause_confidence=0
  • Comment enhancements use ${var:+...} (only render if non-empty)

Consequence: If root-cause.sh fails to source or any function errors, the daemon behaves exactly as it does today — the rootcause block produces no output and no side effects.

4. Dashboard reads JSONL directly, no database

Context: Root cause data lives in ~/.shipwright/optimization/root-causes.jsonl. The dashboard could import it into SQLite (sw-db.sh) or read it directly.

Decision: Read JSONL directly via Bun.file().text() + line splitting + JSON.parse. Filter and aggregate in-memory.

Alternative rejected: SQLite import. Adds a migration, a new table, and a sync mechanism between jsonl and db. Over-engineered for a file that grows by ~1 entry per pipeline failure.

Consequence: Performance degrades if root-causes.jsonl grows very large. Mitigated by filtering to only the last N days (default 30) and reading with a line limit. At typical failure rates (5-20/day), even 6 months of data is <4000 lines — trivial to parse.

Alternatives Considered

  1. ML-based classifier using Claude API calls — Pros: context-aware, handles novel error patterns, higher accuracy. Cons: adds API cost per failure ($0.01+), 2-5s latency in the failure path, fragile when the API itself is the failure cause (rate limits, outages), requires Bash-to-API plumbing. Rejected: over-engineered for v1; the regex classifier covers 7 well-defined categories and the learning system captures data for future sophistication.

  2. Build entirely new classification system from scratch — Pros: clean-slate design, no legacy patterns. Cons: discards 427 lines of working, tested code + 374 lines of passing tests. Violates the principle of building on existing work. The existing library already has classification, learning, issue creation, and reporting — the gap is purely integration. Rejected: wasteful.

Implementation Plan

  • Files to create: scripts/sw-root-cause.sh (CLI entry point, ~80 lines)
  • Files to modify:
    • scripts/lib/root-cause.sh — add rootcause_analyze_history(), rootcause_boost_from_history(), enhance rootcause_classify() (~80 lines added)
    • scripts/lib/daemon-failure.sh — source root-cause.sh, wire rootcause_main into daemon_on_failure(), enhance comments (~50 lines added)
    • scripts/sw — add root-cause) dispatch (1 line)
    • dashboard/server.ts — add /api/root-cause/breakdown endpoint (~50 lines)
    • dashboard/src/types/api.ts — add RootCauseBreakdown interface (~15 lines)
    • dashboard/src/core/api.ts — add fetchRootCauseBreakdown() (~2 lines)
    • dashboard/src/views/insights.ts — add breakdown card to Insights tab (~60 lines)
    • scripts/sw-root-cause-test.sh — add tests for new functions (~50 lines)
  • Dependencies: None new. Uses existing jq, gh, Bun.
  • Risk areas:
    • daemon_on_failure is the critical integration point — must be fail-safe (|| true wrapping)
    • rootcause_learn() prepends entries (newest first via tmp+mv) — rootcause_analyze_history() must account for this ordering when using tail
    • rootcause_boost_from_history() uses grep on the first 30 chars of error messages — special regex characters in error messages could cause grep failures; use grep -F (fixed string) not regex
    • The evidence array in rootcause_classify() uses a fragile printf + jq -Rs pattern (line 117) — existing, not being changed, but worth noting

Validation Criteria

  • ./scripts/sw-root-cause-test.sh passes all existing 25+ tests plus new tests for rootcause_analyze_history, rootcause_boost_from_history
  • ./scripts/sw-lib-daemon-failure-test.sh passes — daemon integration doesn't break existing failure handling
  • shipwright root-cause classify "rate limit 429" returns JSON with category: "rate_limit"
  • shipwright root-cause report produces formatted output when root-causes.jsonl exists
  • shipwright root-cause history returns valid JSON
  • When lib/root-cause.sh is absent or fails to source, daemon_on_failure() behaves identically to current behavior (no regression)
  • Dashboard endpoint GET /api/root-cause/breakdown returns valid JSON matching RootCauseBreakdown schema
  • Dashboard Insights tab renders breakdown card without errors when data is empty
  • Platform bugs with confidence >70% emit rootcause.platform_issue_created event (existing behavior, verified not regressed)
  • daemon.root_cause_classified event emitted on every failure with category, confidence, daemon_class fields
  • shipwright templates list still shows all pipeline templates (discoverability check)

Clone this wiki locally