Skip to content

Pipeline Design 184

Seth Ford edited this page Mar 10, 2026 · 2 revisions

Now I have a thorough understanding of the implementation. Here's the ADR:


Design: Failure Root Cause Classifier with Automated Platform Issue Creation

Context

Shipwright's daemon processes GitHub issues through a 12-stage pipeline. When pipelines fail, the daemon retries with escalation but has no systematic understanding of why failures occur. The same platform bug can trigger dozens of retries across different issues before a human notices the pattern. Meanwhile, error-log.jsonl (captured by the PostToolUse hook) and events.jsonl accumulate structured error data that goes unanalyzed.

Constraints:

  • All scripts must be Bash 3.2 compatible (macOS default) — no associative arrays, no readarray
  • Must integrate into the existing daemon_on_failure() call chain without breaking retry logic
  • Must not create a circular dependency (classifier must work when Claude CLI is unavailable, ruling out LLM-based classification)
  • Must respect NO_GITHUB for local/offline mode
  • Dashboard is TypeScript/Bun; frontend uses vanilla TS with no framework
  • Persistence must use append-only JSONL (no SQLite schema changes needed)

Decision

Regex-based pattern classifier with historical confidence boosting, integrated at four layers: core library, daemon failure handler, CLI, and dashboard.

Architecture (5 Components)

┌─────────────────────────────────────────────────────────────────┐
│                    Component Boundary Map                       │
│                                                                 │
│  ┌───────────────┐   ┌──────────────────┐   ┌───────────────┐ │
│  │ 1. Classifier  │──▶│ 2. Learning      │──▶│ 3. Issue      │ │
│  │ (pure logic)   │   │ (persistence)    │   │ (side effect) │ │
│  │ root-cause.sh  │   │ root-cause.sh    │   │ root-cause.sh │ │
│  └───────┬───────┘   └──────────────────┘   └───────────────┘ │
│          │                                                      │
│  ┌───────┴───────┐                          ┌───────────────┐  │
│  │ 4. Daemon      │                          │ 5. Dashboard  │  │
│  │ Integration    │                          │ (read-only)   │  │
│  │ daemon-        │                          │ server.ts +   │  │
│  │ failure.sh     │                          │ insights.ts   │  │
│  └───────────────┘                          └───────────────┘  │
│                                                                 │
│  Dependencies flow inward:                                      │
│  Daemon ──▶ Classifier ◀── Dashboard (reads JSONL only)        │
│  Issue Creator ──▶ Classifier (gets classification, then acts) │
└─────────────────────────────────────────────────────────────────┘

Component responsibilities:

  1. Classifier — Pure function: (error_message, stage, exit_code) → {category, confidence, evidence}. Seven categories with cascading regex priority. Default fallback: code_bug at 45%.
  2. Learning Store — Append-only JSONL at ~/.shipwright/optimization/root-causes.jsonl. Atomic writes via tmpfile+mv. Capped at 500 entries. Feeds back into classifier via confidence boosting (±10 max).
  3. Issue Creator — Side-effect boundary: creates GitHub issues for platform_bug/config_error with >70% confidence. Deduplicates via cksum signature search. Guarded by NO_GITHUB.
  4. Daemon Integration — Calls rootcause_main() inside daemon_on_failure(), enriches retry/failure GitHub comments with root cause sections, emits daemon.root_cause events.
  5. Dashboard — Read-only consumer of root-causes.jsonl. Server endpoint aggregates by category with period filtering. Frontend renders colored distribution bars.

Interface Contracts

// rootcause_classify(error_message: string, stage: string, exit_code: number): JSON
interface Classification {
  category: "rate_limit" | "context_exhaustion" | "infra_issue" | "platform_bug" | "config_error" | "external_dep" | "code_bug" | "unknown";
  confidence: number;   // 0-99, never 100 (epistemic humility)
  evidence: string[];   // matched pattern fragments
  suggested_action: string;
}

// rootcause_main(error_message, stage, exit_code): JSON
interface RootCauseResult {
  category: string;
  confidence: number;
  evidence: string[];
  suggested_action: string;
  fix_suggestions: string;
  actionability: number;  // 0-100
}

// GET /api/root-cause/breakdown?period=30
interface RootCauseBreakdown {
  breakdown: Array<{
    category: string;
    count: number;
    percentage: number;    // integer 0-100
    avg_confidence: number;
  }>;
  total: number;
  period: number;  // days
}

// Learning entry (one line of root-causes.jsonl)
interface LearningEntry {
  category: string;
  confidence: number;
  message: string;       // first 200 chars (truncated)
  recorded_at: string;   // ISO 8601
}

Error contracts:

  • Classifier never fails — returns {category:"unknown", confidence:0} on any error
  • Learning write failure returns exit 1 but is always called with || true from daemon
  • Issue creation failure logged as warning, never propagates
  • Dashboard endpoint always returns 200, empty {breakdown:[], total:0} on any error

Data Flow

Pipeline failure (exit != 0)
        │
        ▼
daemon_on_failure()
        │
        ├─ 1. Extract last 100 lines of issue log
        │
        ├─ 2. rootcause_classify(log_tail, stage, exit_code)
        │      └─ Cascading regex: rate_limit → context_exhaustion → infra → platform → config → external → code_bug → unknown
        │
        ├─ 3. rootcause_boost_from_history(message, category, confidence)
        │      └─ Read root-causes.jsonl, match first 100 chars of message
        │         Agreement: +2/match (max +10, cap 99)
        │         Disagreement: -5/mismatch (floor 10)
        │
        ├─ 4. rootcause_suggest_fix(category) → actionable suggestions
        │
        ├─ 5. rootcause_learn(category, confidence, message)
        │      └─ Atomic append to root-causes.jsonl
        │
        ├─ 6. rootcause_create_platform_issue() [conditional]
        │      └─ Guards: platform_bug|config_error AND confidence>70 AND !NO_GITHUB
        │      └─ Dedup: cksum signature search in open issues
        │
        ├─ 7. emit_event("daemon.root_cause", category, confidence)
        │
        └─ 8. Enrich GitHub comment with root cause section
               └─ Collapsible <details> with category, confidence, suggestions

Dashboard (async, independent):
  GET /api/root-cause/breakdown
        │
        ├─ Read root-causes.jsonl
        ├─ Filter by recorded_at >= (now - period days)
        ├─ Group by category, aggregate count + avg confidence
        └─ Return sorted breakdown JSON

Error Boundaries

Boundary Failure Mode Behavior
Classifier regex No pattern matches Falls back to code_bug at 45% confidence
Empty input No error message provided Returns unknown at 0% confidence
Learning file missing First classification ever Skip boosting, pass through original confidence
Learning write fails Disk full, permissions `
GitHub API fails Network, auth, rate limit Warning logged, daemon continues with retry logic
Duplicate issue check gh CLI unavailable Skip dedup, create issue anyway (rare double-create is acceptable)
JSONL parse error Malformed line in history jq silently skips via 2>/dev/null
Dashboard file read Missing/corrupt JSONL Returns {breakdown:[], total:0, period:N}

Key Design Decisions

D1: Regex over LLM classification

  • Context: Need classification to work when Claude CLI is unavailable (since Claude failures are what we're classifying)
  • Decision: Cascading regex with 7 category patterns
  • Consequence: Less accurate on ambiguous errors, but zero external dependencies and sub-millisecond latency

D2: Append-only JSONL over SQLite

  • Context: sw-db.sh exists but adds complexity; learning data is write-heavy, read-infrequent
  • Decision: ~/.shipwright/optimization/root-causes.jsonl with atomic append and 500-entry cap
  • Consequence: Simple, no migrations, but linear scan for reads (acceptable at 500 entries)

D3: Confidence boosting bounded at ±10

  • Context: Historical learning must improve accuracy without runaway feedback loops
  • Decision: +2 per agreeing historical entry (max +10, cap 99), -5 per disagreeing entry (floor 10)
  • Consequence: Self-correcting — misclassifications get penalized, but a single bad entry can't tank confidence below 10

D4: Issue creation threshold at >70% confidence

  • Context: False positive GitHub issues create noise and erode trust
  • Decision: Only platform_bug and config_error categories with confidence strictly >70% trigger issue creation
  • Consequence: Misses some real platform bugs (false negatives), but avoids spamming the repo (acceptable trade-off since false negatives still get classified and logged)

D5: Two-layer classification in daemon

  • Context: daemon-failure.sh already has a simple 5-class classify_failure() for retry strategy
  • Decision: Keep the simple classifier for retry decisions, add deep classifier (rootcause_main()) for analytics and issue creation
  • Consequence: Slight duplication, but the simple classifier drives retry logic (where speed matters) while the deep classifier drives learning and reporting

Alternatives Considered

  1. LLM-based classifier — Pros: Better accuracy on ambiguous errors, can understand novel failure modes / Cons: Creates circular dependency (Claude classifying Claude failures), adds API cost per failure, requires CLI availability, slower. Rejected for operational reliability.

  2. SQLite decision tree — Pros: Proper weighted decision tree, efficient queries, joins with events data / Cons: Requires schema migration, adds sw-db.sh dependency, more complex than needed for ~500 historical entries. Rejected as over-engineering — JSONL is sufficient at current scale.

  3. Inline classification in daemon only (no library) — Pros: Simpler, fewer files / Cons: Not reusable from CLI or pipeline scripts, can't test in isolation, violates single-responsibility. Rejected for testability and reuse.

Implementation Plan

Files Created

File Lines Purpose
scripts/lib/root-cause.sh ~428 Core classifier, learning, issue creation, reporting
scripts/sw-root-cause.sh ~197 CLI entry point (classify/analyze/report/history)
scripts/sw-root-cause-test.sh ~375 53 tests across 10 groups

Files Modified

File Change
scripts/lib/daemon-failure.sh Source root-cause.sh, call rootcause_main() in daemon_on_failure(), enrich GitHub comments
scripts/sw Add root-cause) case to CLI dispatcher (~2 lines)
scripts/sw-pipeline.sh Source lib/root-cause.sh
scripts/lib/pipeline-cli.sh Source lib/root-cause.sh
scripts/lib/pipeline-commands.sh Source lib/root-cause.sh
dashboard/server.ts Add GET /api/root-cause/breakdown endpoint (~65 lines)
dashboard/src/types/api.ts Add RootCauseBreakdown + RootCauseBreakdownEntry interfaces
dashboard/src/core/api.ts Add fetchRootCauseBreakdown() function
dashboard/src/views/insights.ts Add failure breakdown visualization (~35 lines)
scripts/sw-server-api-test.sh Add breakdown endpoint test

Dependencies

  • None new. Uses existing jq, gh, and Bun runtime.

Risk Areas

Risk Severity Mitigation
Regex patterns too broad → misclassification Medium Conservative patterns requiring Shipwright-specific markers; 45% fallback confidence signals low certainty
root-causes.jsonl grows unbounded Low Capped at 500 entries via tail slice on write
daemon_on_failure() regression Medium Entire rootcause_main() call wrapped in `
Dashboard endpoint slow on large JSONL Low 500-entry cap makes linear scan trivial (<5ms)

Validation Criteria

  • scripts/sw-root-cause-test.sh passes all 53 tests (classification per category, learning, boosting, issue creation guards, CLI subcommands)
  • scripts/sw-lib-daemon-failure-test.sh passes all 34 tests (no regression in retry/backoff logic)
  • scripts/sw-server-api-test.sh passes (new breakdown endpoint returns correct shape)
  • npx vitest run --config dashboard/vitest.config.ts passes all 284 tests (TypeScript types compile, API client works, insights view renders)
  • scripts/sw-e2e-smoke-test.sh passes all 19 tests (no pipeline regression)
  • shipwright root-cause classify "rate limit exceeded" returns rate_limit category with ≥90% confidence
  • shipwright root-cause report generates markdown report from empty and populated JSONL states
  • With NO_GITHUB=1, rootcause_create_platform_issue() is a no-op (verified by test)
  • shipwright root-cause appears in CLI help/dispatch (verifies router wiring)

Monitoring Checklist

P0 — First Pipeline Run After Deploy

  • Does daemon_on_failure() still retry correctly? (Check retry comments on a test issue)
  • Does the root cause section appear in failure comments? (Check GitHub issue timeline)
  • Are daemon.root_cause events emitted to events.jsonl?

P1 — First Week

  • Are classifications landing in root-causes.jsonl? (shipwright root-cause history 10)
  • Is confidence boosting working? (Same error type should show increasing confidence)
  • Are platform bug issues being created? (Check for auto-created issues with [Platform Bug] title prefix)
  • Is the dashboard breakdown rendering? (GET /api/root-cause/breakdown)

P2 — Two Weeks (Success Metric)

  • Are repeat platform failures decreasing >30%? (Compare root-cause report trend: 24h vs 7d)
  • Are auto-created platform issues being resolved? (Check issue close rate)
  • Is the classifier accuracy acceptable? (Manual spot-check 20 recent classifications)

Anomaly Detection Triggers

  • Spike: >10 platform_bug classifications in 1 hour (possible systemic issue)
  • Absence: Zero classifications after 24h of daemon operation (integration broken)
  • Confidence drift: Average confidence drops below 50% across categories (patterns need updating)

Auto-Rollback Decision Criteria

Not applicable — this feature degrades gracefully. If the classifier fails, the daemon continues with its existing retry logic unchanged. The || true guard ensures zero blast radius. Manual rollback (remove source lines) is sufficient if issues arise.

Schema Changes

No schema changes. All persistence is append-only JSONL:

  • New file: ~/.shipwright/optimization/root-causes.jsonl
  • Format: One JSON object per line, fields: {category, confidence, message, recorded_at}
  • Rollback: Delete the file. No other state depends on it.

Idempotency Strategy

  • Classification: Pure function given same inputs → same output (modulo history boosting, which is deterministic given same JSONL state)
  • Learning writes: Append-only with timestamps. Duplicates are harmless — frequency counting handles them naturally
  • GitHub issue creation: Deduplicates via cksum signature. Searches open issues before creating. Finding an existing match returns its URL instead of creating a duplicate

Rollback Plan

  1. Remove source "${_daemon_failure_dir}/root-cause.sh" from scripts/lib/daemon-failure.sh
  2. Remove rootcause_main() call block and rc_section/rc_final_section from daemon_on_failure()
  3. Remove root-cause) case from scripts/sw
  4. Remove /api/root-cause/breakdown handler from dashboard/server.ts
  5. Remove TypeScript additions from types/api.ts, core/api.ts, views/insights.ts
  6. Leave root-causes.jsonl in place (append-only, harmless)
  7. No schema migrations to reverse

Clone this wiki locally