Pipeline Design 184

Now I have a thorough understanding of the implementation. Here's the ADR:

Design: Failure Root Cause Classifier with Automated Platform Issue Creation

Context

Shipwright's daemon processes GitHub issues through a 12-stage pipeline. When pipelines fail, the daemon retries with escalation but has no systematic understanding of why failures occur. The same platform bug can trigger dozens of retries across different issues before a human notices the pattern. Meanwhile, error-log.jsonl (captured by the PostToolUse hook) and events.jsonl accumulate structured error data that goes unanalyzed.

Constraints:

All scripts must be Bash 3.2 compatible (macOS default) — no associative arrays, no readarray
Must integrate into the existing daemon_on_failure() call chain without breaking retry logic
Must not create a circular dependency (classifier must work when Claude CLI is unavailable, ruling out LLM-based classification)
Must respect NO_GITHUB for local/offline mode
Dashboard is TypeScript/Bun; frontend uses vanilla TS with no framework
Persistence must use append-only JSONL (no SQLite schema changes needed)

Decision

Regex-based pattern classifier with historical confidence boosting, integrated at four layers: core library, daemon failure handler, CLI, and dashboard.

Architecture (5 Components)

┌─────────────────────────────────────────────────────────────────┐
│                    Component Boundary Map                       │
│                                                                 │
│  ┌───────────────┐   ┌──────────────────┐   ┌───────────────┐ │
│  │ 1. Classifier  │──▶│ 2. Learning      │──▶│ 3. Issue      │ │
│  │ (pure logic)   │   │ (persistence)    │   │ (side effect) │ │
│  │ root-cause.sh  │   │ root-cause.sh    │   │ root-cause.sh │ │
│  └───────┬───────┘   └──────────────────┘   └───────────────┘ │
│          │                                                      │
│  ┌───────┴───────┐                          ┌───────────────┐  │
│  │ 4. Daemon      │                          │ 5. Dashboard  │  │
│  │ Integration    │                          │ (read-only)   │  │
│  │ daemon-        │                          │ server.ts +   │  │
│  │ failure.sh     │                          │ insights.ts   │  │
│  └───────────────┘                          └───────────────┘  │
│                                                                 │
│  Dependencies flow inward:                                      │
│  Daemon ──▶ Classifier ◀── Dashboard (reads JSONL only)        │
│  Issue Creator ──▶ Classifier (gets classification, then acts) │
└─────────────────────────────────────────────────────────────────┘

Component responsibilities:

Classifier — Pure function: (error_message, stage, exit_code) → {category, confidence, evidence}. Seven categories with cascading regex priority. Default fallback: code_bug at 45%.
Learning Store — Append-only JSONL at ~/.shipwright/optimization/root-causes.jsonl. Atomic writes via tmpfile+mv. Capped at 500 entries. Feeds back into classifier via confidence boosting (±10 max).
Issue Creator — Side-effect boundary: creates GitHub issues for platform_bug/config_error with >70% confidence. Deduplicates via cksum signature search. Guarded by NO_GITHUB.
Daemon Integration — Calls rootcause_main() inside daemon_on_failure(), enriches retry/failure GitHub comments with root cause sections, emits daemon.root_cause events.
Dashboard — Read-only consumer of root-causes.jsonl. Server endpoint aggregates by category with period filtering. Frontend renders colored distribution bars.

Interface Contracts

// rootcause_classify(error_message: string, stage: string, exit_code: number): JSON
interface Classification {
  category: "rate_limit" | "context_exhaustion" | "infra_issue" | "platform_bug" | "config_error" | "external_dep" | "code_bug" | "unknown";
  confidence: number;   // 0-99, never 100 (epistemic humility)
  evidence: string[];   // matched pattern fragments
  suggested_action: string;
}

// rootcause_main(error_message, stage, exit_code): JSON
interface RootCauseResult {
  category: string;
  confidence: number;
  evidence: string[];
  suggested_action: string;
  fix_suggestions: string;
  actionability: number;  // 0-100
}

// GET /api/root-cause/breakdown?period=30
interface RootCauseBreakdown {
  breakdown: Array<{
    category: string;
    count: number;
    percentage: number;    // integer 0-100
    avg_confidence: number;
  }>;
  total: number;
  period: number;  // days
}

// Learning entry (one line of root-causes.jsonl)
interface LearningEntry {
  category: string;
  confidence: number;
  message: string;       // first 200 chars (truncated)
  recorded_at: string;   // ISO 8601
}

Error contracts:

Classifier never fails — returns {category:"unknown", confidence:0} on any error
Learning write failure returns exit 1 but is always called with || true from daemon
Issue creation failure logged as warning, never propagates
Dashboard endpoint always returns 200, empty {breakdown:[], total:0} on any error

Data Flow

Pipeline failure (exit != 0)
        │
        ▼
daemon_on_failure()
        │
        ├─ 1. Extract last 100 lines of issue log
        │
        ├─ 2. rootcause_classify(log_tail, stage, exit_code)
        │      └─ Cascading regex: rate_limit → context_exhaustion → infra → platform → config → external → code_bug → unknown
        │
        ├─ 3. rootcause_boost_from_history(message, category, confidence)
        │      └─ Read root-causes.jsonl, match first 100 chars of message
        │         Agreement: +2/match (max +10, cap 99)
        │         Disagreement: -5/mismatch (floor 10)
        │
        ├─ 4. rootcause_suggest_fix(category) → actionable suggestions
        │
        ├─ 5. rootcause_learn(category, confidence, message)
        │      └─ Atomic append to root-causes.jsonl
        │
        ├─ 6. rootcause_create_platform_issue() [conditional]
        │      └─ Guards: platform_bug|config_error AND confidence>70 AND !NO_GITHUB
        │      └─ Dedup: cksum signature search in open issues
        │
        ├─ 7. emit_event("daemon.root_cause", category, confidence)
        │
        └─ 8. Enrich GitHub comment with root cause section
               └─ Collapsible <details> with category, confidence, suggestions

Dashboard (async, independent):
  GET /api/root-cause/breakdown
        │
        ├─ Read root-causes.jsonl
        ├─ Filter by recorded_at >= (now - period days)
        ├─ Group by category, aggregate count + avg confidence
        └─ Return sorted breakdown JSON

Error Boundaries

Boundary	Failure Mode	Behavior
Classifier regex	No pattern matches	Falls back to `code_bug` at 45% confidence
Empty input	No error message provided	Returns `unknown` at 0% confidence
Learning file missing	First classification ever	Skip boosting, pass through original confidence
Learning write fails	Disk full, permissions	`
GitHub API fails	Network, auth, rate limit	Warning logged, daemon continues with retry logic
Duplicate issue check	`gh` CLI unavailable	Skip dedup, create issue anyway (rare double-create is acceptable)
JSONL parse error	Malformed line in history	`jq` silently skips via `2>/dev/null`
Dashboard file read	Missing/corrupt JSONL	Returns `{breakdown:[], total:0, period:N}`

Key Design Decisions

D1: Regex over LLM classification

Context: Need classification to work when Claude CLI is unavailable (since Claude failures are what we're classifying)
Decision: Cascading regex with 7 category patterns
Consequence: Less accurate on ambiguous errors, but zero external dependencies and sub-millisecond latency

D2: Append-only JSONL over SQLite

Context: sw-db.sh exists but adds complexity; learning data is write-heavy, read-infrequent
Decision: ~/.shipwright/optimization/root-causes.jsonl with atomic append and 500-entry cap
Consequence: Simple, no migrations, but linear scan for reads (acceptable at 500 entries)

D3: Confidence boosting bounded at ±10

Context: Historical learning must improve accuracy without runaway feedback loops
Decision: +2 per agreeing historical entry (max +10, cap 99), -5 per disagreeing entry (floor 10)
Consequence: Self-correcting — misclassifications get penalized, but a single bad entry can't tank confidence below 10

D4: Issue creation threshold at >70% confidence

Context: False positive GitHub issues create noise and erode trust
Decision: Only platform_bug and config_error categories with confidence strictly >70% trigger issue creation
Consequence: Misses some real platform bugs (false negatives), but avoids spamming the repo (acceptable trade-off since false negatives still get classified and logged)

D5: Two-layer classification in daemon

Context: daemon-failure.sh already has a simple 5-class classify_failure() for retry strategy
Decision: Keep the simple classifier for retry decisions, add deep classifier (rootcause_main()) for analytics and issue creation
Consequence: Slight duplication, but the simple classifier drives retry logic (where speed matters) while the deep classifier drives learning and reporting

Alternatives Considered

LLM-based classifier — Pros: Better accuracy on ambiguous errors, can understand novel failure modes / Cons: Creates circular dependency (Claude classifying Claude failures), adds API cost per failure, requires CLI availability, slower. Rejected for operational reliability.
SQLite decision tree — Pros: Proper weighted decision tree, efficient queries, joins with events data / Cons: Requires schema migration, adds sw-db.sh dependency, more complex than needed for ~500 historical entries. Rejected as over-engineering — JSONL is sufficient at current scale.
Inline classification in daemon only (no library) — Pros: Simpler, fewer files / Cons: Not reusable from CLI or pipeline scripts, can't test in isolation, violates single-responsibility. Rejected for testability and reuse.

Implementation Plan

Files Created

File	Lines	Purpose
`scripts/lib/root-cause.sh`	~428	Core classifier, learning, issue creation, reporting
`scripts/sw-root-cause.sh`	~197	CLI entry point (classify/analyze/report/history)
`scripts/sw-root-cause-test.sh`	~375	53 tests across 10 groups

Files Modified

File	Change
`scripts/lib/daemon-failure.sh`	Source root-cause.sh, call `rootcause_main()` in `daemon_on_failure()`, enrich GitHub comments
`scripts/sw`	Add `root-cause)` case to CLI dispatcher (~2 lines)
`scripts/sw-pipeline.sh`	Source `lib/root-cause.sh`
`scripts/lib/pipeline-cli.sh`	Source `lib/root-cause.sh`
`scripts/lib/pipeline-commands.sh`	Source `lib/root-cause.sh`
`dashboard/server.ts`	Add `GET /api/root-cause/breakdown` endpoint (~65 lines)
`dashboard/src/types/api.ts`	Add `RootCauseBreakdown` + `RootCauseBreakdownEntry` interfaces
`dashboard/src/core/api.ts`	Add `fetchRootCauseBreakdown()` function
`dashboard/src/views/insights.ts`	Add failure breakdown visualization (~35 lines)
`scripts/sw-server-api-test.sh`	Add breakdown endpoint test

Dependencies

None new. Uses existing jq, gh, and Bun runtime.

Risk Areas

Risk	Severity	Mitigation
Regex patterns too broad → misclassification	Medium	Conservative patterns requiring Shipwright-specific markers; 45% fallback confidence signals low certainty
`root-causes.jsonl` grows unbounded	Low	Capped at 500 entries via tail slice on write
`daemon_on_failure()` regression	Medium	Entire `rootcause_main()` call wrapped in `
Dashboard endpoint slow on large JSONL	Low	500-entry cap makes linear scan trivial (<5ms)

Validation Criteria

scripts/sw-root-cause-test.sh passes all 53 tests (classification per category, learning, boosting, issue creation guards, CLI subcommands)
scripts/sw-lib-daemon-failure-test.sh passes all 34 tests (no regression in retry/backoff logic)
scripts/sw-server-api-test.sh passes (new breakdown endpoint returns correct shape)
npx vitest run --config dashboard/vitest.config.ts passes all 284 tests (TypeScript types compile, API client works, insights view renders)
scripts/sw-e2e-smoke-test.sh passes all 19 tests (no pipeline regression)
shipwright root-cause classify "rate limit exceeded" returns rate_limit category with ≥90% confidence
shipwright root-cause report generates markdown report from empty and populated JSONL states
With NO_GITHUB=1, rootcause_create_platform_issue() is a no-op (verified by test)
shipwright root-cause appears in CLI help/dispatch (verifies router wiring)

Monitoring Checklist

P0 — First Pipeline Run After Deploy

Does daemon_on_failure() still retry correctly? (Check retry comments on a test issue)
Does the root cause section appear in failure comments? (Check GitHub issue timeline)
Are daemon.root_cause events emitted to events.jsonl?

P1 — First Week

Are classifications landing in root-causes.jsonl? (shipwright root-cause history 10)
Is confidence boosting working? (Same error type should show increasing confidence)
Are platform bug issues being created? (Check for auto-created issues with [Platform Bug] title prefix)
Is the dashboard breakdown rendering? (GET /api/root-cause/breakdown)

P2 — Two Weeks (Success Metric)

Are repeat platform failures decreasing >30%? (Compare root-cause report trend: 24h vs 7d)
Are auto-created platform issues being resolved? (Check issue close rate)
Is the classifier accuracy acceptable? (Manual spot-check 20 recent classifications)

Anomaly Detection Triggers

Spike: >10 platform_bug classifications in 1 hour (possible systemic issue)
Absence: Zero classifications after 24h of daemon operation (integration broken)
Confidence drift: Average confidence drops below 50% across categories (patterns need updating)

Auto-Rollback Decision Criteria

Not applicable — this feature degrades gracefully. If the classifier fails, the daemon continues with its existing retry logic unchanged. The || true guard ensures zero blast radius. Manual rollback (remove source lines) is sufficient if issues arise.

Schema Changes

No schema changes. All persistence is append-only JSONL:

New file: ~/.shipwright/optimization/root-causes.jsonl
Format: One JSON object per line, fields: {category, confidence, message, recorded_at}
Rollback: Delete the file. No other state depends on it.

Idempotency Strategy

Classification: Pure function given same inputs → same output (modulo history boosting, which is deterministic given same JSONL state)
Learning writes: Append-only with timestamps. Duplicates are harmless — frequency counting handles them naturally
GitHub issue creation: Deduplicates via cksum signature. Searches open issues before creating. Finding an existing match returns its URL instead of creating a duplicate

Rollback Plan

Remove source "${_daemon_failure_dir}/root-cause.sh" from scripts/lib/daemon-failure.sh
Remove rootcause_main() call block and rc_section/rc_final_section from daemon_on_failure()
Remove root-cause) case from scripts/sw
Remove /api/root-cause/breakdown handler from dashboard/server.ts
Remove TypeScript additions from types/api.ts, core/api.ts, views/insights.ts
Leave root-causes.jsonl in place (append-only, harmless)
No schema migrations to reverse

Pipeline Design 184

Design: Failure Root Cause Classifier with Automated Platform Issue Creation

Context

Decision

Architecture (5 Components)

Interface Contracts

Data Flow

Error Boundaries

Key Design Decisions

Alternatives Considered

Implementation Plan

Files Created

Files Modified

Dependencies

Risk Areas

Validation Criteria

Monitoring Checklist

P0 — First Pipeline Run After Deploy

P1 — First Week

P2 — Two Weeks (Success Metric)

Anomaly Detection Triggers

Auto-Rollback Decision Criteria

Schema Changes

Idempotency Strategy

Rollback Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!