-
Notifications
You must be signed in to change notification settings - Fork 1
Pipeline Design 184
Now I have a thorough understanding of the implementation. Here's the ADR:
Shipwright's daemon processes GitHub issues through a 12-stage pipeline. When pipelines fail, the daemon retries with escalation but has no systematic understanding of why failures occur. The same platform bug can trigger dozens of retries across different issues before a human notices the pattern. Meanwhile, error-log.jsonl (captured by the PostToolUse hook) and events.jsonl accumulate structured error data that goes unanalyzed.
Constraints:
- All scripts must be Bash 3.2 compatible (macOS default) — no associative arrays, no
readarray - Must integrate into the existing
daemon_on_failure()call chain without breaking retry logic - Must not create a circular dependency (classifier must work when Claude CLI is unavailable, ruling out LLM-based classification)
- Must respect
NO_GITHUBfor local/offline mode - Dashboard is TypeScript/Bun; frontend uses vanilla TS with no framework
- Persistence must use append-only JSONL (no SQLite schema changes needed)
Regex-based pattern classifier with historical confidence boosting, integrated at four layers: core library, daemon failure handler, CLI, and dashboard.
┌─────────────────────────────────────────────────────────────────┐
│ Component Boundary Map │
│ │
│ ┌───────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ 1. Classifier │──▶│ 2. Learning │──▶│ 3. Issue │ │
│ │ (pure logic) │ │ (persistence) │ │ (side effect) │ │
│ │ root-cause.sh │ │ root-cause.sh │ │ root-cause.sh │ │
│ └───────┬───────┘ └──────────────────┘ └───────────────┘ │
│ │ │
│ ┌───────┴───────┐ ┌───────────────┐ │
│ │ 4. Daemon │ │ 5. Dashboard │ │
│ │ Integration │ │ (read-only) │ │
│ │ daemon- │ │ server.ts + │ │
│ │ failure.sh │ │ insights.ts │ │
│ └───────────────┘ └───────────────┘ │
│ │
│ Dependencies flow inward: │
│ Daemon ──▶ Classifier ◀── Dashboard (reads JSONL only) │
│ Issue Creator ──▶ Classifier (gets classification, then acts) │
└─────────────────────────────────────────────────────────────────┘
Component responsibilities:
-
Classifier — Pure function:
(error_message, stage, exit_code) → {category, confidence, evidence}. Seven categories with cascading regex priority. Default fallback:code_bugat 45%. -
Learning Store — Append-only JSONL at
~/.shipwright/optimization/root-causes.jsonl. Atomic writes via tmpfile+mv. Capped at 500 entries. Feeds back into classifier via confidence boosting (±10 max). -
Issue Creator — Side-effect boundary: creates GitHub issues for
platform_bug/config_errorwith >70% confidence. Deduplicates viacksumsignature search. Guarded byNO_GITHUB. -
Daemon Integration — Calls
rootcause_main()insidedaemon_on_failure(), enriches retry/failure GitHub comments with root cause sections, emitsdaemon.root_causeevents. -
Dashboard — Read-only consumer of
root-causes.jsonl. Server endpoint aggregates by category with period filtering. Frontend renders colored distribution bars.
// rootcause_classify(error_message: string, stage: string, exit_code: number): JSON
interface Classification {
category: "rate_limit" | "context_exhaustion" | "infra_issue" | "platform_bug" | "config_error" | "external_dep" | "code_bug" | "unknown";
confidence: number; // 0-99, never 100 (epistemic humility)
evidence: string[]; // matched pattern fragments
suggested_action: string;
}
// rootcause_main(error_message, stage, exit_code): JSON
interface RootCauseResult {
category: string;
confidence: number;
evidence: string[];
suggested_action: string;
fix_suggestions: string;
actionability: number; // 0-100
}
// GET /api/root-cause/breakdown?period=30
interface RootCauseBreakdown {
breakdown: Array<{
category: string;
count: number;
percentage: number; // integer 0-100
avg_confidence: number;
}>;
total: number;
period: number; // days
}
// Learning entry (one line of root-causes.jsonl)
interface LearningEntry {
category: string;
confidence: number;
message: string; // first 200 chars (truncated)
recorded_at: string; // ISO 8601
}Error contracts:
- Classifier never fails — returns
{category:"unknown", confidence:0}on any error - Learning write failure returns exit 1 but is always called with
|| truefrom daemon - Issue creation failure logged as warning, never propagates
- Dashboard endpoint always returns 200, empty
{breakdown:[], total:0}on any error
Pipeline failure (exit != 0)
│
▼
daemon_on_failure()
│
├─ 1. Extract last 100 lines of issue log
│
├─ 2. rootcause_classify(log_tail, stage, exit_code)
│ └─ Cascading regex: rate_limit → context_exhaustion → infra → platform → config → external → code_bug → unknown
│
├─ 3. rootcause_boost_from_history(message, category, confidence)
│ └─ Read root-causes.jsonl, match first 100 chars of message
│ Agreement: +2/match (max +10, cap 99)
│ Disagreement: -5/mismatch (floor 10)
│
├─ 4. rootcause_suggest_fix(category) → actionable suggestions
│
├─ 5. rootcause_learn(category, confidence, message)
│ └─ Atomic append to root-causes.jsonl
│
├─ 6. rootcause_create_platform_issue() [conditional]
│ └─ Guards: platform_bug|config_error AND confidence>70 AND !NO_GITHUB
│ └─ Dedup: cksum signature search in open issues
│
├─ 7. emit_event("daemon.root_cause", category, confidence)
│
└─ 8. Enrich GitHub comment with root cause section
└─ Collapsible <details> with category, confidence, suggestions
Dashboard (async, independent):
GET /api/root-cause/breakdown
│
├─ Read root-causes.jsonl
├─ Filter by recorded_at >= (now - period days)
├─ Group by category, aggregate count + avg confidence
└─ Return sorted breakdown JSON
| Boundary | Failure Mode | Behavior |
|---|---|---|
| Classifier regex | No pattern matches | Falls back to code_bug at 45% confidence |
| Empty input | No error message provided | Returns unknown at 0% confidence |
| Learning file missing | First classification ever | Skip boosting, pass through original confidence |
| Learning write fails | Disk full, permissions | ` |
| GitHub API fails | Network, auth, rate limit | Warning logged, daemon continues with retry logic |
| Duplicate issue check |
gh CLI unavailable |
Skip dedup, create issue anyway (rare double-create is acceptable) |
| JSONL parse error | Malformed line in history |
jq silently skips via 2>/dev/null
|
| Dashboard file read | Missing/corrupt JSONL | Returns {breakdown:[], total:0, period:N}
|
D1: Regex over LLM classification
- Context: Need classification to work when Claude CLI is unavailable (since Claude failures are what we're classifying)
- Decision: Cascading regex with 7 category patterns
- Consequence: Less accurate on ambiguous errors, but zero external dependencies and sub-millisecond latency
D2: Append-only JSONL over SQLite
-
Context:
sw-db.shexists but adds complexity; learning data is write-heavy, read-infrequent -
Decision:
~/.shipwright/optimization/root-causes.jsonlwith atomic append and 500-entry cap - Consequence: Simple, no migrations, but linear scan for reads (acceptable at 500 entries)
D3: Confidence boosting bounded at ±10
- Context: Historical learning must improve accuracy without runaway feedback loops
- Decision: +2 per agreeing historical entry (max +10, cap 99), -5 per disagreeing entry (floor 10)
- Consequence: Self-correcting — misclassifications get penalized, but a single bad entry can't tank confidence below 10
D4: Issue creation threshold at >70% confidence
- Context: False positive GitHub issues create noise and erode trust
-
Decision: Only
platform_bugandconfig_errorcategories with confidence strictly >70% trigger issue creation - Consequence: Misses some real platform bugs (false negatives), but avoids spamming the repo (acceptable trade-off since false negatives still get classified and logged)
D5: Two-layer classification in daemon
-
Context:
daemon-failure.shalready has a simple 5-classclassify_failure()for retry strategy -
Decision: Keep the simple classifier for retry decisions, add deep classifier (
rootcause_main()) for analytics and issue creation - Consequence: Slight duplication, but the simple classifier drives retry logic (where speed matters) while the deep classifier drives learning and reporting
-
LLM-based classifier — Pros: Better accuracy on ambiguous errors, can understand novel failure modes / Cons: Creates circular dependency (Claude classifying Claude failures), adds API cost per failure, requires CLI availability, slower. Rejected for operational reliability.
-
SQLite decision tree — Pros: Proper weighted decision tree, efficient queries, joins with events data / Cons: Requires schema migration, adds
sw-db.shdependency, more complex than needed for ~500 historical entries. Rejected as over-engineering — JSONL is sufficient at current scale. -
Inline classification in daemon only (no library) — Pros: Simpler, fewer files / Cons: Not reusable from CLI or pipeline scripts, can't test in isolation, violates single-responsibility. Rejected for testability and reuse.
| File | Lines | Purpose |
|---|---|---|
scripts/lib/root-cause.sh |
~428 | Core classifier, learning, issue creation, reporting |
scripts/sw-root-cause.sh |
~197 | CLI entry point (classify/analyze/report/history) |
scripts/sw-root-cause-test.sh |
~375 | 53 tests across 10 groups |
| File | Change |
|---|---|
scripts/lib/daemon-failure.sh |
Source root-cause.sh, call rootcause_main() in daemon_on_failure(), enrich GitHub comments |
scripts/sw |
Add root-cause) case to CLI dispatcher (~2 lines) |
scripts/sw-pipeline.sh |
Source lib/root-cause.sh
|
scripts/lib/pipeline-cli.sh |
Source lib/root-cause.sh
|
scripts/lib/pipeline-commands.sh |
Source lib/root-cause.sh
|
dashboard/server.ts |
Add GET /api/root-cause/breakdown endpoint (~65 lines) |
dashboard/src/types/api.ts |
Add RootCauseBreakdown + RootCauseBreakdownEntry interfaces |
dashboard/src/core/api.ts |
Add fetchRootCauseBreakdown() function |
dashboard/src/views/insights.ts |
Add failure breakdown visualization (~35 lines) |
scripts/sw-server-api-test.sh |
Add breakdown endpoint test |
- None new. Uses existing
jq,gh, and Bun runtime.
| Risk | Severity | Mitigation |
|---|---|---|
| Regex patterns too broad → misclassification | Medium | Conservative patterns requiring Shipwright-specific markers; 45% fallback confidence signals low certainty |
root-causes.jsonl grows unbounded |
Low | Capped at 500 entries via tail slice on write |
daemon_on_failure() regression |
Medium | Entire rootcause_main() call wrapped in ` |
| Dashboard endpoint slow on large JSONL | Low | 500-entry cap makes linear scan trivial (<5ms) |
-
scripts/sw-root-cause-test.shpasses all 53 tests (classification per category, learning, boosting, issue creation guards, CLI subcommands) -
scripts/sw-lib-daemon-failure-test.shpasses all 34 tests (no regression in retry/backoff logic) -
scripts/sw-server-api-test.shpasses (new breakdown endpoint returns correct shape) -
npx vitest run --config dashboard/vitest.config.tspasses all 284 tests (TypeScript types compile, API client works, insights view renders) -
scripts/sw-e2e-smoke-test.shpasses all 19 tests (no pipeline regression) -
shipwright root-cause classify "rate limit exceeded"returnsrate_limitcategory with ≥90% confidence -
shipwright root-cause reportgenerates markdown report from empty and populated JSONL states - With
NO_GITHUB=1,rootcause_create_platform_issue()is a no-op (verified by test) -
shipwright root-causeappears in CLI help/dispatch (verifies router wiring)
- Does
daemon_on_failure()still retry correctly? (Check retry comments on a test issue) - Does the root cause section appear in failure comments? (Check GitHub issue timeline)
- Are
daemon.root_causeevents emitted toevents.jsonl?
- Are classifications landing in
root-causes.jsonl? (shipwright root-cause history 10) - Is confidence boosting working? (Same error type should show increasing confidence)
- Are platform bug issues being created? (Check for auto-created issues with
[Platform Bug]title prefix) - Is the dashboard breakdown rendering? (
GET /api/root-cause/breakdown)
- Are repeat platform failures decreasing >30%? (Compare
root-cause reporttrend: 24h vs 7d) - Are auto-created platform issues being resolved? (Check issue close rate)
- Is the classifier accuracy acceptable? (Manual spot-check 20 recent classifications)
- Spike: >10
platform_bugclassifications in 1 hour (possible systemic issue) - Absence: Zero classifications after 24h of daemon operation (integration broken)
- Confidence drift: Average confidence drops below 50% across categories (patterns need updating)
Not applicable — this feature degrades gracefully. If the classifier fails, the daemon continues with its existing retry logic unchanged. The || true guard ensures zero blast radius. Manual rollback (remove source lines) is sufficient if issues arise.
No schema changes. All persistence is append-only JSONL:
-
New file:
~/.shipwright/optimization/root-causes.jsonl -
Format: One JSON object per line, fields:
{category, confidence, message, recorded_at} - Rollback: Delete the file. No other state depends on it.
- Classification: Pure function given same inputs → same output (modulo history boosting, which is deterministic given same JSONL state)
- Learning writes: Append-only with timestamps. Duplicates are harmless — frequency counting handles them naturally
-
GitHub issue creation: Deduplicates via
cksumsignature. Searches open issues before creating. Finding an existing match returns its URL instead of creating a duplicate
- Remove
source "${_daemon_failure_dir}/root-cause.sh"fromscripts/lib/daemon-failure.sh - Remove
rootcause_main()call block andrc_section/rc_final_sectionfromdaemon_on_failure() - Remove
root-cause)case fromscripts/sw - Remove
/api/root-cause/breakdownhandler fromdashboard/server.ts - Remove TypeScript additions from
types/api.ts,core/api.ts,views/insights.ts - Leave
root-causes.jsonlin place (append-only, harmless) - No schema migrations to reverse