Skip to content

Pipeline Design 170

Seth Ford edited this page Mar 1, 2026 · 1 revision

Now I have a thorough understanding of the codebase. Let me produce the ADR.

Design: Dashboard Real-Time Health & Anomaly Visualization

Context

Shipwright's dashboard (Bun WebSocket server + vanilla TypeScript frontend) currently serves 13 tab views via a reactive StoreView pattern. FleetState is computed server-side every 2s and broadcast to up to 50 WebSocket clients. The dashboard already displays cost, DORA grades, pipeline status, and agent health — but these are scattered across Overview, Metrics, and Insights tabs. There is no unified health-at-a-glance surface.

Issue #170 requests: a composite health score (0-100) with trend, anomaly alerts with drill-down, cost burn gauge, stage progress vs historical avg, and DORA cards — all live-updating via WebSocket.

Constraints:

  • Server is a single 5800-line server.ts (Bun runtime, SQLite + JSONL fallback)
  • Frontend uses no framework — raw DOM manipulation with View interface (init/render/destroy)
  • All CSS uses --var design tokens (--cyan, --abyss, --rose, etc.)
  • Tab panels are pre-rendered in HTML, toggled by switchTab()
  • FleetState broadcast uses JSON string deduplication — payload bloat impacts all clients
  • Bash 3.2 compatibility required for test scripts

Decision

Add a dedicated "Health" tab (Approach B) with server-side health score computation piggybacking on the existing getFleetState() cycle. Heavy data (7-day trend, anomaly details, stage history) is served via 3 new REST endpoints fetched lazily on tab open — NOT stuffed into the WebSocket payload.

Key architectural choices:

  1. Health score computed in TypeScript on the server, not by shelling out to sw-pipeline-vitals.sh. This avoids subprocess overhead in the 2s broadcast loop. The four signals (momentum 25%, convergence 35%, budget 20%, error maturity 20%) mirror the vitals engine weights.

  2. FleetState extended with a small health? optional field (~200 bytes) containing only score, verdict, signals, and activePipelines. Trend and anomaly data are NOT in FleetState.

  3. Three REST endpoints for on-demand data: /api/health/trend, /api/health/anomalies, /api/health/stages. Each is cached server-side (60s TTL for trend, 30s for anomalies/stages).

  4. Header health badge provides at-a-glance status without switching tabs. Clicking navigates to the Health tab.

  5. All new types are additive and optional — zero risk to existing 13 views or their tests.

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                          server.ts                               │
│                                                                  │
│  getFleetState()                                                 │
│    ├── readEvents()          ─── events.jsonl / SQLite          │
│    ├── readDaemonState()     ─── daemon-state.json / SQLite     │
│    ├── getCostInfo()         ─── costs.json + budget.json       │
│    ├── calculateDoraGrades() ─── 7-day event scan               │
│    └── computeHealthScore()  ─── NEW: momentum/convergence/     │
│         │                        budget/errorMaturity            │
│         └── returns HealthInfo (appended to FleetState.health)  │
│                                                                  │
│  REST endpoints (lazy-fetched):                                  │
│    GET /api/health/trend     ─── getHealthTrend(days)           │
│    GET /api/health/anomalies ─── getAnomalies(events)           │
│    GET /api/health/stages    ─── getStageProgress(events, jobs) │
│                                                                  │
│  broadcastToClients(fleetState) ── every 2s via WebSocket       │
└────────────┬──────────────────────┬─────────────────────────────┘
             │ WS push              │ REST responses
             ▼                      ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Browser Client                             │
│                                                                  │
│  ws.ts ──onmessage──► store.set("fleetState", data)            │
│                           │                                      │
│                           ├──► header.ts: renderHealthBadge()   │
│                           │    (score + verdict color)           │
│                           │                                      │
│                           └──► health.ts (if active tab):       │
│                                ├── renderHealthGauge(health)     │
│                                ├── renderCostBurnGauge(cost)     │
│                                └── renderDoraCards(dora)         │
│                                                                  │
│  api.ts (on tab open):                                           │
│    fetchHealthTrend()    ──► renderTrendSparkline(points)       │
│    fetchAnomalies()      ──► renderAnomalyAlerts(anomalies)     │
│    fetchStageProgress()  ──► renderStageProgress(stages)        │
└─────────────────────────────────────────────────────────────────┘

Interface Contracts

Server-Side (server.ts)

// Pure function — no I/O, no subprocess
function computeHealthScore(
  events: DaemonEvent[],
  daemonState: DaemonState,
  costInfo: CostInfo
): HealthInfo;
// Errors: returns { score: 100, verdict: "green", ... } when no data (healthy idle)

// File I/O (reads progress snapshots), cached 60s
function getHealthTrend(days: number): { points: Array<{ ts: string; score: number; verdict: string }> };
// Errors: returns { points: [] } on missing/corrupt files

// Reads events, compares to baselines using EMA
function getAnomalies(events: DaemonEvent[]): { anomalies: AnomalyAlert[] };
// Errors: returns { anomalies: [] } on computation failure

// Reads active jobs + historical events
function getStageProgress(events: DaemonEvent[], pipelines: PipelineInfo[]): { stages: StageProgressInfo[] };
// Errors: returns { stages: [] } on empty input

Shared Types (api.ts)

export interface HealthInfo {
  score: number;           // 0-100, clamped
  verdict: "green" | "yellow" | "red" | "critical";
  signals: {
    momentum: number;      // 0-100
    convergence: number;   // 0-100
    budget: number;        // 0-100
    errorMaturity: number; // 0-100
  };
  activePipelines: number;
}

export interface AnomalyAlert {
  id: string;
  metric: string;
  value: number;
  baseline: number;
  severity: "warning" | "critical";
  rootCause: string;
  factors: string[];
  actions: string[];
  ts: string;
  issue?: number;
}

export interface StageProgressInfo {
  pipelineIssue: number;
  stage: string;
  currentDuration_s: number;
  avgDuration_s: number;
  count: number;
  status: "on-track" | "slow" | "fast";
}

// Extended (additive):
export interface FleetState {
  /* ... existing fields unchanged ... */
  health?: HealthInfo;  // NEW — optional
}

export type TabId = /* existing 13 */ | "health";

Frontend View (health.ts)

export const healthView: View = {
  init(): void;    // Fetches trend, anomalies, stages via REST
  render(state: FleetState): void;  // Updates gauge, cost, DORA from WS data
  destroy(): void; // Removes click listeners on anomaly cards
};

// Pure render functions (DOM manipulation):
function renderHealthGauge(el: HTMLElement, health: HealthInfo): void;
function renderTrendSparkline(el: HTMLElement, points: TrendPoint[]): void;
function renderAnomalyAlerts(el: HTMLElement, anomalies: AnomalyAlert[]): void;
function renderCostBurnGauge(el: HTMLElement, cost: CostInfo): void;
function renderStageProgress(el: HTMLElement, stages: StageProgressInfo[]): void;
function renderDoraCards(el: HTMLElement, dora: DoraGrades): void;

API Client Extensions (api.ts)

export async function fetchHealthTrend(days?: number): Promise<{ points: TrendPoint[] }>;
export async function fetchAnomalies(): Promise<{ anomalies: AnomalyAlert[] }>;
export async function fetchStageProgress(): Promise<{ stages: StageProgressInfo[] }>;
// All follow existing pattern: catch → return default empty response

Data Flow

1. REAL-TIME (every 2s):
   server: getFleetState()
     → computeHealthScore(events, daemonState, costInfo)
     → FleetState.health = { score, verdict, signals, activePipelines }
     → broadcastToClients(fleetState)
     → ws.onmessage → store.set("fleetState")
     → header: renderHealthBadge(state.health)
     → health tab (if active): renderHealthGauge, renderCostBurnGauge, renderDoraCards

2. LAZY (on tab open, cached):
   health.init()
     → fetchHealthTrend(7) → GET /api/health/trend?days=7
       → server: getHealthTrend(7) reads ~/.shipwright/progress/issue-*.json
       → returns { points: [...] } → renderTrendSparkline()
     → fetchAnomalies() → GET /api/health/anomalies
       → server: getAnomalies(events) compares stage durations/failures vs EMA baselines
       → returns { anomalies: [...] } → renderAnomalyAlerts()
     → fetchStageProgress() → GET /api/health/stages
       → server: getStageProgress(events, pipelines) compares current vs avg
       → returns { stages: [...] } → renderStageProgress()

3. USER INTERACTION:
   click anomaly card → toggle .anomaly-drilldown visibility (local state)
   click health badge → switchTab("health")

Error Boundaries

Component Error Source Handling
computeHealthScore() Missing events/cost data Returns healthy idle state (score=100, verdict="green")
getHealthTrend() Missing/corrupt progress files Returns { points: [] } — sparkline shows empty state
getAnomalies() No baseline data Returns { anomalies: [] } — alerts section shows "No anomalies"
getStageProgress() No active pipelines Returns { stages: [] } — progress section shows empty state
REST fetch failures Network/auth errors API client catches, returns default response; view shows last known data
WebSocket disconnect Network loss Existing reconnect logic preserves last FleetState; health badge shows "—"
healthView.init() Any throw Caught by existing tab error boundary in router.ts; shows retry button
healthView.render() Bad FleetState shape Guard: if (!state.health) return; — skip render, no crash

Alternatives Considered

  1. Scatter across existing tabs — Pros: no new tab, minimal new code / Cons: no unified health view, user must hop 3 tabs, doesn't meet acceptance criteria for cohesive dashboard. Rejected because it fragments the monitoring experience.

  2. Header overlay/panel — Pros: always visible without tab switching / Cons: cramped header, complex overlay z-index management, hard to fit 5 widgets plus drill-down in a header panel. Rejected because it requires header restructuring with high blast radius.

  3. Shell out to sw-pipeline-vitals.sh — Pros: reuses existing bash logic / Cons: adds ~200ms subprocess per 2s broadcast, doesn't scale with 50 clients, bash output parsing fragile. Rejected for performance — TypeScript computation is synchronous and fast.

  4. Put all data in FleetState — Pros: single data source / Cons: 7-day trend + anomaly details adds ~5KB per broadcast × 50 clients × every 2s = significant bandwidth waste. Rejected for payload bloat. Only the 200-byte summary goes in FleetState; heavy data is REST-fetched lazily.

Component Hierarchy

App
├── Header (existing)
│    ├── ConnectionDot (existing)
│    ├── CostTicker (existing)
│    └── HealthBadge (NEW) ← state: fleetState.health
│         └── onClick → switchTab("health")
│
├── TabNav (existing, extended)
│    └── "Health" button (NEW)
│
└── Main
     └── HealthView (NEW, tab="health")
          ├── HealthScoreSection ← state: fleetState.health (WS)
          │    ├── HealthGauge (SVG circle)
          │    └── TrendSparkline ← local: healthTrend (REST, cached)
          │
          ├── AnomalyAlertsSection ← local: anomalies (REST, cached)
          │    └── AlertCard[] ← local: expandedAlertId (view state)
          │         └── DrillDown (conditional render)
          │
          ├── BottomRow
          │    ├── CostBurnGauge ← state: fleetState.cost (WS)
          │    └── StageProgressList ← local: stages (REST, cached)
          │         └── StageProgressBar[]
          │
          └── DoraCardsSection ← state: fleetState.dora (WS)
               └── DoraCard × 4

State ownership: WS-driven data lives in the global store (health, cost, dora). REST-fetched data (trend, anomalies, stages) lives as module-level variables in health.ts, refreshed on init().

State Management Approach

Data Source Update Frequency Storage
health.score/verdict/signals WebSocket FleetState Every 2s store.fleetState.health
cost WebSocket FleetState Every 2s store.fleetState.cost (existing)
dora WebSocket FleetState Every 2s store.fleetState.dora (existing)
7-day trend points REST /api/health/trend On tab open Module-local in health.ts
Anomaly alerts REST /api/health/anomalies On tab open Module-local in health.ts
Stage progress REST /api/health/stages On tab open Module-local in health.ts
Expanded alert ID User click On interaction Module-local in health.ts

No new store keys needed — REST data is tab-scoped and discarded on destroy().

Accessibility Checklist (WCAG AA)

  • Health gauge: aria-label="Pipeline health score: {score} out of 100, status {verdict}", not color-only
  • Alert severity: text label + icon, not just colored border
  • Anomaly drill-down: <button> wrapper on cards, aria-expanded, keyboard Enter/Space
  • Focus management: visible focus ring (existing --cyan outline), logical tab order
  • Color contrast: all text uses --text-primary on --abyss (passes 4.5:1 per existing design)
  • Semantic HTML: <section aria-label>, <h2>, <button>, <ul>/<li> for alerts
  • Live updates: aria-live="polite" on health score container for screen reader announcements
  • Touch targets: all clickable elements min 44×44px

Responsive Breakpoints

Breakpoint Layout
320px (mobile) Single column stack. Gauge shrinks to 120px. DORA cards 1×4 vertical. Alert cards full width.
768px (tablet) 2-column grid: gauge + trend side-by-side, cost + stages side-by-side. DORA cards 2×2.
1024px (desktop) Full designed layout. DORA cards 4×1 row. All sections visible without scroll.
1440px (wide) Wider cards with more sparkline data points. Gauge at full 200px.

Implementation Plan

Files to create

  • dashboard/src/views/health.ts — Health tab view with 6 render functions
  • dashboard/src/views/health.test.ts — Vitest unit tests (9 test cases)

Files to modify

  • dashboard/src/types/api.ts — Add HealthInfo, AnomalyAlert, StageProgressInfo; extend FleetState; extend TabId
  • dashboard/server.ts — Add computeHealthScore(), getHealthTrend(), getAnomalies(), getStageProgress(); extend getFleetState(); register 3 REST endpoints
  • dashboard/src/main.ts — Import and registerView("health", healthView)
  • dashboard/src/core/api.ts — Add fetchHealthTrend(), fetchAnomalies(), fetchStageProgress()
  • dashboard/public/index.html — Add Health tab button + panel
  • dashboard/public/styles.css — Health-specific CSS (~150 lines)
  • dashboard/src/components/header.ts — Add renderHealthBadge() after connection dot
  • scripts/sw-dashboard-e2e-test.sh — Add health endpoint + FleetState verification
  • scripts/sw-server-api-test.sh — Add 3 endpoint tests

Dependencies

  • None new. All data sources (events.jsonl, costs.json, budget.json, progress/) are already read by the server.

Risk areas

  • server.ts size: Already 5800 lines. Adding ~150 lines of health computation is acceptable but approaching the threshold where extraction to a module would help. Monitor.
  • computeHealthScore() in broadcast loop: Must remain synchronous and fast (<5ms). No file I/O, no subprocesses. The REST endpoints handle heavy lifting separately.
  • CSS specificity: Prefix all new classes with health- to avoid collisions with existing 13 views' styles.
  • FleetState deduplication: Adding health to FleetState means the score will change every 2s as pipelines progress, reducing deduplication effectiveness. Acceptable — the score is small and clients need the updates.

Validation Criteria

  • npm run build compiles with zero errors (TypeScript strict mode)
  • npm test passes all existing 102 test suites + new health.test.ts
  • Health tab renders gauge at score=0, score=50, score=100 (boundary cases)
  • Health tab renders correct verdict colors: green (>=75), yellow (>=50), red (>=25), critical (<25)
  • Empty state (no pipelines): score=100, "No active pipelines" message, empty alerts/stages
  • Anomaly drill-down toggles on Enter/Space key (keyboard accessible)
  • FleetState WebSocket message includes health field with valid structure
  • /api/health/trend?days=7 returns { points: [...] } array
  • /api/health/anomalies returns { anomalies: [...] } with required fields per item
  • /api/health/stages returns { stages: [...] } with status classification
  • Header health badge updates every 2s and navigates to Health tab on click
  • Layout responsive at 320px, 768px, 1024px, 1440px (no horizontal overflow)
  • No regressions in existing Overview, Metrics, or Insights tab rendering

Clone this wiki locally