Skip to content

Pipeline Design 200

Seth Ford edited this page Apr 4, 2026 · 2 revisions

ADR written to .claude/pipeline-artifacts/design.md (291 lines). Key architectural decisions documented:

  1. Static grep-based shared-state detection (6 patterns) over runtime isolation or manual annotation — conservative, portable, zero dependencies
  2. Temp file + grep for parallel result propagation — solves the bash subshell variable loss problem without IPC
  3. CPU-aware worker cap [2, 8] — scales to the machine, doesn't overwhelm constrained systems
  4. Facade with raw bash -c fallback — optimizer is purely additive, never worse than baseline

Four alternatives rejected with rationale: Vitest native (wrong layer), GNU parallel (no intelligence), Makefile deps (maintenance cost), Docker isolation (overhead exceeds gains). ” disable via SW_TEST_OPTIMIZER=false or optimization: off in pipeline config

  • Must not break existing test correctness — parallel tests that share state must be detected and run sequentially
  • History/prioritization must be append-only and self-healing (corrupt JSONL lines skipped on read)

Decision

Introduce a test execution optimizer (scripts/lib/test-optimizer.sh) as a library sourced by both pipeline stages and the loop harness. The optimizer implements a four-phase execution model:

Component Diagram

                    ┌──────────────────────────�
                    │   Entry Points            │
                    │  stage_test()             │
                    │  run_test_gate()          │
                    └──────────┬───────────────┘
                               │
                    ┌──────────▼───────────────�
                    │  testopt_execute()        │
                    │  Facade orchestrator      │
                    │  - parses options         │
                    │  - gates on <3 tests      │
                    │  - falls back on init err │
                    └──────────┬───────────────┘
                               │
              ┌────────────────┼────────────────�
              │                │                 │
   ┌──────────▼──────� ┌──────▼──────� ┌───────▼────────�
   │ Phase 1: Init   │ │ Phase 2:    │ │ Phase 3:       │
   │ testopt_init()  │ │ Prioritize  │ │ Partition      │
   │ - discover      │ │ testopt_    │ │ testopt_       │
   │   *-test.sh     │ │ prioritize()│ │ partition_     │
   │ - load history  │ │ - fail_rate │ │ shared_state() │
   │ - git diff      │ │   DESC      │ │ - 6 patterns   │
   │ - select        │ │ - duration  │ │ → parallel[]   │
   │   affected      │ │   ASC       │ │ → sequential[] │
   └──────────┬──────┘ └──────┬──────┘ └───────┬────────┘
              └────────────────┼────────────────┘
                               │
              ┌────────────────┼───────────────�
              │                                │
   ┌──────────▼──────────�      ┌──────────────▼──────────�
   │ Phase 4a: Parallel  │      │ Phase 4b: Sequential    │
   │ testopt_run_parallel│      │ testopt_run_with_       │
   │ - N workers (2-8)   │      │ fast_fail               │
   │ - dir-grouped       │      │ - stop on first failure │
   │ - temp-file results │      │ - skip if 4a failed +   │
   └──────────┬──────────┘      │   fast_fail enabled     │
              │                 └──────────────┬──────────┘
              └────────────────┬───────────────┘
                               │
                    ┌──────────▼───────────────�
                    │  Output                   │
                    │  - history → JSONL append  │
                    │  - evidence → JSON file    │
                    │  - events → emit_event()   │
                    │  - report → stdout         │
                    └───────────────────────────┘

Key Design Decisions

1. Shared-state detection via static grep analysis (not runtime isolation)

  • Context: Need to determine which tests can run in parallel without race conditions.
  • Decision: Grep each test file for 6 patterns indicating shared state: hardcoded /tmp paths, port binding, SQLite files, PID/lock files, singleton TMPDIR assignments, and global config sourcing.
  • Alternatives rejected: (a) Runtime sandboxing with namespaces/cgroups — too heavy, not portable to macOS. (b) Manual annotation — requires maintainer discipline, gets stale.
  • Consequences: False positives send safe tests to the sequential bucket (slower but correct). False negatives would cause flaky parallel failures. The patterns are intentionally conservative — Pattern 6 (config sourcing) captures anything that sources a config file, which over-classifies but prevents subtle global-state mutation bugs.

2. Subshell variable propagation via temp file + grep (not direct variable)

  • Context: Background jobs in testopt_run_parallel() run in subshells. Variable assignments (all_passed=false) don't propagate back to the parent.
  • Decision: Each background job writes "<file> PASS|FAIL <duration>" lines to a shared temp file. Parent checks grep -q ' FAIL ' on the results file.
  • Consequences: Simple, correct, no IPC mechanisms. The temp file may see interleaved writes from concurrent jobs, but each line is a single echo which is atomic for small writes on both Linux and macOS.

3. CPU-aware worker detection with hard cap at 8

  • Context: Need to scale parallelism to the machine without overwhelming constrained systems.
  • Decision: Detect cores via sysctl -n hw.ncpu (Darwin) / /proc/cpuinfo (Linux) / nproc, use 75%, clamp to [2, 8].
  • Consequences: 8-core machine gets 6 workers. 2-core CI runner gets 2 workers. Memory-constrained systems with many cores still get capped at 8.

4. Facade pattern with fallback to raw bash -c

  • Context: The optimizer must never be worse than doing nothing.
  • Decision: testopt_execute() falls back to bash -c "$test_cmd" when: (a) init fails, (b) fewer than 3 test files discovered, or (c) the optimizer is disabled via config.
  • Consequences: Zero risk of regression for edge cases. The optimizer is purely additive.

Data Flow

1. stage_test() reads pipeline config: optimization \!= "off" → calls testopt_execute()
   run_test_gate() reads SW_TEST_OPTIMIZER \!= "false" → calls testopt_execute()

2. testopt_execute(".", "npm test", "--fast-fail")
   └── testopt_init(".")
       ├── find *-test.sh *_test.sh test_*.sh → DISCOVERED_TESTS[]
       ├── read ~/.shipwright/optimization/test-history.jsonl → TEST_HISTORY[]
       ├── git diff HEAD~1..HEAD → CHANGED_FILES[]
       └── testopt_select_affected() → AFFECTED_TESTS[] (directory + source matching)

3. testopt_prioritize(AFFECTED_TESTS)
   └── for each test: score = (fail_rate * 10000) - duration_s
       └── sort -rn → highest-fail-rate, fastest-duration first

4. testopt_partition_shared_state(prioritized_tests)
   └── grep each file for 6 patterns → "SHARED:<path>" or "INDEPENDENT:<path>"
       ├── INDEPENDENT → parallel_tests[]
       └── SHARED → sequential_tests[]

5. Phase 4a: testopt_run_parallel(parallel_tests, workers=detect_cores*0.75)
   └── group tests by directory → background subshells → wait → grep results file

6. Phase 4b: testopt_run_with_fast_fail(sequential_tests)
   └── skipped entirely if Phase 4a failed AND fast_fail=true
   └── otherwise runs one-by-one, breaks on first failure

7. Record results → test-history.jsonl (JSONL append)
   Write evidence → $ARTIFACTS_DIR/test-optimizer-evidence.json
   Emit events → testopt.parallel_done, testopt.sequential_done, testopt.fail_fast

Interface Contracts

# Main entry point — called by pipeline and loop
testopt_execute <project_root> <test_cmd> [options...]
  Options:
    --max-workers=N         # int, override CPU-detected worker count (2-8)
    --fast-fail             # bool, stop on first failure (default)
    --continue-on-fail      # bool, run all tests despite failures
    --mode=auto|parallel|sequential  # execution mode (default: auto)
  Returns: exit 0 (all pass) | exit 1 (any failure)
  Errors: falls back to raw bash -c on init failure

# CPU detection — platform-aware core counting
testopt_detect_cores() -> stdout: integer (2-8)
  Errors: returns 4 (safe default) on detection failure

# Shared-state classification — static analysis of test files
testopt_partition_shared_state(file...) -> stdout: "SHARED:<path>" | "INDEPENDENT:<path>"
  Errors: non-existent files classified as INDEPENDENT

# Affected test selection — git-diff-driven test filtering
testopt_select_affected(changed_files...) -> sets AFFECTED_TESTS global array
  Errors: empty changed_files → returns all discovered tests

# Priority ordering — fail-rate weighted sort
testopt_prioritize(tests...) -> stdout: sorted test file paths (one per line)
  Errors: missing history → all tests score equally

# Parallel runner — background subshell execution
testopt_run_parallel(--max-workers=N, tests...) -> exit 0|1
  Errors: writes FAIL lines to temp file, checked via grep

# Sequential runner — fast-fail execution
testopt_run_with_fast_fail([--continue-on-fail], tests...) -> exit 0|1
  Errors: non-existent test files silently skipped

Error Boundaries

Component Handles Propagation Fallback
testopt_execute() Init failures, <3 tests Returns raw bash -c exit code Transparent fallthrough to original behavior
testopt_init() Missing project root, no git, no history Logs warning, continues with empty arrays All tests treated as affected, no prioritization
testopt_run_parallel() Subshell crashes, missing test files Writes FAIL to results temp file grep -q ' FAIL ' on results file catches all
testopt_run_with_fast_fail() Test exit code != 0 Breaks loop (fast-fail) or continues (flag) Returns 1 with failed test name on stdout
testopt_record_history() Write failures, missing directory Suppressed via 2>/dev/null || true Missing history = no prioritization (graceful)
testopt_partition_shared_state() Non-existent files Classified as INDEPENDENT Conservative — false positives go sequential

Alternatives Considered

1. Vitest Native Parallelism (Node-Level)

  • Pros: Vitest already supports --pool threads/forks, parallel file execution, and --reporter for structured output. Would work natively with npm test.
  • Cons: This project's test suite is 102+ bash test scripts, not Vitest test files. The npm test command dispatches to these bash scripts. Vitest parallelism would only help if tests were .ts/.js files. Doesn't address affected-first or shared-state detection for bash scripts.
  • Why rejected: Wrong layer. The optimization target is bash-level test file execution, not Node-level test runner internals.

2. GNU Parallel / xargs -P

  • Pros: Battle-tested parallel execution. Simple: find *-test.sh | parallel -j$(nproc) bash {}.
  • Cons: No shared-state detection — would run all tests in parallel including those that fight over ports/files. No affected-first prioritization. No fast-fail cascade between parallel and sequential phases. Adds a dependency (parallel not installed by default on macOS).
  • Why rejected: Lacks the intelligence layer (prioritization, partitioning, history). Would need the same detection code wrapped around it anyway.

3. Makefile-Based Dependency Graph

  • Pros: Explicit dependency declaration between test files. make -j handles parallelism natively.
  • Cons: Requires maintaining a Makefile with test dependencies — high maintenance burden. Every new test file needs a rule. No automatic shared-state detection. Foreign to the existing bash-centric architecture.
  • Why rejected: Maintenance cost too high for 102+ test files. Static declaration gets stale.

4. Container-Based Isolation (Docker per test)

  • Pros: Perfect isolation — no shared-state concerns. Every test gets a clean filesystem.
  • Cons: Container startup overhead (~2-5s per test) would negate parallelism gains. 102 containers would require significant memory. Not available in all CI environments. Massive complexity increase.
  • Why rejected: Overhead exceeds the time savings from parallelism.

Implementation Plan

Files Created

File Purpose
scripts/lib/test-optimizer.sh (741 lines) Core library: detection, partitioning, prioritization, execution, history, reporting
scripts/sw-test-optimizer-integration-test.sh (438 lines) 25 integration tests covering all new functions

Files Modified

File Change
scripts/lib/pipeline-stages-build.sh:569-584 stage_test() reads optimization from pipeline config, calls testopt_execute() when not off
scripts/sw-loop.sh:997-1000 run_test_gate() checks SW_TEST_OPTIMIZER, calls testopt_execute() when not false
templates/pipelines/*.json (9 files) Added "optimization": "auto", "fast_fail": true to test stage config

Dependencies

  • None new. Uses only: bash, jq, grep, find, sort, awk, mktemp, sysctl/nproc, git

Risk Areas

  1. Shared-state false negatives — Pattern 6 (config sourcing) is broad but not exhaustive. A test could share state via an unconventional mechanism (e.g., writing to a well-known path without matching any of the 6 patterns). Mitigation: --mode=sequential override and SW_TEST_OPTIMIZER=false kill switch.
  2. Concurrent JSONL writes — Multiple parallel pipelines (worktrees) appending to ~/.shipwright/optimization/test-history.jsonl simultaneously. Small single-line appends are atomic on Linux/macOS for typical filesystem block sizes, but not guaranteed. Mitigation: JSONL format is self-healing — corrupt lines are skipped on read.
  3. History file unbounded growth — test-history.jsonl grows indefinitely. For 102 tests * 10 runs/day * 365 days = ~372K lines. Mitigation: not yet implemented. Future work: add rotation or tail-N windowing.

Validation Criteria

  • All 9 pipeline templates contain "optimization": "auto" and "fast_fail": true in test stage config
  • stage_test() calls testopt_execute() when optimization \!= "off" — at pipeline-stages-build.sh:576-581
  • run_test_gate() calls testopt_execute() when SW_TEST_OPTIMIZER \!= "false" — at sw-loop.sh:997-1000
  • Backwards compatible: SW_TEST_OPTIMIZER=false bypasses optimizer in loop
  • Backwards compatible: optimization: "off" bypasses optimizer in pipeline
  • Fallback on <3 test files — at test-optimizer.sh:621
  • Fallback on init failure — at test-optimizer.sh:614-618
  • CPU detection clamped to [2, 8] — at test-optimizer.sh:517-518
  • 6 shared-state patterns implemented — at test-optimizer.sh:537-567
  • Parallel runner uses temp-file + grep for result propagation — at test-optimizer.sh:382-413
  • Evidence JSON written to $ARTIFACTS_DIR/test-optimizer-evidence.json
  • Events emitted: testopt.parallel_done, testopt.sequential_done, testopt.fail_fast, testopt.recorded
  • All 25 integration tests pass
  • All 20 existing unit tests pass
  • Pipeline template config discoverable via shipwright templates list

Test Pyramid Breakdown

Layer Count Coverage Target What's Tested
Unit 20 (sw-test-optimizer-test.sh) Core library functions Discovery, history load/query, affected selection, prioritization sort, fast-fail, parallel execution
Integration 25 (sw-test-optimizer-integration-test.sh) New functions + wiring testopt_detect_cores, testopt_partition_shared_state, testopt_execute orchestrator, stage_test() wiring, run_test_gate() wiring
E2E 0 (covered by existing npm test) Full pipeline flow Optimizer activates during normal npm test runs — 102+ suites exercise the real path

Coverage targets: 100% of public functions have direct tests. Error paths (missing root, <3 tests, empty history) explicitly tested. Edge cases (all-shared, all-independent, single file) covered.

Critical Paths Tested

Happy path: 5 independent + 2 shared-state tests → parallel phase runs first with N workers → sequential phase runs second → all pass → exit 0, evidence JSON written.

Error cases:

  1. Failing test in parallel bucket + fast-fail → sequential bucket skipped entirely → exit 1 with testopt.fail_fast event emitted
  2. Failing test with --continue-on-fail → all tests run regardless → exit 1 with full results

Edge cases:

  1. Zero test files discovered → falls through to bash -c "$test_cmd" (no optimizer overhead)
  2. All tests classified as SHARED → parallel bucket empty, sequential bucket gets everything → behaves like original sequential execution

Baseline Metrics

Metric Current Value Source
Full test suite wall-clock ~1365s memory/metrics.json (2026-04-04 baseline)
Execution mode Sequential only Single bash -c "$test_cmd"
Time to first failure Up to ~1365s (worst case) No early termination

Optimization Targets

Metric Target Rationale
Full suite wall-clock <900s (34% reduction) Parallelism across ~75% of tests classified as independent
Time to first failure <200s (70%+ reduction) Affected-first prioritization + fast-fail stops early
Test correctness Zero regressions Shared-state partitioning prevents parallel flakes

Profiling Strategy

  • Wall-clock per phase: testopt.parallel_done and testopt.sequential_done events capture duration per phase
  • Evidence JSON: test-optimizer-evidence.json records total/parallel/sequential counts, workers, mode, exit code per run
  • Historical trending: test-history.jsonl accumulates per-test duration and pass/fail data across runs
  • Dashboard integration: /api/metrics/stage-performance surfaces test stage duration trends from event data

Benchmark Plan

Step Method Success Criteria
Before time npm test on main branch Record wall-clock baseline (~1365s)
After time npm test on feature branch Wall-clock < 900s
Verify Read test-optimizer-evidence.json parallel_tests > 0, workers > 1, exit_code: 0
Regression check Compare test pass counts Same number of PASS/FAIL as main branch

Clone this wiki locally