Skip to content

Pipeline Design 20

Seth Ford edited this page Feb 13, 2026 · 2 revisions

Design: Add end-to-end integration test suite in CI

Context

Shipwright has 22 unit/component test suites that mock external dependencies (Claude API, GitHub API, tmux) but zero integration tests that exercise the real pipeline orchestration end-to-end. This means regressions in stage sequencing, state file management, and cross-script coordination go undetected until manual testing.

Constraints from the codebase:

  • All scripts are Bash 3.2 compatible (set -euo pipefail, no associative arrays, no readarray)
  • Test harness convention: PASS/FAIL counters, ERR trap, colored output via info()/success()/error() helpers
  • Pipeline state lives in .claude/pipeline-state.md and .claude/pipeline-artifacts/
  • Existing CI (test.yml) runs npm test which executes all 22 test suites via package.json scripts
  • Pipeline templates are JSON files in templates/pipelines/ — the pipeline reads them to determine which stages to run, iteration counts, and gating
  • Budget enforcement exists in sw-cost.sh — the pipeline checks remaining budget before each stage
  • --dry-run flag is already supported by sw-pipeline.sh and skips actual Claude invocations
  • $NO_GITHUB env var disables all GitHub API calls throughout the codebase

Decision

Two-tier integration test architecture with clear isolation between deterministic smoke tests (Tier 1) and API-dependent live tests (Tier 2).

Tier 1: Smoke Tests (no API key, no secrets, runs on every PR)

A new script scripts/sw-integration-test.sh following the existing test harness pattern. It exercises real pipeline orchestration with mock Claude/GitHub binaries — the same mocking approach used in sw-pipeline-test.sh but focused on end-to-end flow rather than individual function behavior.

Four smoke test cases:

  1. Dry-run smokesw pipeline start --goal "test" --dry-run exits 0, emits "Dry run" on stdout, creates .claude/ directory structure
  2. Stage ordering — A mocked pipeline runs the integration template (intake → build → test), verifying each stage executes in sequence by checking ordered timestamps in the state file
  3. State file integrity — After a pipeline run, validates .claude/pipeline-state.md contains required fields (stage names, timestamps, status per stage, goal)
  4. Budget enforcement — Sets budget to $0.00, runs pipeline, verifies it exits cleanly (exit 0 or well-defined exit code) with a budget-exceeded message rather than crashing

Mock strategy: Each test case creates a temp directory, populates it with mock binaries for claude and gh (echoing expected output), sets PATH to prefer mocks, sets NO_GITHUB=1, and runs the pipeline. This is identical to the pattern in sw-pipeline-test.sh:45-80 where mock binaries are set up.

Tier 2: Live Tests (gated behind INTEGRATION_LIVE=true + ANTHROPIC_API_KEY)

Three live test cases that call the real Claude API with strict budget controls:

  1. README modification — Creates a temp git repo, runs sw pipeline start --goal "Add a one-line description to README.md" --template integration, verifies git diff shows README changes
  2. PR creation — Runs a full fast template pipeline against the temp repo, verifies a PR branch exists and the working tree is clean
  3. Budget cap verification — After the live run, reads ~/.shipwright/costs.json and asserts total spend is under $1.00

Safety mechanisms:

  • SHIPWRIGHT_BUDGET_LIMIT=1.00 environment variable hard-caps spending
  • 15-minute job timeout in CI prevents runaway API calls
  • Tests run against a throwaway temp repo (not the real Shipwright repo)
  • INTEGRATION_LIVE must be explicitly set — accidental runs impossible

Pipeline Template

A new templates/pipelines/integration.json with minimal stages:

  • Stages: intake → build → test (3 stages only)
  • Model: sonnet (cheapest capable model)
  • Max iterations: 3 (enough to verify the loop, cheap enough to cap costs)
  • All gates: auto (no human approval needed)
  • No PR/deploy/monitor stages — keeps costs and complexity minimal

CI Workflow

A new .github/workflows/integration-test.yml with three jobs:

Job Trigger Secrets Timeout Purpose
smoke Every PR None 5 min Tier 1 deterministic tests
live PRs when ANTHROPIC_API_KEY secret exists ANTHROPIC_API_KEY 15 min Tier 2 API tests
regression Push to main ANTHROPIC_API_KEY 15 min Post-merge verification

Each job writes per-test-case results to $GITHUB_STEP_SUMMARY as a markdown table (test name, status, duration).

The existing test.yml gets a new parallel job integration-smoke that runs npm run test:integration alongside the existing unit test job — ensuring smoke tests block PRs just like unit tests do.

Data Flow

PR opened
  ├── test.yml → unit tests (existing 22 suites)
  │            → integration-smoke (NEW: Tier 1)
  └── integration-test.yml
       ├── smoke job (Tier 1 — always)
       ├── live job (Tier 2 — when API key available)
       └── regression job (Tier 2 — main branch only)

Error Handling

  • Smoke tests: ERR trap captures failures, logs the failing test case, increments FAIL counter, continues to next test. Final exit code = 1 if any FAIL > 0.
  • Live tests: Same ERR trap pattern. Additionally, if the budget check fails mid-run, the test captures the exit status and verifies it's the expected budget-exceeded code (not a crash).
  • CI: Job-level timeout-minutes prevents infinite hangs. continue-on-error: false on smoke jobs means they block merge. Live jobs use continue-on-error: true initially (since API key may not be configured in all forks).

Alternatives Considered

  1. Extend sw-pipeline-test.sh with integration cases — Pros: No new file, reuses existing mock setup. / Cons: Mixes unit-level function tests with end-to-end flow tests, making failures harder to diagnose. The existing file is already 1757 lines. Integration tests have fundamentally different setup (full temp repo vs. function-level mocking) and different CI characteristics (Tier 2 needs secrets, longer timeouts).

  2. Use a testing framework (bats-core, shunit2) — Pros: Structured test discovery, TAP output, better assertion primitives. / Cons: Introduces a new dependency not used anywhere else in the project. All 22 existing test suites use the custom PASS/FAIL harness pattern. Adopting a framework for one suite creates inconsistency and requires all contributors to learn a new tool. The custom harness is simple and well-understood.

  3. Docker-based integration tests — Pros: Perfect isolation, reproducible environment, no host contamination. / Cons: Adds Docker as a CI dependency, increases build time significantly, complicates debugging. The temp-directory + mock-binary approach already provides sufficient isolation without the overhead. Shipwright targets macOS developers — Docker adds friction.

  4. Single-tier approach (smoke only, no live tests) — Pros: Simpler, no API costs, no secrets management. / Cons: Misses the highest-value validation — that the pipeline actually produces correct output when talking to a real LLM. The tiered approach gives us both: fast deterministic feedback on every PR + real validation when API access is available.

Implementation Plan

  • Files to create:

    • scripts/sw-integration-test.sh — Main integration test script (~300-400 lines)
    • templates/pipelines/integration.json — Minimal pipeline template (~30 lines)
    • .github/workflows/integration-test.yml — CI workflow (~80 lines)
  • Files to modify:

    • package.json — Add test:integration and test:integration:live scripts
    • .github/workflows/test.yml — Add integration-smoke parallel job
    • .claude/CLAUDE.md — Add test suite 23 to the test suites table and update count
  • Dependencies: None. Uses only existing tools (bash, jq, git, gh).

  • Risk areas:

    • Live test flakiness: Claude API responses are non-deterministic. Tier 2 tests should assert structural properties (file changed, PR exists, cost under cap) not exact content. Retry logic with --max-retries 1 for transient API failures.
    • Cost creep: If the integration template or iteration count is accidentally increased, live test costs could spike. The $1.00 budget hard-cap in CI environment variable is the safety net, but the template itself should also specify max_cost: 1.0.
    • State file format changes: If sw-pipeline.sh changes the state file format, Tier 1 test 3 (state file integrity) will break. Mitigate by testing for structural properties (has timestamps, has stage names) rather than exact field positions.
    • Mock binary drift: If sw-pipeline.sh starts calling new external tools that aren't mocked, smoke tests will fail with "command not found." This is actually desirable — it surfaces new dependencies early.
    • CI secret availability: Forks won't have ANTHROPIC_API_KEY. The live job must gracefully skip (not fail) when the secret is absent. Use if: secrets.ANTHROPIC_API_KEY != '' in the workflow.

Validation Criteria

  • ./scripts/sw-integration-test.sh exits 0 with all PASS, no FAIL — no API key needed
  • INTEGRATION_LIVE=true ./scripts/sw-integration-test.sh runs both tiers when ANTHROPIC_API_KEY is set
  • npm run test:integration executes smoke tests and exits 0
  • npm run test:integration:live executes both tiers and exits 0 (when API key present)
  • Existing npm test still passes — no regression in the 22 existing suites
  • CI smoke job runs on PRs without secrets and blocks merge on failure
  • CI live job skips gracefully when ANTHROPIC_API_KEY is not configured
  • Live tests complete within 15 minutes and under $1.00 total spend
  • State file validation catches missing fields (test with intentionally malformed state)
  • Budget enforcement test confirms clean exit (not crash/unhandled error) at $0 budget
  • $GITHUB_STEP_SUMMARY shows per-test markdown table in CI
  • templates/pipelines/integration.json is valid JSON and loadable by sw-pipeline.sh

Clone this wiki locally