Pipeline Design 20

Design: Add end-to-end integration test suite in CI

Context

Shipwright has 22 unit/component test suites that mock external dependencies (Claude API, GitHub API, tmux) but zero integration tests that exercise the real pipeline orchestration end-to-end. This means regressions in stage sequencing, state file management, and cross-script coordination go undetected until manual testing.

Constraints from the codebase:

All scripts are Bash 3.2 compatible (set -euo pipefail, no associative arrays, no readarray)
Test harness convention: PASS/FAIL counters, ERR trap, colored output via info()/success()/error() helpers
Pipeline state lives in .claude/pipeline-state.md and .claude/pipeline-artifacts/
Existing CI (test.yml) runs npm test which executes all 22 test suites via package.json scripts
Pipeline templates are JSON files in templates/pipelines/ — the pipeline reads them to determine which stages to run, iteration counts, and gating
Budget enforcement exists in sw-cost.sh — the pipeline checks remaining budget before each stage
--dry-run flag is already supported by sw-pipeline.sh and skips actual Claude invocations
$NO_GITHUB env var disables all GitHub API calls throughout the codebase

Decision

Two-tier integration test architecture with clear isolation between deterministic smoke tests (Tier 1) and API-dependent live tests (Tier 2).

Tier 1: Smoke Tests (no API key, no secrets, runs on every PR)

A new script scripts/sw-integration-test.sh following the existing test harness pattern. It exercises real pipeline orchestration with mock Claude/GitHub binaries — the same mocking approach used in sw-pipeline-test.sh but focused on end-to-end flow rather than individual function behavior.

Four smoke test cases:

Dry-run smoke — sw pipeline start --goal "test" --dry-run exits 0, emits "Dry run" on stdout, creates .claude/ directory structure
Stage ordering — A mocked pipeline runs the integration template (intake → build → test), verifying each stage executes in sequence by checking ordered timestamps in the state file
State file integrity — After a pipeline run, validates .claude/pipeline-state.md contains required fields (stage names, timestamps, status per stage, goal)
Budget enforcement — Sets budget to $0.00, runs pipeline, verifies it exits cleanly (exit 0 or well-defined exit code) with a budget-exceeded message rather than crashing

Mock strategy: Each test case creates a temp directory, populates it with mock binaries for claude and gh (echoing expected output), sets PATH to prefer mocks, sets NO_GITHUB=1, and runs the pipeline. This is identical to the pattern in sw-pipeline-test.sh:45-80 where mock binaries are set up.

Tier 2: Live Tests (gated behind `INTEGRATION_LIVE=true` + `ANTHROPIC_API_KEY`)

Three live test cases that call the real Claude API with strict budget controls:

README modification — Creates a temp git repo, runs sw pipeline start --goal "Add a one-line description to README.md" --template integration, verifies git diff shows README changes
PR creation — Runs a full fast template pipeline against the temp repo, verifies a PR branch exists and the working tree is clean
Budget cap verification — After the live run, reads ~/.shipwright/costs.json and asserts total spend is under $1.00

Safety mechanisms:

SHIPWRIGHT_BUDGET_LIMIT=1.00 environment variable hard-caps spending
15-minute job timeout in CI prevents runaway API calls
Tests run against a throwaway temp repo (not the real Shipwright repo)
INTEGRATION_LIVE must be explicitly set — accidental runs impossible

Pipeline Template

A new templates/pipelines/integration.json with minimal stages:

Stages: intake → build → test (3 stages only)
Model: sonnet (cheapest capable model)
Max iterations: 3 (enough to verify the loop, cheap enough to cap costs)
All gates: auto (no human approval needed)
No PR/deploy/monitor stages — keeps costs and complexity minimal

CI Workflow

A new .github/workflows/integration-test.yml with three jobs:

Job	Trigger	Secrets	Timeout	Purpose
`smoke`	Every PR	None	5 min	Tier 1 deterministic tests
`live`	PRs when `ANTHROPIC_API_KEY` secret exists	`ANTHROPIC_API_KEY`	15 min	Tier 2 API tests
`regression`	Push to `main`	`ANTHROPIC_API_KEY`	15 min	Post-merge verification

Each job writes per-test-case results to $GITHUB_STEP_SUMMARY as a markdown table (test name, status, duration).

The existing test.yml gets a new parallel job integration-smoke that runs npm run test:integration alongside the existing unit test job — ensuring smoke tests block PRs just like unit tests do.

Data Flow

PR opened
  ├── test.yml → unit tests (existing 22 suites)
  │            → integration-smoke (NEW: Tier 1)
  └── integration-test.yml
       ├── smoke job (Tier 1 — always)
       ├── live job (Tier 2 — when API key available)
       └── regression job (Tier 2 — main branch only)

Error Handling

Smoke tests: ERR trap captures failures, logs the failing test case, increments FAIL counter, continues to next test. Final exit code = 1 if any FAIL > 0.
Live tests: Same ERR trap pattern. Additionally, if the budget check fails mid-run, the test captures the exit status and verifies it's the expected budget-exceeded code (not a crash).
CI: Job-level timeout-minutes prevents infinite hangs. continue-on-error: false on smoke jobs means they block merge. Live jobs use continue-on-error: true initially (since API key may not be configured in all forks).

Alternatives Considered

Extend sw-pipeline-test.sh with integration cases — Pros: No new file, reuses existing mock setup. / Cons: Mixes unit-level function tests with end-to-end flow tests, making failures harder to diagnose. The existing file is already 1757 lines. Integration tests have fundamentally different setup (full temp repo vs. function-level mocking) and different CI characteristics (Tier 2 needs secrets, longer timeouts).
Use a testing framework (bats-core, shunit2) — Pros: Structured test discovery, TAP output, better assertion primitives. / Cons: Introduces a new dependency not used anywhere else in the project. All 22 existing test suites use the custom PASS/FAIL harness pattern. Adopting a framework for one suite creates inconsistency and requires all contributors to learn a new tool. The custom harness is simple and well-understood.
Docker-based integration tests — Pros: Perfect isolation, reproducible environment, no host contamination. / Cons: Adds Docker as a CI dependency, increases build time significantly, complicates debugging. The temp-directory + mock-binary approach already provides sufficient isolation without the overhead. Shipwright targets macOS developers — Docker adds friction.
Single-tier approach (smoke only, no live tests) — Pros: Simpler, no API costs, no secrets management. / Cons: Misses the highest-value validation — that the pipeline actually produces correct output when talking to a real LLM. The tiered approach gives us both: fast deterministic feedback on every PR + real validation when API access is available.

Implementation Plan

Files to create:
- scripts/sw-integration-test.sh — Main integration test script (~300-400 lines)
- templates/pipelines/integration.json — Minimal pipeline template (~30 lines)
- .github/workflows/integration-test.yml — CI workflow (~80 lines)
Files to modify:
- package.json — Add test:integration and test:integration:live scripts
- .github/workflows/test.yml — Add integration-smoke parallel job
- .claude/CLAUDE.md — Add test suite 23 to the test suites table and update count
Dependencies: None. Uses only existing tools (bash, jq, git, gh).
Risk areas:
- Live test flakiness: Claude API responses are non-deterministic. Tier 2 tests should assert structural properties (file changed, PR exists, cost under cap) not exact content. Retry logic with --max-retries 1 for transient API failures.
- Cost creep: If the integration template or iteration count is accidentally increased, live test costs could spike. The $1.00 budget hard-cap in CI environment variable is the safety net, but the template itself should also specify max_cost: 1.0.
- State file format changes: If sw-pipeline.sh changes the state file format, Tier 1 test 3 (state file integrity) will break. Mitigate by testing for structural properties (has timestamps, has stage names) rather than exact field positions.
- Mock binary drift: If sw-pipeline.sh starts calling new external tools that aren't mocked, smoke tests will fail with "command not found." This is actually desirable — it surfaces new dependencies early.
- CI secret availability: Forks won't have ANTHROPIC_API_KEY. The live job must gracefully skip (not fail) when the secret is absent. Use if: secrets.ANTHROPIC_API_KEY != '' in the workflow.

Pipeline Design 20

Design: Add end-to-end integration test suite in CI

Context

Decision

Tier 1: Smoke Tests (no API key, no secrets, runs on every PR)

Tier 2: Live Tests (gated behind INTEGRATION_LIVE=true + ANTHROPIC_API_KEY)

Pipeline Template

CI Workflow

Data Flow

Error Handling

Alternatives Considered

Implementation Plan

Validation Criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tier 2: Live Tests (gated behind `INTEGRATION_LIVE=true` + `ANTHROPIC_API_KEY`)