-
Notifications
You must be signed in to change notification settings - Fork 1
Pipeline Design 20
Shipwright has 22 unit/component test suites that mock external dependencies (Claude API, GitHub API, tmux) but zero integration tests that exercise the real pipeline orchestration end-to-end. This means regressions in stage sequencing, state file management, and cross-script coordination go undetected until manual testing.
Constraints from the codebase:
- All scripts are Bash 3.2 compatible (
set -euo pipefail, no associative arrays, noreadarray) - Test harness convention: PASS/FAIL counters,
ERRtrap, colored output viainfo()/success()/error()helpers - Pipeline state lives in
.claude/pipeline-state.mdand.claude/pipeline-artifacts/ - Existing CI (
test.yml) runsnpm testwhich executes all 22 test suites viapackage.jsonscripts - Pipeline templates are JSON files in
templates/pipelines/— the pipeline reads them to determine which stages to run, iteration counts, and gating - Budget enforcement exists in
sw-cost.sh— the pipeline checks remaining budget before each stage -
--dry-runflag is already supported bysw-pipeline.shand skips actual Claude invocations -
$NO_GITHUBenv var disables all GitHub API calls throughout the codebase
Two-tier integration test architecture with clear isolation between deterministic smoke tests (Tier 1) and API-dependent live tests (Tier 2).
A new script scripts/sw-integration-test.sh following the existing test harness pattern. It exercises real pipeline orchestration with mock Claude/GitHub binaries — the same mocking approach used in sw-pipeline-test.sh but focused on end-to-end flow rather than individual function behavior.
Four smoke test cases:
-
Dry-run smoke —
sw pipeline start --goal "test" --dry-runexits 0, emits "Dry run" on stdout, creates.claude/directory structure -
Stage ordering — A mocked pipeline runs the
integrationtemplate (intake → build → test), verifying each stage executes in sequence by checking ordered timestamps in the state file -
State file integrity — After a pipeline run, validates
.claude/pipeline-state.mdcontains required fields (stage names, timestamps, status per stage, goal) -
Budget enforcement — Sets budget to
$0.00, runs pipeline, verifies it exits cleanly (exit 0 or well-defined exit code) with a budget-exceeded message rather than crashing
Mock strategy: Each test case creates a temp directory, populates it with mock binaries for claude and gh (echoing expected output), sets PATH to prefer mocks, sets NO_GITHUB=1, and runs the pipeline. This is identical to the pattern in sw-pipeline-test.sh:45-80 where mock binaries are set up.
Three live test cases that call the real Claude API with strict budget controls:
-
README modification — Creates a temp git repo, runs
sw pipeline start --goal "Add a one-line description to README.md" --template integration, verifiesgit diffshows README changes -
PR creation — Runs a full
fasttemplate pipeline against the temp repo, verifies a PR branch exists and the working tree is clean -
Budget cap verification — After the live run, reads
~/.shipwright/costs.jsonand asserts total spend is under $1.00
Safety mechanisms:
-
SHIPWRIGHT_BUDGET_LIMIT=1.00environment variable hard-caps spending - 15-minute job timeout in CI prevents runaway API calls
- Tests run against a throwaway temp repo (not the real Shipwright repo)
-
INTEGRATION_LIVEmust be explicitly set — accidental runs impossible
A new templates/pipelines/integration.json with minimal stages:
- Stages: intake → build → test (3 stages only)
- Model: sonnet (cheapest capable model)
- Max iterations: 3 (enough to verify the loop, cheap enough to cap costs)
- All gates: auto (no human approval needed)
- No PR/deploy/monitor stages — keeps costs and complexity minimal
A new .github/workflows/integration-test.yml with three jobs:
| Job | Trigger | Secrets | Timeout | Purpose |
|---|---|---|---|---|
smoke |
Every PR | None | 5 min | Tier 1 deterministic tests |
live |
PRs when ANTHROPIC_API_KEY secret exists |
ANTHROPIC_API_KEY |
15 min | Tier 2 API tests |
regression |
Push to main
|
ANTHROPIC_API_KEY |
15 min | Post-merge verification |
Each job writes per-test-case results to $GITHUB_STEP_SUMMARY as a markdown table (test name, status, duration).
The existing test.yml gets a new parallel job integration-smoke that runs npm run test:integration alongside the existing unit test job — ensuring smoke tests block PRs just like unit tests do.
PR opened
├── test.yml → unit tests (existing 22 suites)
│ → integration-smoke (NEW: Tier 1)
└── integration-test.yml
├── smoke job (Tier 1 — always)
├── live job (Tier 2 — when API key available)
└── regression job (Tier 2 — main branch only)
- Smoke tests:
ERRtrap captures failures, logs the failing test case, increments FAIL counter, continues to next test. Final exit code = 1 if any FAIL > 0. - Live tests: Same
ERRtrap pattern. Additionally, if the budget check fails mid-run, the test captures the exit status and verifies it's the expected budget-exceeded code (not a crash). - CI: Job-level
timeout-minutesprevents infinite hangs.continue-on-error: falseon smoke jobs means they block merge. Live jobs usecontinue-on-error: trueinitially (since API key may not be configured in all forks).
-
Extend
sw-pipeline-test.shwith integration cases — Pros: No new file, reuses existing mock setup. / Cons: Mixes unit-level function tests with end-to-end flow tests, making failures harder to diagnose. The existing file is already 1757 lines. Integration tests have fundamentally different setup (full temp repo vs. function-level mocking) and different CI characteristics (Tier 2 needs secrets, longer timeouts). -
Use a testing framework (bats-core, shunit2) — Pros: Structured test discovery, TAP output, better assertion primitives. / Cons: Introduces a new dependency not used anywhere else in the project. All 22 existing test suites use the custom PASS/FAIL harness pattern. Adopting a framework for one suite creates inconsistency and requires all contributors to learn a new tool. The custom harness is simple and well-understood.
-
Docker-based integration tests — Pros: Perfect isolation, reproducible environment, no host contamination. / Cons: Adds Docker as a CI dependency, increases build time significantly, complicates debugging. The temp-directory + mock-binary approach already provides sufficient isolation without the overhead. Shipwright targets macOS developers — Docker adds friction.
-
Single-tier approach (smoke only, no live tests) — Pros: Simpler, no API costs, no secrets management. / Cons: Misses the highest-value validation — that the pipeline actually produces correct output when talking to a real LLM. The tiered approach gives us both: fast deterministic feedback on every PR + real validation when API access is available.
-
Files to create:
-
scripts/sw-integration-test.sh— Main integration test script (~300-400 lines) -
templates/pipelines/integration.json— Minimal pipeline template (~30 lines) -
.github/workflows/integration-test.yml— CI workflow (~80 lines)
-
-
Files to modify:
-
package.json— Addtest:integrationandtest:integration:livescripts -
.github/workflows/test.yml— Addintegration-smokeparallel job -
.claude/CLAUDE.md— Add test suite 23 to the test suites table and update count
-
-
Dependencies: None. Uses only existing tools (bash, jq, git, gh).
-
Risk areas:
-
Live test flakiness: Claude API responses are non-deterministic. Tier 2 tests should assert structural properties (file changed, PR exists, cost under cap) not exact content. Retry logic with
--max-retries 1for transient API failures. -
Cost creep: If the integration template or iteration count is accidentally increased, live test costs could spike. The
$1.00budget hard-cap in CI environment variable is the safety net, but the template itself should also specifymax_cost: 1.0. -
State file format changes: If
sw-pipeline.shchanges the state file format, Tier 1 test 3 (state file integrity) will break. Mitigate by testing for structural properties (has timestamps, has stage names) rather than exact field positions. -
Mock binary drift: If
sw-pipeline.shstarts calling new external tools that aren't mocked, smoke tests will fail with "command not found." This is actually desirable — it surfaces new dependencies early. -
CI secret availability: Forks won't have
ANTHROPIC_API_KEY. The live job must gracefully skip (not fail) when the secret is absent. Useif: secrets.ANTHROPIC_API_KEY != ''in the workflow.
-
Live test flakiness: Claude API responses are non-deterministic. Tier 2 tests should assert structural properties (file changed, PR exists, cost under cap) not exact content. Retry logic with
-
./scripts/sw-integration-test.shexits 0 with all PASS, no FAIL — no API key needed -
INTEGRATION_LIVE=true ./scripts/sw-integration-test.shruns both tiers whenANTHROPIC_API_KEYis set -
npm run test:integrationexecutes smoke tests and exits 0 -
npm run test:integration:liveexecutes both tiers and exits 0 (when API key present) - Existing
npm teststill passes — no regression in the 22 existing suites - CI
smokejob runs on PRs without secrets and blocks merge on failure - CI
livejob skips gracefully whenANTHROPIC_API_KEYis not configured - Live tests complete within 15 minutes and under $1.00 total spend
- State file validation catches missing fields (test with intentionally malformed state)
- Budget enforcement test confirms clean exit (not crash/unhandled error) at $0 budget
-
$GITHUB_STEP_SUMMARYshows per-test markdown table in CI -
templates/pipelines/integration.jsonis valid JSON and loadable bysw-pipeline.sh