Skip to content

feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552

Open
ret2libc wants to merge 10 commits into
mainfrom
e2e-commands
Open

feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552
ret2libc wants to merge 10 commits into
mainfrom
e2e-commands

Conversation

@ret2libc
Copy link
Copy Markdown
Collaborator

What this adds

A Docker-only end-to-end smoke test of the full Buttercup pipeline against
example-libpng — no
Kubernetes/minikube. Mirrors the milestones in
.github/workflows/system-integration.yml but tails docker compose logs.

  • scripts/e2e.sh — brings the dev/docker-compose/ stack up, submits the
    canned libpng trigger_task, and waits on the pipeline milestones
    (fuzzer build → POV submitted → POV accepted → seed-gen → patch
    generated/approved/passed → bundle submitted; optional SARIF).
  • make e2e (and make e2e E2E_ARGS=...).
  • .claude/commands/e2e.md/e2e slash command wrapper.

Flags: --budget (LiteLLM per-user max budget, default $3),
--task-duration, --image-tag / BUTTERCUP_IMAGE_TAG, --no-pull,
--keep-up, --skip-wait, --sarif, per-phase timeout overrides.

Image source

By default the stack runs the prebuilt GHCR images via the
compose.prebuilt.yaml overlay (nothing built locally). --no-pull skips the
docker compose pull and uses already-present images (e.g. locally built and
tagged ghcr.io/trailofbits/buttercup/*:<tag>).

.env handling

e2e.sh regenerates dev/docker-compose/.env each run. It resolves each
value as environment → existing .env → placeholder, so manually-set
values (e.g. LANGFUSE_*) are preserved across runs instead of being
clobbered with empty/placeholder.

Dependency / merge ordering

The prebuilt path invokes
docker compose -f compose.yaml -f compose.prebuilt.yaml. The
compose.prebuilt.yaml overlay is not in this PR — it lives on the
separate compose-prebuilt branch/PR. This PR should land after or together
with
that one; on its own the overlay file must already be present in
dev/docker-compose/.

Scope

e2e tooling only — .claude/commands/e2e.md, Makefile, scripts/e2e.sh.
Independent of the three pipeline fixes surfaced while building this
(buttercup-ui internal port, litellm budget enforcement, patcher task
storage), which are their own separate PRs.

Validation

This tooling was used to drive the pipeline end-to-end during development:
fuzzer build → POV submitted → POV accepted, through seed-gen and patch
generation, with budget tracking and Langfuse tracing.

🤖 Generated with Claude Code

@ret2libc ret2libc requested a review from hbrodin as a code owner May 15, 2026 13:15
@ret2libc
Copy link
Copy Markdown
Collaborator Author

Addressed CI lint/static failures in 5210208 (shellcheck SC2015 in scripts/e2e.sh).

Comment thread scripts/e2e.sh Outdated
Comment thread scripts/e2e.sh Outdated
Comment thread scripts/e2e.sh Outdated
Comment thread scripts/e2e.sh Outdated
ret2libc and others added 9 commits May 19, 2026 08:25
Adds scripts/e2e.sh, `make e2e`, and a .claude/commands/e2e.md slash
command that bring the Buttercup stack up via dev/docker-compose
(no Kubernetes), submit the example-libpng task, and monitor the
scheduler / seed-gen / patcher logs through the milestones tracked by
.github/workflows/system-integration.yml (fuzzer build, POV submit/
pass, seed-gen, patch generate / approve / pass, bundle submit, and
optionally SARIF). Defaults LITELLM_MAX_BUDGET to \$3 so accidental
runs are cheap; tears the stack down on exit unless --keep-up is set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The e2e driver now brings the stack up through the compose.prebuilt.yaml
overlay and `docker compose pull` (tag configurable via --image-tag /
BUTTERCUP_IMAGE_TAG, default "main") instead of `docker compose build`,
so a run no longer depends on a working local image build (e.g. the
cscope submodule / oss-fuzz base-runner build chain).

- dc() applies `-f compose.yaml -f compose.prebuilt.yaml` and exports
  BUTTERCUP_IMAGE_TAG for every compose subcommand (pull/up/logs/down).
- --no-build kept as a deprecated alias for the new --no-pull.
- Teardown hint and e2e.md updated for the overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env from scratch every run,
sourcing values only from environment variables. Variables not exported
(notably LANGFUSE_HOST/PUBLIC_KEY/SECRET_KEY) were defaulted to empty and
written back, clobbering values a user had set directly in .env.

Add prev_env() and a 3-tier resolution: environment > existing .env >
placeholder. Manually-set .env values (Langfuse creds, provider keys,
litellm key) now survive subsequent runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `wait_for ... && record ok || record TIMEOUT` and
`curl ... && record ok || record fail` constructs with explicit
if-then-else blocks. shellcheck flagged these as SC2015 (A && B || C
is not if-then-else), causing the "Lint shell scripts" step in the
Static Checks workflow to fail. Behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With `set -o pipefail`, `dc logs ... | grep -m1` makes the upstream
`docker compose logs` die with SIGPIPE (rc 141) once grep matches the
first line; pipefail then fails the whole pipeline, so milestones whose
log line appears early in a high-volume stream (e.g. seed-gen's 'Copied
N files to corpus') are never registered and wait_for spins until
timeout even though the milestone occurred. Capture grep output with
'|| true' and test for non-empty instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop --no-build, --keep-up, --skip-wait, --sarif, --task-json and the
per-phase --*-timeout flags. The stack now always tears down on exit;
milestone timeouts are internal constants.

Addresses PR #552 review:
- provider-key check moved below the .env fallback so keys saved to
  .env on a prior run are accepted (tip is now accurate)
- --task-json removed (was silently falling back to the libpng default)
- trigger_task response uses mktemp + on_exit cleanup instead of a
  predictable /tmp/e2e_task_resp.$$ leaked on SIGINT/SIGTERM
- --no-build phantom "deprecated alias" removed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local litellm master key is an internal detail of the docker-compose
stack, not something the user should set. Remove it from the usage text
and the env/.env resolution; e2e.sh now just writes the local default
(sk-1234) into the generated .env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env every run and was always
writing LANGFUSE_HOST=/PUBLIC_KEY=/SECRET_KEY= even when unset. Since
.env is loaded last in compose's env_file list, an empty value silently
disabled Langfuse telemetry. Now resolved env -> existing .env, and the
LANGFUSE_* lines are only written when non-empty, so values the user set
in .env survive across runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pov-submit and bundle-submit waiters used
"POV submission response: pov_id=" and "Bundle submission response: bundle_id="
which never match any rendered log line: the only
"... submission response:" logs are logger.debug calls whose payload is an
API object repr (no literal pov_id=/bundle_id=), while pov_id=/bundle_id=
appear only in the separate structured summary line (logger.info) with a
different prefix. Result: both milestones always timed out, so every run —
including fully successful ones — wasted MILESTONE_TIMEOUT+BUNDLE_TIMEOUT
and exited non-zero.

Repoint both to the structured summary tokens (pov_id= / bundle_id=) and
sync the marker list in .claude/commands/e2e.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ults

Three defects found while verifying the pipeline end-to-end:

1. Approval one-shot race: capture_line 'competition_patch_id=' ran once
   right after the patch-generated milestone, but the scheduler logs that
   id only minutes later (after it builds+verifies+submits the patch). The
   capture always lost the race, so approval was always skipped and the
   local stack never reached Patch passed / bundle. Replace with a
   wait_capture() poll loop (mirrors wait_for) so approval actually fires.

2. Default --task-duration 1800 is self-defeating: build->POV->seed-gen->
   patch exceeds 30 min on normal hardware, so the task expires mid-patch
   ("task expired/cancelled? Will discard") and never reaches patch/bundle.
   Default to 7200 so the task outlives the pipeline.

3. Default --budget 3 cannot reach patch/bundle: a full run through patch
   generation costs ~$10; $3 is exhausted around POV. Default to 10.

e2e.md updated to match (defaults, the cheap --budget 3 caveat, and the
poll-then-approve description).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants