feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552
Open
ret2libc wants to merge 10 commits into
Open
feat(scripts): Docker-only end-to-end /e2e command for example-libpng#552ret2libc wants to merge 10 commits into
ret2libc wants to merge 10 commits into
Conversation
Collaborator
Author
|
Addressed CI lint/static failures in 5210208 (shellcheck SC2015 in scripts/e2e.sh). |
hbrodin
reviewed
May 19, 2026
hbrodin
reviewed
May 19, 2026
hbrodin
reviewed
May 19, 2026
hbrodin
reviewed
May 19, 2026
Adds scripts/e2e.sh, `make e2e`, and a .claude/commands/e2e.md slash command that bring the Buttercup stack up via dev/docker-compose (no Kubernetes), submit the example-libpng task, and monitor the scheduler / seed-gen / patcher logs through the milestones tracked by .github/workflows/system-integration.yml (fuzzer build, POV submit/ pass, seed-gen, patch generate / approve / pass, bundle submit, and optionally SARIF). Defaults LITELLM_MAX_BUDGET to \$3 so accidental runs are cheap; tears the stack down on exit unless --keep-up is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The e2e driver now brings the stack up through the compose.prebuilt.yaml overlay and `docker compose pull` (tag configurable via --image-tag / BUTTERCUP_IMAGE_TAG, default "main") instead of `docker compose build`, so a run no longer depends on a working local image build (e.g. the cscope submodule / oss-fuzz base-runner build chain). - dc() applies `-f compose.yaml -f compose.prebuilt.yaml` and exports BUTTERCUP_IMAGE_TAG for every compose subcommand (pull/up/logs/down). - --no-build kept as a deprecated alias for the new --no-pull. - Teardown hint and e2e.md updated for the overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env from scratch every run, sourcing values only from environment variables. Variables not exported (notably LANGFUSE_HOST/PUBLIC_KEY/SECRET_KEY) were defaulted to empty and written back, clobbering values a user had set directly in .env. Add prev_env() and a 3-tier resolution: environment > existing .env > placeholder. Manually-set .env values (Langfuse creds, provider keys, litellm key) now survive subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `wait_for ... && record ok || record TIMEOUT` and `curl ... && record ok || record fail` constructs with explicit if-then-else blocks. shellcheck flagged these as SC2015 (A && B || C is not if-then-else), causing the "Lint shell scripts" step in the Static Checks workflow to fail. Behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With `set -o pipefail`, `dc logs ... | grep -m1` makes the upstream `docker compose logs` die with SIGPIPE (rc 141) once grep matches the first line; pipefail then fails the whole pipeline, so milestones whose log line appears early in a high-volume stream (e.g. seed-gen's 'Copied N files to corpus') are never registered and wait_for spins until timeout even though the milestone occurred. Capture grep output with '|| true' and test for non-empty instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop --no-build, --keep-up, --skip-wait, --sarif, --task-json and the per-phase --*-timeout flags. The stack now always tears down on exit; milestone timeouts are internal constants. Addresses PR #552 review: - provider-key check moved below the .env fallback so keys saved to .env on a prior run are accepted (tip is now accurate) - --task-json removed (was silently falling back to the libpng default) - trigger_task response uses mktemp + on_exit cleanup instead of a predictable /tmp/e2e_task_resp.$$ leaked on SIGINT/SIGTERM - --no-build phantom "deprecated alias" removed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local litellm master key is an internal detail of the docker-compose stack, not something the user should set. Remove it from the usage text and the env/.env resolution; e2e.sh now just writes the local default (sk-1234) into the generated .env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e2e.sh regenerates dev/docker-compose/.env every run and was always writing LANGFUSE_HOST=/PUBLIC_KEY=/SECRET_KEY= even when unset. Since .env is loaded last in compose's env_file list, an empty value silently disabled Langfuse telemetry. Now resolved env -> existing .env, and the LANGFUSE_* lines are only written when non-empty, so values the user set in .env survive across runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pov-submit and bundle-submit waiters used "POV submission response: pov_id=" and "Bundle submission response: bundle_id=" which never match any rendered log line: the only "... submission response:" logs are logger.debug calls whose payload is an API object repr (no literal pov_id=/bundle_id=), while pov_id=/bundle_id= appear only in the separate structured summary line (logger.info) with a different prefix. Result: both milestones always timed out, so every run — including fully successful ones — wasted MILESTONE_TIMEOUT+BUNDLE_TIMEOUT and exited non-zero. Repoint both to the structured summary tokens (pov_id= / bundle_id=) and sync the marker list in .claude/commands/e2e.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ults
Three defects found while verifying the pipeline end-to-end:
1. Approval one-shot race: capture_line 'competition_patch_id=' ran once
right after the patch-generated milestone, but the scheduler logs that
id only minutes later (after it builds+verifies+submits the patch). The
capture always lost the race, so approval was always skipped and the
local stack never reached Patch passed / bundle. Replace with a
wait_capture() poll loop (mirrors wait_for) so approval actually fires.
2. Default --task-duration 1800 is self-defeating: build->POV->seed-gen->
patch exceeds 30 min on normal hardware, so the task expires mid-patch
("task expired/cancelled? Will discard") and never reaches patch/bundle.
Default to 7200 so the task outlives the pipeline.
3. Default --budget 3 cannot reach patch/bundle: a full run through patch
generation costs ~$10; $3 is exhausted around POV. Default to 10.
e2e.md updated to match (defaults, the cheap --budget 3 caveat, and the
poll-then-approve description).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
A Docker-only end-to-end smoke test of the full Buttercup pipeline against
example-libpng — no
Kubernetes/minikube. Mirrors the milestones in
.github/workflows/system-integration.ymlbut tailsdocker compose logs.scripts/e2e.sh— brings thedev/docker-compose/stack up, submits thecanned libpng
trigger_task, and waits on the pipeline milestones(fuzzer build → POV submitted → POV accepted → seed-gen → patch
generated/approved/passed → bundle submitted; optional SARIF).
make e2e(andmake e2e E2E_ARGS=...)..claude/commands/e2e.md—/e2eslash command wrapper.Flags:
--budget(LiteLLM per-user max budget, default $3),--task-duration,--image-tag/BUTTERCUP_IMAGE_TAG,--no-pull,--keep-up,--skip-wait,--sarif, per-phase timeout overrides.Image source
By default the stack runs the prebuilt GHCR images via the
compose.prebuilt.yamloverlay (nothing built locally).--no-pullskips thedocker compose pulland uses already-present images (e.g. locally built andtagged
ghcr.io/trailofbits/buttercup/*:<tag>)..env handling
e2e.shregeneratesdev/docker-compose/.enveach run. It resolves eachvalue as environment → existing
.env→ placeholder, so manually-setvalues (e.g.
LANGFUSE_*) are preserved across runs instead of beingclobbered with empty/placeholder.
Dependency / merge ordering
The prebuilt path invokes
docker compose -f compose.yaml -f compose.prebuilt.yaml. Thecompose.prebuilt.yamloverlay is not in this PR — it lives on theseparate compose-prebuilt branch/PR. This PR should land after or together
with that one; on its own the overlay file must already be present in
dev/docker-compose/.Scope
e2e tooling only —
.claude/commands/e2e.md,Makefile,scripts/e2e.sh.Independent of the three pipeline fixes surfaced while building this
(buttercup-ui internal port, litellm budget enforcement, patcher task
storage), which are their own separate PRs.
Validation
This tooling was used to drive the pipeline end-to-end during development:
fuzzer build → POV submitted → POV accepted, through seed-gen and patch
generation, with budget tracking and Langfuse tracing.
🤖 Generated with Claude Code