ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits by MauroToscano · Pull Request #777 · yetanotherco/lambda_vm

MauroToscano · 2026-07-03T21:00:54Z

Review follow-ups for #747, targeting its branch. All small, behavior-preserving fixes; shellcheck and actionlint pass.

Fixes

.github/workflows/gpu-tests.yml

rc=$? after if wait_ready ...; fi was always 0 — bash gives a false if with no else exit status 0, so the rc=1 timeout / rc=2 image failure diagnostic in the provisioning warning never worked. Now captured in an else branch (verified empirically).
ssh keepalive on the test session — ConnectTimeout only covers connection setup; a box that wedged or dropped off the network mid-suite would hang the step until the 240-minute job timeout, stalling the merge queue. ServerAliveInterval=60 + ServerAliveCountMax=10 fails ~10 minutes after the box goes dark.
upload the full test log as an artifact (gpu-test-log, 14-day retention) — the workflow's own comment notes the UI truncates/rotates huge step logs; the artifact keeps the complete multi-hour output retrievable. Summary footer now points to it.

scripts/gpu_test.sh

Guard the cudarc-pin sed anchors — if crypto/math-cuda/Cargo.toml's cudarc features are ever renamed or reformatted, the sed would silently no-op, fallback-latest would stay active, and the driver-symbol panic the pin exists to prevent (cuDevSmResourceSplit etc.) would come back with a confusing signature. Now fails loudly.
Restore the mutated Cargo.toml on exit — CI re-checks-out before every run so it never noticed, but a manual run on a dev GPU box was left with a silently dirty tracked file that could be committed by accident.

Makefile

Add test-prover-debug to .PHONY (pre-existing omission; every other test target is listed).
test-cuda-integration comment said the test asserts R1/R2/R3 counters — it also asserts the R4 DEEP-composition and FRI-commit counters.
Scope the test-prover-comprehensive-cuda parity comment: CPU CI's merge-queue comprehensive job also runs test_recursion_execute, which has no GPU leg (see 'Not fixed here' below).

Docs

docs/roadmap.md: GPU FFT, GPU Merkle tree, and GPU FRI were still listed as Planned — all three are implemented (crypto/math-cuda/kernels/), dispatch-counted (crypto/stark/src/gpu_lde.rs), and CI-tested by this very PR. Now Done.
prove_elfs_tests.rs: cargo test --ignored → cargo test -- --ignored in the ignore comment.

Deliberately NOT fixed here (bigger than one-liners)

Vast box leak window: teardown is if: always() in the same job and the cleanup label embeds run_attempt — if the GitHub runner itself dies (infra failure, 6h hard kill), the box leaks and nothing ever sweeps it. Needs a small scheduled sweeper workflow (also covers benchmark-gpu.yml, which has the same gap).
Merge-queue status-check timeout: repo setting, not code — if it's still the 60-min default when gpu-tests becomes a required check, worst-case provisioning (~75 min) fails every queue entry. Check when flipping the required check on.
Recursion on the GPU path: test_recursion_execute is never run with cuda enabled anywhere; adding it needs recursion ELF builds on the box and a test-time budget decision.

…, sed guard, doc nits - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment

* ci: run tests on GPU server * keep instance running for debug * ci: require CUDA 13.1 * ci: test prover on CUDA * ci: improve gpu failed tests summaries * ci: test-threads=1 * remove temporary code * fix: set cuda_max_good>=12.8 * comments * apply code review * dummy comment * remove tmp code * reduce to 48gb * add retries * rent server with cuda 13.1 * print nvidia-smi * print machine info in a separated step * remove temp code * ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits (#777) - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment --------- Co-authored-by: MauroFab <maurotoscano2@gmail.com> Co-authored-by: Mauro Toscano <12560266+MauroToscano@users.noreply.github.com>

MauroToscano merged commit 828000e into ci_run_tests_gpu Jul 3, 2026
6 checks passed

MauroToscano deleted the fix/pr747-review-nits branch July 3, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits#777

ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits#777
MauroToscano merged 1 commit into
ci_run_tests_gpufrom
fix/pr747-review-nits

MauroToscano commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MauroToscano commented Jul 3, 2026

Fixes

Deliberately NOT fixed here (bigger than one-liners)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant