ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits#777
Merged
Merged
Conversation
…, sed guard, doc nits - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment
MauroToscano
added a commit
that referenced
this pull request
Jul 3, 2026
* ci: run tests on GPU server * keep instance running for debug * ci: require CUDA 13.1 * ci: test prover on CUDA * ci: improve gpu failed tests summaries * ci: test-threads=1 * remove temporary code * fix: set cuda_max_good>=12.8 * comments * apply code review * dummy comment * remove tmp code * reduce to 48gb * add retries * rent server with cuda 13.1 * print nvidia-smi * print machine info in a separated step * remove temp code * ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits (#777) - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment --------- Co-authored-by: MauroFab <maurotoscano2@gmail.com> Co-authored-by: Mauro Toscano <12560266+MauroToscano@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review follow-ups for #747, targeting its branch. All small, behavior-preserving fixes;
shellcheckandactionlintpass.Fixes
.github/workflows/gpu-tests.ymlrc=$?afterif wait_ready ...; fiwas always 0 — bash gives a falseifwith noelseexit status 0, so therc=1 timeout / rc=2 image failurediagnostic in the provisioning warning never worked. Now captured in anelsebranch (verified empirically).ConnectTimeoutonly covers connection setup; a box that wedged or dropped off the network mid-suite would hang the step until the 240-minute job timeout, stalling the merge queue.ServerAliveInterval=60+ServerAliveCountMax=10fails ~10 minutes after the box goes dark.gpu-test-log, 14-day retention) — the workflow's own comment notes the UI truncates/rotates huge step logs; the artifact keeps the complete multi-hour output retrievable. Summary footer now points to it.scripts/gpu_test.shcrypto/math-cuda/Cargo.toml's cudarc features are ever renamed or reformatted, the sed would silently no-op,fallback-latestwould stay active, and the driver-symbol panic the pin exists to prevent (cuDevSmResourceSplitetc.) would come back with a confusing signature. Now fails loudly.Cargo.tomlon exit — CI re-checks-out before every run so it never noticed, but a manual run on a dev GPU box was left with a silently dirty tracked file that could be committed by accident.Makefiletest-prover-debugto.PHONY(pre-existing omission; every other test target is listed).test-cuda-integrationcomment said the test asserts R1/R2/R3 counters — it also asserts the R4 DEEP-composition and FRI-commit counters.test-prover-comprehensive-cudaparity comment: CPU CI's merge-queue comprehensive job also runstest_recursion_execute, which has no GPU leg (see 'Not fixed here' below).Docs
docs/roadmap.md: GPU FFT, GPU Merkle tree, and GPU FRI were still listed as Planned — all three are implemented (crypto/math-cuda/kernels/), dispatch-counted (crypto/stark/src/gpu_lde.rs), and CI-tested by this very PR. Now Done.prove_elfs_tests.rs:cargo test --ignored→cargo test -- --ignoredin the ignore comment.Deliberately NOT fixed here (bigger than one-liners)
if: always()in the same job and the cleanup label embedsrun_attempt— if the GitHub runner itself dies (infra failure, 6h hard kill), the box leaks and nothing ever sweeps it. Needs a small scheduled sweeper workflow (also coversbenchmark-gpu.yml, which has the same gap).gpu-testsbecomes a required check, worst-case provisioning (~75 min) fails every queue entry. Check when flipping the required check on.test_recursion_executeis never run withcudaenabled anywhere; adding it needs recursion ELF builds on the box and a test-time budget decision.