feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637
Merged
Conversation
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
gabrielbosio
reviewed
Jun 1, 2026
Collaborator
gabrielbosio
left a comment
There was a problem hiding this comment.
It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:
- type
Fp/ typeFp3aliases - one
rand_fp/rand_fp3(the random generators, currentlyrandom_fp/rand_fp/rand_ext3) ext3_to_u64s/u64s_to_ext3(the interleaved packing)- the canonicalization family (
canon,canon_fp3/canon3/canon_triplet,canon_triplet_raw) reverse_index
This can be addressed in another PR though.
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
gabrielbosio
reviewed
Jun 1, 2026
Collaborator
gabrielbosio
left a comment
There was a problem hiding this comment.
It would be good to add a make command that runs clippy with the cuda feature:
clippy-cuda:
cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
reviewed
Jun 1, 2026
gabrielbosio
previously approved these changes
Jun 2, 2026
MauroToscano
approved these changes
Jun 2, 2026
gabrielbosio
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the
number_of_parts > 2branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2'sgpu_main/gpu_auxdevice handles, no host-side LDE traversal).The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns
Err.R4 DEEP, R4 FRI and the
_keepvariant of R2 (which retains agpu_composition_partshandle for R4 DEEP) are deferred to PR-4.What's in
crypto/math-cuda/kernels/barycentric.cu(new, ~190 LoC). Four kernels:barycentric_{base,ext3}_batchedandbarycentric_{base,ext3}_batched_strided. Thestrided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
crypto/math-cuda/src/barycentric.rs(new, ~215 LoC). Four host wrappers:barycentric_{base,ext3}for host data andbarycentric_{base,ext3}_on_devicefor
&GpuLdeBase/&GpuLdeExt3handles.crypto/math-cuda/src/device.rs(+14).BARY_PTXconst, fourCudaFunctionfields for the new kernels.crypto/math-cuda/build.rs(+1).compile_ptx("barycentric.cu", ...).crypto/math-cuda/src/lib.rs(+1).pub mod barycentric.crypto/math-cuda/tests/(new, ~530 LoC across 3 files).barycentric.rsandbarycentric_strided.rscover the four kernels against a CPU referencesumming the unscaled barycentric over base / ext3 columns with optional stride.
comp_poly_tree.rsexercises the fusedevaluate_poly_coset_batch_ext3_into_with_merkle_treeend-to-end against the CPUcommit_composition_polynomialfor sizes from(log_n=2, blowup=2)up to(log_n=14, blowup=2).crypto/stark/src/gpu_lde.rs(+~390 LoC). New dispatches:try_evaluate_parts_on_lde_gpu(R2, non-_keepext3 LDE for the parts > 2 branch).try_build_comp_poly_tree_gpu(R2 row-pair Keccak leaves + inner tree from host evals).try_barycentric_base_on_handle+try_barycentric_ext3_on_handle(R3 OOD reading PR-2's device handles).ood_ext3_scalarandapply_ext3_scalarfor the per-column scalar application.gpu_parts_lde_calls,gpu_bary_calls) and a separateLAMBDA_VM_GPU_BARY_THRESHOLDenv override (default2^14). Both newcounters reset via
reset_all_gpu_call_counters().crypto/stark/src/prover.rs(~70 LoC of changes). R2 dispatch inround_2_compute_composition_polynomial: pre-computecomposition_poly_partsonce, thenGPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit.
Round2struct is unchanged.crypto/stark/src/trace.rs(~80 LoC of changes). R3 dispatch inget_trace_evaluations_from_lde: per eval-point, trytry_barycentric_base_on_handleformain and
try_barycentric_ext3_on_handlefor aux; onNone, run the existing rayon CPU loop.inv_denomsstays on CPU (documented stream-contention regression).Added
+ 'staticbound on the function's type params to supportTypeIddispatch in the new GPU branches.Known limitations carried over from PR-2
--test-threads=1. Math-cuda-side fix, out of scope here.LAMBDA_VM_GPU_LDE_THRESHOLD=0forces small-domain tables through math-cuda kernels that panic atlog_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.num_AIRs * per_table_LDEbecause R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keepwhen retention would OOM the GPU.Continuation of
Builds on PR-2 (#582). Base branch is
feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2_keep) stacks on top.