feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch by ColoCarletti · Pull Request #637 · yetanotherco/lambda_vm

ColoCarletti · 2026-06-01T12:56:15Z

Summary

Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the number_of_parts > 2 branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2's gpu_main / gpu_aux device handles, no host-side LDE traversal).

The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns Err.

R4 DEEP, R4 FRI and the _keep variant of R2 (which retains a gpu_composition_parts handle for R4 DEEP) are deferred to PR-4.

What's in

crypto/math-cuda/kernels/barycentric.cu (new, ~190 LoC). Four kernels: barycentric_{base,ext3}_batched and barycentric_{base,ext3}_batched_strided. The
strided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
crypto/math-cuda/src/barycentric.rs (new, ~215 LoC). Four host wrappers: barycentric_{base,ext3} for host data and barycentric_{base,ext3}_on_device
for &GpuLdeBase / &GpuLdeExt3 handles.
crypto/math-cuda/src/device.rs (+14). BARY_PTX const, four CudaFunction fields for the new kernels.
crypto/math-cuda/build.rs (+1). compile_ptx("barycentric.cu", ...).
crypto/math-cuda/src/lib.rs (+1). pub mod barycentric.
crypto/math-cuda/tests/ (new, ~530 LoC across 3 files). barycentric.rs and barycentric_strided.rs cover the four kernels against a CPU reference
summing the unscaled barycentric over base / ext3 columns with optional stride. comp_poly_tree.rs exercises the fused
evaluate_poly_coset_batch_ext3_into_with_merkle_tree end-to-end against the CPU commit_composition_polynomial for sizes from (log_n=2, blowup=2) up to
(log_n=14, blowup=2).
crypto/stark/src/gpu_lde.rs (+~390 LoC). New dispatches:
- try_evaluate_parts_on_lde_gpu (R2, non-_keep ext3 LDE for the parts > 2 branch).
- try_build_comp_poly_tree_gpu (R2 row-pair Keccak leaves + inner tree from host evals).
- try_barycentric_base_on_handle + try_barycentric_ext3_on_handle (R3 OOD reading PR-2's device handles).
- Host helpers ood_ext3_scalar and apply_ext3_scalar for the per-column scalar application.
- Two new atomic counters (gpu_parts_lde_calls, gpu_bary_calls) and a separate LAMBDA_VM_GPU_BARY_THRESHOLD env override (default 2^14). Both new
  counters reset via reset_all_gpu_call_counters().
crypto/stark/src/prover.rs (~70 LoC of changes). R2 dispatch in round_2_compute_composition_polynomial: pre-compute composition_poly_parts once, then
GPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit. Round2 struct is unchanged.
crypto/stark/src/trace.rs (~80 LoC of changes). R3 dispatch in get_trace_evaluations_from_lde: per eval-point, try try_barycentric_base_on_handle for
main and try_barycentric_ext3_on_handle for aux; on None, run the existing rayon CPU loop. inv_denoms stays on CPU (documented stream-contention regression).
Added + 'static bound on the function's type params to support TypeId dispatch in the new GPU branches.

Known limitations carried over from PR-2

Parallel cuda tests still deadlock under default rayon (pinned-staging mutex contention). Workaround: --test-threads=1. Math-cuda-side fix, out of scope here.
LAMBDA_VM_GPU_LDE_THRESHOLD=0 forces small-domain tables through math-cuda kernels that panic at log_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.
Peak VRAM still scales with num_AIRs * per_table_LDE because R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keep when retention would OOM the GPU.

Continuation of

Builds on PR-2 (#582). Base branch is feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2 _keep) stacks on top.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…commits

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

gabrielbosio

It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:

type Fp / type Fp3 aliases
one rand_fp / rand_fp3 (the random generators, currently random_fp/rand_fp/rand_ext3)
ext3_to_u64s / u64s_to_ext3 (the interleaved packing)
the canonicalization family (canon, canon_fp3/canon3/canon_triplet, canon_triplet_raw)
reverse_index

This can be addressed in another PR though.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…o feat/cuda-pr3

gabrielbosio

It would be good to add a make command that runs clippy with the cuda feature:

clippy-cuda:
    cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref

Waiting to resolve previous comments

ColoCarletti and others added 30 commits May 6, 2026 15:12

add first cuda files

d1a0abf

fmt

79634ff

fix clippy

ac6fbb5

gpu 2nd part

2ceb3b0

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles

affceb1

merge main

01172f2

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

c4627e1

comments fix

01aa5e4

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

cfc5c19

Update crypto/stark/src/gpu_lde.rs

ea5696f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a8cf265

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

fb8d31f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a79f2b5

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

761a2c0

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

address reviews

e066e9d

fix review comments

7d3d0f0

Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…

cf80771

…commits

address doc comment suggestions

71aba0d

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

83d91b8

fix

34cae4b

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

f076bf4

Pass replay transcript to bus-balance call in verify_vm_minimal

a2cde0f

Update crypto/math-cuda/src/device.rs

46c305b

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

aca3dca

Update crypto/math-cuda/src/device.rs

63d7c00

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

eb16c02

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

66925b1

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4e6daf3

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4cd27d9

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

5fe390f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>