Skip to content

feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637

Merged
ColoCarletti merged 48 commits into
mainfrom
feat/cuda-pr3
Jun 3, 2026
Merged

feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637
ColoCarletti merged 48 commits into
mainfrom
feat/cuda-pr3

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the number_of_parts > 2 branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2's gpu_main / gpu_aux device handles, no host-side LDE traversal).

The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns Err.

R4 DEEP, R4 FRI and the _keep variant of R2 (which retains a gpu_composition_parts handle for R4 DEEP) are deferred to PR-4.

What's in

  • crypto/math-cuda/kernels/barycentric.cu (new, ~190 LoC). Four kernels: barycentric_{base,ext3}_batched and barycentric_{base,ext3}_batched_strided. The
    strided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
  • crypto/math-cuda/src/barycentric.rs (new, ~215 LoC). Four host wrappers: barycentric_{base,ext3} for host data and barycentric_{base,ext3}_on_device
    for &GpuLdeBase / &GpuLdeExt3 handles.
  • crypto/math-cuda/src/device.rs (+14). BARY_PTX const, four CudaFunction fields for the new kernels.
  • crypto/math-cuda/build.rs (+1). compile_ptx("barycentric.cu", ...).
  • crypto/math-cuda/src/lib.rs (+1). pub mod barycentric.
  • crypto/math-cuda/tests/ (new, ~530 LoC across 3 files). barycentric.rs and barycentric_strided.rs cover the four kernels against a CPU reference
    summing the unscaled barycentric over base / ext3 columns with optional stride. comp_poly_tree.rs exercises the fused
    evaluate_poly_coset_batch_ext3_into_with_merkle_tree end-to-end against the CPU commit_composition_polynomial for sizes from (log_n=2, blowup=2) up to
    (log_n=14, blowup=2).
  • crypto/stark/src/gpu_lde.rs (+~390 LoC). New dispatches:
    • try_evaluate_parts_on_lde_gpu (R2, non-_keep ext3 LDE for the parts > 2 branch).
    • try_build_comp_poly_tree_gpu (R2 row-pair Keccak leaves + inner tree from host evals).
    • try_barycentric_base_on_handle + try_barycentric_ext3_on_handle (R3 OOD reading PR-2's device handles).
    • Host helpers ood_ext3_scalar and apply_ext3_scalar for the per-column scalar application.
    • Two new atomic counters (gpu_parts_lde_calls, gpu_bary_calls) and a separate LAMBDA_VM_GPU_BARY_THRESHOLD env override (default 2^14). Both new
      counters reset via reset_all_gpu_call_counters().
  • crypto/stark/src/prover.rs (~70 LoC of changes). R2 dispatch in round_2_compute_composition_polynomial: pre-compute composition_poly_parts once, then
    GPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit. Round2 struct is unchanged.
  • crypto/stark/src/trace.rs (~80 LoC of changes). R3 dispatch in get_trace_evaluations_from_lde: per eval-point, try try_barycentric_base_on_handle for
    main and try_barycentric_ext3_on_handle for aux; on None, run the existing rayon CPU loop. inv_denoms stays on CPU (documented stream-contention regression).
    Added + 'static bound on the function's type params to support TypeId dispatch in the new GPU branches.

Known limitations carried over from PR-2

  • Parallel cuda tests still deadlock under default rayon (pinned-staging mutex contention). Workaround: --test-threads=1. Math-cuda-side fix, out of scope here.
  • LAMBDA_VM_GPU_LDE_THRESHOLD=0 forces small-domain tables through math-cuda kernels that panic at log_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.
  • Peak VRAM still scales with num_AIRs * per_table_LDE because R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keep when retention would OOM the GPU.

Continuation of

Builds on PR-2 (#582). Base branch is feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2 _keep) stacks on top.

ColoCarletti and others added 30 commits May 6, 2026 15:12
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:

  • type Fp / type Fp3 aliases
  • one rand_fp / rand_fp3 (the random generators, currently random_fp/rand_fp/rand_ext3)
  • ext3_to_u64s / u64s_to_ext3 (the interleaved packing)
  • the canonicalization family (canon, canon_fp3/canon3/canon_triplet, canon_triplet_raw)
  • reverse_index

This can be addressed in another PR though.

ColoCarletti and others added 4 commits June 1, 2026 17:24
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a make command that runs clippy with the cuda feature:

clippy-cuda:
    cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref

Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
gabrielbosio
gabrielbosio previously approved these changes Jun 2, 2026
@gabrielbosio gabrielbosio dismissed their stale review June 2, 2026 15:22

Waiting to resolve previous comments

@gabrielbosio gabrielbosio added the gpu Related to GPU/CUDA development label Jun 3, 2026
@ColoCarletti ColoCarletti added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 3ede251 Jun 3, 2026
12 checks passed
@ColoCarletti ColoCarletti deleted the feat/cuda-pr3 branch June 3, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu Related to GPU/CUDA development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants