Skip to content

gpu: add device-backed WITGEN commit path#1339

Merged
kunxian-xia merged 86 commits into
masterfrom
feat/witgen_gpu
May 19, 2026
Merged

gpu: add device-backed WITGEN commit path#1339
kunxian-xia merged 86 commits into
masterfrom
feat/witgen_gpu

Conversation

@hero78119
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 commented May 18, 2026

Problem

GPU_WITGEN,CACHE=1 can produce witness traces on GPU, but the prover path still needs a clean device-resident commit flow. The goal is to keep GPU-generated witness data usable through commit without falling back to replay/deferred raw-cache logic or unnecessary host materialization.

Design Rationale

This PR treats GPU witness output as the source of truth for the commit path: traces are normalized into device-backed row-major metadata, committed through the GPU PCS path, and released once q'/commit no longer needs the raw backing. The post-commit proving flow stays aligned with the existing CPU_WITGEN path so correctness-sensitive transcript, opening, and proof assembly logic remain shared.

The design avoids retaining replay plans as a second witness source. This keeps ownership simpler: GPU witness generation owns raw device buffers until q'/commit construction, then releases them before chip proving pressure grows.

Change Highlights

  • ceno_zkvm: add GPU witness/device-backed trace commit path for GPU_WITGEN,CACHE=1.
  • ceno_zkvm: keep post-commit proving and opening flow shared with the existing GPU prover path.
  • ceno_zkvm: release shard GPU witness caches after proof construction.
  • gkr_iop: support GPU-side batched main-constraint proving integration.

CI Benchmark Summary

Compared CI benchmark runs:

  • GPU_WITGEN: original PR benchmark numbers, kept for context.
  • CPU_WITGEN: 26067686212, branch feat/witgen_gpu, CENO_GPU_ENABLE_WITGEN=0
  • CPU_WITGEN (baseline): 26037135648, branch feat/update_dep, CENO_GPU_ENABLE_WITGEN=0
Metric GPU_WITGEN CPU_WITGEN CPU_WITGEN (baseline) Notes
reth-block E2E 111s 80.2s 83.2s CPU_WITGEN feature branch is fastest
app.prove 107s 65.6s 68.2s CPU_WITGEN feature branch improves 2.6s vs baseline
app_prove.inner 96.6s 65.6s 68.2s Same trend as app.prove
Witness total 35.43s 40.85s 39.84s GPU_WITGEN remains faster raw witness gen
Proof total 60.70s 62.15s 64.78s CPU_WITGEN feature branch improves proof total vs baseline
commit_traces total 12.35s 17.410s 17.450s GPU_WITGEN commit path remains faster
commit_traces avg/shard 950ms n/a n/a Original GPU_WITGEN per-shard metric kept
prove_tower_relation_gpu total n/a 119.624s 22.515s Nested/overlapped span increased in feature run
prove_batched_main_constraints total n/a 7.934s 7.639s Slight CPU_WITGEN regression
pcs_opening total 9.91s 9.857s 10.061s Stable
q commit total 8.25s device_q n/a n/a Original GPU_WITGEN q metric kept
q commit avg/shard 634ms n/a n/a Original GPU_WITGEN q metric kept
q inner commit avg 449ms n/a n/a Original GPU_WITGEN q metric kept
CPU/GPU overlap gap n/a 3.170s 3.200s CPU_WITGEN overlap unchanged
Overall result 111s 80.2s 83.2s CPU_WITGEN feature branch beats baseline; GPU_WITGEN still loses overall due to lost overlap
Conclusion Evidence
GPU_WITGEN still improves commit/witness subpaths Original GPU_WITGEN has faster witness total and commit_traces than CPU_WITGEN
GPU_WITGEN still loses overall 111s E2E vs 80.2s CPU_WITGEN due to lost shard witness/proof overlap
CPU_WITGEN feature branch is slightly faster than CPU_WITGEN baseline reth-block improves by 3.0s; app.prove improves by 2.6s
Commit/opening path is stable for CPU_WITGEN commit_traces and pcs_opening are within ~0.2s across CPU runs

Benchmark / Performance Impact

This is performance-sensitive. CI benchmark runs are used for comparable end-to-end numbers because local wall time depends heavily on runner scheduling and GPU availability.

Operation

Operation master (s) this PR (s) Improve (master -> this PR)
Reth proving benchmark See benchmark CI See benchmark CI See benchmark CI

Layer

Layer master (s) this PR (s) Improve (master -> this PR)
Witness commit/q' path Host/materialized path Device-backed GPU path Reduces host materialization and extra copies
Post-commit proving Existing GPU flow Existing GPU flow Intended to remain unchanged

Benchmark command(s):

# ceno-reth-benchmark CI, GPU_WITGEN,CACHE=1 and CPU_WITGEN,CACHE=1 comparison runs

Environment (CPU/GPU, core count, rust toolchain, commit hash):

CI benchmark runner metadata and commit hashes are recorded in the linked workflow runs.

raw data:

  • master: benchmark CI artifacts
  • this PR: benchmark CI artifacts

Testing

cargo fmt --check
cargo check -p ceno_zkvm --features 'gpu,u16limb_circuit' --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".cuda_hal.path="../ceno-gpu/cuda_hal"'

Risks and Rollout

  • Main risk is lifetime/ownership mistakes around device-backed witness buffers; the rollout keeps release points explicit and avoids replay cache ownership.
  • If regressions appear, disable CENO_GPU_ENABLE_WITGEN to return to the existing CPU_WITGEN GPU proving path.

Follow-ups (optional)

  • Continue profiling per-chip GPU witness generation and q' construction.
  • Add scheduler-level overlap once device memory booking is precise enough.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

@hero78119 hero78119 marked this pull request as draft May 18, 2026 07:20
Base automatically changed from feat/jagged_pcs to feat/batch_main_sumcheck May 18, 2026 08:57
@hero78119 hero78119 changed the title [WIP] experiment witgen + jagged PCS on gpu as base setting [WIP] experiment gpu witgen giga mle May 18, 2026
Base automatically changed from feat/batch_main_sumcheck to master May 18, 2026 11:43
@hero78119 hero78119 changed the title [WIP] experiment gpu witgen giga mle gpu: add device-backed WITGEN commit path May 18, 2026
@hero78119 hero78119 marked this pull request as ready for review May 18, 2026 12:36

#[cfg(feature = "gpu")]
{
if false
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some debug left-over. Fixed this bring back concurrent prove and baseline improve 1.037x e2e performance

@hero78119 hero78119 enabled auto-merge May 19, 2026 03:42
@hero78119 hero78119 disabled auto-merge May 19, 2026 03:45
@kunxian-xia kunxian-xia added this pull request to the merge queue May 19, 2026
Merged via the queue into master with commit 29d826e May 19, 2026
5 checks passed
@kunxian-xia kunxian-xia deleted the feat/witgen_gpu branch May 19, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants