gpu: add device-backed WITGEN commit path by hero78119 · Pull Request #1339 · scroll-tech/ceno

hero78119 · 2026-05-18T07:20:06Z

Problem

GPU_WITGEN,CACHE=1 can produce witness traces on GPU, but the prover path still needs a clean device-resident commit flow. The goal is to keep GPU-generated witness data usable through commit without falling back to replay/deferred raw-cache logic or unnecessary host materialization.

Design Rationale

This PR treats GPU witness output as the source of truth for the commit path: traces are normalized into device-backed row-major metadata, committed through the GPU PCS path, and released once q'/commit no longer needs the raw backing. The post-commit proving flow stays aligned with the existing CPU_WITGEN path so correctness-sensitive transcript, opening, and proof assembly logic remain shared.

The design avoids retaining replay plans as a second witness source. This keeps ownership simpler: GPU witness generation owns raw device buffers until q'/commit construction, then releases them before chip proving pressure grows.

Change Highlights

ceno_zkvm: add GPU witness/device-backed trace commit path for GPU_WITGEN,CACHE=1.
ceno_zkvm: keep post-commit proving and opening flow shared with the existing GPU prover path.
ceno_zkvm: release shard GPU witness caches after proof construction.
gkr_iop: support GPU-side batched main-constraint proving integration.

CI Benchmark Summary

Compared CI benchmark runs:

GPU_WITGEN: original PR benchmark numbers, kept for context.
CPU_WITGEN: 26067686212, branch feat/witgen_gpu, CENO_GPU_ENABLE_WITGEN=0
CPU_WITGEN (baseline): 26037135648, branch feat/update_dep, CENO_GPU_ENABLE_WITGEN=0

Metric	GPU_WITGEN	CPU_WITGEN	CPU_WITGEN (baseline)	Notes
reth-block E2E	111s	80.2s	83.2s	CPU_WITGEN feature branch is fastest
app.prove	107s	65.6s	68.2s	CPU_WITGEN feature branch improves 2.6s vs baseline
app_prove.inner	96.6s	65.6s	68.2s	Same trend as app.prove
Witness total	35.43s	40.85s	39.84s	GPU_WITGEN remains faster raw witness gen
Proof total	60.70s	62.15s	64.78s	CPU_WITGEN feature branch improves proof total vs baseline
commit_traces total	12.35s	17.410s	17.450s	GPU_WITGEN commit path remains faster
commit_traces avg/shard	950ms	n/a	n/a	Original GPU_WITGEN per-shard metric kept
prove_tower_relation_gpu total	n/a	119.624s	22.515s	Nested/overlapped span increased in feature run
prove_batched_main_constraints total	n/a	7.934s	7.639s	Slight CPU_WITGEN regression
pcs_opening total	9.91s	9.857s	10.061s	Stable
q commit total	8.25s device_q	n/a	n/a	Original GPU_WITGEN q metric kept
q commit avg/shard	634ms	n/a	n/a	Original GPU_WITGEN q metric kept
q inner commit avg	449ms	n/a	n/a	Original GPU_WITGEN q metric kept
CPU/GPU overlap gap	n/a	3.170s	3.200s	CPU_WITGEN overlap unchanged
Overall result	111s	80.2s	83.2s	CPU_WITGEN feature branch beats baseline; GPU_WITGEN still loses overall due to lost overlap

Conclusion	Evidence
GPU_WITGEN still improves commit/witness subpaths	Original GPU_WITGEN has faster witness total and commit_traces than CPU_WITGEN
GPU_WITGEN still loses overall	111s E2E vs 80.2s CPU_WITGEN due to lost shard witness/proof overlap
CPU_WITGEN feature branch is slightly faster than CPU_WITGEN baseline	reth-block improves by 3.0s; app.prove improves by 2.6s
Commit/opening path is stable for CPU_WITGEN	commit_traces and pcs_opening are within ~0.2s across CPU runs

Benchmark / Performance Impact

This is performance-sensitive. CI benchmark runs are used for comparable end-to-end numbers because local wall time depends heavily on runner scheduling and GPU availability.

Operation

Operation	master (s)	this PR (s)	Improve (master -> this PR)
Reth proving benchmark	See benchmark CI	See benchmark CI	See benchmark CI

Layer

Layer	master (s)	this PR (s)	Improve (master -> this PR)
Witness commit/q' path	Host/materialized path	Device-backed GPU path	Reduces host materialization and extra copies
Post-commit proving	Existing GPU flow	Existing GPU flow	Intended to remain unchanged

Benchmark command(s):

# ceno-reth-benchmark CI, GPU_WITGEN,CACHE=1 and CPU_WITGEN,CACHE=1 comparison runs

Environment (CPU/GPU, core count, rust toolchain, commit hash):

CI benchmark runner metadata and commit hashes are recorded in the linked workflow runs.

raw data:

master: benchmark CI artifacts
this PR: benchmark CI artifacts

Testing

cargo fmt --check
cargo check -p ceno_zkvm --features 'gpu,u16limb_circuit' --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".cuda_hal.path="../ceno-gpu/cuda_hal"'

Risks and Rollout

Main risk is lifetime/ownership mistakes around device-backed witness buffers; the rollout keeps release points explicit and avoids replay cache ownership.
If regressions appear, disable CENO_GPU_ENABLE_WITGEN to return to the existing CPU_WITGEN GPU proving path.

Follow-ups (optional)

Continue profiling per-chip GPU witness generation and q' construction.
Add scheduler-level overlap once device memory booking is precise enough.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

…_mle_zero_padding

…/ceno into feat/prover_mle_zero_padding

hero78119 · 2026-05-19T03:34:36Z


        #[cfg(feature = "gpu")]
        {
-            if false


this is some debug left-over. Fixed this bring back concurrent prove and baseline improve 1.037x e2e performance

hero78119 added 30 commits April 25, 2026 23:18

refactor GPU compact tower witness flow

ac49ac6

Fix compact tower memory accounting

84a2631

Optimize compact logup ones allocation

12453f6

update dep

7d60f01

Merge branch 'master' into feat/prover_mle_zero_padding

925de92

fix main mem estimation

e9fbe9c

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

46e87bb

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' of github.com:scroll-tech…

b888fbb

…/ceno into feat/prover_mle_zero_padding

fix mem estimator

5ecce04

snapshot compact tower estimator state

be14006

rollback Cargo.toml, Cargo.lock change

df88dec

fix memory estimation

b57b692

verifier log

c50b793

Pass tower input by value for GPU proving

89b8698

split tower layer by view

f210e1f

Use dense tower build for compact GPU input

99b7a94

Pass logup shape to tower prove estimator

f0d81b6

Deduplicate borrowed tower input booking

917810c

fix logging

4fc8dae

Check scheduler memory estimate in mem tracking

ef9fa30

Refine replay tower proof memory estimate

011a898

clippy fix

f3ca1cf

add missing syncronization, avoid race condition

147f567

Account ShardRam tower prove allocator overhead

94fc7bf

misc: clippy fix

c9401d1

Fix GPU proof memory estimation

d14e66a

Fix GPU proof estimate row basis

ceced51

Tune ShardRam tower proof estimate

d1ab71a

Batch main constraints into single sumcheck

7c6e97c

Restore replay backing before batched main

505e258

hero78119 added 14 commits May 16, 2026 11:41

bump ci timeout to 40 min

a943b7c

misc: cleanup and refactor

0a08bca

feat: simplify gpu witgen qprime flow

8e6a697

feat: release shard gpu caches after backing

b9820c6

chore: point deps to witgen gpu branch

77c5e71

fix: keep cpu prover imports available

ae133eb

update dep

cbfc23a

fix: align witgen jagged commit retention

02d6400

fix concurrent

0d11ae7

fix stale logic of gpu witgen

9ce9023

fix wrong ecc estimation on all chip

1f697a9

more debug log

c5057d6

assert cache release

f2aaf52

restore muiti stream binding

0210113

hero78119 marked this pull request as draft May 18, 2026 07:20

Base automatically changed from feat/jagged_pcs to feat/batch_main_sumcheck May 18, 2026 08:57

hero78119 changed the title ~~[WIP] experiment witgen + jagged PCS on gpu as base setting~~ [WIP] experiment gpu witgen giga mle May 18, 2026

Base automatically changed from feat/batch_main_sumcheck to master May 18, 2026 11:43

merge with master

e8d1dba

hero78119 changed the title ~~[WIP] experiment gpu witgen giga mle~~ gpu: add device-backed WITGEN commit path May 18, 2026

hero78119 marked this pull request as ready for review May 18, 2026 12:36

hero78119 commented May 19, 2026

View reviewed changes

hero78119 enabled auto-merge May 19, 2026 03:42

hero78119 disabled auto-merge May 19, 2026 03:45

update dependency

65938ed

kunxian-xia approved these changes May 19, 2026

View reviewed changes

kunxian-xia added this pull request to the merge queue May 19, 2026

Merged via the queue into master with commit 29d826e May 19, 2026
5 checks passed

kunxian-xia deleted the feat/witgen_gpu branch May 19, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: add device-backed WITGEN commit path#1339

gpu: add device-backed WITGEN commit path#1339
kunxian-xia merged 86 commits into
masterfrom
feat/witgen_gpu

hero78119 commented May 18, 2026 •

edited

Loading

Uh oh!

hero78119 May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hero78119 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design Rationale

Change Highlights

CI Benchmark Summary

Benchmark / Performance Impact

Operation

Layer

Testing

Risks and Rollout

Follow-ups (optional)

Copilot Reviewer Directive (keep this section)

Uh oh!

hero78119 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hero78119 commented May 18, 2026 •

edited

Loading