Add frontload CPU sumcheck prover#49
Conversation
Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/d9fe64b2-2a93-463a-8ddd-99885bb15b90 Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
# Conflicts: # crates/multilinear_extensions/src/mle.rs
|
@copilot, after this pr, the behaviors for
are also modified, right? please don't make any changes to the current pr on my behalf. |
Yes, both are modified: 1. Expected sum for batched sumcheck ( 2. Final expected evaluation for batched sumcheck ( The frontload |
Problem
Ceno needs a CPU sumcheck path that interprets mixed-length MLEs in frontload layout: short MLEs bind to early variables and missing later variables contribute
product x_itail factors. The existing suffix/backend-loading prover remains available for comparison and fallback routing.Design Rationale
Frontload MLE Embedding
For an MLE
fwithkvariables embedded into anN-variable sumcheck domain, frontload bindsfto the early variables and represents every missing later variable as a multiplicative tail:This preserves the Boolean-hypercube sum because the tail product contributes only on the all-one assignment:
At a verifier point
r, the same compact MLE is evaluated as:So short MLEs are not expanded to
2^N; they stay compact and receive frontload tail factors after their own variables have been fixed.Difference From Suffix Load
Suffix/backend loading aligns short MLEs with later variables. Missing earlier variables are summed out, so the asserted sum needs the usual
2^(N-k)scaling for random mixed-length monomials.Frontload does the opposite: short MLEs align with early variables, and missing later variables are encoded by the
product x_itail. That tail sums to1, so frontload random monomial sums do not apply suffix missing-variable scaling.Operationally:
2^(N-k)product x_i1Worker-Aware Frontload Split
A subtle but important point is how a medium-size MLE is embedded when the global domain is larger than the MLE and the prover is split across workers.
Assume four workers and a
2^10global domain:For a 7-variable MLE:
A naive split would first pad
eto all 10 variables and then split by the last worker bits:This is the wrong mental model for frontload. The original
ehas only 7 real variables, so treatingi0,i1as variables afterx0..x6would mostly route workers through padded tail space instead of through real MLE data.Frontload uses a worker-aware split instead. The worker bits occupy the last real variables of
e, and the missing local variables become tail factors:Therefore each worker
(i0,i1)receives:Equivalently, the variables are classified as:
This split distributes the non-zero real MLE workload across workers before applying frontload padding. Each worker gets a real
2^5slice ofe, and the remaining local variables are handled as cheap multiplicative tail factors.Toy Example:
a + b + cWith Four WorkersAssume four workers, so
log2(num_workers) = 2, and a global2^10sumcheck domain:The frontload metadata categories are:
Normalmeans the MLE is split across workers and then combined in phase 2.Phase1Onlymeans the MLE is duplicated to every worker and is not split by worker bits; missing worker bits are represented by frontload tail factors.For
a, each worker receives a2^8chunk. Phase 1 folds each chunk overx0..x7, producing four scalars. Phase 2 builds a two-variable MLE from those scalars and folds the worker bitsx8,x9.For
b, every worker receives the same compact2^1MLE. Frontload embeds it as:Round 0 folds the real
bvariable. Rounds 1..7 only apply local tail factors. Becausebhas no worker-bit data, the missing worker-bit tail requiresx8 = 1andx9 = 1, so only worker3 = 0b11contributes this term during phase 1. Phase 2 carriesbas a compact constant with the remaining worker-bit tail.For
c, the MLE is larger than the worker space, so it is split across workers. Each worker receives a2^3local chunk:After phase-1 round 2, each worker has one scalar
s_w = c_w(r0,r1,r2). There is no exchange within phase 1 to squeeze those four scalars into one. During rounds 3..7, each worker independently applies the missing local-variable tail:Phase 2 then builds a two-variable MLE from
[s'_0, s'_1, s'_2, s'_3]and folds the worker bitsx8,x9.Two-Phase Worker Rule
prove_2phasekeeps the existing worker split: local variables in phase 1 and worker-index variables in phase 2. If a term does not depend on a worker-index bit, frontload treats that bit as part of the tail product, so the worker contributes only when the missing worker bits are1. This avoids duplicated contributions across workers while keeping phase-1-only MLEs compact.Change Highlights
sumcheck::frontloadIOPProverState::provedefaults toSumcheckProverMode::Frontload;prove_suffixkeeps the suffix path explicit.PolyMeta::Phase2Onlyto frontloadPhase1Onlyand keeps short MLEs compact across phase boundaries.frontload_uniform_sumcheck_code_genfor dense uniform degree 2/3/4 terms.Benchmark / Performance Impact
Mixed product benchmarks, vars
22/16/2, sample size10. Positive percentage means frontload is faster than suffix.a1*a2 + b1*b2 + c1*c2Command:
Testing
cargo fmt --allcargo check -p multilinear_extensions -p sumcheck --all-targetscargo test -p sumcheck -- --nocapture10 passed; 0 failedRisks and Rollout
worker_matches_frontload_tailrequires all missing worker-index bits to be1.prove_suffix.Rollback is to route callers back to
prove_suffixor revert this PR.Follow-ups (optional)
Use these benchmark cases as regression coverage while integrating tower/main sumchecks in Ceno and GPU frontload kernels.
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.