Skip to content

Add frontload CPU sumcheck prover#49

Merged
hero78119 merged 15 commits into
mainfrom
feat/frontload
May 7, 2026
Merged

Add frontload CPU sumcheck prover#49
hero78119 merged 15 commits into
mainfrom
feat/frontload

Conversation

@hero78119
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 commented May 4, 2026

Problem

Ceno needs a CPU sumcheck path that interprets mixed-length MLEs in frontload layout: short MLEs bind to early variables and missing later variables contribute product x_i tail factors. The existing suffix/backend-loading prover remains available for comparison and fallback routing.

Design Rationale

Frontload MLE Embedding

For an MLE f with k variables embedded into an N-variable sumcheck domain, frontload binds f to the early variables and represents every missing later variable as a multiplicative tail:

F(x_0, ..., x_{N-1}) = f(x_0, ..., x_{k-1}) * product_{i=k}^{N-1} x_i

This preserves the Boolean-hypercube sum because the tail product contributes only on the all-one assignment:

sum_{x in {0,1}^N} F(x) = sum_{x_0..x_{k-1}} f(x_0..x_{k-1})

At a verifier point r, the same compact MLE is evaluated as:

F(r_0, ..., r_{N-1}) = f(r_0, ..., r_{k-1}) * product_{i=k}^{N-1} r_i

So short MLEs are not expanded to 2^N; they stay compact and receive frontload tail factors after their own variables have been fixed.

Difference From Suffix Load

Suffix/backend loading aligns short MLEs with later variables. Missing earlier variables are summed out, so the asserted sum needs the usual 2^(N-k) scaling for random mixed-length monomials.

Frontload does the opposite: short MLEs align with early variables, and missing later variables are encoded by the product x_i tail. That tail sums to 1, so frontload random monomial sums do not apply suffix missing-variable scaling.

Operationally:

Case Suffix load Frontload
Short MLE variables Later variables Early variables
Missing variables Summed out / scale by 2^(N-k) Tail product product x_i
Round after MLE is fixed Scalar remains independent of current var Scalar is multiplied by current/future tail factors
Lane filtering Based on suffix-aligned small index Future tail bits must be 1

Worker-Aware Frontload Split

Worker-aware frontload split

A subtle but important point is how a medium-size MLE is embedded when the global domain is larger than the MLE and the prover is split across workers.

Assume four workers and a 2^10 global domain:

global variables:  x0..x9
phase-1 variables: x0..x7
worker variables:  x8,x9
num_workers:       4 = 2^2

For a 7-variable MLE:

e(x0, x1, x2, x3, x4, x5, x6)

A naive split would first pad e to all 10 variables and then split by the last worker bits:

e_naive_{i0,i1}(x0..x7) = e(x0, x1, x2, x3, x4, x5, x6, i0, i1)

This is the wrong mental model for frontload. The original e has only 7 real variables, so treating i0,i1 as variables after x0..x6 would mostly route workers through padded tail space instead of through real MLE data.

Frontload uses a worker-aware split instead. The worker bits occupy the last real variables of e, and the missing local variables become tail factors:

pad_e(x0..x9) = e(x0, x1, x2, x3, x4, x8, x9) * x5 * x6 * x7

Therefore each worker (i0,i1) receives:

pad_e_{i0,i1}(x0..x7) = e(x0, x1, x2, x3, x4, i0, i1) * x5 * x6 * x7

Equivalently, the variables are classified as:

real local variables:  x0..x4
worker real variables: x8,x9
frontload tail:        x5,x6,x7

This split distributes the non-zero real MLE workload across workers before applying frontload padding. Each worker gets a real 2^5 slice of e, and the remaining local variables are handled as cheap multiplicative tail factors.

Toy Example: a + b + c With Four Workers

Assume four workers, so log2(num_workers) = 2, and a global 2^10 sumcheck domain:

global variables:  x0..x9
phase-1 variables: x0..x7
phase-2 variables: x8,x9
polynomial:        a + b + c
a size:            2^10
b size:            2^1
c size:            2^5

The frontload metadata categories are:

a: 10 variables > 2 worker bits => Normal
b:  1 variable  <= 2 worker bits => Phase1Only
c:  5 variables > 2 worker bits => Normal

Normal means the MLE is split across workers and then combined in phase 2. Phase1Only means the MLE is duplicated to every worker and is not split by worker bits; missing worker bits are represented by frontload tail factors.

For a, each worker receives a 2^8 chunk. Phase 1 folds each chunk over x0..x7, producing four scalars. Phase 2 builds a two-variable MLE from those scalars and folds the worker bits x8,x9.

For b, every worker receives the same compact 2^1 MLE. Frontload embeds it as:

B(x0..x9) = b(x0) * x1 * x2 * ... * x9

Round 0 folds the real b variable. Rounds 1..7 only apply local tail factors. Because b has no worker-bit data, the missing worker-bit tail requires x8 = 1 and x9 = 1, so only worker 3 = 0b11 contributes this term during phase 1. Phase 2 carries b as a compact constant with the remaining worker-bit tail.

For c, the MLE is larger than the worker space, so it is split across workers. Each worker receives a 2^3 local chunk:

worker 0: c_00(x0,x1,x2)
worker 1: c_01(x0,x1,x2)
worker 2: c_10(x0,x1,x2)
worker 3: c_11(x0,x1,x2)

After phase-1 round 2, each worker has one scalar s_w = c_w(r0,r1,r2). There is no exchange within phase 1 to squeeze those four scalars into one. During rounds 3..7, each worker independently applies the missing local-variable tail:

s'_w = s_w * r3 * r4 * r5 * r6 * r7

Phase 2 then builds a two-variable MLE from [s'_0, s'_1, s'_2, s'_3] and folds the worker bits x8,x9.

Two-Phase Worker Rule

prove_2phase keeps the existing worker split: local variables in phase 1 and worker-index variables in phase 2. If a term does not depend on a worker-index bit, frontload treats that bit as part of the tail product, so the worker contributes only when the missing worker bits are 1. This avoids duplicated contributions across workers while keeping phase-1-only MLEs compact.

Change Highlights

Area Change
sumcheck::frontload Adds the CPU frontload prover, evaluator, two-phase worker handling, tail masks, and final evaluations.
Prover routing IOPProverState::prove defaults to SumcheckProverMode::Frontload; prove_suffix keeps the suffix path explicit.
Mixed-length metadata Maps PolyMeta::Phase2Only to frontload Phase1Only and keeps short MLEs compact across phase boundaries.
Uniform fast path Adds frontload_uniform_sumcheck_code_gen for dense uniform degree 2/3/4 terms.
Bench/test utilities Adds frontload-vs-suffix benchmark cases and suffix-scaled random monomial generation.
PR guidance Updates Copilot instructions to review first and avoid in-place PR pushes unless explicitly requested.

Benchmark / Performance Impact

Mixed product benchmarks, vars 22/16/2, sample size 10. Positive percentage means frontload is faster than suffix.

Benchmark Suffix mean Frontload mean Change
a1*a2 + b1*b2 + c1*c2 151.17 ms 43.88 ms +70.97%
degree-3 product sum 315.76 ms 91.68 ms +70.96%
degree-4 product sum 509.92 ms 122.24 ms +76.03%

Command:

cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product_sum_nv_22_16_2' -- --sample-size 10
cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product3_sum_nv_22_16_2' -- --sample-size 10
cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product4_sum_nv_22_16_2' -- --sample-size 10

Testing

Command Result
cargo fmt --all Pass
cargo check -p multilinear_extensions -p sumcheck --all-targets Pass
cargo test -p sumcheck -- --nocapture Previously passed: 10 passed; 0 failed

Risks and Rollout

Risk Mitigation
Frontload tail factor correctness Tests cover mixed-size, compact small-only, and two-phase behavior.
Worker duplication in phase 2 worker_matches_frontload_tail requires all missing worker-index bits to be 1.
Compatibility with existing suffix users Suffix prover remains available through prove_suffix.

Rollback is to route callers back to prove_suffix or revert this PR.

Follow-ups (optional)

Use these benchmark cases as regression coverage while integrating tower/main sumchecks in Ceno and GPU frontload kernels.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

@hero78119 hero78119 changed the title Feat/frontload Add frontend-loaded CPU sumcheck prover May 4, 2026
@hero78119 hero78119 marked this pull request as draft May 4, 2026 09:14
@hero78119 hero78119 marked this pull request as ready for review May 4, 2026 12:58
Copilot AI requested a review from kunxian-xia May 5, 2026 02:10
@hero78119 hero78119 changed the title Add frontend-loaded CPU sumcheck prover Add frontload CPU sumcheck prover May 6, 2026
hero78119 added 2 commits May 6, 2026 14:49
@scroll-tech scroll-tech deleted a comment from Copilot AI May 6, 2026
@kunxian-xia
Copy link
Copy Markdown
Collaborator

@copilot, after this pr, the behaviors for

  1. computing expected sum for batched sumcheck
  2. computing the final expected evaluation for batched sumcheck

are also modified, right? please don't make any changes to the current pr on my behalf.

Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

@copilot, after this pr, the behaviors for

  1. computing expected sum for batched sumcheck
  2. computing the final expected evaluation for batched sumcheck
    ...

Yes, both are modified:

1. Expected sum for batched sumcheck (virtual_polys.rs):
random_monimials no longer applies the 2^(N-k) suffix-scaling for short MLEs — it now uses scale 1 (frontload interpretation, where the tail product x_i sums to 1 on the boolean hypercube). The old suffix-scaling behavior is preserved separately in the new random_suffixload_monimials function.

2. Final expected evaluation for batched sumcheck (prover.rs get_mle_final_evaluations):
When in frontload mode, this returns the stored FrontLoadedProverState.final_evaluations directly. Those values are the per-MLE residual scalars after all tail factors have been folded in during the frontload prover rounds — so they already encode the frontload embedding (f(r_0..r_{k-1}) * product_{i=k}^{N-1} r_i) rather than the raw MLE evaluation at the last fixed point.

The frontload evaluate helper in frontload.rs (used by tests) also applies this same tail product when computing expected evaluations at a verifier point.

@hero78119 hero78119 merged commit 17e9a59 into main May 7, 2026
2 checks passed
@hero78119 hero78119 deleted the feat/frontload branch May 7, 2026 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants