Add frontload CPU sumcheck prover by hero78119 · Pull Request #49 · scroll-tech/gkr-backend

hero78119 · 2026-05-04T08:51:20Z

Problem

Ceno needs a CPU sumcheck path that interprets mixed-length MLEs in frontload layout: short MLEs bind to early variables and missing later variables contribute product x_i tail factors. The existing suffix/backend-loading prover remains available for comparison and fallback routing.

Design Rationale

Frontload MLE Embedding

For an MLE f with k variables embedded into an N-variable sumcheck domain, frontload binds f to the early variables and represents every missing later variable as a multiplicative tail:

F(x_0, ..., x_{N-1}) = f(x_0, ..., x_{k-1}) * product_{i=k}^{N-1} x_i

This preserves the Boolean-hypercube sum because the tail product contributes only on the all-one assignment:

sum_{x in {0,1}^N} F(x) = sum_{x_0..x_{k-1}} f(x_0..x_{k-1})

At a verifier point r, the same compact MLE is evaluated as:

F(r_0, ..., r_{N-1}) = f(r_0, ..., r_{k-1}) * product_{i=k}^{N-1} r_i

So short MLEs are not expanded to 2^N; they stay compact and receive frontload tail factors after their own variables have been fixed.

Difference From Suffix Load

Suffix/backend loading aligns short MLEs with later variables. Missing earlier variables are summed out, so the asserted sum needs the usual 2^(N-k) scaling for random mixed-length monomials.

Frontload does the opposite: short MLEs align with early variables, and missing later variables are encoded by the product x_i tail. That tail sums to 1, so frontload random monomial sums do not apply suffix missing-variable scaling.

Operationally:

Case	Suffix load	Frontload
Short MLE variables	Later variables	Early variables
Missing variables	Summed out / scale by `2^(N-k)`	Tail product `product x_i`
Round after MLE is fixed	Scalar remains independent of current var	Scalar is multiplied by current/future tail factors
Lane filtering	Based on suffix-aligned small index	Future tail bits must be `1`

Worker-Aware Frontload Split

A subtle but important point is how a medium-size MLE is embedded when the global domain is larger than the MLE and the prover is split across workers.

Assume four workers and a 2^10 global domain:

global variables:  x0..x9
phase-1 variables: x0..x7
worker variables:  x8,x9
num_workers:       4 = 2^2

For a 7-variable MLE:

e(x0, x1, x2, x3, x4, x5, x6)

A naive split would first pad e to all 10 variables and then split by the last worker bits:

e_naive_{i0,i1}(x0..x7) = e(x0, x1, x2, x3, x4, x5, x6, i0, i1)

This is the wrong mental model for frontload. The original e has only 7 real variables, so treating i0,i1 as variables after x0..x6 would mostly route workers through padded tail space instead of through real MLE data.

Frontload uses a worker-aware split instead. The worker bits occupy the last real variables of e, and the missing local variables become tail factors:

pad_e(x0..x9) = e(x0, x1, x2, x3, x4, x8, x9) * x5 * x6 * x7

Therefore each worker (i0,i1) receives:

pad_e_{i0,i1}(x0..x7) = e(x0, x1, x2, x3, x4, i0, i1) * x5 * x6 * x7

Equivalently, the variables are classified as:

real local variables:  x0..x4
worker real variables: x8,x9
frontload tail:        x5,x6,x7

This split distributes the non-zero real MLE workload across workers before applying frontload padding. Each worker gets a real 2^5 slice of e, and the remaining local variables are handled as cheap multiplicative tail factors.

Toy Example: `a + b + c` With Four Workers

Assume four workers, so log2(num_workers) = 2, and a global 2^10 sumcheck domain:

global variables:  x0..x9
phase-1 variables: x0..x7
phase-2 variables: x8,x9
polynomial:        a + b + c
a size:            2^10
b size:            2^1
c size:            2^5

The frontload metadata categories are:

a: 10 variables > 2 worker bits => Normal
b:  1 variable  <= 2 worker bits => Phase1Only
c:  5 variables > 2 worker bits => Normal

Normal means the MLE is split across workers and then combined in phase 2. Phase1Only means the MLE is duplicated to every worker and is not split by worker bits; missing worker bits are represented by frontload tail factors.

For a, each worker receives a 2^8 chunk. Phase 1 folds each chunk over x0..x7, producing four scalars. Phase 2 builds a two-variable MLE from those scalars and folds the worker bits x8,x9.

For b, every worker receives the same compact 2^1 MLE. Frontload embeds it as:

B(x0..x9) = b(x0) * x1 * x2 * ... * x9

Round 0 folds the real b variable. Rounds 1..7 only apply local tail factors. Because b has no worker-bit data, the missing worker-bit tail requires x8 = 1 and x9 = 1, so only worker 3 = 0b11 contributes this term during phase 1. Phase 2 carries b as a compact constant with the remaining worker-bit tail.

For c, the MLE is larger than the worker space, so it is split across workers. Each worker receives a 2^3 local chunk:

worker 0: c_00(x0,x1,x2)
worker 1: c_01(x0,x1,x2)
worker 2: c_10(x0,x1,x2)
worker 3: c_11(x0,x1,x2)

After phase-1 round 2, each worker has one scalar s_w = c_w(r0,r1,r2). There is no exchange within phase 1 to squeeze those four scalars into one. During rounds 3..7, each worker independently applies the missing local-variable tail:

s'_w = s_w * r3 * r4 * r5 * r6 * r7

Phase 2 then builds a two-variable MLE from [s'_0, s'_1, s'_2, s'_3] and folds the worker bits x8,x9.

Two-Phase Worker Rule

prove_2phase keeps the existing worker split: local variables in phase 1 and worker-index variables in phase 2. If a term does not depend on a worker-index bit, frontload treats that bit as part of the tail product, so the worker contributes only when the missing worker bits are 1. This avoids duplicated contributions across workers while keeping phase-1-only MLEs compact.

Change Highlights

Area	Change
`sumcheck::frontload`	Adds the CPU frontload prover, evaluator, two-phase worker handling, tail masks, and final evaluations.
Prover routing	`IOPProverState::prove` defaults to `SumcheckProverMode::Frontload`; `prove_suffix` keeps the suffix path explicit.
Mixed-length metadata	Maps `PolyMeta::Phase2Only` to frontload `Phase1Only` and keeps short MLEs compact across phase boundaries.
Uniform fast path	Adds `frontload_uniform_sumcheck_code_gen` for dense uniform degree 2/3/4 terms.
Bench/test utilities	Adds frontload-vs-suffix benchmark cases and suffix-scaled random monomial generation.
PR guidance	Updates Copilot instructions to review first and avoid in-place PR pushes unless explicitly requested.

Benchmark / Performance Impact

Mixed product benchmarks, vars 22/16/2, sample size 10. Positive percentage means frontload is faster than suffix.

Benchmark	Suffix mean	Frontload mean	Change
`a1a2 + b1b2 + c1*c2`	151.17 ms	43.88 ms	+70.97%
degree-3 product sum	315.76 ms	91.68 ms	+70.96%
degree-4 product sum	509.92 ms	122.24 ms	+76.03%

Command:

cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product_sum_nv_22_16_2' -- --sample-size 10
cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product3_sum_nv_22_16_2' -- --sample-size 10
cargo bench -p sumcheck --bench devirgo_sumcheck 'mixed_product4_sum_nv_22_16_2' -- --sample-size 10

Testing

Command	Result
`cargo fmt --all`	Pass
`cargo check -p multilinear_extensions -p sumcheck --all-targets`	Pass
`cargo test -p sumcheck -- --nocapture`	Previously passed: `10 passed; 0 failed`

Risks and Rollout

Risk	Mitigation
Frontload tail factor correctness	Tests cover mixed-size, compact small-only, and two-phase behavior.
Worker duplication in phase 2	`worker_matches_frontload_tail` requires all missing worker-index bits to be `1`.
Compatibility with existing suffix users	Suffix prover remains available through `prove_suffix`.

Rollback is to route callers back to prove_suffix or revert this PR.

Follow-ups (optional)

Use these benchmark cases as regression coverage while integrating tower/main sumchecks in Ceno and GPU frontload kernels.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/d9fe64b2-2a93-463a-8ddd-99885bb15b90 Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>

# Conflicts: # crates/multilinear_extensions/src/mle.rs

kunxian-xia · 2026-05-06T12:31:26Z

@copilot, after this pr, the behaviors for

computing expected sum for batched sumcheck
computing the final expected evaluation for batched sumcheck

are also modified, right? please don't make any changes to the current pr on my behalf.

Copilot · 2026-05-06T12:33:15Z

@copilot, after this pr, the behaviors for

computing expected sum for batched sumcheck

computing the final expected evaluation for batched sumcheck
...

Yes, both are modified:

1. Expected sum for batched sumcheck (virtual_polys.rs):
random_monimials no longer applies the 2^(N-k) suffix-scaling for short MLEs — it now uses scale 1 (frontload interpretation, where the tail product x_i sums to 1 on the boolean hypercube). The old suffix-scaling behavior is preserved separately in the new random_suffixload_monimials function.

2. Final expected evaluation for batched sumcheck (prover.rs get_mle_final_evaluations):
When in frontload mode, this returns the stored FrontLoadedProverState.final_evaluations directly. Those values are the per-MLE residual scalars after all tail factors have been folded in during the frontload prover rounds — so they already encode the frontload embedding (f(r_0..r_{k-1}) * product_{i=k}^{N-1} r_i) rather than the raw MLE evaluation at the last fixed point.

The frontload evaluate helper in frontload.rs (used by tests) also applies this same tail product when computing expected evaluations at a verifier point.

hero78119 added 2 commits April 23, 2026 16:27

[ci] add line level comment

75958d1

Add frontend-loaded sumcheck prover benchmarks

60468d0

hero78119 changed the title ~~Feat/frontload~~ Add frontend-loaded CPU sumcheck prover May 4, 2026

Clarify frontend-loaded phase metadata

6a907de

hero78119 marked this pull request as draft May 4, 2026 09:14

make frontload sumcheck as default

761a349

hero78119 marked this pull request as ready for review May 4, 2026 12:58

Copilot started work on behalf of kunxian-xia May 5, 2026 01:58 View session

rename: frontloaded → front_loaded throughout sumcheck crate

94b8f5b

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/d9fe64b2-2a93-463a-8ddd-99885bb15b90 Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>

Copilot finished work on behalf of kunxian-xia May 5, 2026 02:10

Copilot AI requested a review from kunxian-xia May 5, 2026 02:10

hero78119 added 6 commits May 5, 2026 16:20

Optimize frontloaded sumcheck CPU path

31b8638

add new api

ad8b873

optimize frontloaded sumcheck uniform terms

caaf387

fuse frontloaded worker fold and evaluation

5c21212

generate frontloaded uniform kernels

d4b87f8

canonicalize frontload naming

f8847d8

hero78119 changed the title ~~Add frontend-loaded CPU sumcheck prover~~ Add frontload CPU sumcheck prover May 6, 2026

hero78119 added 2 commits May 6, 2026 14:49

Merge branch 'main' into feat/frontload

f7544b9

# Conflicts: # crates/multilinear_extensions/src/mle.rs

fix frontload compact mle endpoint reads

5ebca2f

scroll-tech deleted a comment from Copilot AI May 6, 2026

Copilot started work on behalf of kunxian-xia May 6, 2026 12:31 View session

Copilot finished work on behalf of kunxian-xia May 6, 2026 12:34

Copilot started work on behalf of kunxian-xia May 6, 2026 12:51 View session

Copilot finished work on behalf of kunxian-xia May 6, 2026 12:53

Copilot started work on behalf of kunxian-xia May 6, 2026 13:53 View session

Copilot finished work on behalf of kunxian-xia May 6, 2026 13:54

more docs

b1ef77f

kunxian-xia approved these changes May 7, 2026

View reviewed changes

refactor boilerplate code

89d1501

hero78119 merged commit 17e9a59 into main May 7, 2026
2 checks passed

hero78119 deleted the feat/frontload branch May 7, 2026 08:10

kunxian-xia mentioned this pull request May 7, 2026

Frontload 2phase: wrong round polynomials for medium-sized Normal MLEs #50

Closed

hero78119 mentioned this pull request May 8, 2026

fix frontload bug and follow design rationales in PR 49 #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add frontload CPU sumcheck prover#49

Add frontload CPU sumcheck prover#49
hero78119 merged 15 commits into
mainfrom
feat/frontload

hero78119 commented May 4, 2026 •

edited

Loading

Uh oh!

kunxian-xia commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hero78119 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design Rationale

Frontload MLE Embedding

Difference From Suffix Load

Worker-Aware Frontload Split

Toy Example: a + b + c With Four Workers

Two-Phase Worker Rule

Change Highlights

Benchmark / Performance Impact

Testing

Risks and Rollout

Follow-ups (optional)

Copilot Reviewer Directive (keep this section)

Uh oh!

kunxian-xia commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hero78119 commented May 4, 2026 •

edited

Loading

Toy Example: `a + b + c` With Four Workers