You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I built a research prover that overlaps a lot with what SP1 and Plonky3 are doing, and I wanted to ask whether the problem it goes after is one you actually care about or a dead end.
Short version: it's a DEEP-FRI STARK over Goldilocks, except every stage that would normally hold the whole trace in RAM streams to disk instead. So peak memory stops depending on
how big the trace is.
How it's built
A few mechanics, since this crowd will want them:
NTT/LDE. Four-step (Bailey) split, N = n1 * n2: NTT the rows, apply twiddles, NTT the columns, transpose. Each pass only touches one row or column, so the working set is O(sqrt N). The matrix sits on NVMe and tiles move through a fixed buffer.
Merkle commitment. Built bottom-up straight off a file of leaves, one chunk per pread, so the full tree never lives in memory. Columns commit in batched groups (a leaf is the
Poseidon hash of a group's columns at one row).
FRI. Each fold reads the previous round's file and writes the next. I commit every other layer (arity-4) and let the verifier recompute the skipped layer from the committed
endpoints.
Page cache. This one bit me. Under a hard cgroup with swap off, dirty pages count against you and get you OOM-killed even with a flat heap. A flusher that writes back and drops
every ~64 MiB keeps it bounded.
Why I trust the numbers
Every memory number is measured under a real systemd cgroup (MemoryMax=256M, MemorySwapMax=0). With swap off the limit is a hard wall, so when I say the in-core prover dies
and the streaming one lives, that's something I watched, not a model.
Soundness gets poked at too: flip any field in a proof, tamper with a lookup table, hand it wrong LogUp multiplicities, or feed an invalid trace, and it rejects.
Numbers (256 MB cgroup, swap off)
2^28 NTT, 2 GB of data: in-core OOM-killed, out-of-core runs in 40 MB and gives the identical checksum.
Full FRI low-degree proof at 2^26 / 2^27 / 2^28 (512 MB up to 2 GB): in-core OOMs at every size, out-of-core stays flat at 34 MB across the whole 8x range.
Full Fibonacci STARK at 2^22, roughly 7 GB if kept in core: OOM in core, 45 MB and 61 s out of core.
A sponge hash proving Sponge(public msg) = digest, 54 columns, 2^18 blocks: OOM in core, 121 MB and 15.5 s out of core.
Resident set tracks the tile size, not N. The CPU side is already parallel (a 2^24 self-proof goes 65.7 s to 11.2 s on 16 cores). The thing eating the time is Poseidon hashing in
the commitment, not the streaming I/O.
The part I want your read on
Same Goldilocks/FRI/STARK stack SP1 runs on, so you'd know better than me where this lands. I don't think the interesting axis is cheap CPU RAM, it's VRAM. A GPU prover can't hold a
trace bigger than its VRAM, and that caps how big a proof a given card can do. Does a streaming layer that spills past VRAM actually buy anything (bigger traces per card, or a lower
hardware bar for people running provers in the network), or is it moot because you'd shard the trace or just rent a bigger card?
What this isn't
Straight about the gaps: it's a research artifact. The AIRs are small, though they go up to a real sponge, and the Poseidon is a research one, not a standardized constant set. It is
not wired into SP1 or Plonky3. And since the bottleneck is Poseidon, the obvious next step is GPU hashing, which is the one piece I haven't had the hardware to test.
So is out-of-core or VRAM-bounded proving something you'd find useful, and if so where would it hurt first? Fine to tell me it's irrelevant, that's useful to know too.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I built a research prover that overlaps a lot with what SP1 and Plonky3 are doing, and I wanted to ask whether the problem it goes after is one you actually care about or a dead end.
Short version: it's a DEEP-FRI STARK over Goldilocks, except every stage that would normally hold the whole trace in RAM streams to disk instead. So peak memory stops depending on
how big the trace is.
How it's built
A few mechanics, since this crowd will want them:
N = n1 * n2: NTT the rows, apply twiddles, NTT the columns, transpose. Each pass only touches one row or column, so the working set isO(sqrt N). The matrix sits on NVMe and tiles move through a fixed buffer.pread, so the full tree never lives in memory. Columns commit in batched groups (a leaf is thePoseidon hash of a group's columns at one row).
endpoints.
every ~64 MiB keeps it bounded.
Why I trust the numbers
MemoryMax=256M,MemorySwapMax=0). With swap off the limit is a hard wall, so when I say the in-core prover diesand the streaming one lives, that's something I watched, not a model.
Soundness gets poked at too: flip any field in a proof, tamper with a lookup table, hand it wrong LogUp multiplicities, or feed an invalid trace, and it rejects.
Numbers (256 MB cgroup, swap off)
Sponge(public msg) = digest, 54 columns, 2^18 blocks: OOM in core, 121 MB and 15.5 s out of core.Resident set tracks the tile size, not
N. The CPU side is already parallel (a 2^24 self-proof goes 65.7 s to 11.2 s on 16 cores). The thing eating the time is Poseidon hashing inthe commitment, not the streaming I/O.
The part I want your read on
Same Goldilocks/FRI/STARK stack SP1 runs on, so you'd know better than me where this lands. I don't think the interesting axis is cheap CPU RAM, it's VRAM. A GPU prover can't hold a
trace bigger than its VRAM, and that caps how big a proof a given card can do. Does a streaming layer that spills past VRAM actually buy anything (bigger traces per card, or a lower
hardware bar for people running provers in the network), or is it moot because you'd shard the trace or just rent a bigger card?
What this isn't
Straight about the gaps: it's a research artifact. The AIRs are small, though they go up to a real sponge, and the Poseidon is a research one, not a standardized constant set. It is
not wired into SP1 or Plonky3. And since the bottleneck is Poseidon, the obvious next step is GPU hashing, which is the one piece I haven't had the hardware to test.
Full writeup with diagrams and a screen recording of the cgroup run (in-core dies, out-of-core verifies in tens of MB):
https://dev.to/nzengi/a-zk-provers-ram-is-a-dial-not-a-wall-3blg
So is out-of-core or VRAM-bounded proving something you'd find useful, and if so where would it hurt first? Fine to tell me it's irrelevant, that's useful to know too.
Beta Was this translation helpful? Give feedback.
All reactions