Resolve Prover optimization: memory reduction #77 #6

einar-taiko · 2023-06-12T06:26:38Z

taikoxyz/zkevm-circuits#77

einar-taiko · 2023-06-12T07:22:05Z

1 test fail:

cargo test -- test_coeff_to_extended_slice

einar-taiko · 2023-06-14T16:03:45Z

cargo test -- test_coeff_to_extended_slice succeeds.

mratsim · 2023-06-15T08:33:09Z

What are or will be the old/new memory requirements?

einar-taiko · 2023-06-16T09:06:22Z

Todo:

Measure reduction in memory use
Check all tests pass
Check Clippy is happy

einar-taiko · 2023-06-21T08:03:37Z

The most interesting benchmark is the super circuit:

DEGREE={} make super_bench

where {} is maybe 22 or higher.

Since this process of patching a dependency is not familiar to me, I document it here. Apply the patch below to the Cargo.toml of the zkevm-circuits workspace and then run:

cargo clean
DEGREE=19 make tx_bench

to check it works.
It takes 6 min compiling + 6min compute on my laptop.

Patch

diff --git i/Cargo.toml w/Cargo.toml
index c8155a1b..1144d9a1 100644
--- i/Cargo.toml
+++ w/Cargo.toml
@@ -12,8 +12,8 @@ members = [
     "testool"
 ]
 
-[patch.crates-io]
-halo2_proofs = { git = "https://github.com/privacy-scaling-explorations/halo2.git", tag = "v2023_04_20" }
+[patch."https://github.com/privacy-scaling-explorations/halo2.git"]
+halo2_proofs = { git = "https://github.com/einar-taiko/halo2.git", branch = "einar/pr/mem" }
 
 # Definition of benchmarks profile to use.
 [profile.bench]

einar-taiko · 2023-06-21T13:09:58Z

I have confirmed the memory reduction on my laptop by running one of the cheaper circuits:

before (unpatched halo2_proofs dependency)

    DEGREE=19 command time --verbose -- make tx_bench
    ~>
    Maximum resident set size (kbytes): 8967628

after (this patched halo2_proofs dependency)

    DEGREE=19 command time --verbose -- make tx_bench
    ~>
    Maximum resident set size (kbytes): 3298872

that is a factor 0.37.

@mratsim could I ask you to please run

DEGREE=26 command time --verbose -- make super_bench

before and after applying the patch and report the full output here? I am not sure if 26 is the right.

mratsim · 2023-06-22T09:59:37Z

I have a good news and a bad news.

When I benchmarked the super-circuit on June 2, with degree 20 iirc, in about 8 min, memory usage went from 8GB to 90GB in about 12min:

and then got killed by OOM.

With the new patch, after 30min, the max memory used was about 20GB, however there is a panic:

at this location

halo2/halo2_proofs/src/poly/domain.rs

Lines 98 to 112 in d0b65f4

    
           let mut t_evaluations = Vec::with_capacity(1 << (extended_k - k)); 
        
           { 
        
               // Compute the evaluations of t(X) = X^n - 1 in the coset evaluation domain. 
        
               // We don't have to compute all of them, because it will repeat. 
        
               let orig = F::ZETA.pow_vartime(&[n as u64, 0, 0, 0]); 
        
               let step = extended_omega.pow_vartime(&[n as u64, 0, 0, 0]); 
        
               let mut cur = orig; 
        
               loop { 
        
                   t_evaluations.push(cur); 
        
                   cur *= &step; 
        
                   if cur == orig { 
        
                       break; 
        
                   } 
        
               } 
        
               assert_eq!(t_evaluations.len(), 1 << (extended_k - k));

han0110 · 2023-06-22T11:12:38Z

This happens because the degree + log2_ceil(cs.degree()-1) is larger than 28, which is the maximum log size of FFT we can do on bn256::Fr. Perhaps the circuit being tested has degree larger than 9.

einar-taiko · 2023-06-22T13:33:30Z

@mratsim reports setting DEGREE=21 and adding sufficient swap space yields 129145428 kbytes ~ 129GB for the after benchmark.

The before benchmark should be under way..

mratsim · 2023-06-22T15:15:43Z

The before benchmark has been OOM-killed even with 128GB RAM + 128GB swap so there is definitely a 2x reduction at minimum.

mratsim

Some helpful background for reviewing this PR:

https://learn.0xparc.org/materials/halo2/learning-group-1/introduction/
On the circuits layout at 45min - https://youtu.be/W_zlH2mmtZA?t=2729

This is porting an external change over and there are many high-level details that got lost from the original idea in zcash#427

It would be nice for evaluate_h inner workings to be split into steps (potentially substeps) because the function is now large and will likely become hard to get into (lots of state), audit and refactor.

This can be done, either in this PR or we can create an issue for a later PR.

mratsim · 2023-06-26T09:46:36Z