Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Prover optimization: memory reduction #77 #6

Merged
merged 2 commits into from
Jun 26, 2023

Conversation

einar-taiko
Copy link

@einar-taiko einar-taiko commented Jun 12, 2023

@einar-taiko
Copy link
Author

1 test fail:

cargo test -- test_coeff_to_extended_slice

@einar-taiko
Copy link
Author

cargo test -- test_coeff_to_extended_slice succeeds.

@einar-taiko einar-taiko force-pushed the einar/pr/mem branch 2 times, most recently from 4d4d459 to 4335944 Compare June 15, 2023 02:17
@mratsim
Copy link

mratsim commented Jun 15, 2023

What are or will be the old/new memory requirements?

@einar-taiko
Copy link
Author

einar-taiko commented Jun 16, 2023

Todo:

  • Measure reduction in memory use
  • Check all tests pass
  • Check Clippy is happy

@einar-taiko einar-taiko marked this pull request as ready for review June 16, 2023 09:06
@mratsim mratsim mentioned this pull request Jun 20, 2023
@einar-taiko
Copy link
Author

The most interesting benchmark is the super circuit:

DEGREE={} make super_bench    

where {} is maybe 22 or higher.

Since this process of patching a dependency is not familiar to me, I document it here. Apply the patch below to the Cargo.toml of the zkevm-circuits workspace and then run:

cargo clean
DEGREE=19 make tx_bench

to check it works.
It takes 6 min compiling + 6min compute on my laptop.

Patch

diff --git i/Cargo.toml w/Cargo.toml
index c8155a1b..1144d9a1 100644
--- i/Cargo.toml
+++ w/Cargo.toml
@@ -12,8 +12,8 @@ members = [
     "testool"
 ]
 
-[patch.crates-io]
-halo2_proofs = { git = "https://github.com/privacy-scaling-explorations/halo2.git", tag = "v2023_04_20" }
+[patch."https://github.com/privacy-scaling-explorations/halo2.git"]
+halo2_proofs = { git = "https://github.com/einar-taiko/halo2.git", branch = "einar/pr/mem" }
 
 # Definition of benchmarks profile to use.
 [profile.bench]

@einar-taiko
Copy link
Author

einar-taiko commented Jun 21, 2023

I have confirmed the memory reduction on my laptop by running one of the cheaper circuits:

  1. before (unpatched halo2_proofs dependency)

        DEGREE=19 command time --verbose -- make tx_bench
        ~>
        Maximum resident set size (kbytes): 8967628
    
  2. after (this patched halo2_proofs dependency)

        DEGREE=19 command time --verbose -- make tx_bench
        ~>
        Maximum resident set size (kbytes): 3298872
    

that is a factor 0.37.

@mratsim could I ask you to please run

DEGREE=26 command time --verbose -- make super_bench

before and after applying the patch and report the full output here? I am not sure if 26 is the right.

@mratsim
Copy link

mratsim commented Jun 22, 2023

I have a good news and a bad news.

When I benchmarked the super-circuit on June 2, with degree 20 iirc, in about 8 min, memory usage went from 8GB to 90GB in about 12min:
ksnip_20230602-145623
ksnip_20230602-150804
and then got killed by OOM.

With the new patch, after 30min, the max memory used was about 20GB, however there is a panic:
ksnip_20230622-115754

at this location

let mut t_evaluations = Vec::with_capacity(1 << (extended_k - k));
{
// Compute the evaluations of t(X) = X^n - 1 in the coset evaluation domain.
// We don't have to compute all of them, because it will repeat.
let orig = F::ZETA.pow_vartime(&[n as u64, 0, 0, 0]);
let step = extended_omega.pow_vartime(&[n as u64, 0, 0, 0]);
let mut cur = orig;
loop {
t_evaluations.push(cur);
cur *= &step;
if cur == orig {
break;
}
}
assert_eq!(t_evaluations.len(), 1 << (extended_k - k));

@han0110
Copy link

han0110 commented Jun 22, 2023

This happens because the degree + log2_ceil(cs.degree()-1) is larger than 28, which is the maximum log size of FFT we can do on bn256::Fr. Perhaps the circuit being tested has degree larger than 9.

@einar-taiko
Copy link
Author

@mratsim reports setting DEGREE=21 and adding sufficient swap space yields 129145428 kbytes ~ 129GB for the after benchmark.

image

The before benchmark should be under way..

@mratsim
Copy link

mratsim commented Jun 22, 2023

The before benchmark has been OOM-killed even with 128GB RAM + 128GB swap so there is definitely a 2x reduction at minimum.
ksnip_20230622-171357

@einar-taiko einar-taiko requested a review from mratsim June 26, 2023 08:28
Copy link

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some helpful background for reviewing this PR:

This is porting an external change over and there are many high-level details that got lost from the original idea in zcash#427

It would be nice for evaluate_h inner workings to be split into steps (potentially substeps) because the function is now large and will likely become hard to get into (lots of state), audit and refactor.

This can be done, either in this PR or we can create an issue for a later PR.

@@ -179,13 +178,41 @@ impl Calculation {
}
}

#[derive(Clone, Default, Debug)]
struct ConstraintCluster<C: CurveAffine> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this needs an explanation on the purpose of this datastructure.


// Lookups
for lookup in cs.lookups.iter() {
constraint_idx += 5;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this 5 come from?

.zip(instance.iter())
.zip(lookups.iter())
.zip(permutations.iter())
let need_to_compute = |part_idx, cluster_idx| part_idx % (num_parts >> cluster_idx) == 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key state that is checked 10+ times, ideally there is an explanation of the logic.


// Calculate the quotient polynomial for each part
let mut current_extended_omega = one;
for part_idx in 0..num_parts {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluate_h function is now almost 600 lines, with almost 500 lines spent within this for loop.

This will make keeping the state and flow in head hard when reading and refactoring the code later.

We likely want either now or in a later sprint a refactoring to factor things in different functions.

For example:

  • evaluate_custom_gates
  • evaluate_permutations
  • evaluate_lookups
  • ...

&& !need_to_compute(part_idx, 2)
&& !need_to_compute(part_idx, running_prod_cluster)
{
constraint_idx += 5;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this 5 comes for?

stop_measure(start);
// Align the constraints by different powers of y.
for (i, cluster) in value_part_clusters.iter_mut().enumerate() {
if need_to_compute(part_idx, i) && cluster_last_constraint_idx[i] > 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other instances of if need_to_compute were preceded by constraint_idx += 1; or constraint_idx += sets.len(); or constraint_idx += sets.len() - 1;

As the evaluation is now quite complex, it would be easier to read and review by separating the inner for loop (500 lines) from the prologue and epilogue (this part).

values: transposed.into_iter().flatten().collect(),
_marker: PhantomData,
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future TODO; optimized this

The transposition could be done without an intermediary step + flatten at the end.

Also if this is a bottleneck, transposition can be improved 4x even on serial, with cache blocking.

See my benchmarks of transposition algorithms at: https://github.com/mratsim/laser/blob/e23b5d63f58441968188fb95e16862d1498bb845/benchmarks/transpose/transpose_bench.nim#L558-L674

The change in algorithm is simple,

this is 3x slower

    for (int i = 0; i < `M`; i++)
      #pragma omp parallel for simd
      for (int j = 0; j < `N`; j++)
        `po`[i+j*`M`] = `pa`[j+i*`N`];

than this (1D-blocking)

    // No min function in C ...
    #define min(a,b) (((a)<(b))?(a):(b))

    #pragma omp parallel for
    for (int i = 0; i < `M`; i+=`blck`)
      for (int j = 0; j < `N`; ++j)
        #pragma omp simd
        for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
          `po`[ii+j*`M`] = `pa`[j+ii*`N`];

or 4x slower than this (2D blocking)

    #define min(a,b) (((a)<(b))?(a):(b))

    #pragma omp parallel for collapse(2)
    for (int j = 0; j < `N`; j+=`blck`)
      for (int i = 0; i < `M`; i+=`blck`)
        for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
          #pragma omp simd
          for (int ii = i; ii<min(i+`blck`,`M`); ii++)
            `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants