-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve Prover optimization: memory reduction #77 #6
Conversation
1 test fail:
|
51817aa
to
43ea496
Compare
|
4d4d459
to
4335944
Compare
What are or will be the old/new memory requirements? |
cac6996
to
edc1886
Compare
edc1886
to
b63734b
Compare
Todo:
|
The most interesting benchmark is the super circuit:
where Since this process of patching a dependency is not familiar to me, I document it here. Apply the patch below to the
to check it works. Patch
|
I have confirmed the memory reduction on my laptop by running one of the cheaper circuits:
that is a factor 0.37. @mratsim could I ask you to please run
before and after applying the patch and report the full output here? I am not sure if 26 is the right. |
I have a good news and a bad news. When I benchmarked the super-circuit on June 2, with degree 20 iirc, in about 8 min, memory usage went from 8GB to 90GB in about 12min: With the new patch, after 30min, the max memory used was about 20GB, however there is a panic: at this location halo2/halo2_proofs/src/poly/domain.rs Lines 98 to 112 in d0b65f4
|
This happens because the |
@mratsim reports setting The before benchmark should be under way.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some helpful background for reviewing this PR:
- https://learn.0xparc.org/materials/halo2/learning-group-1/introduction/
- On the circuits layout at 45min - https://youtu.be/W_zlH2mmtZA?t=2729
This is porting an external change over and there are many high-level details that got lost from the original idea in zcash#427
It would be nice for evaluate_h
inner workings to be split into steps (potentially substeps) because the function is now large and will likely become hard to get into (lots of state), audit and refactor.
This can be done, either in this PR or we can create an issue for a later PR.
@@ -179,13 +178,41 @@ impl Calculation { | |||
} | |||
} | |||
|
|||
#[derive(Clone, Default, Debug)] | |||
struct ConstraintCluster<C: CurveAffine> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this needs an explanation on the purpose of this datastructure.
|
||
// Lookups | ||
for lookup in cs.lookups.iter() { | ||
constraint_idx += 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this 5
come from?
.zip(instance.iter()) | ||
.zip(lookups.iter()) | ||
.zip(permutations.iter()) | ||
let need_to_compute = |part_idx, cluster_idx| part_idx % (num_parts >> cluster_idx) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a key state that is checked 10+ times, ideally there is an explanation of the logic.
|
||
// Calculate the quotient polynomial for each part | ||
let mut current_extended_omega = one; | ||
for part_idx in 0..num_parts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The evaluate_h
function is now almost 600 lines, with almost 500 lines spent within this for loop.
This will make keeping the state and flow in head hard when reading and refactoring the code later.
We likely want either now or in a later sprint a refactoring to factor things in different functions.
For example:
evaluate_custom_gates
evaluate_permutations
evaluate_lookups
- ...
&& !need_to_compute(part_idx, 2) | ||
&& !need_to_compute(part_idx, running_prod_cluster) | ||
{ | ||
constraint_idx += 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this 5 comes for?
stop_measure(start); | ||
// Align the constraints by different powers of y. | ||
for (i, cluster) in value_part_clusters.iter_mut().enumerate() { | ||
if need_to_compute(part_idx, i) && cluster_last_constraint_idx[i] > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all other instances of if need_to_compute
were preceded by constraint_idx += 1;
or constraint_idx += sets.len();
or constraint_idx += sets.len() - 1;
As the evaluation is now quite complex, it would be easier to read and review by separating the inner for loop (500 lines) from the prologue and epilogue (this part).
values: transposed.into_iter().flatten().collect(), | ||
_marker: PhantomData, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future TODO; optimized this
The transposition could be done without an intermediary step + flatten at the end.
Also if this is a bottleneck, transposition can be improved 4x even on serial, with cache blocking.
See my benchmarks of transposition algorithms at: https://github.com/mratsim/laser/blob/e23b5d63f58441968188fb95e16862d1498bb845/benchmarks/transpose/transpose_bench.nim#L558-L674
The change in algorithm is simple,
this is 3x slower
for (int i = 0; i < `M`; i++)
#pragma omp parallel for simd
for (int j = 0; j < `N`; j++)
`po`[i+j*`M`] = `pa`[j+i*`N`];
than this (1D-blocking)
// No min function in C ...
#define min(a,b) (((a)<(b))?(a):(b))
#pragma omp parallel for
for (int i = 0; i < `M`; i+=`blck`)
for (int j = 0; j < `N`; ++j)
#pragma omp simd
for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
`po`[ii+j*`M`] = `pa`[j+ii*`N`];
or 4x slower than this (2D blocking)
#define min(a,b) (((a)<(b))?(a):(b))
#pragma omp parallel for collapse(2)
for (int j = 0; j < `N`; j+=`blck`)
for (int i = 0; i < `M`; i+=`blck`)
for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
#pragma omp simd
for (int ii = i; ii<min(i+`blck`,`M`); ii++)
`po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
taikoxyz/zkevm-circuits#77