{{ message }}

# Understand how to optimize a circuit implementation of Groth16 verification#3425

Open
opened this issue Jul 29, 2018 · 74 comments
Open

# Understand how to optimize a circuit implementation of Groth16 verification#3425

opened this issue Jul 29, 2018 · 74 comments
Labels
Projects

 This ticket will be a discussion of optimizations for implementing recursive SNARKs. For Groth16 verification we need pairings, and for pairings we need extension field arithmetic. Assume that the base field arithmetic is efficient in the circuit — which in practice requires it to be the same as the R1CS field. Call that field Fp. We'll use C to refer to the cost of an R1CS constraint. Table 1 of Beuchat, González-Díaz, Mitsunari, Okamoto, Rodríguez-Henríquez, and Teruya gives costs for operations in Fp2, Fp6, and Fp12, using Karatsuba multiplication and squaring. (This part, and in fact most of the pairing algorithm, doesn't differ between BN and BLS curves.) Let (a, m, s, d), (a~, m~, s~, d~), and (A, M, S, D) denote the cost of field addition, multiplication, squaring, and division in Fp, Fp2, and Fp6, respectively. This matches the paper except that we use "d" for division instead of "i" for inversion, since a division is its own operation which need not be decomposed into a multiplication and an inversion. In the circuit we have a = 0 and m = s = d = 1C. Karatsuba is certainly optimal for multiplication and squaring in Fp2, so that gives m~ = 3C and s~ = 2C. [Edit: I was thinking of Complex squaring and calling it Karatsuba here.] Division (or inversion) can be implemented via multiplication: d~ = 3C. It isn't clear to me that Karatsuba is optimal for the extension of Fp2 to Fp6. That gives 18C for multiplication, but I believe we can do it more efficiently by applying Toom-3 to get M = 5m~ = 15C. Note that the reduction from 5 Fp2 terms to 3 terms is linear, and so free. On the other hand, that appears to beat Montgomery's 17C here, so maybe I made a mistake. (Major impostor syndrome here :-p ) The text was updated successfully, but these errors were encountered:

 Fp6 squaring can be done by Toom-3 in 5s~ = 10C, which beats "Squaring Method 3" (1m~ + 4s~) in Chung and Hasan by an m~ - s~ tradeoff. Imposter syndrome intensifies! (Granted, they were considering the case where linear combinations are not free.) Edit: actually Chung and Hasan do consider this variant; it's the first row of Table 2.

### Pratyush commented Jul 30, 2018

 Just as a friendly heads up, Bls12's Fq is not highly 2-adic (q -1 is not a multiple of a large power of 2), and so using it in a SNARK is not so efficient (one would need to use different, more inefficient evaluation domains than the efficient powers-of-2 stuff that most SNARK impls use). For comparison, Bls12-381's Fr is highly 2-adic. (We ran into the same issues in our project, and had to sample a new BLS curve that was highly 2-adic on both sides)

### Pratyush commented Jul 30, 2018

 Also, there's a good comparison of the number of multiplications (and hence constraints) required to implement various finite-field extension arithmetic here: https://eprint.iacr.org/2006/471.pdf

 @Pratyush Yes, I knew that but thanks for clarifying. Do you mind giving the parameters of your curve? The comments on Fp2 and Fp6 arithmetic above are not specific to the curve, only (somewhat) to the extension degree. Thanks for the reference. Edit: the best (circuit) algorithm for multiplication in Fp6 seems to be Toom-3 over Karatsuba, which costs 15m = 15C as given in table 9 of https://eprint.iacr.org/2006/471.pdf . So we have M = D = 15C and S = 10C. (A = a~ = a = 0C, and reductions are also free.)

 The next step is curve arithmetic on E/Fp and E'/Fp2. For short Weierstrass curves (e.g. any BN or BLS curve), it is most efficient to use affine coordinates; the formulae are the same as for Montgomery curves except for the numerator of λ in doubling, which does not affect the constraint count. So, for E/Fp, incomplete addition costs 1s + 1m + 1d = 3C, and doubling costs 2s + 1m + 1d = 4C. For E'/Fp2, incomplete addition costs 1s~ + 1m~ + 1d~, and doubling costs 2s~ + 1m~ + 1d~. With Fp2 implemented using Karatsuba multiplication (m~ = d~ = 3m) and Complex squaring (s~ = 2m), that is (2 + 3 + 3)m = 8C for incomplete addition, and (4 + 3 + 3)m = 10C for doubling. [Edit: use d~ for division instead of i~ for inversion, since this more accurately represents the circuit operation.]

 Next is the final exponentiation, for which we need arithmetic in Fp12 and its cyclotomic subgroup GΦ6(Fp2). For multiplication in Fp12, we can use Karatsuba over Fp6 which gives 3M = 45C. For squaring we can use Complex squaring over Fp6 which gives 2M = 30C. (Those costs are for the tower Fp2 → Fp6 → Fp12. If we instead used Fp2 → Fp4 → Fp12 with Toom-3 over Karatsuba over Karatsuba, multiplication would require 15m~ = 45C and squaring 10m~ = 30C, so there would be no difference.) For squaring in GΦ6, we represent elements in the compressed representation (g2, g3, g4, g5) of Karabina. We can use the main algorithm in Karabina's paper which requires 4m~ = 12C, or the "SQR2345" one from that paper (with 2 squarings in Fp4) which is also 12C, or the algorithm in section 5.2 of Aranha, Karabina, Longa, Gebotys and López which requires 6s~ which again is 12C. So we can see that this saves significantly over the 30C needed to do squarings directly in Fp12. For multiplication in GΦ6, we cannot use the compressed representation. In the case needed for the final exponentiation where the exponent is sparse, it is most efficient to compute all of the squarings in compressed format and then decompress the ones we need. A decompression uses the formulae:     g0 = (2.g12 + g2.g5 - 3.g3.g4).ξ + 1     g1 = { (g52.ξ + 3.g42 - 2.g3) / 4.g2, if g2 ≠ 0             { (2.g4.g5) / g3, if g2 = 0 The formula for g0 can be computed in terms of g1 without any conditionals. We require 3C for a "boolean scaling" which computes a boolean b which is 0 when g2 = 0, otherwise 1:     (1 - b) × (b) = 0     (g2) × (λ) = (b)     (μ) × (b) = (g2) and then 4C to select the numerator and denominator of the division (since each element of Fp2 is 2 field elements). Note that computing two divisions and then paying 2C for a single selection is less efficient; trying to combine the boolean scaling with the selection (which would work if we only had a single selection) is also less efficient. We then need 2s~ + m~ to compute g52, g42, and g4.g5, and d~ for the division. We need s~ + 2m~ to compute g0. Overall this costs 3C + 4C + 2s~ + m~ + d~ + s~ + 2m~ = (3 + 4 + 4 + 3 + 3 + 2 + 6)C = 25C for each decompression. So when u is sparse with k bits set and is ℓ bits in length, exponentiation by u (which is a component of the "hard part" of the final exponentiation) costs ℓ * 12C + (k-1) * (25 + 45)C = ℓ * 12C + (k-1) * 70C. The Karabina squaring and decompression pays off whenever (30 - 12)*ℓ > 25*(k-1), i.e. whenever k < 0.72*ℓ + 1, which is true for typical pairing-friendly curves. We also need p-power and p2-power Frobenius operations, i.e. g ↦ gp and g ↦ gp2, for which we can use the formulae in section 3.2 of Beuchat, González-Díaz, Mitsunari, Okamoto, Rodríguez-Henríquez, and Teruya. These formulae only involve constant multiplications and conjugations so they are free in the circuit. For the curve used in Aranha, Karabina, Longa, Gebotys and López: The final exponentiation executes in total 1 inversion, 4 conjugations, 15 multiplications, 3 u-th powers, 4 cyclotomic squarings, 5 p-power Frobenius, 3 p2-power Frobenius. So that is: 45C for the inversion in Fp12; the conjugations are free; 15 * 45C for the multiplications in Fp12; 3 * (62 * 12C + 2 * 70C) = 3 * 884C for the 3 u-th powers; 4 * 12C for the cyclotomic squarings; the Frobenius computations are free giving a total of 3420C for the final exponentiation. [Edit: corrected arithmetic error.] [Edit: we can combine the inversion and one of the multiplications into a division in Fp12, saving 45C. This gives a total of 3375C.]

The remaining part of a pairing computation is the Miller loop. The basic operations in the body of the loop are "point doubling with line evaluation" and "point addition with line evaluation".

### Doubling with line evaluation

A doubling on E'/Fp2 requires 10C as computed above. Note that this already computes x12. So the line evaluation (based on formula (2) of Aranha et al):

l = (−2.y1.yP).v.w + (3.x12.xP).v2 + ξ.(3b' - y12)

can be computed in an additional 2m~ + s~ = 8C, for a total of 18C. There are alternative ways of combining doubling and line evaluation, but I wasn't able to find any further improvement in the constraint cost.

TODO: check whether (xP, yP) is sometimes constant (this depends on how the pairing is used).

An incomplete addition on E'/Fp2 costs 8C, also as computed above. Incomplete addition is sufficient provided that the input points P and Q are of large order. Aranha et al don't give formulae for the line evaluation used in the addition step, but their work is based on section 5 of Costello, Lange, and Naehrig (in turn based on section 4 of Costello, Hısıl, Boyd, Nieto, and Wong) which has very similar formulae for the doubling case. Switching to affine coordinates:

l = (y1 - yP).(xP - xQ) + (x1 - xP).(yQ - yP)

or something like that. (This formula is probably slightly wrong because it is the version that eliminates the twisting of the point P, which is an optimization from Costello, Lange, and Naehrig that is handled differently in Aranha et al; but I don't think that affects the operation count.) So that is an additional 2m~, for a total of 14C.

### Miller loop

Aranha et al section 6 says:

For the selected parameters and with the presented improvements, the Miller Loop in Algorithm 6 executes 64 point doublings with line evaluations, 6 point additions with line evaluations (4 inside Miller Loop and 2 more at the final steps), 1 negation in Fp2 to precompute yP, 1 p-power Frobenius, 1 p2-power Frobenius and 2 negations in E'(Fp2); and 1 conjugation, 1 multiplication, 66 sparse multiplications, 2 sparser multiplications and 63 squarings in Fp12.

For now we won't bother to optimize the sparse[r] multiplications –partly because the paper doesn't describe that– so we will count those as full multiplications requiring 45C each. (This adds up to 3060C so is an obvious area for improvement.)

The resulting total cost of the rest of the pairing, i.e. everything but the final exponentiation, is 64 * 18C + 6 * 14C + (1+66+2) * 45C + 63 * 30C = 6231C.

 Grewal, Azarderakhsh, Longa, Hu, and Jao section 4.1 gives doubling/addition-with-line-evaluation formulae for affine coordinates on y2 = x3 + b curves. In both cases they require only an extra 2m~ for the line evaluation, i.e. 2s~ + 1m~ + 1d~ + 2m~ = 16C for doubling with line evaluation (improving on 18C above); 1s~ + 1m~ + 1d~ + 2m~ = 14C for incomplete addition with line evaluation (same as above). The sparse multiplication is described in "Pseudo 8-Sparse Multiplication for Efficient Ate-Based Pairing on Barreto-Naerig Curve", by Yuki Mori, Shoichi Akagi, Yasuyuki Nogani, and Masaaki Shirase, DOI: 10.1007/978-3-319-04873-4_11. This paper does not seem to be online, but the journal it is published in is available from Library Genesis. The relevant part is this: This has 10 general multiplications (m~u in the right-hand column), so the cost is 10m~ = 30C. We'll treat the "sparser" multiplications as just sparse for now. So we can improve the cost excluding the final exponentiation to 64 * 16C + 6 * 14C + 45C + (66+2) * 30C + 63 * 30C = 5083C. The total cost of a single pairing is (3375 + 5083)C = 8458C (less than a tenth of the cost of the Sapling Spend circuit).

mentioned this issue Aug 26, 2018

 We may be able to improve the cost of the Fp12 arithmetic by implementing Fp4 multiplication directly using Toom-4, and using that in an Fp4 → Fp12 tower. Toom-4 multiplication costs 7m, instead of 3m~ = 9m for Karatsuba-over-Karatsuba. So Toom-3-over-Toom-4 costs 5 * 7m = 35C rather than 45C. Note that conversions between Fp4 → Fp12 and Fp2 → Fp6 → Fp12 representations are free. This helps most for the final exponentiation, where it saves 10C for the Fp12 inversion and each of 21 Fp12 multiplications (for the Aranha et al curve). Note that 6 of the multiplications are in the u-th power computations. There's also a single Fp12 multiplication in the Miller loop (I'm conservatively assuming we can't apply Toom-4 to the sparse multiplications). So overall we save 230C giving a total cost of 8228C; around a 2.7% improvement.

### daira commented Sep 25, 2018

 It seems that we can also just use Toom-12 directly to implement Fp12 multiplication and squaring. This costs 23m for multiplication and 23s for squaring, i.e. 23C in both cases (this is better than the sparse multiplication algorithm; in this approach sparse multiplications have fewer terms but no fewer constraints). It turns out that the cyclotomic squaring method still pays off whenever k < 0.44*ℓ + 1. For the Aranha et al curve, the final exponentiation takes (excluding free operations): 23C for the inversion in Fp12; 15 * 23C for the multiplications in Fp12; 3 * (62 * 12C + 2 * (25+23)C) = 3 * 840C for the 3 u-th powers; 4 * 12C for the cyclotomic squarings giving a total of 2629C, a saving of 22.1%. For the same curve, the Miller loop takes (again excluding free operations): 64 * 16C for the doublings with line evaluation; 6 * 14C for the incomplete additions with line evaluation; (1+66+2) * 23C for the multiplications in Fp12; 63 * 23C for the squarings in Fp12 giving a total of 4144C, a saving of 18.5%. The overall cost of a pairing on the Aranha et al curve is (2629 + 4144)C = 6773C, which is a 19.9% improvement.

 From Appendix B.2 of the protocol spec, adapting notation to be consistent with above: Neglecting multiplications in the r-order subgroup of Fp12 and Fp, and other trivial operations, the cost of batched verification [for N proofs] is therefore for each proof: the cost of decoding the proof representation to the form Groth16.Proof, which requires three point decompressions and three prime-order subgroup checks (two for G1 and one for G2); for each successfully decoded proof: a Miller loop; and a 128-bit scalar multiplication by zj; for each verification key: two Miller loops; an exponentiation in Fp12; a multiscalar multiplication with N 128-bit terms to compute Accum∆; and a multiscalar multiplication with ℓ + 1 255-bit terms to compute ∑i = 0..ℓ [AccumΓ,i] Ψi; one final exponentiation. A point decompression for Aranha-curve G1 costs 291C for the range check + 3C for the on-curve check. (The Aranha BN curve has p = 36.u4 + 36.u3 + 24.u2 + 6.u + 1 where u = -(262 + 255 + 1). The comparison in the range check is with (p-1)/2 because we're using SORT compression.) A point decompression for Aranha-curve G2 costs 291C + 5C + 3C (where the 5C is for a trick to avoid having to do two comparisons, by selecting the element of the y polynomial to compare based on whether the most significant coefficient of the polynomial is 0). The Aranha G1 is prime-order so no subgroup checks are needed for πA and πC (this might be different for BLS and other curves). A subgroup check is needed for πB. This can be done using the QED-it trick of witnessing πB/h and then checking that [h] (πB/h) = πB ≠ 𝓞, where h is the cofactor in G2. Since h is fixed, the best way to do this is to find a short addition-subtraction chain; unfortunately I don't know the order of E'/Fp2 (the curve of which G2 is a subgroup) so I can't compute h exactly. However we know that by the Hasse bound, h must be around lg(p2 / r) ~= 254 bits, so we can implement the cofactor multiplication by h with roughly 253 doublings (10C each) and 84 incomplete additions or subtractions (8C each); probably better. That costs 253 * 10C + 84 * 8C = 3202C (quite expensive compared to the pairing, so worth optimizing further). We will also need 128-bit variable-base scalar multiplications in G1. Using "Montgomery without clamping", each of these costs (128 * 9 - 1)C = 1151C. (This includes conversion of πj,A and πj,C from Edwards to Montgomery.) Let n = ceiling(lg(N)), so that AccumY has 128+n bits. (This is always less than r in practice, so reduction mod r is not useful.) The exponentiation YAccumY in Fp12, using a simple square-and-conditional-multiply algorithm, costs (127+n) * 23C + (128+n) * (23 + 12)C = (7401 + 58 * n)C. The computation of [AccumΓ,i] Ψi is tricky. It's possible to do it by splitting each aj,i into three 85-bit limbs. (One limb could be 84 bits but let's ignore that for simplicity.) Then we precompute bases [285.k] Ψi for k = 0..2, and do 3.(ℓ + 1) scalar multiplications each with a (128+85+n)-bit scalar. If we use the "fixed-base mostly Montgomery" approach, we only need a single adjustment at the end to correct for the digit offsets. If we assume for simplicity that n is a multiple of 3, then we have 71 + n/3 windows for each multiplication, so the cost per multiplication is (71 + n/3)*3 for the window lookups, (70 + n/3)*3 for the incomplete additions, and 4C to add the result of this multiplication to the partial result for [AccumΓ,i] Ψi (with error checking for equal x-coordinates). So the cost of this part is 3 * (ℓ + 1) * (426 + n * 2 + 4)C = (ℓ + 1) * (1290 + n * 6)C. This is expensive so I'm sure there must be a better way; I'll see what I can come up with later. Assume a single verification key, with ℓ field elements for the public input. In that case we need, for a batch of N Groth16 verifications on the Aranha curve: N * 294C for the G1 decompression; N * 299C for the G2 decompression; N * 3202C for the G2 subgroup check; (N+2) * 4144C for N+2 Miller loops; (N+1) * 1151C for N+1 128-bit scalar multiplications by zj in G1; N * ℓ * 3 for computation of the three limbs of AccumΓ,i = ∑j = 0..N-1 zj · aj,i for each i ∈ {1..ℓ}; (ℓ + 1) * (1290 + n * 6)C for the computation of [AccumΓ,i] Ψi; (7401 + n * 58)C for an exponentiation in Fp12; 2629C for the final exponentiation; 3 * 23C for three multiplications in Fp12 for a total of (N * 9090 + 20828 + ℓ * (1290 + N * 3 + n * 6) + n * 64)C. [Edit: the cost of the exponentiation was too low because it didn't take into account that the multiplies are conditional. The fix added an extra (128+n) * 12C.] [Edit: fixed an arithmetic error in the cost of the Fp12 exponentiation, adding (n * 23)C.] [Edit: I had forgotten to include the final exponentiation. Oops!]

mentioned this issue Oct 8, 2018

 Reminder to investigate Katherine Stange's alternative "elliptic net" pairing algorithm replacing the Miller loop. It's described here (paper and slides), and optimized somewhat here (available from Libgen). At first sight it looks strictly slower, but maybe I missed something.

 The elliptic net algorithms are indeed much slower; the equivalent operation to doubling with line addition takes at least 32C in that approach, compared to 16C above.

### imeckler commented Dec 6, 2018

 @daira Am I wrong to think that you need a simulation-extractable SNARK for composition to be sound? Of course both GM17 and Groth16 are doing very similar computations so this whole thread transfers over. Or maybe (assuming SE is really necessary) @ebfull and @arielgabizon's construction.

### daira commented Dec 7, 2018

 Why do you think so @imeckler? Seems all papers on Recursive SNARKs - Valiant, BCCT, BCTV just use Knowledge soundness.

### imeckler commented Dec 10, 2018

 I'm not an expert in the kind of malleability that the non-SE pairing based SNARKs are subject to, but my understanding is that they can be malleated to verify against instances for which the prover does not know the witness. As such some counter-measure must be taken when using them. Again, not an expert in ZCash, but my understanding is that a spender of a note must sign the proof with some key used in the witness generation which somehow prevents malleability. If you're composing SNARKs together, wouldn't you need either a general non-malleability (like SE) or some kind of specialized non-malleability mechanism a la ZCash so that the top SNARK can't be malleated?

 So I think you've got a good picture of it. Knowledge Sound (KS) SNARKs provide no guarantee of non-malleability. Previous works showed composition of KS snarks preserves KS; you might try to prove SE is preserved under composition; at least after thinking about it a little I don't think it's immediate - the natural argument would be something like if the adversary mauls a proof in the composed system you'd want to get some mauled proof in one of the inner systems to contradict the SE of the inner system you're composing - but I don't see how you'd get this inner proof from the composed proof. Like you said, Zcash does fine with just KS SNARK for two reasons: We combine signatures and SNARKs in a way that the signature helps with the malleability In parts of the argument that at first seem to require SE, KS is enough because you can think of the adversary and honest party, i.e. the whole system, as just one party. For example: if a certain case of mauling the proof after seeing transactions of an honest party would lead to finding a hash collision, then for that case you can think of the honest party and adversary as one party, use just KS (now there is no extra information from the outside for this combined party), extract a witness using just KS, and if that witness gets you a collision that's a contradiction - cause the honest party and adversary shouldn't be able to make a hash function collision also when working together. This might not make full sense at first - and I recommend looking at the non-malleability proof in the zero cash paper (appendix C or D) to get a feel for what I mean. Maybe a similar argument can be used for Coda and you can avoid using SE. And, again, if you do end up wanting SE from the SNARK, you'll need to prove it holds under composition cause I don't think there's such a result so far.

### imeckler commented Dec 10, 2018

 And, again, if you do end up wanting SE from the SNARK, you'll need to prove it holds under composition cause I don't think there's such a result so far. Groth--Maller at least claim that this holds (see the paragraph beginning at the top of page 5). I think basically you should be able to extract inner proofs from the adversarially-produced outer proof, then recursively extract from those, just as in the KS argument no? The only difference is that the adversary can see simulated proofs, but the inner extractors are sound against that as well so it seems it should be fine. Actually @arielgabizon I wanted to ask about your SE SNARK construction as well (I'll call it the BG SNARK in the following) -- though I'm not sure if this issue thread is the right place to do so. Basically it seems to me you can actually malleate proofs, so any pointers in where to look to correct my understanding would be much appreciated. There is a way to convert a Groth16 proof into a BG proof: Send π = (A, B, C) to F(π) = (A, B, C, delta, H(A, B, C, delta) * g) (where g is the generator of G_1). Suppose π = (A, B, C) is malleated to a proof π' = (A', B', C') for which no one knows the witness. Then would not F(π') be a BG proof for which no one knows the witness? Granted the simulator would not produce such a proof since it produces proofs with random delta', but still it seems wrong that one can produce proofs of invalid statements if one has access to proofs generated with bad delta' values.

 It's interesting that they indeed write that, maybe it's fine. I don't see the argument directly - cause you see simulated outer proofs and then from SE extract an outer witness - which contains proofs of the inner system; and then you want to use the inner system SE to get an inner witness from those proofs - but that inner system is SE "against" proofs of the inner system, not necessarily of the outer one. (Hmm..maybe it's not a big deal to deal with that, all these proof systems down the recursion will be say Groth-Maller with two different sets of parameters (guessing that's what happens in Coda), so you need to strengthen the SE argument to a simulator that simulates proofs at request from each of these systems, and an adversary that just needs to succeed making a new proof at one of them. Maybe that's a straightforward extension..though perhaps tedious to write formally) Yeah, that's a valid point about BG - the non-malleability requires the honest party to follow the protocol and choose a random delta'. I don't think it's a big deal - it's just saying you need to follow the protocol if you don't want your proofs mauled. But maybe you could say in some sense that's weaker than Groth-Maller, where I think seeing any set of valid proofs doesn't allow you to create a new one (there is an r element the prover needs to choose randomly there, but afaics that's just for zero-knowledge, and SE holds also when honest prover/simulator doesn't choose it randomly)

### imeckler commented Dec 11, 2018

 As I think about it more I realize I’m a bit confused. In the Groth16 case, if π gets malleated to π’, the “compound prover” consisting of the prover of π and the malleater know a witness for the statement against which π’ verifies by knowledge soundness, yes? In that case using a mere KS SNARK should be ok. Do any of you know of a note somewhere describing the kinds of malleability Groth16 eg or the Pinocchio SNARK are subject to? I know I’ve read something before but can’t track it down.

### arielgabizon commented Dec 11, 2018

 As I think about it more I realize I’m a bit confused. In the Groth16 case, if π gets malleated to π’, the “compound prover” consisting of the prover of π and the malleater know a witness for the statement against which π’ verifies by knowledge soundness, yes? In that case using a mere KS SNARK should be ok. Yes, in cases where you can think of both as a compound prover, KS suffices. As @daira mentioned in discussion, this could be the case for coda, where you just care there is a proof of work chain leading to current state, and you're not too concerned with who knows it and who doesn't. Do any of you know of a note somewhere describing the kinds of malleability Groth16 eg or the Pinocchio SNARK are subject to? I know I’ve read something before but can’t track it down. There are two examples in GM of malleating Groth; they just malleate the proof and not the instance. Mary Maller showed me a simple example where you can maul the instance too, when you have a public input that doesn't participate in any constraint..

### imeckler commented Dec 14, 2018

 Cool -- thanks so much for the wisdom @daira and @arielgabizon. @vanishreerao and I came up with an extension of the BG construction yesterday that lets you have a SE proof which can be converted into a cheaper-to-verify-inside-the-snark KS proof when recursively composing, I will link when we write something up on it in case it ends up being of use to someone.

### arielgabizon commented Dec 15, 2018

 Sounds awesome!

### daira commented Feb 4, 2019

 https://eprint.iacr.org/2019/077 is a recent paper by Michael Scott on optimized multipairing computation (i.e. computing products of pairings). It works by splitting the line function part of the Miller loop from the G2 point arithmetic; the former has to be done once per pairing, but the latter only once per product. This is directly applicable to the batch Groth16 validation in appendix B.2 of the protocol spec (most pairing-based proof systems can be optimized in a similar way). It's nice that the paper explicitly considers optimizations applicable to affine coordinates. Once I'm not so busy, I'll get around to applying the techniques of this paper to reduce the constraint count of in-circuit batched validation.

### arielgabizon commented Apr 25, 2019

 @daira I think potentially the whole Γ accumulator can just be done outside of the circuit by the verifier cause it only depends on the public input, right? Same for, in the notation of app B.2 of spec - the element YaccumY More generally you could have the circuit compute just AccumAB*ML(AccumΔ,Δ) , and then verify outside the circuit using the public input and the random zj's that the batched verification equation holds.

### arielgabizon commented Apr 25, 2019

 I guess what I wrote above depends on whether in your scenario all of the public inputs in the batch are public or not - e.g. just a commitment of them is public

### daira commented Apr 26, 2019

 The primary motivation for this ticket is the case where you only know the public inputs within the verification circuit. Of course if they are known outside the verification circuit then you can precompute. In a fully succinct block chain there is no "outside" in that sense, though.

### daira commented Apr 27, 2019

 According to Wahby and Boneh 2019 (Fast and simple constant-time hashing to the BLS12-381 elliptic curve), there is a fast way to multiply by the G2 cofactor — and therefore to do G2 subgroup checks via the QED-it trick: Ψ : E2(Fp2) → E2(Fp2) is the endomorphism of Budroni and Pintore [BP17, §4.1], based on the work of Scott et al. [SBC+09] and Fuentes-Castañeda et al. [FKR12]. This is effectively a fast method of exponentiating by the cofactor h2 of E2(Fp2).

 Hmm looking at Sec 4.1 of Budroni and Pintore the dominant cost is 2 scalar mults by x, where x is the integer parameter for constructing the BLS curve, so question is what is hamming weight of this x, e.g. for BLS-381 In particular: [h(a)]P = [h(ψ)]P = [x2 − x − 1]P + [x − 1]ψ(P) + ψ2(2P) that can be computed at the cost of 5 point additions, 1 point doubling, 2 scalar multiplications by the parameter x and 3 applications of ψ

 The parameter is -15132376222941642752 (see section 5.4.8.2 of the protocol spec), which is -1101001000000001000000000000000000000000000000010000000000000000. So its Hamming weight is 6, and its bit length is 64.

 Using the technique of #3924, a double-and-add step in G2 can be implemented in 3m~ + 2s~ = 13C. (This is not applicable to the first double-and-add for this bit pattern, to avoid an exceptional case for addition; the first addition therefore takes 8C.) A doubling in G2 takes 10C. So multiplying by the above parameter takes (10 + 8 + 10 + 13 + 2*10 + 13 + 8*10 + 13 + 31*10 + 13 + 16*10)C = 650C. [Edit: corrected to use the correct operation costs for G2.]

 [x - 1] R can be computed in 653C. (Replace the last doubling with a 13C double-and-subtract.) We have [x2 - x - 1] P + [x - 1] ψ(P) + ψ2([2] P) = [x - 1] ([x] P + ψ(P)) - P + ψ2([2] P). R = [x] P + ψ(P) and [2] P can be computed at the same time, in 661C plus the cost of computing ψ. ψ is described in [Galbraith and Scott 2008], but that doesn't give much detail on how to compute it. In any case, the cost of G2 cofactor multiplication or subgroup checking is (653 + 661 + 2*8)C = 1330C plus three ψ computations (for BLS12-381). I had previously estimated 3202C for the whole cofactor multiplication (for the Aranha curve).

 As I pointed out in #3425 (comment), the Miller loop for the Ate pairing e(P, Q) computes as a side effect [x] Q. This is part (roughly half, excluding the ψ computations which I haven't figured out yet) of what we need for the optimized G2 cofactor multiplication above. The cost becomes (653 + 8 + 2*8)C = 677C plus three ψ computations, I think (for BLS12-381). (When applying this to batch Groth16 validation, it is compatible with using a multi-Miller loop to compute AccumAB. The GT squarings are shared but the G2 arithmetic is separate, and so we still compute each [x] Q.)

 Quoting for reference on ψ from Section 2 of BP17: Such an endomorphism is defined as ψ = φ−1 ◦π ◦φ, where π is the q-power Frobenius on E and φ is an isomorphism from the twist curve E˜ to E

 Right. π is free in R1CS; it's φ that I don't yet know the cost of. (φ and φ-1 have the same cost in R1CS.)

 iiuc according to last paragraph of page 71 here http://www.craigcostello.com.au/pairings/PairingsForBeginners.pdf φ takes (x',y') on the twist into (x'/ω2, y′/ω3), where ω is in Fp12 So it takes one mult in Fp6 and one in Fp4 So I guess at most 46C as you estimate above 23C for multiplication in Fp12

mentioned this issue Apr 28, 2019

 Thanks! ω-2 and ω-3 are constants. Multiplications by any constant in Fpk are linear, and hence free in R1CS. So the full cost of the G2 subgroup check (given that we've already computed [x] Q in the Miller loop) is just 677C for BLS12-381. Edit: actually 1330C, since [x] Q can't be reused as discussed below: #3425 (comment)

### daira commented Apr 28, 2019

 In my copy of Pairings for Beginners the relevant part is on page 64 in section 4.3: In the general case according to our context, a twist of E : y2 = x3 + ax + b is given by E′ : y2 = x3 + aω4x + bω6, with Ψ : E′ → E : (x′, y′) → (x′ / ω2, y′ / ω3), ω ∈ Fqk. We can only achieve specific degrees d through combinations of zero and non-zero values for a and b. [...] d = 6 sextic twists. Sextic twists are only available when a = 0, so if E/Fq : y2 = x3 + b, then E′/Fqk/6 : y2 = x3 + bω6, with ω6 ∈ Fqk/6, ω3 ∈ Fqk/3 and ω2 ∈ Fqk/2. Thus, Ψ pushes elements in E′(Fqk/6) up to E(Fqk), whilst Ψ−1 pulls elements from E(Fqk) all the way down to E′(Fqk/6). This is consistent with the constraint cost calculated above, but it's important that we end up mapping back into Fqk/6 (where k/6 = 2).

 One last estimate on the Aranha et al curve before we switch to considering BLS12 curves. The Aranha BN parameter is x = −(262 + 255 + 1). Since it is a BN curve, the relevant formula for the cofactor multiplication is in [Fuentes-Castañeda, Knapp, and Rodríguez-Henríquez 2012, section 6.1] (notation adapted): We can compute Q → [x] Q → [2x] Q → [3x] Q using one doubling, one addition, and one multiply-by-x. Given Q, [x] Q, [3x] Q, we can compute [h(a)] Q using three ψ-maps, and three additions. In total, we require one doubling, four additions, one multiply-by-x, and three ψ-maps. Computing [x] Q would cost (6*10 + 13 + 54*10 + 13)C = 626C, but we get it for free from the Miller loop. Unlike the BLS12 case, we don't need to also compute a multiplication by x-1. The ψ-maps are free, so the cost of the cofactor multiplication (and hence a G2 subgroup check) is just (10 + 8 + 3*8)C = 42C. This replaces the previous estimate of 3202C, so we get a quite spectacular saving of N * 3160C. The total cost for a batch of N Groth16 verifications becomes (N * 4085 + 15801 + ℓ * (1290 + N * 3 + n * 6) + n * 61)C. Note that this is even before applying the potential optimizations of #3425 (comment) . (I'm somewhat confused as to why the Aranha et al paper says there are 6 additions with line evaluation in the Miller loop, when x has a Hamming weight of only 3. I'll figure that out later.)

 Oops. As @ebfull pointed out, the [x] Q calculated by the Miller loop is not what we need to use the QED-it trick to check that Q is in G2. We would need [x/h] Q for that. This means the cofactor multiplication costs (626 + 42)C = 668C for BN and 1330C for BLS12. On the Aranha curve, the total cost for a batch of N Groth16 verifications becomes (N * 4711 + 15801 + ℓ * (1290 + N * 3 + n * 6) + n * 61)C. The proportion of the per-proof cost taken by the G2 subgroup check drops from ~44.2% to ~14.2%. I'll see if I can figure out how to reuse the Miller loop computation again.

mentioned this issue Apr 30, 2019

### arielgabizon commented May 8, 2019

 One thing I'd like to pick your brains on @Pratyush @daira ; if someone were to implement a Groth verifier circuit right now, would you use the pair of curves from ZEXE, or is there something else/better?

### Pratyush commented May 8, 2019

 If you're fine with depth-2 recursion then the curves in Zexe should meet the 128-bit security level. I don't think there's a more efficient way. If you're looking for unlimited recursion, then Coda's curves are probably better. Note that BLS12-381 is unsuitable for recursion because the base field prime has low 2-adicity.

 For P : G1, Sean pointed out to me a different technique (not included in his paper, but more efficient in R1CS), due to Michael Scott. It is described for BLS12-381 in [Wahby and Boneh 2019, section 5]: Clearing cofactors. The maps of Sections 3 and 4 output elements of E1(Fp) or E2(Fp2), but we want elements of G1 or G2. For G2, we use the endomorphism Ψ (see above). For G1, Scott observes [personal communication] that there is a faster way than exponentiating by [the G1 cofactor] h1. Recall (§2.1) that h1 = (z - 1)2/3; for BLS12-381, 1 - z = 3 ∏k pk for primes pk, i.e., h1 = 3 ∏k pk2. It is easy to check that E1(Fp) has no points of order pk2 for any k. As a result, exponentiating by 1 - z ≈ √h1 gives an element of G1. This speeds up cofactor clearing by more than 2× versus exponentiating by h1, in part because 1 - z has low Hamming weight. In R1CS, we can use the Qedit trick to implement subgroup testing in terms of cofactor multiplication, and this extends to the [1 - z] map. That is, we compute Q = [(1 - z)-1 (mod n)] P outside the circuit, then calculate the desired point P = [1 - z] Q inside. As for G2, the scalar multiplication must be done using complete formulae. As above, we focus on the case of a run of k > 0 doublings, followed by an addition of some point R = (xR, yR) ≠ 𝓞, i.e. we are calculating U = [2k] Q + R. Fortunately G1 is odd-order, so we don't need to consider points of order 2 as inputs to doubling: [2k] Q = 𝓞 iff Q = 𝓞. Assume that we have a boolean flag isNonzeroQ which is zero iff Q = 𝓞. The doublings themselves cost 4C each. Let S = (xS, yS) be the output of the last doubling (which is undefined if isNonzeroQ is false). Let T = S + R, using complete addition. (T will be undefined if isNonzeroAfterDoublings is false or S = −R, but will be correct if S = R.) It costs 2C to select between T and R according to isNonzeroQ. To construct the complete addition, check whether the inputs have different x-coordinates, i.e. set isDifferentX true iff xS ≠ xR. Then use the short Weierstrass addition formulae, computing λ for both the xS = xR and xS ≠ xR cases, and select between them according to isDifferentX. The cost of the complete addition comes out to 8C. We also need to set isNonzeroU true iff S ≠ −R, i.e. iff isDifferentX is true or yS = yR; this can be done in 2C. isNonzeroU will become isNonzeroQ for the next run. Writing this out, we have     // [computing the doublings costs k*4C]     // complete addition for y2 = x3 + b     let isDifferentX = boolean(xR ≠ xS)    // 2C     (xR) × (xR) = (xxR)     (2.yR) × (λ1) = (3.xxR)     (xR - xS) × (λ2) = (yR - yS)     let λ = isDifferentX ? λ2 : λ1    // 1C     (λ) × (λ) = (xR + xS + xT)     (xS - xT) × (λ) = (yT + yS)     let U = isNonzeroQ ? T : R    // 2C     // isSEqualToR is 1 if S = R, 0 if S = −R, and undefined if isDifferentX is true     (yR) × (2.isSEqualToR - 1) = (yS)     // use DeMorgan for OR     (1 - isDifferentX) × (1 - isSEqualToR) = (1 - isNonzeroU) So a run of k doublings followed by an addition costs (4*k + 8 + 2 + 2)C = (4*k + 12)C. Assume that Q ≠ 𝓞. Now we need to calculate [1 - z] Q. Since 1 - z is 1101001000000001000000000000000000000000000000010000000000000001 in binary representation, we can express this as 6 runs of the above algorithm with k = [1, 2, 3, 9, 32, 16]. So the overall cost is (4*(1 + 2 + 3 + 9 + 32 + 16) + 12*6)C = 324C. Then we assert that isNonzeroU is false after the last iteration. So, the BLS12-381 G1 subgroup check can be done in 324C.

### daira commented Jul 15, 2019

 Next we have to translate the other components from the Aranha curve to BLS12-381...

mentioned this issue Jul 15, 2019
mentioned this issue Jul 17, 2019
mentioned this issue Aug 10, 2019

 For #4092, it would also be useful to work out the cost in a Spartan-style arithmetic circuit, where addition gates also count toward the cost. Most of the work we've done on optimizing elliptic curve arithmetic carries over directly to the gate cost model, because the Montgomery curve arithmetic also has relatively few additions. The Toom-12 arithmetic makes no sense in that context, and should probably be replaced by towering. See #4094 for discussion of an appropriate cost model.

### daira commented Nov 2, 2020

 If you've read this far, I think what should be quite clear is that the complexity of optimised pairing computation in a circuit is pretty horrendous. And this is just with R1CS, before we've started optimising for PLONK. We might use this in future but it's looking increasingly likely that we will just focus on DL-based proof systems that don't need pairings in the circuit. Or possibly, we may use an accumulation scheme that defers use of the pairing to the final decision algorithm.

mentioned this issue Mar 15, 2021