Faster variable-base scalar multiplication in zk-SNARK circuits #3924

daira · 2019-03-26T20:54:32Z

The best general way to perform a variable-base scalar multiplication in an R1CS circuit, required (before this write-up) 9 constraints per scalar bit. There's a way that requires 8.5 constraints/bit, but only with a recoded scalar using a nonstandard and inconvenient digit set (e.g. [0, 1, 2, -1]).

I believe it's possible to implement this in 6 constraints/bit, by using a modification of a technique from [Eisentraeger, Lauter, and Montgomery]. They state their technique for curves in short Weierstrass form, but it is also applicable to Montgomery form. (Everything in this ticket is easily adapted to both.) The basic idea is that when we need to find [2] P + Q, where P is the accumulator in a double-and-add algorithm, we compute it as (P + Q) + P. This allows two optimizations:

we do not need to compute the y-coordinate of the intermediate point P + Q (details below);
we've replaced a doubling with an addition, which is more efficient in R1CS by one constraint.

Here we adapt [Eisentraeger, Lauter, and Montgomery]'s formulae to a Montgomery curve with equation B·y² = x³ + A·x² + x. The constraint system for the incomplete addition P + Q = R on its own would be:

    (x_Q - x_P) × (λ₁) = (y_Q - y_P)
    (B·λ₁) × (λ₁) = (A + x_P + x_Q + x_R)
    (x_P - x_R) × (λ₁) = (y_R + y_P)

When computing (P + Q) + P = S, we drop the last constraint which is only needed for y_R. The outer addition requires the gradient λ₂ = (y_P - y_R)/(x_P - x_R), but we have:

λ₁ = (y_R + y_P)/(x_P - x_R)

therefore

λ₂ = 2·y_P/(x_P - x_R) - λ₁.

So in R1CS we can write:

    (x_Q - x_P) × (λ₁) = (y_Q - y_P)
    (B·λ₁) × (λ₁) = (A + x_P + x_Q + x_R)
    (x_P - x_R) × (λ₁ + λ₂) = (2·y_P)

and then complete the outer addition with:

(B·λ₂) × (λ₂) = (A + x_R + x_P + x_S)
(x_P - x_S) × (λ₂) = (y_S + y_P)

The practical problem with applying this technique in an R1CS circuit is that we cannot efficiently implement a conditional that sometimes computes [2] P, and sometimes [2] P + Q. (Conditionally replacing Q with 𝓞 does not work, because 𝓞 does not have an affine Montgomery representation.)

The following trick works around this. Suppose that r is of length n bits. Consider the following algorithm:

    Acc := [2] T
    for i from n-1 down to 0 {
        Q := r_i ? T : −T
        Acc := (Acc + Q) + Acc
    }

For each step we can compute the y-coordinate of Q using:

(y_T) × (2.r_i - 1) = (y_Q)

This requires a total of 6 constraints per scalar bit. However, at the end we have computed [2ⁿ⁺¹ - (2ⁿ - 1) + 2·r] T = [2ⁿ + 1 + 2·r] T.

Not to worry. Suppose that we actually want to compute [2ⁿ + k] T, where k < 2ⁿ⁺¹. Without loss of generality, assume k is odd (if it is even then add one to k and subtract T from the result). Let k = 1 + 2·r, and solve to give r = (k - 1)/2. Conveniently, this is equivalent to setting r = k >> 1 where >> is the bitwise right-shift operator.

So the full algorithm is:

    Acc := [2] T
    for i from n-1 down to 0 {
        Q := k_i+1 ? T : −T
        Acc := (Acc + Q) + Acc
    }
    return (k₀ = 0) ? (Acc - T) : Acc

This requires 4C for the initial doubling, n * 6C for the loop, 3C to compute Acc - T, and 2C for the conditional.

There is a minor further improvement by specializing for k_n = 0. In that case the first iteration of the loop calculates Acc = [3] T, which can be implemented directly as [2] T + T saving 3C (since we have replaced one loop iteration by an incomplete addition):

    Acc := [2] T + T
    for i from n-2 down to 0 {
        Q := k_i+1 ? T : −T
        Acc := (Acc + Q) + Acc
    }
    return (k₀ = 0) ? (Acc - T) : Acc

Let s be the order of the large prime-order subgroup. Assume that T is of order s and that 2ⁿ⁺¹ - 1 ≤ (s-1)/2. Under these conditions, we can calculate [2ⁿ + k] T for k < 2ⁿ in (n+1) * 6C.

It remains to check that the x-coordinates of each pair of points to be added are distinct.

When adding points in the large prime-order subgroup, we can rely on Theorem A.3.4 from the Zcash protocol spec, which says that if we have two such points with nonzero indices wrt a given odd-prime order base, where the indices taken in the range -(s-1)/2..(s-1)/2 are distinct disregarding sign, then they have different x-coordinates. This is helpful, because it is easier to reason about the indices of points occurring in the scalar multiplication algorithm than it is to reason about their x-coordinates directly.

So, the required check is equivalent to saying that the following "indexed version" of the above algorithm never asserts:

    acc := 3
    for i from n-2 down to 0 {
        q = k_i+1 ? 1 : −1
        assert acc ≠ ± q
        assert (acc + q) ≠ acc    // X
        acc := (acc + q) + acc
        assert 0 < acc ≤ (s-1)/2
    }
    if k₀ = 0 {
        assert acc ≠ 1
        acc := acc - 1
    }

The assertion labelled X obviously cannot fail because q ≠ 0. It is easy to see that acc is monotonically increasing except in the last conditional. It reaches its largest value when k is maximal, i.e. 2ⁿ⁺¹ - 1, which justifies the condition on n above. This discharges all of the other assertions.

[Edit: the constraint count was off-by-one.]

arielgabizon · 2019-03-27T12:00:12Z

I'm wondering how much this improves the circuit size for batched Groth16 verification..perhaps I'll do the computation later.

daira · 2019-03-28T13:21:17Z

@arielgabizon The Miller loop curve arithmetic is in G₂, so the constraint counts are different; also in the Miller loop we need all the intermediate points in order to compute the line functions. It might still be possible to apply this optimization, but it needs more analysis.

It can definitely be applied to the double-and-add steps in the subgroup checks. There the exponent is fixed so we should find addition-subtraction chains that minimize the constraint cost taking into account this optimization.

We could also apply it to [z_j] π_j,A and to the multiscalar multiplication in the computation of Accum_Δ. A complicating factor in the latter is that the base points π_j,C do not necessarily have distinct x-coordinates; also, the saving would not be large in that case because there is only one doubling per scalar bit regardless of N.

The optimization is not applicable to [Accum_Γ,i] Ψ_i, because that's best done as a fixed-base scalar multiplication, which could already be done in fewer constraints.

daira · 2019-03-30T13:09:34Z

For comparison, a fixed-base scalar multiplication takes 2 constraints/bit. One way of doing it using 3-bit windows is described in #2230 (comment) . I think there's a simpler way using 2-bit windows: that requires 1C to AND each pair of scalar bits, the lookup is free because linear, and 3C for an incomplete addition, for a total of 4C per 2-bit window. The adjustment to avoid zero points does not affect the per-bit cost.

(If you share the same scalar between different scalar multiplications, it's 0.5C/bit to prepare the scalar and then 1.5C/bit to actually do the fixed-base multiplication.)

daira · 2019-04-03T10:07:22Z

I reread this and experienced a pang of guilt that I was maybe teaching people to do something that would cause security bugs in their constraint systems in general:

The constraint system for the incomplete addition P + Q = R on its own would be:

    (x_Q - x_P) × (λ₁) = (y_Q - y_P)
    (B·λ₁) × (λ₁) = (A + x_P + x_Q + x_R)
    (x_P - x_R) × (λ₁) = (y_R + y_P)

When computing (P + Q) + P = S, we drop the last constraint which is only needed for y_R. The outer addition requires the gradient λ₂ = (y_P - y_R)/(x_P - x_R), but we have:

    λ₁ = (y_R + y_P)/(x_P - x_R)

therefore

    λ₂ = 2·y_P/(x_P - x_R) - λ₁.

So in R1CS we can write:

    [2 constraints not important to this point snipped]
    (x_P - x_R) × (λ₁ + λ₂) = (2·y_P)

Notice that we drop a constraint that is equivalent to λ₁ = (y_R + y_P)/(x_P - x_R), and then we rely on that equation!

What is going on here that allows us to drop a constraint and then apparently rely on it? In general this can definitely cause bugs. For example, there are two similar attempted optimizations in the suggested constraint systems for MiMC and LowMC in section 6.1 of the MiMC paper ([Albrecht, Grassi, Rechberger, Roy, and Tiessen], page 22), which are completely insecure as a result.

The argument for why this is okay in this case goes as follows:

Under the assumption x_P ≠ x_Q (which we discharge later), the first constraint determines λ₁ as a function of variables {x_P, x_Q, y_P, y_Q}.
The second constraint determines x_R as a function of variables {λ₁, x_P, x_Q} (and some constants).
Under the assumption x_P ≠ x_R (which we also discharge), the last constraint determines λ₂ as a function of variables {λ₁, x_P, x_R, y_P}.

Combining these dependencies in a quite mechanical way (not depending on any higher-level understanding of the elliptic curve math), we find that under the assumptions x_P ≠ x_Q and x_P ≠ x_R, λ₂ is determined by {x_P, x_Q, y_P, y_Q}, both before and after the optimization.

Notice that this is the kind of analysis that can and should be doable by an automated tool. Given:

two constraint systems, before and after a purported optimization;
a set of "variables of interest";
a set of assumptions (here, and for many similar problems, it's sufficient to specify a set of linear combinations that are assumed to be nonzero)

we should be able to ask the tool whether the dependency relations between the variables of interest changed, and what they are. This would help to catch a class of mistakes without requiring any sophisticated machinery for equivalence-proving. For instance, it would catch the mistakes in the MiMC paper (although tbh the manual analysis is sufficient for that).

[Edit: as if to prove my point about this needing an automated tool, I initially made a mistake that resulted in omitting the required assumption x_P ≠ x_Q.]

daira · 2019-04-03T15:34:14Z

A subtlety is that the assumption x_P ≠ x_R depends on x_R being sufficiently constrained, and so x_R should be one of our "variables of interest" even though it is an intermediate. In this case, we can infer that under the assumption x_P ≠ x_Q, x_R is determined as a function of variables {x_P, x_Q, y_P, y_Q}. Note that the assumption here does not itself depend on x_R (which would be circular reasoning).

daira · 2019-04-28T17:38:32Z

For #3425 we need to be able to implement multiplications on E'/F_p².

A conditional negation takes 2C (since both elements of the y-coordinate need to be negated). So processing each scalar bit requires 3m~ + 2s~ + 2C, which is (9 + 4 + 2)C = 15C using Karatsuba multiplication and Complex squaring.

The overall cost to compute [2ⁿ + k] T in E'/F_p² for k < 2ⁿ, is (10 + 8)C to compute [3] T, (n-1) * 15C for the loop, 8C to compute Acc - T, and 4C for the final conditional, for a total of (n+1) * 15C.

If the scalar is fixed (and there is no addition-subtraction chain better than the binary one), it is more efficient to use the (P + Q) + P technique only for the double-and-add steps (13C each) and not for the plain doublings (10C each). No conditional negation is needed in this case.

hdevalence · 2019-04-30T16:49:21Z

This is possibly a tangent from the topic in this issue, but it may be possible to use this together with a Ristretto-for-JubJub along the following lines:

As described in cite:2017/costello-smith section 4.3, it's possible to recover the y-coordinate of the output point from data computed during the montgomery ladder (Algorithm 5, Okeya-Sakurai y-coordinate recovery).

Since Ristretto can be implemented using a Montgomery curve internally, it seems possible to implement Ristretto inside of a circuit using @daira's few-constraints Montgomery formulas above plus Okeya-Sakurai y-coordinate recovery, and to implement it outside of a circuit using the Hisil-Wong-Carter-Dawson extended coordinates.

daira · 2019-05-19T17:27:45Z

Okeya-Sakurai in R1CS is described here. However, it's cheaper to do variable-base scalar multiplication in R1CS in Montgomery coordinates, but without the Montgomery ladder. A ladder-based multiplication would require at least 10 constraints per scalar bit. The method described in this issue requires only 6 constraints/bit. Note that it doesn't require Okeya-Sakurai at the end since it computes both coordinates already.

The difference in trade-offs is due to the fact that R1CS greatly favours using (x, y) affine coordinates, because division is cheap. The [Eisentraeger, Lauter, and Montgomery] method does avoid computing some of the intermediate y-coordinates, but as far as I can see, there's no further improvement to be made by staying x-coordinate-only, because computing the y-coordinate is the fastest way to compute the needed slope (λ) for the next addition or doubling.

The additional Ristretto operations can be implemented just by converting to/from Edwards and following the affine Ristretto formulae. I haven't calculated the exact costs, but it looks straightforward — with some caveats about special-casing the zero point. Note that the inverse square root operation is particularly efficient in R1CS because you basically just need to confirm the square. You do need to also constrain the square root to be positive, which I think can only be done by boolean-unpacking it (similar to appendix A.3.3.2 in the protocol spec).

hdevalence · 2019-05-20T14:11:38Z

Ah, thanks for the explanation on the Montgomery formulas, and the pointer on checking positivity!

I'm not sure but I think that rather than converting to/from Edwards and following the affine Edwards Ristretto formulas, it will be simpler to use the Ristretto-to-Montgomery formulas directly (since the affine Edwards formulas are those formulas, composed with the Montgomery-to-Edwards map). @cathieyun and I have been planning to prototype this using Bulletproofs, which could be useful as a test case.

daira · 2019-06-14T23:13:50Z

Using the prime-order subgroup turns out to be always more efficient in R1CS than Ristretto: #4024 (comment)

kidker · 2019-07-15T22:37:33Z

Daira, please give me a book for reading;) thanks;)

daira · 2019-09-10T18:52:39Z

multiply_fast in https://github.com/ebfull/halo/blob/master/src/gadgets/ecc.rs implements this algorithm for short Weierstrass curves (in Sonic/Halo-style constraints, not bellman/R1CS).

daira · 2019-09-21T10:09:38Z

This technique can be adapted to double-variable-base scalar multiplication, i.e. [u₁] P₁ + [u₂] P₂ where P_1,2 are both variable (but P₁ ≠ +/- P₂). We can compute Q in each iteration as the sum of two conditionally-negated points, which costs 2C for the conditional negations and 3C for incomplete addition. Therefore the overall cost is (5+2+3)C = 10C per bit — a modest improvement over 12C per bit when computing [u₁] P₁ + [u₂] P₂ naively.

The fixed-and-variable case (as above but with P₁ fixed) is interesting for EdDSA, RedDSA, or ECDSA signature verification. We can do a fixed-base scalar multiplication in 2C per bit, so naively adding the fixed- and variable-base results gives 8C per bit (or 0.5C per bit to prepare the fixed-base scalar and 7.5C per bit for the double multiplication). I cannot find any improvement on this; adapting the approach used for double-variable-base above would take 9C per bit.

daira · 2019-09-21T10:26:02Z

In general we can do a k-way all-variable-base multiplication in (2+4k)C per bit using the same approach.

daira · 2019-12-11T09:45:28Z

Note that the correctness argument for using incomplete addition does not work for the multiple-base variant without additional work. ~~For k = 2, it is sufficient that P₁ and P₂ have different x-coordinates, since then we can reduce proving correctness to the k = 1 case.~~

daira · 2020-10-23T13:20:18Z

There's nothing left to do here; see #4254 for further optimizations specific to PLONK arithmetization. The technique in this ticket was also described in my ZK Study Club presentation (video, slides).

weikengchen · 2020-11-19T17:33:26Z

An additional note. For the R1CS-based universal setup---such as Marlin---this scalar multiplication may also enjoy lower density (i.e., non-zero elements in the constraint), which can make the proving cost lower.

vanishreerao · 2020-12-14T16:13:16Z

An additional note on the scalar inputs for this algorithm: Of course, the scalar can't be p, the prime order, but also it can't be p-1, since the algorithm computes [scalar + 1]T along the way whenever the scalar is even.

* We add a double_and_add method that computes 2 * self + other more efficiently than just doubling + addition; this is not used anywhere yet, but I am planning on fiddling with it to see if we can leverage it somehow. (See zcash/zcash#3924 for details) * We handle constant scalars better: * We skip the most-significant constant zeroes to avoid unnecessary doubling * When intermediate bits of the scalar are constants, instead of conditionally adding, we directly use the value of the bit to decide whether to add or not. Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: weikeng <w.k@berkeley.edu>

daira added elliptic curves I-performance Problems and improvements with respect to performance C-research Category: Engineering notes in support of design choices A-circuit Area: zk-SNARK circuits labels Mar 26, 2019

HarryR mentioned this issue Apr 10, 2019

Pairings on Twisted Edwards Curves HarryR/ethsnarks#112

Open

daira mentioned this issue Apr 27, 2019

Understand how to optimize a circuit implementation of Groth16 verification #3425

Open

daira mentioned this issue May 19, 2019

Calculate circuit costs of prime group operations (Ristretto or ctEdwards-subgroup) #4024

Closed

imeckler mentioned this issue Jul 9, 2019

Implement Daira's variable-base scalar multiplication MinaProtocol/mina#2806

Closed

daira mentioned this issue Sep 10, 2019

More ECC math ebfull/halo#10

Merged

daira added this to Needs Prioritization in Protocol Team via automation Sep 21, 2019

daira mentioned this issue Dec 11, 2019

[Orchard] EC scalar multiplication using PLONK / Halo 2 custom gates #4254

Closed

CPerezz mentioned this issue Jan 31, 2020

Knowledge of a secret key dusk-network/dusk-zerocaf#96

Closed

vlopes11 mentioned this issue Feb 11, 2020

Ristretto to Montgomery dalek-cryptography/curve25519-dalek#314

Closed

imeckler mentioned this issue Apr 11, 2020

Add daira's fast scalar multiplication o1-labs/snarky#467

Merged

Pratyush mentioned this issue Nov 13, 2020

Incomplete SW group operations may be worthwhile arkworks-rs/r1cs-std#14

Closed

daira closed this as completed Nov 25, 2020

Pratyush mentioned this issue Jan 21, 2021

Improve handling of constant bits in scalar mul for SW curves arkworks-rs/r1cs-std#43

Merged

6 tasks

This was referenced Sep 28, 2021

Introduced Hopwood optimized vbSM gadget HorizenOfficial/ginger-lib#126

Merged

Bitcoin circuit support HorizenOfficial/ginger-lib#120

Merged

DanieleDiBenedetto mentioned this issue Oct 6, 2021

Hopwood Optimized Fixed-Base Scalar Multiplication HorizenOfficial/ginger-lib#129

Merged

ivokub mentioned this issue Nov 18, 2021

std/sw: use faster double-and-add for scalar multiplication and add constant scalar multiplication fast-path Consensys/gnark#181

Closed

6 tasks

DanieleDiBenedetto mentioned this issue Feb 4, 2022

Release 0.4.0 HorizenOfficial/ginger-lib#168

Merged

huitseeker mentioned this issue Jul 24, 2023

details about scalar_mul? microsoft/Nova#197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

daira commented Mar 26, 2019 •

edited

Loading

arielgabizon commented Mar 27, 2019 •

edited

Loading

daira commented Mar 28, 2019 •

edited

Loading

daira commented Mar 30, 2019 •

edited

Loading

daira commented Apr 3, 2019 •

edited

Loading

daira commented Apr 3, 2019

daira commented Apr 28, 2019 •

edited

Loading

hdevalence commented Apr 30, 2019

daira commented May 19, 2019 •

edited

Loading

hdevalence commented May 20, 2019

daira commented Jun 14, 2019

kidker commented Jul 15, 2019

daira commented Sep 10, 2019 •

edited

Loading

daira commented Sep 21, 2019

daira commented Sep 21, 2019 •

edited

Loading

daira commented Dec 11, 2019 •

edited

Loading

daira commented Oct 23, 2020 •

edited

Loading

weikengchen commented Nov 19, 2020

vanishreerao commented Dec 14, 2020

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

Comments

daira commented Mar 26, 2019 • edited Loading

arielgabizon commented Mar 27, 2019 • edited Loading

daira commented Mar 28, 2019 • edited Loading

daira commented Mar 30, 2019 • edited Loading

daira commented Apr 3, 2019 • edited Loading

daira commented Apr 3, 2019

daira commented Apr 28, 2019 • edited Loading

hdevalence commented Apr 30, 2019

daira commented May 19, 2019 • edited Loading

hdevalence commented May 20, 2019

daira commented Jun 14, 2019

kidker commented Jul 15, 2019

daira commented Sep 10, 2019 • edited Loading

daira commented Sep 21, 2019

daira commented Sep 21, 2019 • edited Loading

daira commented Dec 11, 2019 • edited Loading

daira commented Oct 23, 2020 • edited Loading

weikengchen commented Nov 19, 2020

vanishreerao commented Dec 14, 2020

daira commented Mar 26, 2019 •

edited

Loading

arielgabizon commented Mar 27, 2019 •

edited

Loading

daira commented Mar 28, 2019 •

edited

Loading

daira commented Mar 30, 2019 •

edited

Loading

daira commented Apr 3, 2019 •

edited

Loading

daira commented Apr 28, 2019 •

edited

Loading

daira commented May 19, 2019 •

edited

Loading

daira commented Sep 10, 2019 •

edited

Loading

daira commented Sep 21, 2019 •

edited

Loading

daira commented Dec 11, 2019 •

edited

Loading

daira commented Oct 23, 2020 •

edited

Loading