Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

Open
daira opened this issue Mar 26, 2019 · 15 comments
Open

Faster variable-base scalar multiplication in zk-SNARK circuits #3924

daira opened this issue Mar 26, 2019 · 15 comments

Comments

@daira
Copy link
Contributor

@daira daira commented Mar 26, 2019

The best general way to perform a variable-base scalar multiplication in an R1CS circuit, required (before this write-up) 9 constraints per scalar bit. There's a way that requires 8.5 constraints/bit, but only with a recoded scalar using a nonstandard and inconvenient digit set (e.g. [0, 1, 2, -1]).

I believe it's possible to implement this in 6 constraints/bit, by using a modification of a technique from [Eisentraeger, Lauter, and Montgomery]. They state their technique for curves in short Weierstrass form, but it is also applicable to Montgomery form. (Everything in this ticket is easily adapted to both.) The basic idea is that when we need to find [2] P + Q, where P is the accumulator in a double-and-add algorithm, we compute it as (P + Q) + P. This allows two optimizations:

  • we do not need to compute the y-coordinate of the intermediate point P + Q (details below);
  • we've replaced a doubling with an addition, which is more efficient in R1CS by one constraint.

Here we adapt [Eisentraeger, Lauter, and Montgomery]'s formulae to a Montgomery curve with equation B·y2 = x3 + A·x2 + x. The constraint system for the incomplete addition P + Q = R on its own would be:

    (xQ - xP) × (λ1) = (yQ - yP)
    (B·λ1) × (λ1) = (A + xP + xQ + xR)
    (xP - xR) × (λ1) = (yR + yP)

When computing (P + Q) + P = S, we drop the last constraint which is only needed for yR. The outer addition requires the gradient λ2 = (yP - yR)/(xP - xR), but we have:

    λ1 = (yR + yP)/(xP - xR)

therefore

    λ2 = 2·yP/(xP - xR) - λ1.

So in R1CS we can write:

    (xQ - xP) × (λ1) = (yQ - yP)
    (B·λ1) × (λ1) = (A + xP + xQ + xR)
    (xP - xR) × (λ1 + λ2) = (2·yP)

and then complete the outer addition with:

    (B·λ2) × (λ2) = (A + xR + xP + xS)
    (xP - xS) × (λ2) = (yS + yP)

The practical problem with applying this technique in an R1CS circuit is that we cannot efficiently implement a conditional that sometimes computes [2] P, and sometimes [2] P + Q. (Conditionally replacing Q with 𝓞 does not work, because 𝓞 does not have an affine Montgomery representation.)

The following trick works around this. Suppose that r is of length n bits. Consider the following algorithm:

    Acc := [2] T
    for i from n-1 down to 0 {
        Q := ri ? T : −T
        Acc := (Acc + Q) + Acc
    }

For each step we can compute the y-coordinate of Q using:

    (yT) × (2.ri - 1) = (yQ)

This requires a total of 6 constraints per scalar bit. However, at the end we have computed [2n+1 - (2n - 1) + 2·r] T = [2n + 1 + 2·r] T.

Not to worry. Suppose that we actually want to compute [2n + k] T, where k < 2n+1. Without loss of generality, assume k is odd (if it is even then add one to k and subtract T from the result). Let k = 1 + 2·r, and solve to give r = (k - 1)/2. Conveniently, this is equivalent to setting r = k >> 1 where >> is the bitwise right-shift operator.

So the full algorithm is:

    Acc := [2] T
    for i from n-1 down to 0 {
        Q := ki+1 ? T : −T
        Acc := (Acc + Q) + Acc
    }
    return (k0 = 0) ? (Acc - T) : Acc

This requires 4C for the initial doubling, n * 6C for the loop, 3C to compute Acc - T, and 2C for the conditional.

There is a minor further improvement by specializing for kn = 0. In that case the first iteration of the loop calculates Acc = [3] T, which can be implemented directly as [2] T + T saving 3C (since we have replaced one loop iteration by an incomplete addition):

    Acc := [2] T + T
    for i from n-2 down to 0 {
        Q := ki+1 ? T : −T
        Acc := (Acc + Q) + Acc
    }
    return (k0 = 0) ? (Acc - T) : Acc

Let s be the order of the large prime-order subgroup. Assume that T is of order s and that 2n+1 - 1 ≤ (s-1)/2. Under these conditions, we can calculate [2n + k] T for k < 2n in (n+1) * 6C.

It remains to check that the x-coordinates of each pair of points to be added are distinct.

When adding points in the large prime-order subgroup, we can rely on Theorem A.3.4 from the Zcash protocol spec, which says that if we have two such points with nonzero indices wrt a given odd-prime order base, where the indices taken in the range -(s-1)/2..(s-1)/2 are distinct disregarding sign, then they have different x-coordinates. This is helpful, because it is easier to reason about the indices of points occurring in the scalar multiplication algorithm than it is to reason about their x-coordinates directly.

So, the required check is equivalent to saying that the following program never asserts:

    acc := 3
    for i from n-2 down to 0 {
        q = ki+1 ? 1 : −1
        assert acc ≠ ± q
        assert (acc + q) ≠ acc    // X
        acc := (acc + q) + acc
        assert 0 < acc ≤ (s-1)/2
    }
    if k0 = 0 {
        assert acc ≠ 1
        acc := acc - 1
    }

The assertion labelled X obviously cannot fail because q ≠ 0. It is easy to see that acc is monotonically increasing except in the last conditional. It reaches its largest value when k is maximal, i.e. 2n+1 - 1, which justifies the condition on n above. This discharges all of the other assertions.

[Edit: the constraint count was off-by-one.]

@arielgabizon

This comment has been minimized.

Copy link
Contributor

@arielgabizon arielgabizon commented Mar 27, 2019

I'm wondering how much this improves the circuit size for batched Groth16 verification..perhaps I'll do the computation later.

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Mar 28, 2019

@arielgabizon The Miller loop curve arithmetic is in G2, so the constraint counts are different; also in the Miller loop we need all the intermediate points in order to compute the line functions. It might still be possible to apply this optimization, but it needs more analysis.

It can definitely be applied to the double-and-add steps in the subgroup checks. There the exponent is fixed so we should find addition-subtraction chains that minimize the constraint cost taking into account this optimization.

We could also apply it to [zj] πj,A and to the multiscalar multiplication in the computation of AccumΔ. A complicating factor in the latter is that the base points πj,C do not necessarily have distinct x-coordinates; also, the saving would not be large in that case because there is only one doubling per scalar bit regardless of N.

The optimization is not applicable to [AccumΓ,i] Ψi, because that's best done as a fixed-base scalar multiplication, which could already be done in fewer constraints.

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Mar 30, 2019

For comparison, a fixed-base scalar multiplication takes 2 constraints/bit. One way of doing it using 3-bit windows is described in #2230 (comment) . I think there's a simpler way using 2-bit windows: that requires 1C to AND each pair of scalar bits, the lookup is free because linear, and 3C for an incomplete addition, for a total of 4C per 2-bit window. The adjustment to avoid zero points does not affect the per-bit cost.

(If you share the same scalar between different scalar multiplications, it's 0.5C/bit to prepare the scalar and then 1.5C/bit to actually do the fixed-base multiplication.)

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Apr 3, 2019

I reread this and experienced a pang of guilt that I was maybe teaching people to do something that would cause security bugs in their constraint systems in general:

The constraint system for the incomplete addition P + Q = R on its own would be:

    (xQ - xP) × (λ1) = (yQ - yP)
    (B·λ1) × (λ1) = (A + xP + xQ + xR)
    (xP - xR) × (λ1) = (yR + yP)

When computing (P + Q) + P = S, we drop the last constraint which is only needed for yR. The outer addition requires the gradient λ2 = (yP - yR)/(xP - xR), but we have:

    λ1 = (yR + yP)/(xP - xR)

therefore

    λ2 = 2·yP/(xP - xR) - λ1.

So in R1CS we can write:

    [2 constraints not important to this point snipped]
    (xP - xR) × (λ1 + λ2) = (2·yP)

Notice that we drop a constraint that is equivalent to λ1 = (yR + yP)/(xP - xR), and then we rely on that equation!

What is going on here that allows us to drop a constraint and then apparently rely on it? In general this can definitely cause bugs. For example, there are two similar attempted optimizations in the suggested constraint systems for MiMC and LowMC in section 6.1 of the MiMC paper ([Albrecht, Grassi, Rechberger, Roy, and Tiessen], page 22), which are completely insecure as a result.

The argument for why this is okay in this case goes as follows:

  • Under the assumption xP ≠ xQ (which we discharge later), the first constraint determines λ1 as a function of variables {xP, xQ, yP, yQ}.
  • The second constraint determines xR as a function of variables {λ1, xP, xQ} (and some constants).
  • Under the assumption xP ≠ xR (which we also discharge), the last constraint determines λ2 as a function of variables {λ1, xP, xR, yP}.

Combining these dependencies in a quite mechanical way (not depending on any higher-level understanding of the elliptic curve math), we find that under the assumptions xP ≠ xQ and xP ≠ xR, λ2 is determined by {xP, xQ, yP, yQ}, both before and after the optimization.

Notice that this is the kind of analysis that can and should be doable by an automated tool. Given:

  • two constraint systems, before and after a purported optimization;
  • a set of "variables of interest";
  • a set of assumptions (here, and for many similar problems, it's sufficient to specify a set of linear combinations that are assumed to be nonzero)

we should be able to ask the tool whether the dependency relations between the variables of interest changed, and what they are. This would help to catch a class of mistakes without requiring any sophisticated machinery for equivalence-proving. For instance, it would catch the mistakes in the MiMC paper (although tbh the manual analysis is sufficient for that).

[Edit: as if to prove my point about this needing an automated tool, I initially made a mistake that resulted in omitting the required assumption xP ≠ xQ.]

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Apr 3, 2019

A subtlety is that the assumption xP ≠ xR depends on xR being sufficiently constrained, and so xR should be one of our "variables of interest" even though it is an intermediate. In this case, we can infer that under the assumption xP ≠ xQ, xR is determined as a function of variables {xP, xQ, yP, yQ}. Note that the assumption here does not itself depend on xR (which would be circular reasoning).

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Apr 28, 2019

For #3425 we need to be able to implement multiplications on E'/Fp2.

A conditional negation takes 2C (since both elements of the y-coordinate need to be negated). So processing each scalar bit requires 3m~ + 2s~ + 2C, which is (9 + 4 + 2)C = 15C using Karatsuba multiplication and Complex squaring.

The overall cost to compute [2n + k] T in E'/Fp2 for k < 2n, is (10 + 8)C to compute [3] T, (n-1) * 15C for the loop, 8C to compute Acc - T, and 4C for the final conditional, for a total of (n+1) * 15C.

If the scalar is fixed (and there is no addition-subtraction chain better than the binary one), it is more efficient to use the (P + Q) + P technique only for the double-and-add steps (13C each) and not for the plain doublings (10C each). No conditional negation is needed in this case.

@hdevalence

This comment has been minimized.

Copy link

@hdevalence hdevalence commented Apr 30, 2019

This is possibly a tangent from the topic in this issue, but it may be possible to use this together with a Ristretto-for-JubJub along the following lines:

As described in cite:2017/costello-smith section 4.3, it's possible to recover the y-coordinate of the output point from data computed during the montgomery ladder (Algorithm 5, Okeya-Sakurai y-coordinate recovery).

Since Ristretto can be implemented using a Montgomery curve internally, it seems possible to implement Ristretto inside of a circuit using @daira's few-constraints Montgomery formulas above plus Okeya-Sakurai y-coordinate recovery, and to implement it outside of a circuit using the Hisil-Wong-Carter-Dawson extended coordinates.

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented May 19, 2019

Okeya-Sakurai in R1CS is described here. However, it's cheaper to do variable-base scalar multiplication in R1CS in Montgomery coordinates, but without the Montgomery ladder. A ladder-based multiplication would require at least 10 constraints per scalar bit. The method described in this issue requires only 6 constraints/bit. Note that it doesn't require Okeya-Sakurai at the end since it computes both coordinates already.

The difference in trade-offs is due to the fact that R1CS greatly favours using (x, y) affine coordinates, because division is cheap. The [Eisentraeger, Lauter, and Montgomery] method does avoid computing some of the intermediate y-coordinates, but as far as I can see, there's no further improvement to be made by staying x-coordinate-only, because computing the y-coordinate is the fastest way to compute the needed slope (λ) for the next addition or doubling.

The additional Ristretto operations can be implemented just by converting to/from Edwards and following the affine Ristretto formulae. I haven't calculated the exact costs, but it looks straightforward — with some caveats about special-casing the zero point. Note that the inverse square root operation is particularly efficient in R1CS because you basically just need to confirm the square. You do need to also constrain the square root to be positive, which I think can only be done by boolean-unpacking it (similar to appendix A.3.3.2 in the protocol spec).

@hdevalence

This comment has been minimized.

Copy link

@hdevalence hdevalence commented May 20, 2019

Ah, thanks for the explanation on the Montgomery formulas, and the pointer on checking positivity!

I'm not sure but I think that rather than converting to/from Edwards and following the affine Edwards Ristretto formulas, it will be simpler to use the Ristretto-to-Montgomery formulas directly (since the affine Edwards formulas are those formulas, composed with the Montgomery-to-Edwards map). @cathieyun and I have been planning to prototype this using Bulletproofs, which could be useful as a test case.

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Jun 14, 2019

Using the prime-order subgroup turns out to be always more efficient in R1CS than Ristretto: #4024 (comment)

@kidker

This comment has been minimized.

Copy link

@kidker kidker commented Jul 15, 2019

Daira, please give me a book for reading;) thanks;)

@daira daira mentioned this issue Sep 10, 2019
@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Sep 10, 2019

multiply_fast in https://github.com/ebfull/halo/blob/master/src/gadgets/ecc.rs implements this algorithm for short Weierstrass curves (in Sonic/Halo-style constraints, not bellman/R1CS).

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Sep 21, 2019

This technique can be adapted to double-variable-base scalar multiplication, i.e. [u1] P1 + [u2] P2 where P1,2 are both variable (but P1 ≠ +/- P2). We can compute Q in each iteration as the sum of two conditionally-negated points, which costs 2C for the conditional negations and 3C for incomplete addition. Therefore the overall cost is (5+2+3)C = 10C per bit — a modest improvement over 12C per bit when computing [u1] P1 + [u2] P2 naively.

The fixed-and-variable case (as above but with P1 fixed) is interesting for EdDSA, RedDSA, or ECDSA signature verification. We can do a fixed-base scalar multiplication in 2C per bit, so naively adding the fixed- and variable-base results gives 8C per bit (or 0.5C per bit to prepare the fixed-base scalar and 7.5C per bit for the double multiplication). I cannot find any improvement on this; adapting the approach used for double-variable-base above would take 9C per bit.

@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Sep 21, 2019

In general we can do a k-way all-variable-base multiplication in (2+4k)C per bit using the same approach.

@daira daira added this to Needs Prioritization in Protocol Team via automation Sep 21, 2019
@daira

This comment has been minimized.

Copy link
Contributor Author

@daira daira commented Dec 11, 2019

Note that the correctness argument for using incomplete addition does not work for the multiple-base variant without additional work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Protocol Team
  
Needs Prioritization
4 participants
You can’t perform that action at this time.