Perf: Improve M31 mul #622

schouhy · 2024-05-08T18:56:20Z

Description

This PR slightly changes the implementation of the reduce algorithm saving one operation resulting in an improvement of about 8%.

The current implementation of the multiplication of two M31 elements first computes the integer product of their representatives and then applies a reduce function to it. The reduce function works for every integer in the range $[0, p^2)$. This can be refined by a special reduce function for elements in the range $[0, p^2)$ that are not multiples of $p$.

General

The general idea is that taking remainder modulo $p = 2^{31} - 1$ is easy if the elements are written in base $2^{31}$. This is because the element $2^{31}$ equals $1$ modulo $p$. So, an element of the form
$$a_n (2^{31})^n + a_{n-1} (2^{31})^{n-1} + \cdots + a_1 2^{31} + a_0$$
is equivalent to the element $$a_{n} + \cdots + a_{0}.$$ This gives an integer in the same residue class than the original element. It may not be in the interval $[0, p)$. But it will be in the interval $[0, (n+1)(2^{31}-1)]$ which (for $n>0$) is smaller than the original range of values. So iterating this process eventually leads to the representative in the range $[0, p]$.

General remarks

Let $b > 1$ be an integer.

Let's consider nonnegative integers in base $b$. If $v < b^2$, then there exist $a_0, a_1 \in [0, b)$ such that $$v = a_1 b + a_0.$$ Let $w = a_0 + a_1$. Then, $w \in [0, 2b - 1)$. Since $w$ is bounded by $2b - 1$, if we write $w$ in base $b$ we obtain, $$w = b_1b + b_0,$$ with either

$b_1 = 0$ and $b_0 \in [0, b)$, or
$b_1 = 1$ and $b_0 \in [0, b-1)$.

In particular, $b_0 + b_1 \in [0, b)$.

The case of M31

Let $b = 2^{31}$ and $p = 2^{31} - 1 = b - 1$. By the argument above, for all $v \in [0, b^2)$, if we write $v = a_1 b + a_0$ and write $a_0 + a_1 = b_1 b + b_0$, we obtain $b_0 + b_1 \in [0, b)$, which is the same as $b_0 + b_1 \in [0, p].$

Supopse in addition we know that $v = ab$ is the product of two elements $a, b \in (0, p)$. Since $p$ is prime and $p$ does not divide both $a$ and $b$, then $p$ does not divide $v$. On the other hand, $v \equiv b_0 + b_1 ,\text{ mod } p$. This implies $b_0 + b_1$ is not divisible by $p$ and in particular $b_0 + b_1$ is different from $0$ and $p$.

Putting all together we obtain that $b_0 + b_1 \in [0, p)$ for the cases where $v \in [0, b^2)$ and $v = ab$ is the product of two elements strictly less than $p$.

Alternative algorithm

This follows the same idea as the reduce algorithm already implemented. But taking into account the particular case of $v$ being the product of two nonnegative integers less than $p$ we are able to remove the + 1 after the first shift in the current algorithm.

Let $b = 2^{31}$ and $p = 2^{31} - 1$. Suppose $v = ab$ is the product of two elements $a, b \in [0, p)$. Then $v$ belongs to the interval $[0, p^2)$ and is not a multiple of $p$. Then, if we write $v = a_1 b + a_0$, with $a_1, a_0 \in [0, b)$, then $a_1$ can't be equal to $b-1$. Otherwise $v = (b-1)b + a_0 = pb + a_0$ which is larger than $p^2$. So, $a_1 \leq b-2$.

Going back to the algorithm, instead of computing $a_1, a_0$ adding them, then computing $b_1, b_0$ and adding them, there's a shortcut.

Say $v = a_1 b + a_0$ and let $b_1, b_0$ be the elements such that $a_1 + a_0 = b_1 b + b_0$. As before, we know that $b_1$ is either $0$ or $1$. Let's consider the following elements:

Let $w := v + a_1$. If we expand this we obtain $$w = a_1 b + a_0 + a_1 = a_1 b + b_1 b + b_0 = (a_1 + b_1) b + b_0.$$ Since $b_1$ is either $0$ or $1$ and $a_1 \leq b-2$, we obtain that $a_1 + b_1 \leq b-1$. Therefore, the above expression is the decomposition of $v + a_1$ in base $b$.
Let $u := v + a_1 + b_1$. Expanding once again we obtain $$u = a_1 b + a_0 + a_1 + b_1 = (a_1 + b_1) b + b_0 + b_1.$$ Since we know from the previous section that $b_0 + b_1$ is less than $p$, then the above expression is the decomposition of $u$ in base $b$.

/// Assumes that `val` is in the range [0, `P`.pow(2)) and `val` is not a multiple of `P`.
///
/// Returns `val` % `P` .
fn reduce_alternative_algorithm(v: u64) -> Self {
    let w = v + (v >> 31);
    let u = v + (w >> 31);
    Self(u as u32 & P)
}

This gives about 8% improvement over the reduce algorithm in an x86 laptop (Core i7).

This change is

use special reduce function for mul

2985733

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Improve M31 mul #622

Perf: Improve M31 mul #622

schouhy commented May 8, 2024 •

edited

Perf: Improve M31 mul #622

Are you sure you want to change the base?

Perf: Improve M31 mul #622

Conversation

schouhy commented May 8, 2024 • edited

Description

General

General remarks

The case of M31

Alternative algorithm

schouhy commented May 8, 2024 •

edited